DT Data Scientists win 2022 Kaggle Days Paris Competition


Two members of DT’s Data Science team, Aliaksandr Varshyau and Wojciech Rosa, were recently awarded a prestigious first place position at the Kaggle Days 2022 Paris edition competition.

Kaggle is the world’s largest data science community with powerful tools and resources. Nowadays there are more than 14M Kaggle user accounts. In addition, Kaggle Days are a series of events for experienced data scientists and Kagglers. These are worldwide international events and local meetups that aim to gather Kagglers and people interested in Data Science, Machine Learning, and Artificial Intelligence.

The key values for every event are knowledge, high quality content and speakers, passionate people, networking and development. Every Kaggle Days Event and Meetup brings something new and important to the table.

LVMH was the main sponsor of the event.

About 350 data science specialists came to the amazing Google offices in downtown Paris. Among them were also the best and most recognized figures from the world of Data Science and Kaggle: Philipp Singer (aka Psi), Andrada Olteanu, Gilberto Titericz (aka Giba), Jean-Francois Puget (aka CPMP) and many others.

The competition related to the topic of e-commerce with recommender systems combined with textual data:

Participants were given 9 hours to understand both the problem and data, explore datasets, write code, test, evaluate their models and finally submit solutions/predictions.

The winning solution was developed by Aliaksandr and Wojciech and uses feature engineering, LGBM and NLP.

The following challenge was presented to the competition participants.

A user enters a keyword in an e-commerce search engine and receives a list of results. The user then clicks on the selected items. The goal of the competition was (for each user) to predict a list of items and present the items on which they are most likely to click (based on historical data).

To find a solution, Aliaksandr and Wojciech decided the most effective way would be to use the python language with pandas and LGBM classification trees, where for each item label = 1 when the user clicks on the item and label = 0 otherwise.

For inference, we used user-level features (session length, location, shopping cart length, etc.) and item-level features (NLP similarity between the keyword entered and the item description from the database).

The winning solution was to predict for each product_id from the list of search_results a probability of 0-1, and taking into consideration the top 12 product_ids.

The solution we wrote was using the Python language. The challenge was to create a recommendation system with textual data. We needed to find a way to merge the two methods with tabular data – LGBM with NLP-based embedding. Common solution is to blend two model predictions, and we decided to use embeddings as a feature for LGBM. That idea turned out to be a powerful enough option to help us to win the competition.

The main problem was to extract data and manually place it in LGBM. We conducted experiments on the data we had, through local validation. This is a model not normally used for this type of issue, but due to the time constraints of the competition, it was important that we could extract and validate the data as quickly as possible.

Within the rules of the competition – we were only allowed to submit 20 tries. We decided to use a slightly different approach, by validating data on local test samples only – which is very similar to the way that Digital Turbine works today, since local evaluation is critical.

As the competition progressed and the results of the tries we submitted became more accurate, we slowly moved up the leaderboard until we were in first place – ending the competition with a winning margin of some 2% over our closest rivals.

Additional Links

The full solution code can be found in the links below.

By Wojciech Rosa

Read more by
this author



Source link