TalkingData, China’s largest third-party mobile data platform, understands that everyday choices and behaviors paint a picture of who we are and what we value. Currently, TalkingData is seeking to leverage behavioral data from more than 70% of the 500 million mobile devices active daily in China to help its clients better understand and interact with their audiences.
In this competition, Kagglers are challenged to build a model predicting users’ demographic characteristics based on their app usage, geolocation, and mobile device properties. Doing so will help millions of developers and brand advertisers around the world to pursue data-driven marketing efforts which are relevant to their users and catered to their preferences.
The Data is collected from TalkingData SDK integrated within mobile apps. TalkingData serves under the service term between TalkingData and mobile app developers. Full recognition and consent from individual user of those apps have been obtained, and appropriate anonymization have been performed to protect privacy. Due to confidentiality, we won’t provide details on how the gender and age data was obtained. Please treat them as accurate ground truth for prediction.
The data schema can be represented in the following chart:
Given –
Training Data and Test Data.
- The main file in training data is gender_age_train.csv, which consisted of 74,645 records for training, and included Device ids, Gender, Age, and Group [which needs to be predicted for Test data].
- The phone_brand_device_model.csv consisted of 187,245 entries, and included Device ids, Brand and Models. Device ids are given for both Training and Test data.
- The events.csv has 3,252,950 events.
When a user uses TalkingData SDK, the event gets logged in this data. Each event has an event id and location (lat/long), and the event corresponds to a list of apps in app_events.csv. - The app_labels.csv and label_categories.csv provide more information, i.e, on Labels and Categories of apps.
To Find –
In this competition, you have to predict the demographics of a user (gender and age) based on their app download and usage behaviors.
The feature to be predicted in this competition is the ‘group’ feature. It has 12 age categories: six for females and six for males as following:
F23, F24-26, F27-28, F29-32, F33-42, F43+
M22, M23-26, M27-28, M29-31, M32-38, M39+
Solution
The Synerzip team took the challenge of the TalkingData kaggle competition, where the focus was on predicting users’ demographic characteristics based on their app usage, geolocation and mobile device properties. TalkingData provided users a good amount of data which needed to be analysed. The Synerzip team started analyzing the data from a statistical perspective.
The Team concluded that this was a typical classification problem, where the user needed to predict the probability of a person being in these categories. The evaluation metric to be used is a multi-class logarithmic loss [mlogloss], which tends to 0 for best result.
We have used Python with Scipy, Numpy, XGBoost.
Analysis & Strategy
The team started analyzing this data from statistical perspective.
To start with, they needed to find out the best baseline result, which could be obtained without taking any particular feature into consideration.
There are 12 classes, so we can assign 1/12 = 0.08333 as the probability for each class, which gives the result = 2.4849 (mlogloss).
The Synerzip team wanted to improve this result further…
The first feature we can use is Brand and Model information.
We calculated the probabilities for each of the classes [12 in total], by brand and by model and substituted the probabilities for each device based on their model.
In case of test data, we have to take into consideration that we may have 0 entries for a particular model, and in that case we took brand probabilities.
After applying the probabilities, the results (mlogloss) are:
On training data: 2.26992
On test data: 3.64634
As we can see, this result improved on training data, but deteriorated on test data. (The basis for a good result is the mark of 2.48, as discovered earlier.)
Re-analyzing the data and looking at the distribution, we see that there are many cases where data for a Model/Brand does not have entries in all classes. This creates a bias: a user can only be in a given class, which may not be quite right in the real world.
To rectify this bias, we simply add a dummy record for the missing data, i.e, where ever we observe a “zero” record in any class, we added “one” record to that particular Model/Brand.
Now with these new probabilities, we get new results (mlogloss):
On training data: 2.31621
On test data: 2.40066
We can see that the results are improved in train and test data as bias has been removed.
Result Validation
To verify and validate these results, we used XGBoost.
We provided training data (brand, model as input features) and group as predictor variable with 80% training data and 20% as validation data.
The results (mlogloss) are:
On Training data: 2.372958
On Test data: 2.39906
This is almost similar to the figures we got using the statistical approach we started with, so we can assume we are in the right direction.
Next, we looked into events data.
First, we needed to find out the available events data for the number of devices. After analyzing the data, we found that we have events data only for 31.4% devices in both training data and test data set.
Assuming we get 100% accuracy for predicting the right “Group” for the devices having Events data and by applying simple math, we get the following result (mlogloss):
0.314*0+0.686*2.4 = 1.6464
As derived above, it is seen that the best accuracy we can get for this problem is ~1.6464, as we do not have more data available for devices without events.
This is a simple explanation of how Synerzip tried to solve this problem of probability. You can try and solve the problem with the guidelines given above. Maybe you will be able to dive deeper and solve many more on similar lines.
Next, we start working on the feature engineering based on events data. Stay tuned for our upcoming blog!