Genpact Machine Learning Hackathon - 5th Place solution
Here is my 5th place solution to the Genpact Machine Learning Hackathon conducted by Analytics Vidhya in December 2018.
The full Python code is available on my github repository.
Problem Statement
The task in this ML hackathon was to predict the number of food orders for an online food delivery business at each of their branches on a particular week in the future.
Solving such a problem is useful for planning just-in-time procurement of ingredients so as to reduce wastage and costs.
A look at the data
Here’s the training data we were asked to work with.
Column | Description |
---|---|
id |
Unique transaction id |
week |
Week number; training data had weeks 1 through 145 |
center_id |
Unique identifier for the branch of the online food delivery business |
meal_id |
Unique identifier for the meal |
checkout_price |
Price of the meal after discounts, coupons, etc |
base_price |
Base price of the meal |
emailer_for_promotion |
Boolean indicating whether the meal was promoted via email |
homepage_featured |
Boolean indicating whether the meal was featured on the website’s homepage |
num_orders |
The target (or dependent) variable we were asked to predict |
There was also the following information about the branch of the food delivery business.
Column | Description |
---|---|
center_id |
Unique identifier for the branch of the online food delivery business |
city_code |
Unique identifier for the city in which the branch operates |
region_code |
Unique identifier for the region in which the branch operates |
center_type |
Categorical variable for the branch type |
op_area |
Operating area of the branch |
Then, there was some information about the meal’s themselves.
Column | Description |
---|---|
meal_id |
Unique identifier for the meal |
category |
The meal category |
cuisine |
The meal cuisine (categorical variable) |
Machine Learning Model
I decided to use the LightGBM regressor for this challenge since from my experience in such competitions, gradient boosted trees are very powerful and popular.
Feature Engineering and Data Transformations
I decided to use most of the given features as it is apart from the following new features I designed.
Feature | Description |
---|---|
week_sin |
Sine transform of the ‘week’ to capture cyclic dependency |
week_cos |
Cosine transform of the ‘week’ to capture cyclic dependency |
price_diff_percent |
Percentage difference between checkout_price and base_price |
Sine and cosine transform’s
are very frequently used to represent cyclic features like the ‘week’ in our case.
This is useful when you are trying to capture dependencies like increased demand during a particular
month every year due to a festival, for example.
The formula for the sine and cosine transform for the ‘week’ variable is as below:
week_sin = np.sin(2 * np.pi * week / 52.143)
week_cos = np.cos(2 * np.pi * week / 52.143)
Ofcourse, I decided to keep the original ‘week’ feature as well to capture long-term dependency (for example, increase in demand over the years).
I used scikit-learn’s label encoder
to encode categorical variables since that is how LightGBM prefers it.
Transforming the target variable
I used the log transform (np.log1p()) on the target variable - num_orders
- so that
it looked more like a gaussian distribution (bell-shaped curve). The original ‘num_orders’
had values ranging from a few hunders to several thousands with a majority of the values
in the lower range.
Another reason for the log transformation of the target variable was that the metric for the competition was RMSLE (root mean squared log error) which means after the log transformation of the target variable, I could simply use the build-in “mse” or “rmse” metric of LightGBM.
Hyperparameter tuning
I used scikit-learn’s Parameter Grid to systematically search through hyperparameter values for the LightGBM model.
The hyperparameters I tuned with this method are:
- colsample_bytree - Also called feature fraction, it is the fraction of features to consider while building a single gradient boosted tree. Reducing its value reduces overfitting by considering fewer features while building each tree.
- min_child_samples - The number of samples in the leaf node of the tree.
- num_leaves - The number of leaf nodes. Higher the number, the more complex and deeper the tree is going to be making the model overfit.
Choosing the cross-validation set
Since we are trying to predict the number of orders on a future date, it makes sense to order the training data by the ‘week’ in ascending order and then pick samples at the end of the list as our cross-validation set. For example, since we are given training data for week’s 1 through 145, we can consider data for week’s 1 through 140 as our training data and week’s 141 through 145 as our cross-validation data.
For this, I used scikit-learn’s train test split
to split the given training data into a train and cross-validation set.
Note that I explicitly set shuffle=False
since we want the data to be ordered
by week and we want to take samples towards the end as our cross-validation set.
Solution
The full Python code is available on my github repository.
Read the training and test datasets.
df_train = pd.read_csv('train_GzS76OK/train.csv')
df_center_info = pd.read_csv('train_GzS76OK/fulfilment_center_info.csv')
df_meal_info = pd.read_csv('train_GzS76OK/meal_info.csv')
df_test = pd.read_csv('test_QoiMO9B.csv')
Merge with branch and meal information.
df_train = pd.merge(df_train, df_center_info,
how="left",
left_on='center_id',
right_on='center_id')
df_train = pd.merge(df_train, df_meal_info,
how='left',
left_on='meal_id',
right_on='meal_id')
df_test = pd.merge(df_test, df_center_info,
how="left",
left_on='center_id',
right_on='center_id')
df_test = pd.merge(df_test, df_meal_info,
how='left',
left_on='meal_id',
right_on='meal_id')
Feature engineering - Convert ‘city_code’ and ‘region_code’ into a single feature - ‘city_region’.
df_train['city_region'] = \
df_train['city_code'].astype('str') + '_' + \
df_train['region_code'].astype('str')
df_test['city_region'] = \
df_test['city_code'].astype('str') + '_' + \
df_test['region_code'].astype('str')
Label encode categorical features (label encoded features will have suffix encoded).
label_encode_columns = ['center_id',
'meal_id',
'city_code',
'region_code',
'city_region',
'center_type',
'category',
'cuisine']
le = preprocessing.LabelEncoder()
for col in label_encode_columns:
le.fit(df_train[col])
df_train[col + '_encoded'] = le.transform(df_train[col])
df_test[col + '_encoded'] = le.transform(df_test[col])
Feature engineering - Sine and Cosine transform for ‘week’ - Capture cyclic dependency.
df_train['week_sin'] = np.sin(2 * np.pi * df_train['week'] / 52.143)
df_train['week_cos'] = np.cos(2 * np.pi * df_train['week'] / 52.143)
df_test['week_sin'] = np.sin(2 * np.pi * df_test['week'] / 52.143)
df_test['week_cos'] = np.cos(2 * np.pi * df_test['week'] / 52.143)
Feature engineering - Price difference percentage.
df_train['price_diff_percent'] = \
(df_train['base_price'] - df_train['checkout_price']) / df_train['base_price']
df_test['price_diff_percent'] = \
(df_test['base_price'] - df_test['checkout_price']) / df_test['base_price']
Feature engineering - Convert the ad campaign features - ‘emailer_for_promotion’ and ‘homepage_featured’ into a single feature.
Both these features were originally boolean (0 and 1). So, adding them up to create a new feature does not require label encoding.
df_train['email_plus_homepage'] = df_train['emailer_for_promotion'] + df_train['homepage_featured']
df_test['email_plus_homepage'] = df_test['emailer_for_promotion'] + df_test['homepage_featured']
Prepare a list of features to train on. Split them into categorical and numerical features.
columns_to_train = ['week',
'week_sin',
'week_cos',
'checkout_price',
'base_price',
'price_diff_percent',
'email_plus_homepage',
'city_region_encoded',
'center_type_encoded',
'op_area',
'category_encoded',
'cuisine_encoded',
'center_id_encoded',
'meal_id_encoded']
categorical_columns = ['email_plus_homepage',
'city_region_encoded',
'center_type_encoded',
'category_encoded',
'cuisine_encoded',
'center_id_encoded',
'meal_id_encoded']
numerical_columns = [col for col in columns_to_train if col not in categorical_columns]
Log transform the target variable - num_orders.
df_train['num_orders_log1p'] = np.log1p(df_train['num_orders'])
I used the np.log1p()
instead of np.log()
because it is more numerically stable (i.e, log(0) is not defined).
Train + Cross-validation split.
The original dataset was already sorted by week number. I just had to pick the samples towards the end
as the cross validation set. This corresponds to week numbers 141 through 145. Since we’re trying
to predict orders at a future date, random shuffling of the dataset before split does not make sense
and hence the shuffle=False
.
X = df_train[categorical_columns + numerical_columns]
y = df_train['num_orders_log1p']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.02, shuffle=False)
Hyperparameter grid search.
scores = []
params = []
param_grid = {'num_leaves': [31, 127, 255],
'min_child_samples': [5, 10, 30],
'colsample_bytree': [0.4, 0.6, 0.8]}
for i, g in enumerate(ParameterGrid(param_grid)):
print("param grid {}/{}".format(i, len(ParameterGrid(param_grid)) - 1))
pprint.pprint(g)
estimator = LGBMRegressor(learning_rate=0.003,
n_estimators=10000,
silent=False,
**g)
fit_params = {'feature_name': categorical_columns + numerical_columns,
'categorical_feature': categorical_columns,
'eval_set': [(X_train, y_train), (X_test, y_test)]}
estimator.fit(X_train, y_train, **fit_params)
scores.append(estimator.best_score_['valid_1']['l2'])
params.append(g)
print("Best score = {}".format(np.min(scores)))
print("Best params =")
print(params[np.argmin(scores)])
LightGBM is able to natively work with categorical features by specifying the categorical_feature
parameter to the fit
method. Also, I’ve stayed with the default evaluation metric for LightGBM regressor
which is L2 (or MSE or Mean Squared Error).
Training the final LightGBM regression model on the entire dataset.
I used a method called early stopping
to reduce overfitting. As a result, I cannot
use the entire dataset for training. I will have to keep aside a test set for the purpose
of early stopping.
The following model was trained using the best hyperparameters obtained by the parameter grid search step above.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.02, shuffle=False)
g = {'colsample_bytree': 0.4,
'min_child_samples': 5,
'num_leaves': 255}
estimator = LGBMRegressor(learning_rate=0.003,
n_estimators=40000,
silent=False,
**g)
fit_params = {'early_stopping_rounds': 1000,
'feature_name': categorical_columns + numerical_columns,
'categorical_feature': categorical_columns,
'eval_set': [(X_train, y_train), (X_test, y_test)]}
estimator.fit(X_train, y_train, **fit_params)
Get predictions on the test data and prepare a submission file for the contest.
Since the target variable was log transformed using np.log1p()
, the predicted num_orders
will have to be inverse transformed using np.expm1()
.
X = df_test[categorical_columns + numerical_columns]
pred = estimator.predict(X)
pred = np.expm1(pred)
submission_df = df_test.copy()
submission_df['num_orders'] = pred
submission_df = submission_df[['id', 'num_orders']]
submission_df.to_csv('submission.csv', index=False)