Applying Different Classification Methods on Advertisement Click Prediction Problem

In this article, we use a coded dataset of an online auction company in order to predict user’s behavior (click or not click) on online ads. This whole work was the final project of the Machine Learning course that we took. The dataset contains a few features, and a label called Clicked denoting whether the corresponding user clicked on the corresponding ad or not. Without further due, let’s start the EDA part and get a better sense of the data. First, we import the essential packages and load the dataset:

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns
train_data = pd.read_csv("data/train_data.csv", )

We have used the panda’s read_csv function to load the dataset as a data frame. Consequently, each row of the table is a training sample and train_data is initialized with this table containing the following columns (features and output):

  • user: a unique ID assigned to each user
  • display: an ID related to the frame on which this ad was being shown
  • creativeId: the ad’s ID
  • campaignId: the advertisement’s campaign ID
  • advertiserId: an ID showing the advertiser of this particular ad
  • widgetId: the place of the ad’s frame on the display
  • device: user’s device model
  • OS: user’s operating system
  • Browser: user’s browser
  • doc: web page’s ID
  • source: website’s ID
  • clicked: the actual target value (1 means that the user has clicked on the ad. Otherwise, it is 0)

Other features like hourOfDay are self-explanatory.

A serious issue here is that the data is super imbalanced; The number of clicked samples in the training set is way below the number of unclicked samples. It can lead the model to overfit on the unclicked cases. To get a better sense of this issue, consider a model that ignores the input and always classifies it as unclicked. This specific example, results to high accuracy and low loss due to the fact that our data is imbalanced. As a solution, we used a similar idea as in upsampling. We kept the unclicked data from the original data set. Then randomly sampled as much as the number of the unclicked samples from the clicked data samples. The implementation of this part is as follows:

# Doing the upsamplingclicked_samples = train_data[train_data['clicked'] == 1]non_clicked_samples = train_data[train_data['clicked'] == 0]clicked_samples = clicked_samples.sample(replace=True, n=len(non_clicked_samples), random_state=1) # upsamplingtrain_data = pd.concat([clicked_samples, non_clicked_samples])

After that, we checked to see if there are any null values in the data set, and luckily, there were not any. Moreover, we shuffled the data before splitting it into train, validation, and test parts to make sure that the distribution of the data samples in these tree sets is almost the same (this property helps to reduce the bias of our models). We also add a bias feature to the data samples which is always 1. For example for the linear models, this bias feature helps to also consider decision boundaries that are not passing through the origin.

sr = train_data.isna().sum() # there is no null data in trainingtrain_data['bias'] = np.ones(train_data.shape[0]) # add bias termtrain_data = train_data.sample(frac=1) # shuffle data

By the following code, we can see the correlation of different features with the output:

correlation_matrix = train_data.corr().round(2)sns.heatmap(data=correlation_matrix, annot=False)plt.show()

The result is as follows:

It is true that the features are currently IDs and we need to transform them into one-hot vectors in order to be able to judge their correlation with the output (clicked); However, because of our computational limitation, we were not able to encode every single feature as one-hot vectors. Consequently, we just dropped the features that had too many unique values in their column, or had a low correlation with the output:

# Features which we won't usetrain_data = train_data.drop(['displayId', 'timestamp', 'docId', 'userId'], axis=1)# Features which we will use -- doing one hot encodingfeatures = ['dayOfWeek', 'hourOfDay', 'advertiserId', 'campaignId', 'creativeId', 'publisher', 'widgetId', 'device', 'os', 'browser', 'source']

Now, to generate meaningful features out of these ID columns, we encode their values as one-hot vectors. Remember to use the sparse trick to save memory since some columns have too many unique values:

from scipy.sparse import *one_hot = pd.get_dummies(train_data, columns=features, sparse=True)one_hot.shapeone = one_hot.astype('Sparse')matrix = one.sparse.to_coo().tocsr() # Converting to SciPy sparse matrix

After that, we extract the input and output parts from the dataset:

matrix_X = matrix[:, 1:] # Xmatrix_y = matrix[:, 0] # Y

Before getting to train our models, we split the dataset into train, validation, and test sets with 60:20:20 ratio:

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(matrix_X, np.squeeze(np.asarray(matrix_y.todense())), test_size=0.2) # This funcion would work efficiently.X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.25) # 0.25 * 0.8 = 0.2

Now, we are ready to train different models on this specific dataset and compare the results.

Let’s start with Random Forest:

from sklearn.metrics import *from sklearn.ensemble import RandomForestClassifierclf = RandomForestClassifier(max_depth=4, random_state=0, n_estimators=80)clf.fit(X_train, y_train)print("end training")preds = clf.predict(X_test)preds_proba = clf.predict_proba(X_test)print("F1:           ", f1_score(y_test, preds))fpr, tpr, thresholds = roc_curve(y_test, preds_proba[ : , 1], pos_label=None)print("AUC:          ", auc(fpr, tpr))print("Cross Entropy:", log_loss(y_test, preds_proba))print("Accuracy:     ", accuracy_score(y_test, preds))

The output of the above cell was as follows:

end training 
F1: 0.6087770079244672
AUC: 0.6028115864061809
Cross Entropy: 0.6893415826593268
Accuracy: 0.5632055069625873

Then, we used SVM. But, because of the computational limitations, we only trained it on the first 1000 training samples:

from sklearn.svm import SVCclf = SVC(C=1)clf.fit(X_train.head(1000), y_train.head(1000))print("end training")preds = clf.predict(X_test)preds_proba = clf.predict_proba(X_test)print("F1:           ", f1_score(y_test, preds))fpr, tpr, thresholds = roc_curve(y_test, preds, pos_label=None)print("AUC:          ", auc(fpr, tpr))

The output:

end training 
F1: 0.6057987208137474
AUC: 0.5338544225761602

As you can see, the Random Forest model was slightly better than this one. However, we need to keep in mind that in the SVM case, we only used the first 1000 samples out of 3511489 samples!

Next, Let’s train a Logistic Regressor:

from sklearn.linear_model import LogisticRegressionclf = LogisticRegression(random_state=0, C=1, max_iter=1200)clf.fit(X_train, y_train)print("end training")preds = clf.predict(X_test)preds_proba = clf.predict_proba(X_test)print("F1:           ", f1_score(y_test, preds))fpr, tpr, thresholds = roc_curve(y_test, preds, pos_label=None)print("AUC:          ", auc(fpr, tpr))print("Cross Entropy:", log_loss(y_test, preds_proba))

And the output was:

end training 
F1: 0.6244800254253621
AUC: 0.6204446745676685
Cross Entropy: 0.6487833714971616

It is so far the best model in our experiments.

The next model that we used was XGBoost Classifier:

import xgboost as xgbclf = xgb.XGBClassifier(max_depth=4,  n_estimators=80)clf.fit(X_train, y_train)print("end training")preds = clf.predict(X_test)preds_proba = clf.predict_proba(X_test)print("F1:           ", f1_score(y_test, preds))fpr, tpr, thresholds = roc_curve(y_test, preds, pos_label=None)print("AUC:          ", auc(fpr, tpr))print("Cross Entropy:", log_loss(y_test, preds_proba))

And again, you can see the output here:

end training 
F1: 0.6058749030148753
AUC: 0.6267827426157575
Cross Entropy: 0.6713597341509303

Finally, we tried the Factorization Machine¹ approach.

We start by converting the dataset into the right format for the Factorization Machine model:

from sklearn.datasets import dump_svmlight_filedump_svmlight_file(X_train, y_train, 'data/FM_train.libsvm')
dump_svmlight_file(X_test, y_test, 'data/FM_test.libsvm')

And then, we get to train the model by the following code:

# !pip install xlearnimport xlearn as xlfm_model = xl.create_fm()fm_model.setTrain("data)/FM_train.libsvm")param = {'task':'binary', 'lr':0.2,'lambda':0.002, 'metric':'acc'}fm_model.fit(param, 'data/model.out')fm_model.setTest('data/FM_test.libsvm')  # Test datafm_model.setSigmoid()  # Convert output to 0-1# Start to predict# The output result will be stored in output.txtfm_model.predict("data/model.out", "data/output.txt")

The resulting metrics:

AUC:           0.6463157655406329
Cross Entropy: 0.675787874373196

As you can see, this final model was the best model that we used.

This whole project was done by Arad Mohammadi, Aryo Lotfi, and me (Ashkan Mirzaei).

[1] Factorization Machines
https://ieeexplore.ieee.org/document/5694074