## 1. Introduction

In this tutorial, we will try to predict click-through rate of ads with the Decision Tree algorithm we learnt in the last post. Before continuing, I would recommend you to first read that post for a theoretical understanding of Decision Trees.

**What does Click-Through Rate Prediction mean?**

Let’s assume that you are designing an algorithm for a search engine, and your task is to maximize revenue by displaying the best ads — ads that are both related to search results and are most likely to be clicked — at the top of the search results. How would you do it?

You might say that revenue can be maximized by first filtering on relevancy and then displaying the ad from the highest bidder at the top. There is a problem with this solution. Most advertisers pay for the clicks and not for ad views — they want to be sure that ads lead to their website — so just displaying the ad from the highest bidder wouldn’t maximize revenue. Under this cost-per-click model, advertisers will only be charged if the users actually click on their ads. So how to correctly approach this problem?

Let’s say you have two ads:

- A 1.00$ ad with a 10% probability of being clicked
- A 2.00$ ad with a 1% probability of being clicked

Which one would provide the most revenue? The expected revenue* (cost * click-probability)* for the first ad is higher than that of the second one, even though its dollar-cost is lower, because there is a higher chance that the user will click the first ad. In other words, we can look at the problem of maximizing revenue in terms of accurately predicting the probability that a given ad will be clicked i.e. “click-through rate” (CTR).

In this tutorial, I’ll walk you through an example of predicting CTR. We will use the dataset from a Kaggle competition, *Click-Through Rate Prediction,* sponsored by Avazu*. *

**2. Exploratory Data Analysis**

### The dataset

Our dataset comprises of the following features:

`id`: ad identifier`click`: 0 for non-click, 1 for click`hour`: in the format of YYMMDDHH`C1`: some anonymized categorical variable e.g.`1002``banner_pos`: where a banner is located,`1`and`0``site_id`: site identifier`site_domain`: hashed site domain`site_category`: hashed site category e.g.`28905ebd``app_id`: mobile app identifier`app_domain``app_category``device_id`: mobile device identifier`device_ip`: IP address`device_model`: hashed model e.g. iPhone 6, Samsung`device_type`: hashed device type e.g. tablet, smartphone`device_conn_type`: hashed type of connection e.g. Wi-Fi, 4G`C14-C21`: some more anonymized categorical variables

* Anonymized and hashed values*: these are categorical features and their values correspond to some real and meaningful values. They are presented this way for privacy concerns.

### Data Exploration

We will be using Python 3 and the following packages:

– allows efficient numerical computations`numpy`

– provides data structures for data analysis`pandas`

– for machine learning algorithms**scikit-learn**– for plotting data`matplotlib`

– for extra plot types and for more elegant and readable plots**seaborn**

So let’s start by importing libraries:

Our source files are in compressed format but no worries! We can use some python magic to unzip files into .csv format as here:

Now that we have unzipped the source file, let’s try to understand what story the data is telling us. We are going to take only the first 100,000 samples from the *train.csv* file (unzipped from the `train.gz`). Working with a smaller subset allows to make faster calculations, which works for the purpose of this tutorial.

We can obtain the column headers of our data by using the *.columns* command:

The *click* column, in the snippet above, is our target variable, and the other columns are our potential features.

Let’s take a quick look at the first rows to see how our data looks like:

From the data description provided to us, we know that *click=0* means the ad was not clicked, and *click=1* means the ad was clicked.

We can use pandas *df*.*describe()* function to see the statistical distribution of our variables (count, mean, min, max, quartiles). These can also be computed separately: * df.count(), df.min(), df.max(), df.median(), df.quantile(q)* . Especially, we would like to see the summary statistics of our target variable,

*click*.

From the mean value, we can see that the number of ads clicked was just 17.5% — value of *click* can only be 0 or 1, so the mean value is also the click ratio.

Now let’s explore our data a bit more. ** device_type** seems to be an interesting variable, so let’s try to understand it better. We can separate out our feature of interest,

*device_type*, and our target variable,

*click,*as below:

We can see that there are four types of devices, with type 1 being the most prevalent.

If we compare the click ratios per *device_type*, we can see that there is a marked difference in the average click-through rate depending on device_type — highest click ratio is for *device_type_**0*, 22.7%, and lowest is for *device_type_**4*, 7.3%. This tells us that *device_type *is a good feature for predicting our target variable.

As an **exercise**, try to analyze some other features, and see if you can find some interesting insights or some other good features.

#### Correlations between attributes

Correlation is a statistical technique that can tell us whether pairs of variables are related. For example, your calorie intake and weight are related; people with a high calorie intake tend to be heavier. In an ideal situation, we would have an independent set of features, but real data is unfortunately not ideal. So it is useful to know whether some pairs of attributes are correlated and by how much.

*Pandas* allows us to easily get correlation coefficients. Pearson correlation coefficient is the most common one, and it is used to test for linear relationships between data. In short, it returns pairs of all attributes and their correlation coefficients in range [-1; 1], where 1 indicates positive correlation (i.e. both variables follow the same trend), -1 negative correlation (i.e. when one variable increases, the other decreases) and 0 means no relationship between variables at all.

We can obtain correlations among variables with a single line of code (*Python awesomeness!*). We can also plot the value of these correlations as a heatmap — a great way to get a quick overview of the relationships between attributes.

The heatmap allows us to easily see that:

- the highest positive correlation (red squares) is between
*C1*,*banner_pos*and*device_type*. Correlation between the position of the banner and type of device seems to make logical sense as well. - we have negatively correlated variables (blue squares) —
*C18*and*C21*. - the variables
*C16*and*C21*are highly correlated to our target variable,*click*. *id*is not correlated to any of the features, as one would expect.

Now we can drop some non-informative variables:

Let’s check if we have any missing values:

All the *0*s above (counts of missing values per variable) indicate that we do not need to worry about dealing with missing data.

#### Categorical variables

If we look at our data types, we can see that all of our variables are not numerical — we have some categorical variables (ones with type = *object*).

**Categorical variables** are variables that fall into a specific category; they have descriptive and non-quantitative values. *‘device_model’* and *‘site_category’* are such examples. It is possible to represent each possible value for a category as a separate feature by a technique called **One-hot encoding**.

For example, the *site_category* feature has three possible values. If we assume that its three possible values are *news, education, and sports*, one-hot encoding will convert them into three binary features, * is_news, is_education*, and

*.*

`is_sports`Before we can proceed, we need to transform our categorical variables (stored as text values) into vectors. The reason we need such transformations is because many ML algorithms, including Decision Trees in *Scikit-Learn*, do not support them directly — they require numerical inputs. *Pandas* can help us to easily perform these transformations via the *get_dummies()* function as below:

As you can see, the number of columns has now increased from a mere 24 to 4060!

For more detailed data exploration and data visualization, please refer to the previous hands-on tutorial. In this tutorial, I have only gone through the essentials to solve the problem at hand. I have skipped some of the steps that were thoroughly explained in the past post (e.g. an in-depth analysis of correlations among attributes and a detailed look at the distribution of variables).

## 3. **Predictive Modelling**

So let’s move to predictive modelling. We want to create a model that can generalize to new or future data well. To this aim, we will split our data into *training set* and *testing set*.

We are going to use a 80/20 split — 80% of the data is used for training and 20% for testing:

Now let’s create a basic Decision Tree model, fitted on `X_train`

and `Y_train,`

using Scikit-learn. In the first step, we are going to use the default values of the parameters for the *DecisionTreeClassifier. *For example*, *as you can see in the snippet below,* *default value for *max_depth* parameter is *“None”*. Please refer to the theoretical post on Decision Trees for understanding the meaning of the main parameters.

Now we will predict the likelihood of clicks for the test set (unseen cases) by passing the test data to the “fitted model”, and we will store the output probabilities in *y_predict*. Then we are going to evaluate the model by comparing the predicted probabilities versus the actual values — y_predict vs *Y_test — *using appropriate evaluation metrics:

We can see that we have achieved an accuracy score of 0.81.

Let’s look at another evaluation metric, **AUC **value: A ROC curve plots the **True Positive Rate **(i.e. actual equals clicked and predicted equals clicked) against the **False Positive Rate **(i.e. predicted equals clicked but actual equals not-clicked). One way to compare classifiers is to measure the **area under the ROC curve** — AUC value. A perfect classifier will have *AUC* value equal to 1, whereas a purely random classifier will have AUC equal to 0.5. Scikit-Learn provides a function, *roc_auc_score*, to compute the AUC value:

The AUC we have achieved is **0.65**. Let’s see if we can improve our performance by tweaking the values of the parameters for the Decision Tree i.e. by not using default values.

### Parameter tuning

**GridSearchCV** is a technique that performs an exhaustive search over the best set of parameters. Our only job is to specify the parameters we want to tune and the values we want to explore. Then we can use the optimal values (output from *GridSearchCV)* as input parameters for our classifier; this in return allows us to obtain the “best” model.

Let’s try to explore the *max_depth* paramter:

The best value for *max_depth* according to GridSearch is 5. Now let’s get the evaluation metrics for our “optimal” model (given by *grid_search.best_estimator_*):

As you can see in the snippet above, we have improved our performance by just tweaking one parameter — AUC value has increased from **0.65 to 0.68**! However, it is still not a great value, but click-through prediction is a difficult problem.

Note that although accuracy has increased, the True Positive Rate has decreased, so optimizing for accuracy is not always the correct approach. Accuracy is an intuitive and simple measure, but when we have highly skewed data, it is not a good measure. A model that always assigns the value of the most prevalent class to new instances (e.g. the model takes the easy path and always predicts *not-clicked*) would have a higher accuracy than a model that tries to assign some values from the other class as well (e.g. it predicts some output labels as *clicked*).

Also note that Decision Trees tend to cause **overfitting** — it is likely that the optimal values for the splits at each step only work for the training instances and not on new data. In practice, one could compare different algorithms and then pick the best model.

Anyway, now that we have the click-probabilities, finding the ad that maximizes revenue is easy — **multiply these probabilities by the cost-per-click** **and sort the results in descending order**.

### Visualizing the Decision Tree

As a last step, we are going to use an open source visualization library called * Graphviz *for visualizing our Decision Tree

*.*Graphiz is widely used in networking applications to visualize connections between switch hubs and networks. In ML, it can be used to visualize Decision Trees and neural networks.

### End Notes

This is not really the end, it’s your turn now! For example, we have not done any feature engineering in this post. We just removed some variables like timestamp — one could apply some transformations and see if there is some interesting correlation between e.g. *TimeOfDay* and *clicks*. So try to do some feature engineering by creating new features. You could also try some other algorithms, and use the evaluation metrics to compare their relative performances. Have fun!

**Thanks for reading** and remember to click **follow** for receiving the latest posts! 🙂