In this post, we will deconstruct the basics of working with a dataset to solve ML problems. This first part of this post gives a quick overview of the general workflow needed to work with ML problems. The second part comprises of a practical end-to-end example in Python.
The process of working with datasets to solve ML problems can be categorized into two main steps:
- Exploratory Data Analysis
- Predictive Modelling
Exploratory Data Analysis:
- Understand the problem, both formally and informally
- Figure out the right questions to ask and how to frame them
- Understand the assumptions
- Summarize the data: find type of variables or map out the underlying data structure, find correlation among variables, identify the most important variables, check for missing values and mistakes in the data
- Visualize the data to take a broad look at patterns, trends, anomalies and outliers, i.e. use data summarization and data visualization to understand the story the data is telling you
- Split the dataset into training, test and validation sets
- Choose the most appropriate algorithm. If you are a starter and you are playing with Python, scikit-learn module has integrated ML algorithms and they have created a cheat sheet, figure below, that you can use as your starting point.
- Start with a very simplistic model with minimal and most prominent set of features
- Plot learning curves and see how the error varies with changes in parameters and features
- Identify your quantitative evaluation measure
- Optimize the learning algorithm by including additional features, creating new features, tuning parameters etc
- Present results in a most appropriate form, depending on your final goal
Now let’s get our hands dirty with a practical example. I will use the HousePrices dataset from Kaggle. The dataset comprises of 1460 observations and 79 variables describing houses in Ames, Iowa. Here’s a description of a few variables:
- SalePrice – the property’s sale price in dollars. This is the target variable that we are trying to predict.
- MSSubClass: The building class
- LotFrontage: Linear feet of street connected to property
- LotArea: Lot size in square feet
- Street: Type of road access
- Alley: Type of alley access
- LotShape: General shape of property
- LandContour: Flatness of the property
- LotConfig: Lot configuration
- LandSlope: Slope of property
- Neighborhood: Physical locations within Ames city limits
- Condition1: Proximity to main road or railroad
- HouseStyle: Style of dwelling
- OverallQual: Overall material and finish quality
- OverallCond: Overall condition rating
- YearBuilt: Original construction date
We will be using Python 3. We will also utilize some useful packages:
numpy– efficient numerical computations
pandas– data structures for data analysis
scikit-learn– machine learning algorithms
matplotlib– plotting data
seaborn– extra plot types and for more elegant and readable plots
So let’s start by importing libraries and reading data as here:
Now let’s try to understand what story the data is telling us.
We can use pandas df.describe() function to see the statistical distribution of our variables (count, mean, min, max, quartiles). These can also be computed separately:
df.count(), df.min(), df.max(), df.median(), df.quantile(q) . Especially, we would like to see the summary statistics of our target variable, SalesPrice.
We can also visualize SalesPrice as an elegant histogram using sns.
We can see that the price distribution deviates from the normal distribution; there’s a marked positive skew (long tail in the positive direction and mean to the right of the peak; non-symmetric data) and peakedness.
Corelations between attributes
Correlation is a statistical technique that can tell us whether and how strongly pairs of variables are related. For example, your calorie intake and weight are related; people with a high calorie intake tend to be heavier. In an ideal situation, we would have an independent set of features, but real data is unfortunately not ideal. So it is useful to know whether some pairs of attributes are correlated and by how much.
Pandas allows us to easily get the following correlation coefficients: standard Pearson correlation coefficient, Spearman rank correlation, Kendall Tau correlation coefficient. Pearson correlation coefficient is the most common correlation coefficient and it is used to test for linear relationships between data. In short, it returns pairs of all attributes and their correlation coefficients in range [-1; 1], where 1 indicates positive correlation, -1 negative correlation and 0 means no relationship between variables at all.
So now we will select some strong correlations between attribute pairs, with a bit of Python magic!
Not much of a surprise: we can see that the highest correlation is between
- GarageCars: Size of garage in car capacity
- GarageArea: Size of garage in square feet
Having correlated features is not ideal and it can be a good idea to engineer features. There exist methods for deriving features that are as uncorrelated as possible (PCA, autoencoder, dimensionality reduction etc). Strong correlations can be an indicator of a situation called multicollinearity. Multicollinearity generally occurs when there are highly correlated predictor variables; one predictor variable can be used to predict the other – redundant information.
Actually, from the description of the variables, we can conclude that they give almost the same information and this is really a case of multicollinearity – garage size in terms of car capacity can be derived from size in square feet.
Visualization – Because a picture is worth a thousand words!
In order to better understand the dataset we can try to make things visual. To this aim, Python comes in very handy. In particular, we use the matplotlib and seaborn packages. (In my opinion, seaborn is just awesome!)
We can plot the value of correlation of pairs of attributes as a heatmap – a great way to get a quick overview of the relationships between attribute.
From the heatmap we can easily see the variables highly correlated (red sqaures) to our target variable, SalesPrice. It is a good and easy way of knowing which variables we should take into account in our predictive model.
Let’s take a quick look at the correlation between the various input attributes and our target variable with results sorted in descending order or in order of predictivity/predictive power.
So it looks like our most important features are: ‘OverallQual’, ‘GrLivArea’, ‘GarageX’ and ‘TotalBsmtSF’ – strongly correlated with ‘SalePrice’.
Now let’s look at these relationships in more detail via some plots.
We can conclude that ‘GrLivArea’ seems to be linearly related with ‘SalePrice’. The relationship is positive – as one variable increases, the other also increases. ‘OverallQual’ and ‘YearBuilt’ also seem to be related with ‘SalePrice’. The relationship seems to be higher in the case of ‘OverallQual’, where the box plot shows how sales prices increase with the overall quality.
Just for fun: let’s see how, and if, the housing style in the area has changed over the years.
We can see that ‘SFoyer’ and ‘SLvl’ are more recent styles, while ‘2Story’ houses have been around throughout the century.
Pair-wise scatter matrix
We have quite a lot of unique pairs of variables i.e.
N * (N - 1) / 2. Joint distribution can be used to look for a relationship between them. And the whole point here is to look for a relationship between different variables, two at a time.
For a bigger picture or for the sake of completeness, we might want to display a rough joint distribution plot for each pair of variables. This can be done by using pairplot() from seaborn (sns). But since we have a large N, here is an overview with just a few important pairs of variables.
So as you can see, the figure above gives us a reasonable idea about variables’ relationships.
Important considerations when dealing with missing data:
- How prevalent is it?
- Is data missing at random or is there a pattern?
ML algorithms can fail when there is missing data. So there are different techniques for dealing with it. For example:
- Remove Rows With Missing Values
- Impute Missing Values: replace missing values with some other reasonable value, e.g., a value from another randomly selected record, a mean, median or mode value for the column or a default value like 0.
Outliers in input data can skew and mislead the training process of ML algorithms. They can skew the summary distribution of attribute values in descriptive statistics (one we obtained by using df.describe()) like mean and standard deviation. However, they can also be a valuable source of information by providing insights about some specific behaviors. Extreme Value analysis is one approach to find outliers. For example, from the scatter plot above, we can easily spot an out of place large value for ‘GrLivArea’. We can define it as an outlier and delete it.
How we deal with outliers and missing data is quite important. The approaches I have used were merely for demonstration purposes. When dealing with a real problem, one should experiment with many different techniques and analyze how they impact the model.
Note: I am skipping data normalization. Briefly, when you normalize data you eliminate the units of measurement for data so that you can more easily compare different variables. For example, in feature scaling you rescale data to have values between 0 and 1. You can read more about it here.
Categorical variables are those variables that fall into a specific category; they have descriptive and not quantitative values. ‘HomeStyle’, gender, city are such examples. The category ‘gender’ is generally binary with two categorical variables, Male and Female, that can be represented as 0 and 1. It is also possible to represent each possible value for a category as a separate feature, a technique called One-hot encoding. And before we can proceed, we need to convert our categorical variables, which are stored as text values, because many ML algorithms do not support them directly. Pandas can help to very easily transform categorical features into vectors via the get_dummies() function.
Finally, we can look at creating a basic Linear Regression model to predict the housing prices. For a quick primer on Linear Regression, you can read this quick intro first.
First we split the dataset into two: target values, X, and predictor values, Y. Then we split the data into training and test sets.
If we check the shape of our variables, we can see that we got ourselves our train and test datasets with a proportion of 80% for train data and 20% for test data. (I am skipping the creation of a third validation set.)
Now let’s create a basic linear regression, fitted on
Y_train, using scikit-learn and use it to predict sale value for the test set, stored in Y_pred.
We can create a scatter plot to visualize the differences between actual prices and predicted values and also get the Root Mean Squared Error (RMSE or also known as RMSD) – a measure of the differences between values predicted by our model and the actual prices. RMSE is the standard deviation of the prediction errors or residuals; a measure of how spread out these residuals are from the line of best fit.
Ideally, the scatter plot should have been a linear line (green line). This was a simple end-to-end example, and the model created is not the best one.
Pheww! This was a long post, so without going into further details, I will leave you with some ideas to try out. If you have come this far and are still curious, it should be easy to play with the features and try out other ML models in scikit-learn like Random Forest. You can also read about approaches to minimize prediction error. For example, you can play with K-Fold validation, Stocahstic Gradient Descent and Ridge Regression.
Now that you have witnessed the power of Python and its libraries, I hope you found something useful and inspiring here. And on that note, roll up your sleeves and get busy!
- A good book that you might want to read: “Hands-On Machine Learning with Scikit-Learn and TensorFlow“
- Amazing work by the Kaggle community