Machine Learning Algorithms 101

Posted by

This blog post is for the newbies who want to quickly understand the basics of the most popular Machine Learning algorithms. I am assuming that you have read the previous article regarding types or categories of algorithms. If not, I would suggest you to read that first.

In this quick tour you will learn about:

  1. Linear Regression
  2. Logistic Regression
  3. Decision Trees
  4. Naive Bayes
  5. Support Vector Machines, SVM
  6. K-Nearest Neighbors, KNN
  7. K-Means
  8. Random Forest
  9. Dimensionality Reduction
  10. Artificial Neural Networks, ANN

Let’s get started!

1. Linear Regression

Linear Regression is one of the most, or perhaps the most, popular algorithm. It attempts to establish a relationship between one or more independent variables and a numeric outcome or dependent variable by fitting the equation of a line to observed data:

Y = a*X + b

For example, you might want to relate the weights (Y) of individuals to their heights (X) using linear regression. This approach assumes a strong linear relationship between input and output variables – we would assume that if height increases then weight also increases proportionally in a linear fashion.

The goal is to derive optimal coefficients, a and b, so that our estimated values, Y, can be as close as possible to their correct values given in the training set.

800px-Linear_regression.svg

Different techniques can be used to learn the linear regression model, i.e. to derive the coefficients a and b from data. The two most popular methods:

  • Ordinary least squares: The method of least squares calculates the best-fitting line by minimizing the vertical distances from each data point to the line. If a point lies on the fitted line exactly, then its vertical distance is 0. To be specific, the distance is the sum of the squares of the vertical distances. The main idea is to fit a model by minimizing this squared error – blue points as close as possible to the red line.
  • Gradient descent is a method that minimizes the amount of error at each step of the model training process. I will go into its details in a separate post.

When there is only one independent variable, like in the figure above, then we call it Simple Linear Regression. When there are more than one independent variables (e.g., if we want to predict weight using more variables than just height) then the type is called Multiple Linear Regression. While finding the line of best fit, you can also use a polynomial instead of a straight line and this is known as polynomial regression.

2. Logistic Regression

Logistic Regression has the same main idea as linear regression – the goal is to find values for the coefficients for each input variable. This technique is used when the output/dependent variable is binary. For example: does age have an influence on the probability of having a heart attack (yes vs. no)?

In this case the regression line is not straight anymore; the prediction for the output is transformed using a non-linear S-shaped function called the logistic function, g(). The logistic function maps the values into an outcome variable Y ranging from [0,1] which can be interpreted as the probability of occurrence of an event.

The properties of the S-shaped logistic function make Logistic Regression suitable for classification tasks. In the heart attack example we have two class labels to be predicted (yes/no labelled as 1/0). If we set a threshold of 0.5 then all the instances with a probability higher than the threshold will be classified as instances belonging to class 1 while all those having probability below the threshold will be assigned to class 0.

Logistic Regression

 

3. Decision Trees

Decision Trees algorithm belongs to the category of supervised learning algorithms and it can be used for solving both regression and classification tasks.

In this algorithm the training model can learn to predict class or value of target variables by learning decision rules, using a tree representation. A tree is made up of nodes, corresponding to a feature/attribute. At each node we ask a question about the data. The left and right branches represent the possible answers. The final nodes, leaf nodescorrespond to a class label. The importance for each feature is determined in a top-down approach – the higher the node, the more important its attribute/feature. This is more easy to understand by a visual representation of an example problem:

decsion tree.png

 

In the image above, you can see that data instances are classified into ‘will accept the offer or not’ based on attributes like salary range and commute time. Salary is at a higher node than ‘offers free-coffee’ implying that it is more important in determining the final outcome.

4. Naive Bayes

Naive Bayes is a simple and widely used Machine Learning algorithm based on the Bayes Theorem. It is called naive because the classifier assumes that the input variables are independent of each other (a strong and unrealistic assumption for real data). The Bayes theorem is given by the equation below:

P(c|x) = P(x|c) . P(c) / P(x)

P(c|x) is the probability of the event of class c, given the predictor x; P(x|c) is the probability of x, given cP(c) is the probability of the class; P(x) is the probability of the predictor. To put it simply, the model is composed of two types of probabilities:

  • The probability of each class;
  • The conditional probability for each class given each value of x

Example: If we have a training data set of weather conditions (x) and corresponding target variable ‘Play’ (c), we might want to obtain the probability of ‘Players will play if it is rainy’, P(c|x) . (Even if the answer is a numerical value ranging from 0 to 1, this is a classification problem – we can use the probabilities to reach a binary or yes/no outcome.)

5. Support Vector Machine, SVM

Support Vector Machines is a supervised Machine Learning algorithm, used mainly for classification problems. In this algorithm, we plot each data item as a point in n-dimensional space, where n is number of input features. For example, with two input variables, we would have a two-dimensional space. Based on these transformations, SVM finds an optimal boundary, a hyperplane, that best separates the possible outputs by their class label. In a two-dimensional space, this hyperplane can be visualized as a line (not necessarily a straight line). The task of the SVM algorithm is to find the coefficients that provide the best separation of classes by this hyperplane.

The distance between the hyperplane (line) and the closest class points is called the margin. The optimal hyperplane is one that has the largest margin, or in other words, one that classifies points in such a way that the distance between the closest data point from both classes is maximum.

In simple words, suppose you are given a plot of two classes (red and blue dots) on a graph as shown in the figure below. The job of the SVM classifier would be to decide the best line that can separate the blue dots from the red dots.

SVM classifier

6. K-Nearest Neighbors, KNN

KNN algorithm is a very simple and popular technique. It classifies an object by searching through the entire training set for the K most similar instances (the K neighbors) and creating a common output variable for those K instances. KNN example from real life: ‘You are the average of the five people you most associate with’!

The figure below represents a classification example. If we classify a new instance (red circle) based on its 5 nearest neighbors, we would assign it to the ‘minus’ category. But if we look only at its 1 nearest neighbor then it would end up in the ‘plus’ category. Hence, the selection of K is critical here: a small value can result in a lot of noise and inaccurate results, while a large value is not feasible and defeats the purpose of the algorithm.

Although mostly used for classification, this technique can also be used for regression problems. For example, when dealing with a regression task, the output variable might be the mean of the K instances; for classification problems this might be the mode (or most common) class value.

The distance functions for assessing similarity between instances can be the so called Euclidean, Manhattan and Minkowski distances. Euclidean distance is simply an “ordinary” straight-line distance between two points (to be specific, the square root of the sum of the squares of the differences between corresponding coordinates of the points).

KNN

 

7. K-Means

K-means is a type of unsupervised algorithm which deals with data clustering. It follows a simple procedure to classify a given data set by finding a certain number of  clusters (k clusters). Since we are dealing with unsupervised learning scenario, all we have is the training data X, number of clusters K and no labelled training instances (i.e., data without defined categories or groups that we can use to train our model). The goal of this algorithm is to find K groups in the data. For example, K-Means could be used to segment users into K groups based on their purchase history.

The algorithm iteratively assigns each data point to one of K groups based on their features. Initially, it picks k points for each each of the K-clusters known as the centroid. A new data point from the dataset is put into the cluster having the closest centroid based on feature similarity.  As new elements are added to the cluster, the cluster centroid is recomputed and it keeps changing – the new centroid becomes the average location of all the data points currently in the cluster.  This process is continued iteratively until the centroids  stop changing. At the end, each centroid is a collection of feature values which define the resulting group.

Continuing with the purchase history example, the red cluster might represent users that like to buy tech gadgets and the blue one might be users interested in buying sports equipment. Now the algorithm will keep moving the centroid (plus sign in the figure below) for each user-segment until it is able to create K groups. It will do so by trying to maximize the separation between groups and users outside of the group.

kmeansclustering

8. Random Forest

Random Forest is one of the most popular and powerful Machine Learning algorithms. It is a type of ensemble algorithm. The underlying idea for ensemble learning is ‘wisdom of crowds‘ – collective opinion of many is more likely to be accurate than that of one. The outcome of each of the models is combined and a prediction is made.

In Random Forest, we have an ensemble of decision trees (seen above, in algorithm 3). When we want to classify a new object, we take the vote of each decision tree and combine the outcome to make a final decision – majority vote wins.

Random Forest

 

9. Dimensionality Reduction

In the last years, there has been an exponential increase in the amount of data captured and many Machine Learning problems involve thousands or even millions of features for each training instance, making not only training extremely slow but finding a good solution also becomes much harder. This problem is often referred to as the curse of dimensionality. In real-world problems, it is often possible to reduce the number of features considerably, making problems tractable.

For example, in an image classification problems if the pixels on the image borders are almost always white, these pixels can completely be dropped from the training set without losing much information.

Principal Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm. PCA requires a blog post on its own, so I will leave the details for some future post.

10. Artificial Neural Networks, ANN

Last but not the least, a sneak peek into Artificial Neural Networks which are at the very core of Deep Learning. ANN are ideal to tackle large and highly complex Machine Learning tasks, such as recommending the best videos to watch to hundreds of millions of users every day (e.g., YouTube), powering speech recognition services (e.g., Siri, Cortana) or learning to beat the world champion at the game of Go (DeepMind’s AlphaGo).

The key idea behind ANN is to use brain’s architecture for inspiration on how to build intelligent machines. ANN require a huge amount of training data, high computational power and long training time but, in the end, they are able make very accurate predictions.

To train a neural network, a set of neurons are mapped out and assigned a random weight which determines how the neurons process new data (images, text, sounds). The correct relationship between inputs and outputs is learned from training the neural network on input data. And since during the training phase the system gets to see the correct answers, if the network doesn’t accurately identify the input – doesn’t see a face in an image, for example — then the system adjusts the weights. Eventually, after sufficient training, the neural network will consistently recognize the correct patterns in speech, text or images.

So a neural network is a essentially a set of interconnected layers with weighted edges and nodes called neurons. Between the input and output layers you can insert multiple hidden layers. ANN make use of only two hidden layers. However, if we increase the depth of these layers then we are dealing with the famous Deep Learning.

ANN

End Notes

By now, I am sure, you would have a good idea of the most commonly used Machine Learning algorithms. The idea behind this post was to give you a walk-through. For details regarding each of these algorithms, stay tuned!

 

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s