Gaussian mixture models can be used to cluster unlabeled data in much the same way as k-means. There are, however, a couple of advantages to using Gaussian mixture models over k-means.
First and foremost, k-means does not account for variance. By variance, we are referring to the width of the bell shape curve.
In two dimensions, variance (covariance to be exact) determines the shape of the distribution.
One way to think about the k-means model is that it places a circle (or, in higher dimensions, a hyper-sphere) at the center of each cluster, with a radius defined by the most distant point in the cluster.
This works fine for when your data is circular. However, when your data takes on different shape, you end up with something like this.
In contrast, Gaussian mixture models can handle even very oblong clusters.
The second difference between k-means and Gaussian mixture models is that the former performs hard classification whereas the latter performs soft classification. In other words, k-means tells us what data point belong to which cluster but won’t provide us with the probabilities that a given data point belongs to each of the possible clusters.
In calling the
predict function, the model will assign every data point to one of the clusters.
On the other hand, we can call the
predict_proba function to return the probabilities that a data point belongs to each of the K clusters.
As the name implies, a Gaussian mixture model involves the mixture (i.e. superposition) of multiple Gaussian distributions. For the sake of explanation, suppose we had three distributions made up of samples from three distinct classes.
The blue Gaussian represents the level of education of people that make up the lower class. The red Gaussian represents the level of education of people that make up the middle class, and the green Gaussian represents the level of education of people that make up the upper class.
Not knowing what samples came from which class, our goal will be to use Gaussian Mixture Models to assign the data points to the appropriate cluster.
After training the model, we’d ideally end up with three distributions on the same axis. Then, depending on the level of education of a given sample (where it is located on the axis), we’d place it in one of the three categories.
Every distribution is multiplied by a weight π to account for the fact that we do not have an equal number of samples from each category. In other words, we might only have included 1000 people from the upper class and 100,000 people from the middle class.
Since, we’re dealing with probabilities, the weights should add to 1, when summed.
If we decided to add another dimension such as the number of children, then, it might look something like this.
For those of you who aren’t mathematically inclined, I apologize in advance as the next section is quite heavy.
Let’s suppose we wanted to know what is the likelihood that the ith sample came from Gaussian k. We can express this as:
Where theta represents the mean, covariance and weight for each Gaussian.
You may also come across the equation written as π. This is not to be confused with the weight associated with each Gaussian (confusing I know).
Next, we express the likelihood of observing a data point given that it came from Gaussian K as:
The latter is sometimes written as follows (I believe the N comes from Normal Distribution):
Suppose we had a Gaussian distribution where the horizontal axis is the different IQ scores an individual could possibly get, lowest through highest. We can find out how likely it is for an individual to have an IQ of 120 by drawing a vertical line from the position along the x-axis to the curve and then looking at the corresponding value on the y-axis. The value of y at any point is equal to the equation above.
If we’d like to know the likelihood of observing the sample i while taking into account all the different distributions, we simply sum the likelihoods of observing the sample given that it came from each of the possible Gaussian.
Said differently, we take one sample (row) from our dataset, look at a single feature (i.e. level of education), plot its position on the x-axis and sum the corresponding y values (likelihood) for each distribution.
In order to extend this to all samples in our dataset. We assume the likelihood of observing one sample is independent from all the others and then we can simply multiply them.
We can rewrite the equation using the nomenclature we saw previously as follows:
More often than not, we take the log of the likelihood because the multiplication of two numbers inside of a log is equal to the sum of the logs of its constituents, and it’s easier to add numbers than to multiply them.
We have yet to address the fact that we need the parameters of each Gaussian (i.e. variance, mean and weight) in order to cluster our data but we need to know which sample belongs to what Gaussian in order to estimate those very same parameters.
This is where expectation maximization comes in to play. At a high level, the expectation maximization algorithm can be described as follows:
- Start with random Gaussian parameters (θ)
- Repeat the following until we converge:
a) Expectation Step: Compute p(zi = k | xi, θ). In other words, does sample i look like it came from cluster k?
b) Maximization Step: Update the Gaussian parameters (θ) to fit points assigned to them.
In the maximization step, we want to maximize the likelihood that each sample came from the distribution. Recall, that the likelihood is the height of the curve at a point along the x-axis. Therefore, we want to modify the variance and mean of the distribution such that the height of the plot at each data point is maximized.
This brings up the question, “How should we go about selecting the optimal values for the variance and mean.” Using the general form of expectation maximization, we can derive a set of equations for the mean, variance and weight.
We can follow the same process to obtain the equations for the covariance and weight.
Once we have the equations, we simply apply them during the maximization step. That is to say, we plugin in the numbers in each equation to determine the best mean, covariance, weight, and then set the Gaussian parameters accordingly.
Let’s take a look at the math in action. Initially, we don’t know what points are associated with which distribution.
We start off with K Gaussians (in this case, K=2) with random mean, variance and weight.
Then, we repeat the expectation and maximization steps until there is little to no change in theta (θ).
It’s worth noting, that the algorithm is susceptible to local maxima.
Now, that we have a grasp of how Gaussian mixture models work, let’s take a look at how we could go about implementing them. To start, import the following libraries.
import numpy as np from sklearn.datasets.samples_generator import make_blobs from sklearn.mixture import GaussianMixture from matplotlib import pyplot as plt import seaborn as sns sns.set()
We randomly generate 4 clusters.
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) plt.scatter(X[:,0], X[:,1])
The optimal number of clusters (K) is the value that minimizes the Akaike information criterion (AIC) or the Bayesian information criterion (BIC).
n_components = np.arange(1, 21) models = [GaussianMixture(n, covariance_type='full', random_state=0).fit(X) for n in n_components]plt.plot(n_components, [m.bic(X) for m in models], label='BIC') plt.plot(n_components, [m.aic(X) for m in models], label='AIC') plt.legend(loc='best') plt.xlabel('n_components');
We train our model using the optimal number of clusters (in this case, 4).
gmm = GaussianMixture(n_components=4)gmm.fit(X)
We use the
predict method to obtain a list of points and their respective clusters.
labels = gmm.predict(X)plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis');
Unlike k-means, Gaussian mixture models account for variance and returns the probability that a data point belongs to each of the K clusters.