Skip to content
Snippets Groups Projects
Commit 4625a6fe authored by Danilo Ferreira de Lima's avatar Danilo Ferreira de Lima
Browse files

More information

parent 64d17d5b
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:6a764d92 tags: %% Cell type:markdown id:6a764d92 tags:
   
# Mixture Models # Mixture Models
   
One common objective is to find similarities in data and cluster it. There are several heuristic methods to cluster them. We are going to focus on a few strongly theoretically motivated methods. Other methods use heuristics to identify similarities in data and are mentioned in the end. One common objective is to find similarities in data and cluster it. There are several heuristic methods to cluster them. We are going to focus on a few strongly theoretically motivated methods. Other methods use heuristics to identify similarities in data and are mentioned in the end.
   
For the purposes of this example, we will generate a fake dataset. For the purposes of this example, we will generate a fake dataset.
   
%% Cell type:markdown id:fce4d8e8 tags: %% Cell type:markdown id:fce4d8e8 tags:
   
We start by loading the necessary Python modules. If you have not yet installed them, run the following cell to install them with pip: We start by loading the necessary Python modules. If you have not yet installed them, run the following cell to install them with pip:
   
%% Cell type:code id:44ca341e tags: %% Cell type:code id:44ca341e tags:
   
``` python ``` python
!pip install numpy scikit-learn pandas matplotlib !pip install numpy scikit-learn pandas matplotlib
``` ```
   
%% Output %% Output
   
Requirement already satisfied: numpy in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (1.19.2) Requirement already satisfied: numpy in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (1.19.2)
Requirement already satisfied: scikit-learn in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (0.24.2) Requirement already satisfied: scikit-learn in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (0.24.2)
Requirement already satisfied: pandas in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (1.1.5) Requirement already satisfied: pandas in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (1.1.5)
Requirement already satisfied: matplotlib in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (3.3.4) Requirement already satisfied: matplotlib in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (3.3.4)
Requirement already satisfied: scipy>=0.19.1 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from scikit-learn) (1.5.2) Requirement already satisfied: scipy>=0.19.1 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from scikit-learn) (1.5.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from scikit-learn) (2.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from scikit-learn) (2.2.0)
Requirement already satisfied: joblib>=0.11 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from scikit-learn) (1.0.1) Requirement already satisfied: joblib>=0.11 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from scikit-learn) (1.0.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from pandas) (2.8.2) Requirement already satisfied: python-dateutil>=2.7.3 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2017.2 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from pandas) (2021.3) Requirement already satisfied: pytz>=2017.2 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from pandas) (2021.3)
Requirement already satisfied: pillow>=6.2.0 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from matplotlib) (8.3.1) Requirement already satisfied: pillow>=6.2.0 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from matplotlib) (8.3.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from matplotlib) (3.0.4) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from matplotlib) (3.0.4)
Requirement already satisfied: cycler>=0.10 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from matplotlib) (0.11.0) Requirement already satisfied: cycler>=0.10 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from matplotlib) (1.3.1) Requirement already satisfied: kiwisolver>=1.0.1 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from matplotlib) (1.3.1)
Requirement already satisfied: six>=1.5 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0) Requirement already satisfied: six>=1.5 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0)
   
%% Cell type:code id:300cf8d3 tags: %% Cell type:code id:300cf8d3 tags:
   
``` python ``` python
%matplotlib notebook %matplotlib notebook
   
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
   
import pandas as pd import pandas as pd
import numpy as np import numpy as np
from sklearn.mixture import GaussianMixture, BayesianGaussianMixture from sklearn.mixture import GaussianMixture, BayesianGaussianMixture
from sklearn.cluster import KMeans from sklearn.cluster import KMeans
``` ```
   
%% Cell type:markdown id:0ecd6a69 tags: %% Cell type:markdown id:0ecd6a69 tags:
   
Let's generate the fake data now to have something to cluster. Let's generate the fake data now to have something to cluster.
   
%% Cell type:code id:4959a292 tags: %% Cell type:code id:4959a292 tags:
   
``` python ``` python
def generate_clusters(mu: np.ndarray, sigma: np.ndarray, N: int) ->np.ndarray: def generate_clusters(mu: np.ndarray, sigma: np.ndarray, N: int) ->np.ndarray:
assert len(mu) == len(sigma) assert len(mu) == len(sigma)
assert N > 1 assert N > 1
D = len(mu[0].shape) D = len(mu[0].shape)
data = np.concatenate([np.random.default_rng().multivariate_normal(mean=mu_k, cov=sigma_k, size=N) data = np.concatenate([np.random.default_rng().multivariate_normal(mean=mu_k, cov=sigma_k, size=N)
for mu_k, sigma_k in zip(mu, sigma)], axis=0) for mu_k, sigma_k in zip(mu, sigma)], axis=0)
source = np.concatenate([k*np.ones([N, 1]) for k in range(len(mu))], axis=0) source = np.concatenate([k*np.ones([N, 1]) for k in range(len(mu))], axis=0)
return np.concatenate([data, source], axis=1) return np.concatenate([data, source], axis=1)
``` ```
   
%% Cell type:code id:82929490 tags: %% Cell type:code id:82929490 tags:
   
``` python ``` python
data = generate_clusters(mu=[np.array([5.0, -2.0]), data = generate_clusters(mu=[np.array([5.0, -2.0]),
np.array([1.0, 5.0]), np.array([1.0, 5.0]),
np.array([-5.0, -1.0])], np.array([-5.0, -1.0])],
sigma=[np.array([[0.1, 0.2], sigma=[np.array([[0.1, 0.2],
[0.2, 0.1]]), [0.2, 0.1]]),
np.array([[1.0, 0.5], np.array([[1.0, 0.5],
[0.5, 1.0]]), [0.5, 1.0]]),
np.array([[2.0, 0.0], np.array([[2.0, 0.0],
[0.0, 5.0]])], [0.0, 5.0]])],
N=1000) N=1000)
data = pd.DataFrame(data, columns=["x", "y", "source"]) data = pd.DataFrame(data, columns=["x", "y", "source"])
``` ```
   
%% Output %% Output
   
/home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: covariance is not positive-semidefinite. /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: covariance is not positive-semidefinite.
   
%% Cell type:markdown id:d8295e8a tags: %% Cell type:markdown id:d8295e8a tags:
   
Let's print out the dataset read first. Let's print out the dataset read first.
   
%% Cell type:code id:024fb65a tags: %% Cell type:code id:024fb65a tags:
   
``` python ``` python
data data
``` ```
   
%% Output %% Output
   
x y source x y source
0 6.128013 -1.708080 0.0 0 6.128013 -1.708080 0.0
1 3.938043 -2.788596 0.0 1 3.938043 -2.788596 0.0
2 4.221406 -1.813124 0.0 2 4.221406 -1.813124 0.0
3 4.530432 -2.052744 0.0 3 4.530432 -2.052744 0.0
4 4.903167 -1.418898 0.0 4 4.903167 -1.418898 0.0
... ... ... ... ... ... ... ...
2995 -4.874779 -1.486904 2.0 2995 -4.874779 -1.486904 2.0
2996 -6.650308 -0.551687 2.0 2996 -6.650308 -0.551687 2.0
2997 -6.388364 -2.193081 2.0 2997 -6.388364 -2.193081 2.0
2998 -6.324778 -0.540765 2.0 2998 -6.324778 -0.540765 2.0
2999 -5.787230 -0.515902 2.0 2999 -5.787230 -0.515902 2.0
[3000 rows x 3 columns] [3000 rows x 3 columns]
   
%% Cell type:markdown id:1c178424 tags: %% Cell type:markdown id:1c178424 tags:
   
We can plot this fairly easily using Matplotlib. We can plot this fairly easily using Matplotlib.
   
%% Cell type:code id:e63b38c5 tags: %% Cell type:code id:e63b38c5 tags:
   
``` python ``` python
fig, ax = plt.subplots(figsize=(8, 8)) fig, ax = plt.subplots(figsize=(8, 8))
data.plot.scatter(x="x", y="y", c="source", colormap='viridis', ax=ax) data.plot.scatter(x="x", y="y", c="source", colormap='viridis', ax=ax)
ax.set(xlabel="x", ylabel=r"y", title="") ax.set(xlabel="x", ylabel=r"y", title="")
plt.show() plt.show()
``` ```
   
%% Output %% Output
   
   
   
%% Cell type:markdown id:a376636d tags: %% Cell type:markdown id:a376636d tags:
   
There are several ways of finding similarities in the data. The Gaussian Mixture Model assumes that the data has been produced exactly as in this example. For each sample, the procedure assumed to be used for the generation of each sample is the following: There are several ways of finding similarities in the data. The Gaussian Mixture Model assumes that the data has been produced exactly as in this example. For each sample, the procedure assumed to be used for the generation of each sample is the following:
* a random integer is chosen according to a discrete probability distribution, which identifies to which cluster the sample belongs to (this is the `source` variable in the generation procedure above); * a random integer is chosen according to a discrete probability distribution, which identifies to which cluster the sample belongs to (this is the `source` variable in the generation procedure above);
* that random integer is assumed to be used to choose the mean and covariance matrix for a Gaussian random variable, which is then used to produce the observed sample. * that random integer is assumed to be used to choose the mean and covariance matrix for a Gaussian random variable, which is then used to produce the observed sample.
   
Out objective here is then to find out the probability of producing a sample for each cluster (the probabilities with which a sample belongs to one cluster instead of the others), the means and covariance matrices for the model. Using Bayes' theorem, one may write down the posterior probability for the true means, covariances and cluster probabilities ($\theta$) given the data ($\x$), but it is very hard to calculate the parameters from it. Let us call the cluster that a specific sample belongs to, $z$. Out objective here is then to find out the probability of producing a sample for each cluster (the probabilities with which a sample belongs to one cluster instead of the others), the means and covariance matrices for the model. Using Bayes' theorem, one may write down the posterior probability for the true means, covariances and cluster probabilities ($\theta$) given the data ($\x$), but it is very hard to calculate the parameters from it. Let us call the cluster that a specific sample belongs to, $z$.
   
The true posterior would look like this: The true posterior would look like this:
   
$p(\theta|x) = \frac{p(x|\theta) p(\theta)}{p(x)}$ $p(\theta|x) = \frac{p(x|\theta) p(\theta)}{p(x)}$
   
We assume $p(\theta)$ is constant for all $\theta$ and we know $p(x)$ is independent of $\theta$ (because it is just the probability we obtained the existing data). We also know that $p(x|\theta) = \sum_z p(x,z|\theta)$, since summing over all $z$, we obtain a total probability 1 of getting any $z$. We assume $p(\theta)$ is constant for all $\theta$ and we know $p(x)$ is independent of $\theta$ (because it is just the probability we obtained the existing data). We also know that $p(x|\theta) = \sum_z p(x,z|\theta)$, since summing over all $z$, we obtain a total probability 1 of getting any $z$.
   
$p(\theta|x) \propto \sum_z p(x,z|\theta) = \sum_z p(x|\theta,z) p(z|\theta)$ $p(\theta|x) \propto \sum_z p(x,z|\theta) = \sum_z p(x|\theta,z) p(z|\theta)$
   
Where we have expanded it on $z$ using $p(a,b) = p(a|b)p(b)$. Where we have expanded it on $z$ using $p(a,b) = p(a|b)p(b)$.
   
Now if we want to find the most probable $\theta$ from this posterior (the "maximum a posteriori", or "MAP"), we just need to maximize the right-hand side of the last equation. This is very hard to do, but we can use the Expectation-Maximization (EM) algorithm, on which we iteratively improve our probability estimates. This method works as follows. Now if we want to find the most probable $\theta$ from this posterior (the "maximum a posteriori", or "MAP"), we just need to maximize the right-hand side of the last equation. This is very hard to do, but we can use the Expectation-Maximization (EM) algorithm, on which we iteratively improve our probability estimates. This method works as follows.
   
First we define the log-likelihood $LL(\theta|z) = \log L(\theta|z) = \log p(x|\theta,z) = \sum_{k=\text{sample}} \log p(x_k|\theta,z_k)$. Since we are assuming that the probability for each data sample $x_k$ is a Gaussian distribution, each of the terms in the sum is a Gaussian probability, so if we know $\theta$ and $z_k$, this full sum can be calculated. The issue is that we do not know $z_k$ (that is the whole point!). First we define the log-likelihood $LL(\theta|z) = \log L(\theta|z) = \log p(x|\theta,z) = \sum_{k=\text{sample}} \log p(x_k|\theta,z_k)$. Since we are assuming that the probability for each data sample $x_k$ is a Gaussian distribution, each of the terms in the sum is a Gaussian probability, so if we know $\theta$ and $z_k$, this full sum can be calculated. The issue is that we do not know $z_k$ (that is the whole point!).
   
We start by assuming $\theta=\theta_0$ for some random initial value of the parameters. The EM method iterates on two steps: We start by assuming $\theta=\theta_0$ for some random initial value of the parameters. The EM method iterates on two steps:
   
1. Calculate the *average* value of $LL(\theta|z)$ over $z$ assuming the current $\theta_i$ to calculate $p(z|\theta)$, that is: $Q(\theta) = \sum_z LL(\theta|z) p(z|\theta_i)$. This means we calculate the weighted sum of the log-likelihood a point belongs to each Gaussian, weighted by the probability that that Gaussian was the reason that sample was generated. This avoids needing to know the correct $z_k$, since we just sum over all possibilities. The weights of this weighted sum are simply one of the parameters $\theta$, namely, the cluster probabilities. 1. Calculate the *average* value of $LL(\theta|z)$ over $z$ assuming the current $\theta_i$ to calculate $p(z|\theta)$, that is: $Q(\theta) = \sum_z LL(\theta|z) p(z|\theta_i)$. This means we calculate the weighted sum of the log-likelihood a point belongs to each Gaussian, weighted by the probability that that Gaussian was the reason that sample was generated. This avoids needing to know the correct $z_k$, since we just sum over all possibilities. The weights of this weighted sum are simply one of the parameters $\theta$, namely, the cluster probabilities.
   
2. Find the $\theta$ which maximizes $Q(\theta)$ and use this as the next $\theta_{i+1}$. 2. Find the $\theta$ which maximizes $Q(\theta)$ and use this as the next $\theta_{i+1}$.
   
These two steps are iterated to improve the $\theta$ for several iterations. It can be shown that an improvement on $Q$ following this procedure leads to a new $\theta$ that improves the posterior probability $p(\theta|x)$ above (see https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm#Proof_of_correctness ). These two steps are iterated to improve the $\theta$ for several iterations. It can be shown that an improvement on $Q$ following this procedure leads to a new $\theta$ that improves the posterior probability $p(\theta|x)$ above (see https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm#Proof_of_correctness ).
   
We will not write all of this from scratch. Instead, we use the `GaussianMixture` function in `scikit-learn`, which has this ready for us. It is nevertheless important to understand the assumptions made here: the underlying samples are assumed to come from a discrete combination of Gaussians. Other mixture models are possible using other underlying distributions, but note that this method will not work if the samples do not follow this generative pattern. We will not write all of this from scratch. Instead, we use the `GaussianMixture` function in `scikit-learn`, which has this ready for us. It is nevertheless important to understand the assumptions made here: the underlying samples are assumed to come from a discrete combination of Gaussians. Other mixture models are possible using other underlying distributions, but note that this method will not work if the samples do not follow this generative pattern.
   
%% Cell type:code id:0837b3ff tags: %% Cell type:code id:0837b3ff tags:
   
``` python ``` python
gmm = GaussianMixture(n_components=3, covariance_type="full", max_iter=20) gmm = GaussianMixture(n_components=3, covariance_type="full", max_iter=20)
``` ```
   
%% Cell type:code id:8798f857 tags: %% Cell type:code id:8798f857 tags:
   
``` python ``` python
gmm.fit(data.loc[:, ["x", "y"]]) gmm.fit(data.loc[:, ["x", "y"]])
``` ```
   
%% Output %% Output
   
GaussianMixture(max_iter=20, n_components=3) GaussianMixture(max_iter=20, n_components=3)
   
%% Cell type:markdown id:e928f498 tags: %% Cell type:markdown id:e928f498 tags:
   
Now that we have model fit, we can even check the fit parameters $\theta$, which include the means, the covariance matrices and the probabilities given to each Gaussian in the mixture: Now that we have model fit, we can even check the fit parameters $\theta$, which include the means, the covariance matrices and the probabilities given to each Gaussian in the mixture:
   
%% Cell type:code id:fb5796e5 tags: %% Cell type:code id:fb5796e5 tags:
   
``` python ``` python
gmm.means_ gmm.means_
``` ```
   
%% Output %% Output
   
array([[ 4.98067655, -2.00779217], array([[ 4.98067655, -2.00779217],
[-4.96087396, -1.06118412], [-4.96087396, -1.06118412],
[ 1.02445445, 5.03277158]]) [ 1.02445445, 5.03277158]])
   
%% Cell type:code id:182904d1 tags: %% Cell type:code id:182904d1 tags:
   
``` python ``` python
gmm.covariances_ gmm.covariances_
``` ```
   
%% Output %% Output
   
array([[[ 0.19237869, 0.09361426], array([[[ 0.19237869, 0.09361426],
[ 0.09361426, 0.20047796]], [ 0.09361426, 0.20047796]],
[[ 1.95114689, -0.01346582], [[ 1.95114689, -0.01346582],
[-0.01346582, 5.10415062]], [-0.01346582, 5.10415062]],
[[ 0.91835643, 0.48388716], [[ 0.91835643, 0.48388716],
[ 0.48388716, 1.03723078]]]) [ 0.48388716, 1.03723078]]])
   
%% Cell type:code id:2faa1f72 tags: %% Cell type:code id:2faa1f72 tags:
   
``` python ``` python
gmm.weights_ gmm.weights_
``` ```
   
%% Output %% Output
   
array([0.33333333, 0.33282526, 0.33384141]) array([0.33333333, 0.33282526, 0.33384141])
   
%% Cell type:markdown id:b54001fc tags: %% Cell type:markdown id:b54001fc tags:
   
We can predict to which cluster each sample belongs now by selecting the cluster for each sample that maximizes the probability $p(z|\theta,x)$, now that we know $\theta$. We can predict to which cluster each sample belongs now by selecting the cluster for each sample that maximizes the probability $p(z|\theta,x)$, now that we know $\theta$.
   
%% Cell type:code id:cc8fc1f1 tags: %% Cell type:code id:cc8fc1f1 tags:
   
``` python ``` python
guess = gmm.predict(data.loc[:, ["x", "y"]]) guess = gmm.predict(data.loc[:, ["x", "y"]])
``` ```
   
%% Cell type:markdown id:68560fb4 tags: %% Cell type:markdown id:68560fb4 tags:
   
Let's plot it! Let's plot it!
   
%% Cell type:code id:88982b21 tags: %% Cell type:code id:88982b21 tags:
   
``` python ``` python
data.loc[:, "guess"] = guess data.loc[:, "guess"] = guess
``` ```
   
%% Cell type:code id:333581b5 tags: %% Cell type:code id:333581b5 tags:
   
``` python ``` python
fig, ax = plt.subplots(figsize=(10, 10), ncols=2) fig, ax = plt.subplots(figsize=(10, 10), ncols=2)
data.plot.scatter(x="x", y="y", c="guess", colormap='viridis', ax=ax[0]) data.plot.scatter(x="x", y="y", c="guess", colormap='viridis', ax=ax[0])
data.plot.scatter(x="x", y="y", c="source", colormap='viridis', ax=ax[1]) data.plot.scatter(x="x", y="y", c="source", colormap='viridis', ax=ax[1])
ax[0].set(xlabel="x", ylabel=r"y", title="Guessed source") ax[0].set(xlabel="x", ylabel=r"y", title="Guessed source")
ax[1].set(xlabel="x", ylabel=r"y", title="True association") ax[1].set(xlabel="x", ylabel=r"y", title="True association")
plt.show() plt.show()
``` ```
   
%% Output %% Output
   
   
   
%% Cell type:markdown id:9b1363ec tags: %% Cell type:markdown id:9b1363ec tags:
   
Note that if the sample clusters were not "blobs" of data, but were in concentric circles, the assumption of this method would be false and the method would simply not work well. This is why it is important to understand the underlying assumptions made in the method. Note that if the sample clusters were not "blobs" of data, but were in concentric circles, the assumption of this method would be false and the method would simply not work well. This is why it is important to understand the underlying assumptions made in the method.
   
%% Cell type:markdown id:fbee128c tags:
## Determining the number of clusters
So far, we have assumed that the number of clusters is known. This is not often the case. One method to determine the number of clusters is to fit the GMM with different number of clusters and after each fit, calculate `gmm.bic()`. This method returns the Bayesian Information Criteria (see also AIC), which should be small when the fit worked well and the number of components is compatible with your sample.
See a full example here: https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html#sphx-glr-auto-examples-mixture-plot-gmm-selection-py
%% Cell type:markdown id:7076f779 tags: %% Cell type:markdown id:7076f779 tags:
   
## K-Means ## K-Means
   
Another common method used for clustering is the K-Means, on which one simply tries to find the cluster centers which minimize in-cluster distances (in an Euclidean sense) while maximizing distances between the centers. It can be shown that this method is a special case of the Gaussian Mixture Model when the covariance matrices are diagonal, which would mean that within each blob, there is no correlation between the variables (see https://en.wikipedia.org/wiki/K-means_clustering#Gaussian_mixture_model and references). Another common method used for clustering is the K-Means, on which one simply tries to find the cluster centers which minimize in-cluster distances (in an Euclidean sense) while maximizing distances between the centers. It can be shown that this method is a special case of the Gaussian Mixture Model when the covariance matrices are diagonal, which would mean that within each blob, there is no correlation between the variables (see https://en.wikipedia.org/wiki/K-means_clustering#Gaussian_mixture_model and references).
   
While this is an approximation of the GMM model, it is still a very useful approach, since there are faster algorithms to achieve the clustering than for GMMs. While this is an approximation of the GMM model, it is still a very useful approach, since there are faster algorithms to achieve the clustering than for GMMs.
   
The scikit-learn module also provides an easy-to-use implementation of this algorithm: The scikit-learn module also provides an easy-to-use implementation of this algorithm:
   
%% Cell type:code id:2f280e1d tags: %% Cell type:code id:2f280e1d tags:
   
``` python ``` python
kmeans = KMeans(n_clusters=3) kmeans = KMeans(n_clusters=3)
kmeans.fit(data.loc[:, ["x", "y"]]) kmeans.fit(data.loc[:, ["x", "y"]])
``` ```
   
%% Output %% Output
   
KMeans(n_clusters=3) KMeans(n_clusters=3)
   
%% Cell type:code id:950a7eec tags: %% Cell type:code id:950a7eec tags:
   
``` python ``` python
data.loc[:, "guess_kmeans"] = kmeans.predict(data.loc[:, ["x", "y"]]) data.loc[:, "guess_kmeans"] = kmeans.predict(data.loc[:, ["x", "y"]])
``` ```
   
%% Cell type:code id:16a56489 tags: %% Cell type:code id:16a56489 tags:
   
``` python ``` python
fig, ax = plt.subplots(figsize=(10, 10), ncols=2) fig, ax = plt.subplots(figsize=(10, 10), ncols=2)
data.plot.scatter(x="x", y="y", c="guess_kmeans", colormap='viridis', ax=ax[0]) data.plot.scatter(x="x", y="y", c="guess_kmeans", colormap='viridis', ax=ax[0])
data.plot.scatter(x="x", y="y", c="source", colormap='viridis', ax=ax[1]) data.plot.scatter(x="x", y="y", c="source", colormap='viridis', ax=ax[1])
ax[0].set(xlabel="x", ylabel=r"y", title="Guessed source using K-Means") ax[0].set(xlabel="x", ylabel=r"y", title="Guessed source using K-Means")
ax[1].set(xlabel="x", ylabel=r"y", title="True association") ax[1].set(xlabel="x", ylabel=r"y", title="True association")
plt.show() plt.show()
``` ```
   
%% Output %% Output
   
   
   
%% Cell type:markdown id:2fe3cad2 tags: %% Cell type:markdown id:2fe3cad2 tags:
   
## Variational Inference on Gaussian Mixture Models ## Variational Inference on Gaussian Mixture Models
   
While the Gaussian Mixture Model presented beforehand had a very strong theoretical background, it still assumes that there is a single correct value for the Gaussian parameters. This may not be the case if the Gaussian model is only approximate (as is very often the case!). While the Gaussian Mixture Model presented beforehand had a very strong theoretical background, it still assumes that there is a single correct value for the Gaussian parameters. This may not be the case if the Gaussian model is only approximate (as is very often the case!).
   
One improvement to the Gaussian Mixture Models to allow for the usaeg of prior knowledge (a prior) on the parameters to be discovered. This helps us, for instance, to determine the ideal number of components automatically, since the number of components itself can be fit with it. One improvement to the Gaussian Mixture Models to allow for the usage of prior knowledge (a prior) on the parameters to be discovered. This helps us, for instance, to determine the ideal number of components automatically, since the number of components itself can be fit with it (assuming a prior distribution on them).
   
Optimizing this model becomes even more complicated as the derivation shown before and if we wanted to be fully general, we would need to use very slow algorithms, such as Monte-Carlo sampling to obtain uncertainties with the least amount of extra assumptions. This is often undesirable, since we need fast clustering and often Monte-Carlo sampling is very slow with even more data! An alternative is to assume some underlying prior probability for the means and covariances and find those parameters as well in an optimization algorithm. This is what is done in Variational Inference. Further details can be seen in Bishop (2006), or in a more practical approach, here: https://scikit-learn.org/stable/modules/mixture.html#bgmm Optimizing this model becomes even more complicated as the derivation shown before and if we wanted to be fully general, we would need to use very slow algorithms, such as Monte-Carlo sampling to obtain uncertainties with the least amount of extra assumptions. This is often undesirable, since we need fast clustering and often Monte-Carlo sampling is very slow with even more data! An alternative is to assume some underlying prior probability for the means and covariances and find those parameters as well in an optimization algorithm. This is what is done in Variational Inference. Further details can be seen in Bishop (2006), or in a more practical approach, here: https://scikit-learn.org/stable/modules/mixture.html#bgmm
   
We can also easily use this method in our toy data, taking the code from scikit-learn: We can also easily use this method in our toy data, taking the code from scikit-learn:
   
%% Cell type:code id:0a8642bd tags: %% Cell type:code id:0a8642bd tags:
   
``` python ``` python
bgmm = BayesianGaussianMixture(n_components=3) bgmm = BayesianGaussianMixture(n_components=3)
bgmm.fit(data.loc[:, ["x", "y"]]) bgmm.fit(data.loc[:, ["x", "y"]])
data.loc[:, "guess_bgmm"] = bgmm.predict(data.loc[:, ["x", "y"]]) data.loc[:, "guess_bgmm"] = bgmm.predict(data.loc[:, ["x", "y"]])
``` ```
   
%% Cell type:code id:ccbf4019 tags: %% Cell type:code id:ccbf4019 tags:
   
``` python ``` python
fig, ax = plt.subplots(figsize=(10, 10), ncols=2) fig, ax = plt.subplots(figsize=(10, 10), ncols=2)
data.plot.scatter(x="x", y="y", c="guess_bgmm", colormap='viridis', ax=ax[0]) data.plot.scatter(x="x", y="y", c="guess_bgmm", colormap='viridis', ax=ax[0])
data.plot.scatter(x="x", y="y", c="source", colormap='viridis', ax=ax[1]) data.plot.scatter(x="x", y="y", c="source", colormap='viridis', ax=ax[1])
ax[0].set(xlabel="x", ylabel=r"y", title="Guessed source using Variation Inference GMM") ax[0].set(xlabel="x", ylabel=r"y", title="Guessed source using Variation Inference GMM")
ax[1].set(xlabel="x", ylabel=r"y", title="True association") ax[1].set(xlabel="x", ylabel=r"y", title="True association")
plt.show() plt.show()
``` ```
   
%% Output %% Output
   
   
   
%% Cell type:markdown id:72508ead tags: %% Cell type:markdown id:72508ead tags:
   
## Other heuristic approaches ## Other heuristic approaches
   
Many other heuristic approaches for clustering have been developed. Some of them rely on a strong foundation in statistics, others take a more practical approach and make assumptions on how the data samples are expected to be distributed in the $xy$ scatter plot. The following methods deserve a look if you are not familiar with them already: Many other heuristic approaches for clustering have been developed. Some of them rely on a strong foundation in statistics, others take a more practical approach and make assumptions on how the data samples are expected to be distributed in the $xy$ scatter plot. The following methods deserve a look if you are not familiar with them already:
   
* DBSCAN (see https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html ). Uses the euclidean distances between the samples to define if a sample belongs in a cluster or another. The assumption is related to GMMs. While the GMM requires fixing the number of clusters, DBSCAN relies on the maximum distance parameter do define which samples belong to a cluster. * DBSCAN (see https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html ). Uses the euclidean distances between the samples to define if a sample belongs in a cluster or another. The assumption is related to GMMs. While the GMM requires fixing the number of clusters, DBSCAN relies on the maximum distance parameter do define which samples belong to a cluster.
* Agglomerative clustering (see https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering). It uses the nearest neighbour to each sample to define which samples belong together in the same cluster. Since the cluster association is topological, it does not assume the samples make a "blob" of some sort, as in the case of GMM. One could for example have samples close to one another making a circle in the $xy$-scatter plane and those samples would belong to the same cluster (depending on the definition of closeness and other parameters of the algorithm). * Agglomerative clustering (see https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering). It uses the nearest neighbour to each sample to define which samples belong together in the same cluster. Since the cluster association is topological, it does not assume the samples make a "blob" of some sort, as in the case of GMM. One could for example have samples close to one another making a circle in the $xy$-scatter plane and those samples would belong to the same cluster (depending on the definition of closeness and other parameters of the algorithm).
* Spectral clustering (see https://scikit-learn.org/stable/modules/clustering.html#spectral-clustering). Uses K-means on another representation of the data. This representation of the data transforms the data points from the $xy$ space shown before into a space where connectedness of nearest neighbours is the criteria for similarity. For the mathematical foundation, this document may be useful: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.165.9323&rep=rep1&type=pdf * Spectral clustering (see https://scikit-learn.org/stable/modules/clustering.html#spectral-clustering). Uses K-means on another representation of the data. This representation of the data transforms the data points from the $xy$ space shown before into a space where connectedness of nearest neighbours is the criteria for similarity. For the mathematical foundation, this document may be useful: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.165.9323&rep=rep1&type=pdf
   
   
%% Cell type:markdown id:488df5eb tags: %% Cell type:markdown id:488df5eb tags:
   
### Contact us at the EuXFEL Data Analysis group at any time if you need help analysing your data! ### Contact us at the EuXFEL Data Analysis group at any time if you need help analysing your data!
   
#### Data Analysis group: da@xfel.eu #### Data Analysis group: da@xfel.eu
   
%% Cell type:code id:941c7cb8 tags: %% Cell type:code id:941c7cb8 tags:
   
``` python ``` python
``` ```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment