More information

4625a6fe · Danilo Ferreira de Lima · 64d17d5b · 4625a6fe
Commit 4625a6fe authored 3 years ago by Danilo Ferreira de Lima
--- a/Mixture Models.ipynb
+++ b/Mixture Models.ipynb
@@ -2429,6 +2429,18 @@
    "Note that if the sample clusters were not \"blobs\" of data, but were in concentric circles, the assumption of this method would be false and the method would simply not work well. This is why it is important to understand the underlying assumptions made in the method."
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "fbee128c",
+   "metadata": {},
+   "source": [
+    "## Determining the number of clusters\n",
+    "\n",
+    "So far, we have assumed that the number of clusters is known. This is not often the case. One method to determine the number of clusters is to fit the GMM with different number of clusters and after each fit, calculate `gmm.bic()`. This method returns the Bayesian Information Criteria (see also AIC), which should be small when the fit worked well and the number of components is compatible with your sample.\n",
+    "\n",
+    "See a full example here: https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html#sphx-glr-auto-examples-mixture-plot-gmm-selection-py"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "7076f779",
@@ -3470,7 +3482,7 @@
    "\n",
    "While the Gaussian Mixture Model presented beforehand had a very strong theoretical background, it still assumes that there is a single correct value for the Gaussian parameters. This may not be the case if the Gaussian model is only approximate (as is very often the case!).\n",
    "\n",
-    "One improvement to the Gaussian Mixture Models to allow for the usaeg of prior knowledge (a prior) on the parameters to be discovered. This helps us, for instance, to determine the ideal number of components automatically, since the number of components itself can be fit with it.\n",
+    "One improvement to the Gaussian Mixture Models to allow for the usage of prior knowledge (a prior) on the parameters to be discovered. This helps us, for instance, to determine the ideal number of components automatically, since the number of components itself can be fit with it (assuming a prior distribution on them).\n",
    "\n",
    "Optimizing this model becomes even more complicated as the derivation shown before and if we wanted to be fully general, we would need to use very slow algorithms, such as Monte-Carlo sampling to obtain uncertainties with the least amount of extra assumptions. This is often undesirable, since we need fast clustering and often Monte-Carlo sampling is very slow with even more data! An alternative is to assume some underlying prior probability for the means and covariances and find those parameters as well in an optimization algorithm. This is what is done in Variational Inference. Further details can be seen in Bishop (2006), or in a more practical approach, here: https://scikit-learn.org/stable/modules/mixture.html#bgmm\n",
    "\n",
 %% Cell type:markdown id:6a764d92 tags:
 # Mixture Models
 One common objective is to find similarities in data and cluster it. There are several heuristic methods to cluster them. We are going to focus on a few strongly theoretically motivated methods. Other methods use heuristics to identify similarities in data and are mentioned in the end.
 For the purposes of this example, we will generate a fake dataset.
 %% Cell type:markdown id:fce4d8e8 tags:
 We start by loading the necessary Python modules. If you have not yet installed them, run the following cell to install them with pip:
 %% Cell type:code id:44ca341e tags:
 ``` python
 !pip install numpy scikit-learn pandas matplotlib
 ```
 %% Output
    Requirement already satisfied: numpy in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (1.19.2)
    Requirement already satisfied: scikit-learn in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (0.24.2)
    Requirement already satisfied: pandas in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (1.1.5)
    Requirement already satisfied: matplotlib in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (3.3.4)
    Requirement already satisfied: scipy>=0.19.1 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from scikit-learn) (1.5.2)
    Requirement already satisfied: threadpoolctl>=2.0.0 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from scikit-learn) (2.2.0)
    Requirement already satisfied: joblib>=0.11 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from scikit-learn) (1.0.1)
    Requirement already satisfied: python-dateutil>=2.7.3 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from pandas) (2.8.2)
    Requirement already satisfied: pytz>=2017.2 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from pandas) (2021.3)
    Requirement already satisfied: pillow>=6.2.0 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from matplotlib) (8.3.1)
    Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from matplotlib) (3.0.4)
    Requirement already satisfied: cycler>=0.10 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from matplotlib) (0.11.0)
    Requirement already satisfied: kiwisolver>=1.0.1 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from matplotlib) (1.3.1)
    Requirement already satisfied: six>=1.5 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0)
 %% Cell type:code id:300cf8d3 tags:
 ``` python
 %matplotlib notebook
 import matplotlib.pyplot as plt
 import pandas as pd
 import numpy as np
 from sklearn.mixture import GaussianMixture, BayesianGaussianMixture
 from sklearn.cluster import KMeans
 ```
 %% Cell type:markdown id:0ecd6a69 tags:
 Let's generate the fake data now to have something to cluster.
 %% Cell type:code id:4959a292 tags:
 ``` python
 def generate_clusters(mu: np.ndarray, sigma: np.ndarray, N: int) ->np.ndarray:
    assert len(mu) == len(sigma)
    assert N > 1
    D = len(mu[0].shape)
    data = np.concatenate([np.random.default_rng().multivariate_normal(mean=mu_k, cov=sigma_k, size=N)
                           for mu_k, sigma_k in zip(mu, sigma)], axis=0)
    source = np.concatenate([k*np.ones([N, 1]) for k in range(len(mu))], axis=0)
    return np.concatenate([data, source], axis=1)
 ```
 %% Cell type:code id:82929490 tags:
 ``` python
 data = generate_clusters(mu=[np.array([5.0, -2.0]),
                             np.array([1.0, 5.0]),
                             np.array([-5.0, -1.0])],
                         sigma=[np.array([[0.1, 0.2],
                                          [0.2, 0.1]]),
                                np.array([[1.0, 0.5],
                                          [0.5, 1.0]]),
                                np.array([[2.0, 0.0],
                                          [0.0, 5.0]])],
                         N=1000)
 data = pd.DataFrame(data, columns=["x", "y", "source"])
 ```
 %% Output
    /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: covariance is not positive-semidefinite.
 %% Cell type:markdown id:d8295e8a tags:
 Let's print out the dataset read first.
 %% Cell type:code id:024fb65a tags:
 ``` python
 data
 ```
 %% Output
                 x         y  source
    0     6.128013 -1.708080     0.0
    1     3.938043 -2.788596     0.0
    2     4.221406 -1.813124     0.0
    3     4.530432 -2.052744     0.0
    4     4.903167 -1.418898     0.0
    ...        ...       ...     ...
    2995 -4.874779 -1.486904     2.0
    2996 -6.650308 -0.551687     2.0
    2997 -6.388364 -2.193081     2.0
    2998 -6.324778 -0.540765     2.0
    2999 -5.787230 -0.515902     2.0
    [3000 rows x 3 columns]
 %% Cell type:markdown id:1c178424 tags:
 We can plot this fairly easily using Matplotlib.
 %% Cell type:code id:e63b38c5 tags:
 ``` python
 fig, ax = plt.subplots(figsize=(8, 8))
 data.plot.scatter(x="x", y="y", c="source", colormap='viridis', ax=ax)
 ax.set(xlabel="x", ylabel=r"y", title="")
 plt.show()
 ```
 %% Output
 %% Cell type:markdown id:a376636d tags:
 There are several ways of finding similarities in the data. The Gaussian Mixture Model assumes that the data has been produced exactly as in this example. For each sample, the procedure assumed to be used for the generation of each sample is the following:
  * a random integer is chosen according to a discrete probability distribution, which identifies to which cluster the sample belongs to (this is the `source` variable in the generation procedure above);
  * that random integer is assumed to be used to choose the mean and covariance matrix for a Gaussian random variable, which is then used to produce the observed sample.
 Out objective here is then to find out the probability of producing a sample for each cluster (the probabilities with which a sample belongs to one cluster instead of the others), the means and covariance matrices for the model. Using Bayes' theorem, one may write down the posterior probability for the true means, covariances and cluster probabilities ($\theta$) given the data ($\x$), but it is very hard to calculate the parameters from it. Let us call the cluster that a specific sample belongs to, $z$.
 The true posterior would look like this:
 $p(\theta|x) = \frac{p(x|\theta) p(\theta)}{p(x)}$
 We assume $p(\theta)$ is constant for all $\theta$ and we know $p(x)$ is independent of $\theta$ (because it is just the probability we obtained the existing data). We also know that $p(x|\theta) = \sum_z p(x,z|\theta)$, since summing over all $z$, we obtain a total probability 1 of getting any $z$.
 $p(\theta|x) \propto \sum_z p(x,z|\theta) = \sum_z p(x|\theta,z) p(z|\theta)$
 Where we have expanded it on $z$ using $p(a,b) = p(a|b)p(b)$.
 Now if we want to find the most probable $\theta$ from this posterior (the "maximum a posteriori", or "MAP"), we just need to maximize the right-hand side of the last equation. This is very hard to do, but we can use the Expectation-Maximization (EM) algorithm, on which we iteratively improve our probability estimates. This method works as follows.
 First we define the log-likelihood $LL(\theta|z) = \log L(\theta|z) = \log p(x|\theta,z) = \sum_{k=\text{sample}} \log p(x_k|\theta,z_k)$. Since we are assuming that the probability for each data sample $x_k$ is a Gaussian distribution, each of the terms in the sum is a Gaussian probability, so if we know $\theta$ and $z_k$, this full sum can be calculated. The issue is that we do not know $z_k$ (that is the whole point!).
 We start by assuming $\theta=\theta_0$ for some random initial value of the parameters. The EM method iterates on two steps:
  1. Calculate the *average* value of $LL(\theta|z)$ over $z$ assuming the current $\theta_i$ to calculate $p(z|\theta)$, that is: $Q(\theta) = \sum_z LL(\theta|z) p(z|\theta_i)$. This means we calculate the weighted sum of the log-likelihood a point belongs to each Gaussian, weighted by the probability that that Gaussian was the reason that sample was generated. This avoids needing to know the correct $z_k$, since we just sum over all possibilities. The weights of this weighted sum are simply one of the parameters $\theta$, namely, the cluster probabilities.
  2. Find the $\theta$ which maximizes $Q(\theta)$ and use this as the next $\theta_{i+1}$.
 These two steps are iterated to improve the $\theta$ for several iterations. It can be shown that an improvement on $Q$ following this procedure leads to a new $\theta$ that improves the posterior probability $p(\theta|x)$ above (see https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm#Proof_of_correctness ).
 We will not write all of this from scratch. Instead, we use the `GaussianMixture` function in `scikit-learn`, which has this ready for us. It is nevertheless important to understand the assumptions made here: the underlying samples are assumed to come from a discrete combination of Gaussians. Other mixture models are possible using other underlying distributions, but note that this method will not work if the samples do not follow this generative pattern.
 %% Cell type:code id:0837b3ff tags:
 ``` python
 gmm = GaussianMixture(n_components=3, covariance_type="full", max_iter=20)
 ```
 %% Cell type:code id:8798f857 tags:
 ``` python
 gmm.fit(data.loc[:, ["x", "y"]])
 ```
 %% Output
    GaussianMixture(max_iter=20, n_components=3)
 %% Cell type:markdown id:e928f498 tags:
 Now that we have model fit, we can even check the fit parameters $\theta$, which include the means, the covariance matrices and the probabilities given to each Gaussian in the mixture:
 %% Cell type:code id:fb5796e5 tags:
 ``` python
 gmm.means_
 ```
 %% Output
    array([[ 4.98067655, -2.00779217],
           [-4.96087396, -1.06118412],
           [ 1.02445445,  5.03277158]])
 %% Cell type:code id:182904d1 tags:
 ``` python
 gmm.covariances_
 ```
 %% Output
    array([[[ 0.19237869,  0.09361426],
            [ 0.09361426,  0.20047796]],
           [[ 1.95114689, -0.01346582],
            [-0.01346582,  5.10415062]],
           [[ 0.91835643,  0.48388716],
            [ 0.48388716,  1.03723078]]])
 %% Cell type:code id:2faa1f72 tags:
 ``` python
 gmm.weights_
 ```
 %% Output
    array([0.33333333, 0.33282526, 0.33384141])
 %% Cell type:markdown id:b54001fc tags:
 We can predict to which cluster each sample belongs now by selecting the cluster for each sample that maximizes the probability $p(z|\theta,x)$, now that we know $\theta$.
 %% Cell type:code id:cc8fc1f1 tags:
 ``` python
 guess = gmm.predict(data.loc[:, ["x", "y"]])
 ```
 %% Cell type:markdown id:68560fb4 tags:
 Let's plot it!
 %% Cell type:code id:88982b21 tags:
 ``` python
 data.loc[:, "guess"] = guess
 ```
 %% Cell type:code id:333581b5 tags:
 ``` python
 fig, ax = plt.subplots(figsize=(10, 10), ncols=2)
 data.plot.scatter(x="x", y="y", c="guess", colormap='viridis', ax=ax[0])
 data.plot.scatter(x="x", y="y", c="source", colormap='viridis', ax=ax[1])
 ax[0].set(xlabel="x", ylabel=r"y", title="Guessed source")
 ax[1].set(xlabel="x", ylabel=r"y", title="True association")
 plt.show()
 ```
 %% Output
 %% Cell type:markdown id:9b1363ec tags:
 Note that if the sample clusters were not "blobs" of data, but were in concentric circles, the assumption of this method would be false and the method would simply not work well. This is why it is important to understand the underlying assumptions made in the method.
+%% Cell type:markdown id:fbee128c tags:
+## Determining the number of clusters
+So far, we have assumed that the number of clusters is known. This is not often the case. One method to determine the number of clusters is to fit the GMM with different number of clusters and after each fit, calculate `gmm.bic()`. This method returns the Bayesian Information Criteria (see also AIC), which should be small when the fit worked well and the number of components is compatible with your sample.
+See a full example here: https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html#sphx-glr-auto-examples-mixture-plot-gmm-selection-py
 %% Cell type:markdown id:7076f779 tags:
 ## K-Means
 Another common method used for clustering is the K-Means, on which one simply tries to find the cluster centers which minimize in-cluster distances (in an Euclidean sense) while maximizing distances between the centers. It can be shown that this method is a special case of the Gaussian Mixture Model when the covariance matrices are diagonal, which would mean that within each blob, there is no correlation between the variables (see https://en.wikipedia.org/wiki/K-means_clustering#Gaussian_mixture_model and references).
 While this is an approximation of the GMM model, it is still a very useful approach, since there are faster algorithms to achieve the clustering than for GMMs.
 The scikit-learn module also provides an easy-to-use implementation of this algorithm:
 %% Cell type:code id:2f280e1d tags:
 ``` python
 kmeans = KMeans(n_clusters=3)
 kmeans.fit(data.loc[:, ["x", "y"]])
 ```
 %% Output
    KMeans(n_clusters=3)
 %% Cell type:code id:950a7eec tags:
 ``` python
 data.loc[:, "guess_kmeans"] = kmeans.predict(data.loc[:, ["x", "y"]])
 ```
 %% Cell type:code id:16a56489 tags:
 ``` python
 fig, ax = plt.subplots(figsize=(10, 10), ncols=2)
 data.plot.scatter(x="x", y="y", c="guess_kmeans", colormap='viridis', ax=ax[0])
 data.plot.scatter(x="x", y="y", c="source", colormap='viridis', ax=ax[1])
 ax[0].set(xlabel="x", ylabel=r"y", title="Guessed source using K-Means")
 ax[1].set(xlabel="x", ylabel=r"y", title="True association")
 plt.show()
 ```
 %% Output
 %% Cell type:markdown id:2fe3cad2 tags:
 ## Variational Inference on Gaussian Mixture Models
 While the Gaussian Mixture Model presented beforehand had a very strong theoretical background, it still assumes that there is a single correct value for the Gaussian parameters. This may not be the case if the Gaussian model is only approximate (as is very often the case!).
-One improvement to the Gaussian Mixture Models to allow for the usaeg of prior knowledge (a prior) on the parameters to be discovered. This helps us, for instance, to determine the ideal number of components automatically, since the number of components itself can be fit with it.
+One improvement to the Gaussian Mixture Models to allow for the usage of prior knowledge (a prior) on the parameters to be discovered. This helps us, for instance, to determine the ideal number of components automatically, since the number of components itself can be fit with it (assuming a prior distribution on them).
 Optimizing this model becomes even more complicated as the derivation shown before and if we wanted to be fully general, we would need to use very slow algorithms, such as Monte-Carlo sampling to obtain uncertainties with the least amount of extra assumptions. This is often undesirable, since we need fast clustering and often Monte-Carlo sampling is very slow with even more data! An alternative is to assume some underlying prior probability for the means and covariances and find those parameters as well in an optimization algorithm. This is what is done in Variational Inference. Further details can be seen in Bishop (2006), or in a more practical approach, here: https://scikit-learn.org/stable/modules/mixture.html#bgmm
 We can also easily use this method in our toy data, taking the code from scikit-learn:
 %% Cell type:code id:0a8642bd tags:
 ``` python
 bgmm = BayesianGaussianMixture(n_components=3)
 bgmm.fit(data.loc[:, ["x", "y"]])
 data.loc[:, "guess_bgmm"] = bgmm.predict(data.loc[:, ["x", "y"]])
 ```
 %% Cell type:code id:ccbf4019 tags:
 ``` python
 fig, ax = plt.subplots(figsize=(10, 10), ncols=2)
 data.plot.scatter(x="x", y="y", c="guess_bgmm", colormap='viridis', ax=ax[0])
 data.plot.scatter(x="x", y="y", c="source", colormap='viridis', ax=ax[1])
 ax[0].set(xlabel="x", ylabel=r"y", title="Guessed source using Variation Inference GMM")
 ax[1].set(xlabel="x", ylabel=r"y", title="True association")
 plt.show()
 ```
 %% Output
 %% Cell type:markdown id:72508ead tags:
 ## Other heuristic approaches
 Many other heuristic approaches for clustering have been developed. Some of them rely on a strong foundation in statistics, others take a more practical approach and make assumptions on how the data samples are expected to be distributed in the $xy$ scatter plot. The following methods deserve a look if you are not familiar with them already:
  * DBSCAN (see https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html ). Uses the euclidean distances between the samples to define if a sample belongs in a cluster or another. The assumption is related to GMMs. While the GMM requires fixing the number of clusters, DBSCAN relies on the maximum distance parameter do define which samples belong to a cluster.
  * Agglomerative clustering (see https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering). It uses the nearest neighbour to each sample to define which samples belong together in the same cluster. Since the cluster association is topological, it does not assume the samples make a "blob" of some sort, as in the case of GMM. One could for example have samples close to one another making a circle in the $xy$-scatter plane and those samples would belong to the same cluster (depending on the definition of closeness and other parameters of the algorithm).
  * Spectral clustering (see https://scikit-learn.org/stable/modules/clustering.html#spectral-clustering). Uses K-means on another representation of the data. This representation of the data transforms the data points from the $xy$ space shown before into a space where connectedness of nearest neighbours is the criteria for similarity. For the mathematical foundation, this document may be useful: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.165.9323&rep=rep1&type=pdf
 %% Cell type:markdown id:488df5eb tags:
 ### Contact us at the EuXFEL Data Analysis group at any time if you need help analysing your data!
 #### Data Analysis group: da@xfel.eu
 %% Cell type:code id:941c7cb8 tags:
 ``` python
 ```