Representation Learning.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6a764d92",
   "metadata": {},
   "source": [
    "# Representation Learning\n",
    "\n",
    "Often it is very hard either to find similarities within data or even to classify data as belonging to different labels when the data contains so many features. One particular example are images which may contain millions of pixels. Even low-dimensional data may not be easy to visualize in a 2D plane, so bringing the dominant effects in a 2D plane for visualization is in itself a very helpful starting point.\n",
    "\n",
    "There are different methods to obtain another view of the data, by performing linear or even non-linear combinations of the data features. The price being paid by such methods is that the new representation of the data may not be that straightforward to digest, loosing therefore some of its scientific interpretation. On the other hand, if one understands the assumptions made in such methods, one can easily imagine the mathematical process required to transform to and from this new view and gain insight from the new view without loosing track of the scientific background.\n",
    "\n",
    "We are going to go through a few methods of obtaining an alternative view of the data here and what their assumptions might be."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fce4d8e8",
   "metadata": {},
   "source": [
    "We start by loading the necessary Python modules. If you have not yet installed them, run the following cell to install them with pip:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "44ca341e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: numpy in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (1.19.2)\r\n",
      "Requirement already satisfied: scikit-learn in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (0.24.2)\r\n",
      "Requirement already satisfied: pandas in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (1.1.5)\r\n",
      "Requirement already satisfied: matplotlib in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (3.3.4)\r\n",
      "Requirement already satisfied: scipy>=0.19.1 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from scikit-learn) (1.5.2)\r\n",
      "Requirement already satisfied: joblib>=0.11 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from scikit-learn) (1.0.1)\r\n",
      "Requirement already satisfied: threadpoolctl>=2.0.0 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from scikit-learn) (2.2.0)\r\n",
      "Requirement already satisfied: python-dateutil>=2.7.3 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from pandas) (2.8.2)\r\n",
      "Requirement already satisfied: pytz>=2017.2 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from pandas) (2021.3)\r\n",
      "Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from matplotlib) (3.0.4)\r\n",
      "Requirement already satisfied: cycler>=0.10 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from matplotlib) (0.11.0)\r\n",
      "Requirement already satisfied: kiwisolver>=1.0.1 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from matplotlib) (1.3.1)\r\n",
      "Requirement already satisfied: pillow>=6.2.0 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from matplotlib) (8.3.1)\r\n",
      "Requirement already satisfied: six>=1.5 in /home/daniloefl/miniconda3/envs/ml2/lib/python3.6/site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0)\r\n"
     ]
    }
   ],
   "source": [
    "!pip install numpy scikit-learn pandas matplotlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "300cf8d3",
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib notebook\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.decomposition import PCA"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ecd6a69",
   "metadata": {},
   "source": [
    "Let's generate the fake data now to have something to cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "4959a292",
   "metadata": {},
   "outputs": [],
   "source": [
    "rng = np.random.RandomState(0)\n",
    "n_samples = 500\n",
    "cov = [[3, 3], [3, 4]]\n",
    "data = rng.multivariate_normal(mean=[0, 0], cov=cov, size=n_samples)\n",
    "data = pd.DataFrame(data, columns=[\"x1\", \"x2\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8295e8a",
   "metadata": {},
   "source": [
    "Let's print out the dataset read first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "024fb65a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>x1</th>\n",
       "      <th>x2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-3.123062</td>\n",
       "      <td>-3.267402</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-2.775958</td>\n",
       "      <td>-0.929101</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-2.582416</td>\n",
       "      <td>-4.072345</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-1.492453</td>\n",
       "      <td>-1.920361</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.041529</td>\n",
       "      <td>0.381166</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>495</th>\n",
       "      <td>-0.821492</td>\n",
       "      <td>-0.782416</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>496</th>\n",
       "      <td>1.197165</td>\n",
       "      <td>1.665481</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>497</th>\n",
       "      <td>-0.691309</td>\n",
       "      <td>-0.383494</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>498</th>\n",
       "      <td>0.279317</td>\n",
       "      <td>0.428408</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>499</th>\n",
       "      <td>2.082251</td>\n",
       "      <td>2.082815</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>500 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           x1        x2\n",
       "0   -3.123062 -3.267402\n",
       "1   -2.775958 -0.929101\n",
       "2   -2.582416 -4.072345\n",
       "3   -1.492453 -1.920361\n",
       "4   -0.041529  0.381166\n",