{ "cells": [ { "cell_type": "markdown", "id": "9cb10ea7-ba09-44b5-b8f5-16164464afc9", "metadata": {}, "source": [ "# **Lecture 6: Examples for logistical Regression**\n", "\n", "**Two problems related to Logistical regression demonstrated**\n", "- Logistical classifier\n", "- Logistical regression" ] }, { "cell_type": "markdown", "id": "d0e26c6d-aa8e-49b4-9f79-2b9ede3e72e5", "metadata": {}, "source": [ "## **Part I: logistic classification**" ] }, { "cell_type": "code", "execution_count": 1, "id": "099644bb-883e-402e-9503-ddacd42eacb5", "metadata": {}, "outputs": [], "source": [ "# Code source: Gaël Varoquaux\n", "# Modified for documentation by Jaques Grobler\n", "# License: BSD 3 clause\n", "\n", "import matplotlib.pyplot as plt\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn import datasets\n", "from sklearn.inspection import DecisionBoundaryDisplay" ] }, { "cell_type": "markdown", "id": "c64a8687-4610-4202-a37d-a9b4d64e8e45", "metadata": {}, "source": [ "### 1.1, Import some data to play with" ] }, { "cell_type": "code", "execution_count": 2, "id": "ecdbdf12-dc29-4c06-92d9-d8f413f808dc", "metadata": {}, "outputs": [], "source": [ "# import some data to play with\n", "iris = datasets.load_iris()\n", "X = iris.data[:, :2] # we only take the first two features.\n", "Y = iris.target" ] }, { "cell_type": "markdown", "id": "f8b51501-19d5-4835-b7a8-bb73f0a678e6", "metadata": {}, "source": [ "### 1.2, Create an instance of Logistic Regression Classifier and fit the data." ] }, { "cell_type": "code", "execution_count": 6, "id": "6b98424d-ab84-4089-a477-efa873e56dc2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
LogisticRegression(C=100000.0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LogisticRegression(C=100000.0)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create an instance of Logistic Regression Classifier and fit the data.\n", "logreg = LogisticRegression(C=1e5)\n", "logreg.fit(X, Y)\n" ] }, { "cell_type": "markdown", "id": "007863a0-cfc2-4a62-bde9-b8f0e5253788", "metadata": {}, "source": [ "### 1.3, Plot resluts" ] }, { "cell_type": "code", "execution_count": 7, "id": "77b4c453-0f25-46b4-a9e6-f06cffba769c", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "_, ax = plt.subplots(figsize=(8, 6))\n", "\n", "DecisionBoundaryDisplay.from_estimator(\n", " logreg,\n", " X,\n", " cmap=plt.cm.Paired,\n", " ax=ax,\n", " response_method=\"predict\",\n", " plot_method=\"pcolormesh\",\n", " shading=\"auto\",\n", " xlabel=\"Sepal length\",\n", " ylabel=\"Sepal width\",\n", " eps=0.5,\n", ")\n", "\n", "# Plot also the training points\n", "plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors=\"k\", cmap=plt.cm.Paired)\n", "\n", "\n", "plt.xticks(())\n", "plt.yticks(())\n", "\n", "plt.show()\n" ] }, { "cell_type": "markdown", "id": "d4a84213-16c6-44ec-9df6-adb7e40d516e", "metadata": {}, "source": [ "## **Part II: logistic regression**" ] }, { "cell_type": "markdown", "id": "e915d96c-c1f9-4db1-8aef-2262cbaf2957", "metadata": {}, "source": [ "### 2.1, a simple explanation case: numbers connected to logistics" ] }, { "cell_type": "code", "execution_count": 9, "id": "4b517ded-99b2-4569-baf1-6cd904240c45", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "19.33672738021736\n", "0.048945750079555296\n" ] } ], "source": [ "from __future__ import print_function, division\n", "from builtins import range\n", "# Note: you may need to update your version of future\n", "# sudo pip install -U future\n", "\n", "\n", "\n", "import numpy as np\n", "\n", "N = 100\n", "D = 2\n", "\n", "\n", "X = np.random.randn(N,D)\n", "\n", "# center the first 50 points at (-2,-2)\n", "X[:50,:] = X[:50,:] - 2*np.ones((50,D))\n", "\n", "# center the last 50 points at (2, 2)\n", "X[50:,:] = X[50:,:] + 2*np.ones((50,D))\n", "\n", "# labels: first 50 are 0, last 50 are 1\n", "T = np.array([0]*50 + [1]*50)\n", "\n", "# add a column of ones\n", "# ones = np.array([[1]*N]).T # old\n", "ones = np.ones((N, 1))\n", "Xb = np.concatenate((ones, X), axis=1)\n", "\n", "# randomly initialize the weights\n", "w = np.random.randn(D + 1)\n", "\n", "# calculate the model output\n", "z = Xb.dot(w)\n", "\n", "def sigmoid(z):\n", " return 1/(1 + np.exp(-z))\n", "\n", "Y = sigmoid(z)\n", "\n", "# calculate the cross-entropy error\n", "def cross_entropy(T, Y):\n", " E = 0\n", " for i in range(len(T)):\n", " if T[i] == 1:\n", " E -= np.log(Y[i])\n", " else:\n", " E -= np.log(1 - Y[i])\n", " return E\n", "\n", "print(cross_entropy(T, Y))\n", "\n", "# try it with our closed-form solution\n", "w = np.array([0, 4, 4])\n", "\n", "# calculate the model output\n", "z = Xb.dot(w)\n", "Y = sigmoid(z)\n", "\n", "# calculate the cross-entropy error\n", "print(cross_entropy(T, Y))\n", "\n" ] }, { "cell_type": "markdown", "id": "d59897c3-6681-4095-bcf5-cccca217979f", "metadata": {}, "source": [ "### 2.2, A bit more complex case use sklearn from scikit website\n", "**Multiclass sparse logistic regression on 20newgroups**\n", "\n", "- Comparison of multinomial logistic L1 vs one-versus-rest L1 logistic regression\n", "to classify documents from the newgroups20 dataset. Multinomial logistic\n", "regression yields more accurate results and is faster to train on the larger\n", "scale dataset.\n", "\n", "- Here we use the l1 sparsity that trims the weights of not informative\n", "features to zero. This is good if the goal is to extract the strongly\n", "discriminative vocabulary of each class. If the goal is to get the best\n", "predictive accuracy, it is better to use the non sparsity-inducing l2 penalty\n", "instead.\n", "\n", "- A more traditional (and possibly better) way to predict on a sparse subset of\n", "input features would be to use univariate feature selection followed by a\n", "traditional (l2-penalised) logistic regression model.\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "c876b423-9f1f-4be1-9b2f-4de6f22d7b50", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset 20newsgroup, train_samples=4500, n_features=130107, n_classes=20\n", "[model=One versus Rest, solver=saga] Number of epochs: 1\n", "[model=One versus Rest, solver=saga] Number of epochs: 2\n", "[model=One versus Rest, solver=saga] Number of epochs: 3\n", "Test accuracy for model ovr: 0.5960\n", "% non-zero coefficients for model ovr, per class:\n", " [0.26593496 0.43348936 0.26362917 0.31973683 0.37815029 0.2928359\n", " 0.27054655 0.62717609 0.19522393 0.30897646 0.34586917 0.28207552\n", " 0.34125758 0.29898468 0.34279478 0.59489497 0.38353048 0.35278655\n", " 0.19829832 0.14603365]\n", "Run time (3 epochs) for model ovr:1.12\n", "[model=Multinomial, solver=saga] Number of epochs: 1\n", "[model=Multinomial, solver=saga] Number of epochs: 2\n", "[model=Multinomial, solver=saga] Number of epochs: 5\n", "Test accuracy for model multinomial: 0.6440\n", "% non-zero coefficients for model multinomial, per class:\n", " [0.36047253 0.1268187 0.10606655 0.17985197 0.5395559 0.07993421\n", " 0.06686804 0.21443888 0.11528972 0.2075215 0.10914094 0.11144673\n", " 0.13988486 0.09684337 0.26286057 0.11682692 0.55800226 0.17370318\n", " 0.11452112 0.14603365]\n", "Run time (5 epochs) for model multinomial:0.96\n", "Example run in 103.000 s\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Author: Arthur Mensch\n", "\n", "import timeit\n", "import warnings\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "from sklearn.datasets import fetch_20newsgroups_vectorized\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.exceptions import ConvergenceWarning\n", "\n", "warnings.filterwarnings(\"ignore\", category=ConvergenceWarning, module=\"sklearn\")\n", "t0 = timeit.default_timer()\n", "\n", "# We use SAGA solver\n", "solver = \"saga\"\n", "\n", "# Turn down for faster run time\n", "n_samples = 5000\n", "\n", "X, y = fetch_20newsgroups_vectorized(subset=\"all\", return_X_y=True)\n", "X = X[:n_samples]\n", "y = y[:n_samples]\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, random_state=42, stratify=y, test_size=0.1\n", ")\n", "train_samples, n_features = X_train.shape\n", "n_classes = np.unique(y).shape[0]\n", "\n", "print(\n", " \"Dataset 20newsgroup, train_samples=%i, n_features=%i, n_classes=%i\"\n", " % (train_samples, n_features, n_classes)\n", ")\n", "\n", "models = {\n", " \"ovr\": {\"name\": \"One versus Rest\", \"iters\": [1, 2, 3]},\n", " \"multinomial\": {\"name\": \"Multinomial\", \"iters\": [1, 2, 5]},\n", "}\n", "\n", "for model in models:\n", " # Add initial chance-level values for plotting purpose\n", " accuracies = [1 / n_classes]\n", " times = [0]\n", " densities = [1]\n", "\n", " model_params = models[model]\n", "\n", " # Small number of epochs for fast runtime\n", " for this_max_iter in model_params[\"iters\"]:\n", " print(\n", " \"[model=%s, solver=%s] Number of epochs: %s\"\n", " % (model_params[\"name\"], solver, this_max_iter)\n", " )\n", " lr = LogisticRegression(\n", " solver=solver,\n", " multi_class=model,\n", " penalty=\"l1\",\n", " max_iter=this_max_iter,\n", " random_state=42,\n", " )\n", " t1 = timeit.default_timer()\n", " lr.fit(X_train, y_train)\n", " train_time = timeit.default_timer() - t1\n", "\n", " y_pred = lr.predict(X_test)\n", " accuracy = np.sum(y_pred == y_test) / y_test.shape[0]\n", " density = np.mean(lr.coef_ != 0, axis=1) * 100\n", " accuracies.append(accuracy)\n", " densities.append(density)\n", " times.append(train_time)\n", " models[model][\"times\"] = times\n", " models[model][\"densities\"] = densities\n", " models[model][\"accuracies\"] = accuracies\n", " print(\"Test accuracy for model %s: %.4f\" % (model, accuracies[-1]))\n", " print(\n", " \"%% non-zero coefficients for model %s, per class:\\n %s\"\n", " % (model, densities[-1])\n", " )\n", " print(\n", " \"Run time (%i epochs) for model %s:%.2f\"\n", " % (model_params[\"iters\"][-1], model, times[-1])\n", " )\n", "\n", "fig = plt.figure()\n", "ax = fig.add_subplot(111)\n", "\n", "for model in models:\n", " name = models[model][\"name\"]\n", " times = models[model][\"times\"]\n", " accuracies = models[model][\"accuracies\"]\n", " ax.plot(times, accuracies, marker=\"o\", label=\"Model: %s\" % name)\n", " ax.set_xlabel(\"Train time (s)\")\n", " ax.set_ylabel(\"Test accuracy\")\n", "ax.legend()\n", "fig.suptitle(\"Multinomial vs One-vs-Rest Logistic L1\\nDataset %s\" % \"20newsgroups\")\n", "fig.tight_layout()\n", "fig.subplots_adjust(top=0.85)\n", "run_time = timeit.default_timer() - t0\n", "print(\"Example run in %.3f s\" % run_time)\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }