{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Labo 5 Data Science : K-NN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Oefening 1__ K-NN classificatie\n", "\n", "Gegeven volgende beperkte datset :\n", "\n", "| naam | zoetheid | krokantheid | soort |\n", "|-------------|:--------:|:-------------:|---------:|\n", "| pompelmoes | 8 | 5 | fruit |\n", "| groene boon | 3 | 7 | groente |\n", "| noot | 3 | 6 | proteïne |\n", "| appelsien | 7 | 3 | fruit |\n", "\n", "We wensen nu voor 2 onbekende ingredienten te beslissen tot welke categorie ze behoren : _fruit, groente of proteïne_ . Deze ingredienten zijn :\n", "\n", "| naam | zoetheid | krokantheid | soort |\n", "|-----------|:--------:|:-------------:|---------:|\n", "| tomaat | 6 | 4 | ? |\n", "| wortel | 4 | 9 | ? |\n", "\n", "Gebruik K-NN om deze classificatie te doen. \n", "\n", "* Doe dit eerst visueel : m.a.w. plot de trainings- en test data in het vlak en bepaal visueel de classificatie. Geef dezelfde kleur aan data uit dezelfde klasse.\n", "\n", "* Gebruik vervolgens de KNeighborsClassifier van module sklearn om de voorspellingen te doen. doe dit eerst voor $k=1$ daarna voor $k=4$. Kan je deze classificaties logisch verklaren?\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'names': ['displacement', 'horsepower', 'mpg'], 'dataset': [['307', '130', '18'], ['350', '165', '15'], ['318', '150', '18'], ['304', '150', '16'], ['302', '140', '17'], ['429', '198', '15'], ['454', '220', '14'], ['440', '215', '14'], ['455', '225', '14'], ['390', '190', '15'], ['383', '170', '15'], ['340', '160', '14'], ['400', '150', '15'], ['455', '225', '14'], ['113', '95', '24'], ['198', '95', '22'], ['199', '97', '18'], ['200', '85', '21'], ['97', '88', '27'], ['97', '46', '26'], ['110', '87', '25'], ['107', '90', '24'], ['104', '95', '25'], ['121', '113', '26'], ['199', '90', '21'], ['360', '215', '10'], ['307', '200', '10'], ['318', '210', '11'], ['304', '193', '9'], ['97', '88', '27'], ['140', '90', '28'], ['113', '95', '25'], ['232', '100', '19'], ['225', '105', '16'], ['250', '100', '17'], ['250', '88', '19'], ['232', '100', '18'], ['350', '165', '14'], ['400', '175', '14'], ['351', '153', '14'], ['318', '150', '14'], ['383', '180', '12'], ['400', '170', '13'], ['400', '175', '13'], ['258', '110', '18'], ['140', '72', '22'], ['250', '100', '19'], ['250', '88', '18'], ['122', '86', '23'], ['116', '90', '28'], ['79', '70', '30'], ['88', '76', '30'], ['71', '65', '31'], ['72', '69', '35'], ['97', '60', '27'], ['91', '70', '26'], ['113', '95', '24'], ['97.5', '80', '25'], ['97', '54', '23'], ['140', '90', '20'], ['122', '86', '21'], ['350', '165', '13'], ['400', '175', '14'], ['318', '150', '15'], ['351', '153', '14'], ['304', '150', '17'], ['429', '208', '11'], ['350', '155', '13'], ['350', '160', '12'], ['400', '190', '13'], ['70', '97', '19'], ['304', '150', '15'], ['307', '130', '13'], ['302', '140', '13'], ['318', '150', '14'], ['121', '112', '18'], ['121', '76', '22'], ['120', '87', '21'], ['96', '69', '26'], ['122', '86', '22'], ['97', '92', '28'], ['120', '97', '23'], ['98', '80', '28'], ['97', '88', '27'], ['350', '175', '13'], ['304', '150', '14'], ['350', '145', '13'], ['302', '137', '14'], ['318', '150', '15'], ['429', '198', '12'], ['400', '150', '13'], ['351', '158', '13'], ['318', '150', '14'], ['440', '215', '13'], ['455', '225', '12'], ['360', '175', '13'], ['225', '105', '18'], ['250', '100', '16'], ['232', '100', '18'], ['250', '88', '18'], ['198', '95', '23'], ['97', '46', '26'], ['400', '150', '11'], ['400', '167', '12'], ['360', '170', '13'], ['350', '180', '12'], ['232', '100', '18'], ['97', '88', '20'], ['140', '72', '21'], ['108', '94', '22'], ['70', '90', '18'], ['122', '85', '19'], ['155', '107', '21'], ['98', '90', '26'], ['350', '145', '15'], ['400', '230', '16'], ['68', '49', '29'], ['116', '75', '24'], ['114', '91', '20'], ['121', '112', '19'], ['318', '150', '15'], ['121', '110', '24'], ['156', '122', '20'], ['350', '180', '11'], ['198', '95', '20'], ['232', '100', '19'], ['250', '100', '15'], ['79', '67', '31'], ['122', '80', '26'], ['71', '65', '32'], ['140', '75', '25'], ['250', '100', '16'], ['258', '110', '16'], ['225', '105', '18'], ['302', '140', '16'], ['350', '150', '13'], ['318', '150', '14'], ['302', '140', '14'], ['304', '150', '14'], ['98', '83', '29'], ['79', '67', '26'], ['97', '78', '26'], ['76', '52', '31'], ['83', '61', '32'], ['90', '75', '28'], ['90', '75', '24'], ['116', '75', '26'], ['120', '97', '24'], ['108', '93', '26'], ['79', '67', '31'], ['225', '95', '19'], ['250', '105', '18'], ['250', '72', '15'], ['250', '72', '15'], ['400', '170', '16'], ['350', '145', '15'], ['318', '150', '16'], ['351', '148', '14'], ['231', '110', '17'], ['250', '105', '16'], ['258', '110', '15'], ['225', '95', '18'], ['231', '110', '21'], ['262', '110', '20'], ['302', '129', '13'], ['97', '75', '29'], ['140', '83', '23'], ['232', '100', '20'], ['140', '78', '23'], ['134', '96', '24'], ['90', '71', '25'], ['119', '97', '24'], ['171', '97', '18'], ['90', '70', '29'], ['232', '90', '19'], ['115', '95', '23'], ['120', '88', '23'], ['121', '98', '22'], ['121', '115', '25'], ['91', '53', '33'], ['107', '86', '28'], ['116', '81', '25'], ['140', '92', '25'], ['98', '79', '26'], ['101', '83', '27'], ['305', '140', '17.5'], ['318', '150', '16'], ['304', '120', '15.5'], ['351', '152', '14.5'], ['225', '100', '22'], ['250', '105', '22'], ['200', '81', '24'], ['232', '90', '22.5'], ['85', '52', '29'], ['98', '60', '24.5'], ['90', '70', '29'], ['91', '53', '33'], ['225', '100', '20'], ['250', '78', '18'], ['250', '110', '18.5'], ['258', '95', '17.5'], ['97', '71', '29.5'], ['85', '70', '32'], ['97', '75', '28'], ['140', '72', '26.5'], ['130', '102', '20'], ['318', '150', '13'], ['120', '88', '19'], ['156', '108', '19'], ['168', '120', '16.5'], ['350', '180', '16.5'], ['350', '145', '13'], ['302', '130', '13'], ['318', '150', '13'], ['98', '68', '31.5'], ['111', '80', '30'], ['79', '58', '36'], ['122', '96', '25.5'], ['85', '70', '33.5'], ['305', '145', '17.5'], ['260', '110', '17'], ['318', '145', '15.5'], ['302', '130', '15'], ['250', '110', '17.5'], ['231', '105', '20.5'], ['225', '100', '19'], ['250', '98', '18.5'], ['400', '180', '16'], ['350', '170', '15.5'], ['400', '190', '15.5'], ['351', '149', '16'], ['97', '78', '29'], ['151', '88', '24.5'], ['97', '75', '26'], ['140', '89', '25.5'], ['98', '63', '30.5'], ['98', '83', '33.5'], ['97', '67', '30'], ['97', '78', '30.5'], ['146', '97', '22'], ['121', '110', '21.5'], ['80', '110', '21.5'], ['90', '48', '43.1'], ['98', '66', '36.1'], ['78', '52', '32.8'], ['85', '70', '39.4'], ['91', '60', '36.1'], ['260', '110', '19.9'], ['318', '140', '19.4'], ['302', '139', '20.2'], ['231', '105', '19.2'], ['200', '95', '20.5'], ['200', '85', '20.2'], ['140', '88', '25.1'], ['225', '100', '20.5'], ['232', '90', '19.4'], ['231', '105', '20.6'], ['200', '85', '20.8'], ['225', '110', '18.6'], ['258', '120', '18.1'], ['305', '145', '19.2'], ['231', '165', '17.7'], ['302', '139', '18.1'], ['318', '140', '17.5'], ['98', '68', '30'], ['134', '95', '27.5'], ['119', '97', '27.2'], ['105', '75', '30.9'], ['134', '95', '21.1'], ['156', '105', '23.2'], ['151', '85', '23.8'], ['119', '97', '23.9'], ['131', '103', '20.3'], ['163', '125', '17'], ['121', '115', '21.6'], ['163', '133', '16.2'], ['89', '71', '31.5'], ['98', '68', '29.5'], ['231', '115', '21.5'], ['200', '85', '19.8'], ['140', '88', '22.3'], ['232', '90', '20.2'], ['225', '110', '20.6'], ['305', '130', '17'], ['302', '129', '17.6'], ['351', '138', '16.5'], ['318', '135', '18.2'], ['350', '155', '16.9'], ['351', '142', '15.5'], ['267', '125', '19.2'], ['360', '150', '18.5'], ['89', '71', '31.9'], ['86', '65', '34.1'], ['98', '80', '35.7'], ['121', '80', '27.4'], ['183', '77', '25.4'], ['350', '125', '23'], ['141', '71', '27.2'], ['260', '90', '23.9'], ['105', '70', '34.2'], ['105', '70', '34.5'], ['85', '65', '31.8'], ['91', '69', '37.3'], ['151', '90', '28.4'], ['173', '115', '28.8'], ['173', '115', '26.8'], ['151', '90', '33.5'], ['98', '76', '41.5'], ['89', '60', '38.1'], ['98', '70', '32.1'], ['86', '65', '37.2'], ['151', '90', '28'], ['140', '88', '26.4'], ['151', '90', '24.3'], ['225', '90', '19.1'], ['97', '78', '34.3'], ['134', '90', '29.8'], ['120', '75', '31.3'], ['119', '92', '37'], ['108', '75', '32.2'], ['86', '65', '46.6'], ['156', '105', '27.9'], ['85', '65', '40.8'], ['90', '48', '44.3'], ['90', '48', '43.4'], ['121', '67', '36.4'], ['146', '67', '30'], ['91', '67', '44.6'], ['97', '67', '33.8'], ['89', '62', '29.8'], ['168', '132', '32.7'], ['70', '100', '23.7'], ['122', '88', '35'], ['107', '72', '32.4'], ['135', '84', '27.2'], ['151', '84', '26.6'], ['156', '92', '25.8'], ['173', '110', '23.5'], ['135', '84', '30'], ['79', '58', '39.1'], ['86', '64', '39'], ['81', '60', '35.1'], ['97', '67', '32.3'], ['85', '65', '37'], ['89', '62', '37.7'], ['91', '68', '34.1'], ['105', '63', '34.7'], ['98', '65', '34.4'], ['98', '65', '29.9'], ['105', '74', '33'], ['107', '75', '33.7'], ['108', '75', '32.4'], ['119', '100', '32.9'], ['120', '74', '31.6'], ['141', '80', '28.1'], ['145', '76', '30.7'], ['168', '116', '25.4'], ['146', '120', '24.2'], ['231', '110', '22.4'], ['350', '105', '26.6'], ['200', '88', '20.2'], ['225', '85', '17.6'], ['112', '88', '28'], ['112', '88', '27'], ['112', '88', '34'], ['112', '85', '31'], ['135', '84', '29'], ['151', '90', '27'], ['140', '92', '24'], ['105', '74', '36'], ['91', '68', '37'], ['91', '68', '31'], ['105', '63', '38'], ['98', '70', '36'], ['120', '88', '36'], ['107', '75', '36'], ['108', '70', '34'], ['91', '67', '38'], ['91', '67', '32'], ['91', '67', '38'], ['181', '110', '25'], ['262', '85', '38'], ['156', '92', '26'], ['232', '112', '22'], ['144', '96', '32'], ['135', '84', '36'], ['151', '90', '27'], ['140', '86', '27'], ['97', '52', '44'], ['135', '84', '32'], ['120', '79', '28'], ['119', '82', '31']]}\n" ] } ], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.model_selection import train_test_split\n", "\n", "lines = []\n", "with open(\"auto.csv\", \"r\") as infile:\n", " lines = infile.readlines()\n", "\n", "data = {}\n", "data[\"names\"] = lines.pop(0).replace(\"\\n\", '').split(\",\")\n", "data[\"dataset\"] = []\n", "for line in lines:\n", " data[\"dataset\"] += [line.replace('\\n', '').split(',')]\n", "print(data)\n", "\n", "#xtrain, xtest, ytrain, ytest = train_test_split(, random_state=0)\n", "\n", "#knn = KNeighborsClassifier(n_neighbors=1)\n", "#knn.fit(xtrain, ytrain)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Oefening 2** K-NN regression : De bedoeling is dat je een leermodel opstelt om te voorspellen hoeveel het verbuik is van een wagen (in miles /gallon) gegeven de afgelegde weg (in miles) en de pk-waarden van de wagen.\n", "\n", "1. Lees het bestanden _auto.csv_ in als een dataframe. Ga na wat deze data precies inhoudt en hoe omvangrijk ze is. \n", "\n", "2. Maak gebruik van de $train\\_test\\_split$ methode om je data op te splitsen in training versus test data. Neem 30% van de data als testdata, 70% van de data als trainingsdata.\n", "\n", "3. Ga eerst na wat in dit geval een goede waarde voor $k$ zou zijn. Gebruik \n", " hiervoor de Elbow-methode. Als error bereken je de $mean-squared_error \\; (mse)$ voor elke k-waarde die je test. Plot de _elbow_ uit in een grafiek (m.a.w. voor elke geteste k de bijbehorende $mse$). De $k-waarden$ neem je oneven als volgt : $ k\\_waarden = np.arrange(1,20,2)$\n", "De $mse$ is het gemiddelde van het verschil van de kwadraten tussen elke voorspelde waarde en zijn werkelijke waarde :\n", " \n", " \\begin{equation}\n", " mse = \\frac{1}{len(testset)} \\sum_i (y_{i\\;predicted} - y_{i\\;expected})^2\n", " \\end{equation}\n", " Gelukkig kan python die ook gewoon voor je berekenen : nl. via de $mean\\_squared\\_error$ methode uit : $sklearn.metrics$\n", "Test zeker ook uit wat het effect is van de parameter $random\\_state$ in je oproep van de $train\\_test\\_split$ methode die je hierboven gebruikte om je testset te genrereren.\n", "\n", "4. Werk nu verder met de $k$-waarde die een minimale error geeft in je grafiek. Train je model en bereken de accuracy en de mse op je test set. Maak een plot waarbij je voor de test data de voorspelde en werkelijke mpg waarde uitplot. Neem als waarde voor de X-as gewoon de range(1,119) een nummering over het aantal elementen uit je test set.\n", "\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "392\n", "392\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/beppe/.local/lib/python3.7/site-packages/sklearn/utils/validation.py:563: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64).\n", " FutureWarning)\n", "/home/beppe/.local/lib/python3.7/site-packages/sklearn/utils/validation.py:563: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64).\n", " FutureWarning)\n" ] }, { "data": { "text/plain": [ "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", " metric_params=None, n_jobs=None, n_neighbors=1, p=2,\n", " weights='uniform')" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.model_selection import train_test_split\n", "\n", "lines = []\n", "with open(\"auto.csv\", \"r\") as infile:\n", " lines = infile.readlines()\n", "\n", "data = {}\n", "data[\"names\"] = lines.pop(0).replace(\"\\n\", '').split(\",\")\n", "data[\"dataset\"] = []\n", "for line in lines:\n", " data[\"dataset\"] += [line.replace('\\n', '').split(',')]\n", "#rint(data)\n", "\n", "\n", "\n", "setjen = []\n", "target = []\n", "for line in data[\"dataset\"]:\n", " target.append(line.pop())\n", " setjen += [line]\n", " \n", "\n", "#print(len(target))\n", "#print(len(setjen))\n", "xtrain, xtest, ytrain, ytest = train_test_split(setjen, target, random_state=0)\n", "\n", "knn = KNeighborsClassifier(n_neighbors=1)\n", "knn.fit(xtrain, ytrain)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }