This commit is contained in:
beppe.vanrolleghem
2019-03-15 14:51:38 +01:00
parent 209fee95e3
commit 24cd306c29
2 changed files with 284 additions and 9 deletions

195
4/Labo4.dasnogneboyipynb Normal file
View File

@@ -0,0 +1,195 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Labo 4 Data Science : pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Pandas Series"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Oefening 1 __ \n",
"* Genereer een serie 1 vertrekkende van een lijst (met alle letters van het alfabet)\n",
"* Genereer een serie 2 vertrekkende van een numpy array \n",
"(met hierin de cijfers van 1 tot 26) \n",
"* Genereer een serie 3 vertrekkende van een dict (met als key de cijfers van 1 tot 26 en als value de letters van het alfabet). \n",
"\n",
"Geef de index en data values van deze series een naam (name attribute) en druk de eerste en laatste elementen van de series uit.\n",
"Geef nu ook : \n",
"\n",
"* alle elementen van serie 1 die zich op posities $0,4,8,14$ en $20$ bevinden\n",
"* alle elementen van serie 1 die ook in serie 3 aanwezig zijn\n",
"* alle elementen van serie 2 die niet in serie 3 aanwezig zijn\n",
"* verdubbel elke waarde van serie 2\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Oefening 2 __ \n",
"\n",
"Gegeven een serie opgevuld met 10 stukken fruit uit de soorten : apple, lemon,banana,kiwi\n",
"\n",
"```\n",
"fruit = pd.Series(np.random.choice(['apple','lemon', 'banana','kiwi'], 10))\n",
"```\n",
"Genereer voor elk element in de serie een gewicht voor het fruit. Geef dan per soort het gemiddeld gewicht. Denk hierbij aan de functie $groupby$ die je kan toepassen op een serie. Gebruik 1 van de numpy $random$ methodes om 10 random gewichten te genereren."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Pandas Data Frames"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df1 = pd.concat([ser2, ser3], axis=0)\n",
"print(df1)\n",
"print()\n",
"\n",
"df2 = pd.DataFrame({'col1': ser1, 'col2': ser2})\n",
"print(df2.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Oefening 3 __\n",
"Gegeven 3 csv files : _ratings.csv_ , _payments_csv_ en _genres.csv_ . \n",
"\n",
"* Load deze files in als 3 aparte data frames. Bekijk de gegevens in deze frames oa. via $head()$, $tail()$, $describe()$ en $info()$\n",
"\n",
"* plot de top 10 genre restaurants uit, m.a.w. welk type restaurants is het meest vertegenwoordigd in deze data, toon alleen de top 10. Voor zie op de x-as de verschillende types en op de y-as de aantallen. (_Tip_ : gebruik de $value_counts$ methode) \n",
"\n",
"* plot de verschillende betalingsmanieren t.o.v. hoe vaak deze zijn gebruikt in de data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__ Oefening 4 __\n",
"Merge de 3 bovenstaande dataframes tot 1 dataframe. Zorg ervoor dat alle user ratings mee opgenomen zijn in deze tabel."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Oefening 5__ Data cleaning : opsporen van ontbrekende data\n",
"\n",
"Ga na hoeveel data ontbreekt in je gemergd eindresultaat (aantal NA's). Maak hiertoe een nieuwe dataframe aan. De eerste kolom is het aantal NAN's voor elke kolom van je gemerged eindresultaat. De tweede kolom is het percentage van NA's. _Tip :_ gebruik de $isnull$ methode.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__ Oefening 5 vervolg __ \n",
"\n",
"De _Upayment_ kolom heeft een verwaarloosbaar percentage aan niet ingevulde velden. Deze zet je via de $fillna$ methode op de default betaalwijze : 'cash'\n",
"\n",
"De _Rtype_ kolom heeft echter een hoog percentage aan niet ingevulde data. De rijen waarvoor dit het geval is mag je verwijderen."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Oefening 6__ Data grouping : \n",
"\n",
"Groepeer je opgekuiste datatabel volgens het type restaurant. Maak nu een datatabel waarin je de verschillende ratings per type restaurants gesorteerd (van groot naar klein) weergeeft.\n",
"\n",
"Plot voor elk restauranttype de verschillende ratings in een barplot.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -35,10 +35,57 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": []
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']\n",
"[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24\n",
" 25 26]\n",
"{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26}\n",
"a,e,i,o,u\n",
"[]\n",
"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]\n",
"[ 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48\n",
" 50 52]\n"
]
}
],
"source": [
"import numpy as np\n",
"\n",
"alfabet = list(\"abcdefghijklmnopqrstuvwxyz\")\n",
"print(alfabet)\n",
"f = np.array(range(1,27))\n",
"print(f)\n",
"dic = {}\n",
"for i in alfabet:\n",
" dic[alfabet.index(i)+1] = i\n",
"\n",
"dic = dict(zip(alfabet,f)) # korte manier zelfde shit\n",
"print(dic)\n",
"\n",
"\n",
"print(\"{},{},{},{},{}\".format(alfabet[0],alfabet[4],alfabet[8],alfabet[14],alfabet[20]))\n",
"\n",
"\n",
"print([a for a in alfabet if a in dic.values()])\n",
"\n",
"\n",
"print([a for a in f if a not in dic.keys()])\n",
"\n",
"\n",
"\n",
"print(f * 2)\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
@@ -56,10 +103,31 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 59,
"metadata": {},
"outputs": [],
"source": []
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" gewicht\n",
"fruit \n",
"apple 0.650914\n",
"banana 0.241583\n",
"kiwi 0.493209\n",
"lemon 0.564981\n"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"s1 = pd.Series(np.random.choice(['apple','lemon', 'banana','kiwi'], 10))\n",
"s2 = pd.Series(np.random.rand(10))\n",
"f = pd.concat([s1, s2], keys=['fruit','gewicht'], axis=1)\n",
"print(f.groupby(['fruit']).mean())\n",
"\n"
]
},
{
"cell_type": "markdown",
@@ -70,9 +138,21 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 34,
"metadata": {},
"outputs": [],
"outputs": [
{
"ename": "NameError",
"evalue": "name 'ser2' is not defined",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-34-97a3c6f78260>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mdf1\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mconcat\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mser2\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mser3\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2\u001b[0m \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdf1\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 3\u001b[0m \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[0mdf2\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mDataFrame\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m{\u001b[0m\u001b[1;34m'col1'\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0mser1\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'col2'\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0mser2\u001b[0m\u001b[1;33m}\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;31mNameError\u001b[0m: name 'ser2' is not defined"
]
}
],
"source": [
"df1 = pd.concat([ser2, ser3], axis=0)\n",
"print(df1)\n",
@@ -187,7 +267,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
"version": "3.6.5"
}
},
"nbformat": 4,