You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

932 lines
117 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise - 交通事故理赔审核预测\n",
"\n",
"\n",
"这个比赛的链接http://sofasofa.io/competition.php?id=2\n",
"\n",
"\n",
"* 任务类型:二元分类\n",
"\n",
"* 背景介绍在交通摩擦事故发生后理赔员会前往现场勘察、采集信息这些信息往往影响着车主是否能够得到保险公司的理赔。训练集数据包括理赔人员在现场对该事故方采集的36条信息信息已经被编码以及该事故方最终是否获得理赔。我们的任务是根据这36条信息预测该事故方没有被理赔的概率。\n",
"\n",
"* 数据介绍训练集中共有200000条样本预测集中有80000条样本。 \n",
"![data_description](images/data_description.png)\n",
"\n",
"* 评价方法Precision-Recall AUC\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Demo code\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import os\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"# read data\n",
"homePath = \"data\"\n",
"trainPath = os.path.join(homePath, \"train.csv\")\n",
"testPath = os.path.join(homePath, \"test.csv\")\n",
"submitPath = os.path.join(homePath, \"sample_submit.csv\")\n",
"trainData = pd.read_csv(trainPath)\n",
"testData = pd.read_csv(testPath)\n",
"submitData = pd.read_csv(submitPath)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"参照数据说明CaseID这列是没有意义的编号因此这里将他丢弃。\n",
"\n",
"~drop()函数axis指沿着哪个轴0为行1为列inplace指是否在原数据上直接操作\n"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"# 去掉没有意义的一列\n",
"trainData.drop(\"CaseId\", axis=1, inplace=True)\n",
"testData.drop(\"CaseId\", axis=1, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 快速了解数据\n",
"\n",
"~head()默认显示前5行数据可指定显示多行例如.head(15)显示前15行\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Q1</th>\n",
" <th>Q2</th>\n",
" <th>Q3</th>\n",
" <th>Q4</th>\n",
" <th>Q5</th>\n",
" <th>Q6</th>\n",
" <th>Q7</th>\n",
" <th>Q8</th>\n",
" <th>Q9</th>\n",
" <th>Q10</th>\n",
" <th>...</th>\n",
" <th>Q28</th>\n",
" <th>Q29</th>\n",
" <th>Q30</th>\n",
" <th>Q31</th>\n",
" <th>Q32</th>\n",
" <th>Q33</th>\n",
" <th>Q34</th>\n",
" <th>Q35</th>\n",
" <th>Q36</th>\n",
" <th>Evaluation</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>6</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>7</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>2</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>8</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>9</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>10</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>6</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>15 rows × 37 columns</p>\n",
"</div>"
],
"text/plain": [
" Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 ... Q28 Q29 Q30 Q31 \\\n",
"0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 \n",
"1 0 0 0 0 0 0 0 0 0 0 ... 0 1 1 1 \n",
"2 0 0 0 0 0 0 0 1 0 0 ... 1 2 2 2 \n",
"3 0 0 0 0 0 0 0 0 0 0 ... 1 3 2 3 \n",
"4 0 0 0 0 0 0 0 0 0 0 ... 1 4 2 4 \n",
"5 0 0 0 0 0 0 0 0 0 0 ... 1 2 3 5 \n",
"6 0 0 0 0 0 0 0 0 0 1 ... 0 3 1 6 \n",
"7 0 0 0 0 0 0 0 0 0 0 ... 1 3 1 3 \n",
"8 0 0 0 0 0 0 0 2 0 0 ... 0 2 1 2 \n",
"9 0 0 0 0 0 0 0 0 0 0 ... 0 2 1 7 \n",
"10 0 0 0 0 0 0 0 0 0 0 ... 2 5 0 8 \n",
"11 0 0 0 0 0 0 0 0 0 0 ... 0 2 1 1 \n",
"12 1 0 0 0 0 0 0 0 0 0 ... 3 3 3 9 \n",
"13 0 0 0 0 0 0 0 0 0 0 ... 0 1 1 10 \n",
"14 0 0 0 0 0 0 0 3 0 0 ... 1 6 1 2 \n",
"\n",
" Q32 Q33 Q34 Q35 Q36 Evaluation \n",
"0 0 0 0 0 0 0 \n",
"1 1 0 0 0 0 0 \n",
"2 1 0 0 0 0 0 \n",
"3 1 0 0 1 1 0 \n",
"4 1 0 0 1 1 0 \n",
"5 1 0 0 0 0 0 \n",
"6 1 0 0 1 1 1 \n",
"7 1 0 0 1 1 1 \n",
"8 1 0 0 0 0 0 \n",
"9 1 0 0 0 0 0 \n",
"10 1 0 0 1 1 0 \n",
"11 1 0 0 0 0 0 \n",
"12 1 0 0 1 1 0 \n",
"13 1 0 0 0 0 0 \n",
"14 1 0 0 1 1 0 \n",
"\n",
"[15 rows x 37 columns]"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trainData.head(15)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"显示数据简略信息,可以每列有多少非空的值,以及每列数据对应的数据类型。\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 200000 entries, 0 to 199999\n",
"Data columns (total 37 columns):\n",
"Q1 200000 non-null int64\n",
"Q2 200000 non-null int64\n",
"Q3 200000 non-null int64\n",
"Q4 200000 non-null int64\n",
"Q5 200000 non-null int64\n",
"Q6 200000 non-null int64\n",
"Q7 200000 non-null int64\n",
"Q8 200000 non-null int64\n",
"Q9 200000 non-null int64\n",
"Q10 200000 non-null int64\n",
"Q11 200000 non-null int64\n",
"Q12 200000 non-null int64\n",
"Q13 200000 non-null int64\n",
"Q14 200000 non-null int64\n",
"Q15 200000 non-null int64\n",
"Q16 200000 non-null int64\n",
"Q17 200000 non-null int64\n",
"Q18 200000 non-null int64\n",
"Q19 200000 non-null int64\n",
"Q20 200000 non-null int64\n",
"Q21 200000 non-null int64\n",
"Q22 200000 non-null int64\n",
"Q23 200000 non-null int64\n",
"Q24 200000 non-null int64\n",
"Q25 200000 non-null int64\n",
"Q26 200000 non-null int64\n",
"Q27 200000 non-null int64\n",
"Q28 200000 non-null int64\n",
"Q29 200000 non-null int64\n",
"Q30 200000 non-null int64\n",
"Q31 200000 non-null int64\n",
"Q32 200000 non-null int64\n",
"Q33 200000 non-null int64\n",
"Q34 200000 non-null int64\n",
"Q35 200000 non-null int64\n",
"Q36 200000 non-null int64\n",
"Evaluation 200000 non-null int64\n",
"dtypes: int64(37)\n",
"memory usage: 56.5 MB\n"
]
}
],
"source": [
"trainData.info()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"~hist():绘制直方图参数figsize可指定输出图片的尺寸。\n"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fcce92a6f28>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce9247518>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce925f860>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce9276ef0>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce9215588>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce922ba90>],\n",
" [<matplotlib.axes._subplots.AxesSubplot object at 0x7fcce91c8160>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce91e17f0>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce91f8e80>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce9194550>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce91acbe0>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce91492b0>],\n",
" [<matplotlib.axes._subplots.AxesSubplot object at 0x7fcce9160940>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce9178fd0>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce9115630>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce912fcc0>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce90cf390>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce90e6a20>],\n",
" [<matplotlib.axes._subplots.AxesSubplot object at 0x7fcce90810f0>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce9099780>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce90b2e10>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce904d4e0>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce9065b70>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce9004240>],\n",
" [<matplotlib.axes._subplots.AxesSubplot object at 0x7fcce901c8d0>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce9033f60>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce8fd0630>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce8fe8cc0>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce8f86390>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce8f9ca20>],\n",
" [<matplotlib.axes._subplots.AxesSubplot object at 0x7fcce8fba0f0>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce8f53780>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce8f6ae10>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce8f064e0>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce8f1eb70>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce8f3b240>],\n",
" [<matplotlib.axes._subplots.AxesSubplot object at 0x7fcce8ed28d0>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce8eebf60>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce8e89630>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce8ea1cc0>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce8ec0390>,\n",
" <matplotlib.axes._subplots.AxesSubplot object at 0x7fcce8e56a20>]],\n",
" dtype=object)"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x1440 with 42 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"trainData.hist(figsize=(20, 20))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"想要了解特征之间的相关性,可计算相关系数矩阵。然后可对某个特征来排序。\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Evaluation 1.000000\n",
"Q28 0.410700\n",
"Q30 0.324421\n",
"Q36 0.302709\n",
"Q35 0.224996\n",
"Q34 0.152743\n",
"Q32 0.049397\n",
"Q21 0.034897\n",
"Q33 0.032248\n",
"Q13 0.023603\n",
"Q8 0.021922\n",
"Q19 0.019694\n",
"Q20 0.013903\n",
"Q4 0.011626\n",
"Q27 0.004262\n",
"Q23 0.002898\n",
"Q7 0.001143\n",
"Q31 -0.000036\n",
"Q14 -0.000669\n",
"Q29 -0.002014\n",
"Q10 -0.002711\n",
"Q12 -0.005287\n",
"Q1 -0.006511\n",
"Q16 -0.007184\n",
"Q18 -0.007643\n",
"Q26 -0.008188\n",
"Q11 -0.009252\n",
"Q24 -0.010891\n",
"Q22 -0.011821\n",
"Q25 -0.012660\n",
"Q6 -0.016072\n",
"Q2 -0.018307\n",
"Q15 -0.019570\n",
"Q9 -0.021261\n",
"Q5 -0.023893\n",
"Q3 -0.026349\n",
"Q17 -0.028461\n",
"Name: Evaluation, dtype: float64"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"corr_matrix = trainData.corr()\n",
"corr_matrix[\"Evaluation\"].sort_values(ascending=False) # ascending=False 降序排列"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"从训练集中分离标签"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"y = trainData['Evaluation']\n",
"trainData.drop(\"Evaluation\", axis=1, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"使用K-Means训练模型\n",
"\n",
"KMeans()\n",
"* `n_clusters`指要预测的有几个类;\n",
"* `init`指初始化中心的方法,默认使用的是`k-means++`方法而非经典的K-means方法的随机采样初始化当然你可以设置为random使用随机初始化\n",
"* `n_jobs`指定使用CPU核心数-1为使用全部CPU。"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [],
"source": [
"# do k-means\n",
"from sklearn.cluster import KMeans\n",
"est = KMeans(n_clusters=2, init=\"k-means++\", n_jobs=-1)\n",
"est.fit(trainData, y)\n",
"\n",
"y_train = est.predict(trainData)\n",
"y_pred = est.predict(testData)\n",
"\n",
"# 保存预测的结果\n",
"submitData['Evaluation'] = y_pred\n",
"submitData.to_csv(\"submit_data.csv\", index=False)"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"acc_train = 0.682140\n"
]
}
],
"source": [
"# calculate accuracy\n",
"from sklearn.metrics import accuracy_score\n",
"\n",
"acc_train = accuracy_score(y, y_train)\n",
"print(\"acc_train = %f\" % (acc_train))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 随机森林\n",
"\n",
"使用K-means可能得到的结果没那么理想。在官网上举办方给出了两个标杆模型效果最好的是随机森林。以下是代码读者可以自己测试。\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.metrics import accuracy_score\n",
"\n",
"# 读取数据\n",
"train = pd.read_csv(\"data/train.csv\")\n",
"test = pd.read_csv(\"data/test.csv\")\n",
"submit = pd.read_csv(\"data/sample_submit.csv\")\n",
"\n",
"# 删除id\n",
"train.drop('CaseId', axis=1, inplace=True)\n",
"test.drop('CaseId', axis=1, inplace=True)\n",
"\n",
"# 取出训练集的y\n",
"y_train = train.pop('Evaluation')\n",
"\n",
"# 建立随机森林模型\n",
"clf = RandomForestClassifier(n_estimators=100, random_state=0)\n",
"clf.fit(train, y_train)\n",
"y_pred = clf.predict_proba(test)[:, 1]\n",
"\n",
"# 输出预测结果至my_RF_prediction.csv\n",
"submit['Evaluation'] = y_pred\n",
"submit.to_csv('my_RF_prediction.csv', index=False)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.00177294 0.00207449 0.00187096 0.00471492 0.00443815 0.0029538\n",
" 0.00364967 0.00652341 0.00235713 0.00739511 0.00245106 0.00106103\n",
" 0.0007513 0.00090631 0.00150727 0.0037793 0.00183821 0.00196833\n",
" 0.00209665 0.00726069 0.00816243 0.00107563 0.00559247 0.00766561\n",
" 0.00760666 0.00028462 0.00025573 0.18472067 0.25559838 0.21436631\n",
" 0.0425301 0.00662325 0.00297955 0.03148822 0.03907383 0.13060584]\n",
"[0 0 0 ... 0 0 0]\n",
"acc_train = 0.931525\n"
]
}
],
"source": [
"# freature importances\n",
"print(clf.feature_importances_)\n",
"\n",
"# Train accuracy\n",
"from sklearn.metrics import accuracy_score\n",
"y_train_pred = clf.predict(train)\n",
"print(y_train_pred)\n",
"\n",
"acc_train = accuracy_score(y_train, y_train_pred)\n",
"print(\"acc_train = %f\" % (acc_train))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
},
"main_language": "python"
},
"nbformat": 4,
"nbformat_minor": 2
}