Add some references

7 years ago · 2a37c01599
parent 881a67acf5
commit 2a37c01599
20 changed files with 521319 additions and 6 deletions
--- a/README.md
+++ b/README.md
@ -20,13 +20,20 @@ git pull upstream master

 ## 作业
 1. [Python基础](homework_01_python/README.md)
+2. [numpy & matplotlib](homework_02_numpy_matplotlib/README.md)
+
+## 报告
+1. [交通事故理赔审核预测](report_01_交通事故理赔审核预测/)
+3. [Titanic](report_03_Titanic/)


 ## 使用帮助

+* Git
  * [使用码云提交作业的说明](help/gitee_homework_usage.md)
+  * [Git使用教程](help/Git使用教程_PILAB.pdf)
  * [Git快速入门 - Git初体验](https://my.oschina.net/dxqr/blog/134811)
  * [在win7系统下使用TortoiseGit(乌龟git)简单操作Git](https://my.oschina.net/longxuu/blog/141699)
  * [Git系统学习 - 廖雪峰的Git教程](https://my.oschina.net/dxqr/blog/134811)
-
+* Markdown
  * [Markdown——入门指南](https://www.jianshu.com/p/1e402922ee32)
--- a/help/Git使用教程_PILAB.pdf
+++ b/help/Git使用教程_PILAB.pdf
--- a/homework_02_numpy_matplotlib/README.md
+++ b/homework_02_numpy_matplotlib/README.md
@ -0,0 +1,66 @@
+ 
+## 1. 数值计算 numpy
+
+
+### （1）对于一个存在在数组，如何添加一个用0填充的边界?
+例如对一个二维矩阵
+```
+10, 34, 54, 23
+31, 87, 53, 68
+98, 49, 25, 11
+84, 32, 67, 88
+```
+
+变换成
+```
+ 0,  0,  0,  0,  0, 0
+ 0, 10, 34, 54, 23, 0
+ 0, 31, 87, 53, 68, 0
+ 0, 98, 49, 25, 11, 0
+ 0, 84, 32, 67, 88, 0
+ 0,  0,  0,  0,  0, 0
+```
+
+### （2） 创建一个 5x5的矩阵，并设置值1,2,3,4落在其对角线下方位置
+
+
+### （3） 创建一个8x8 的矩阵，并且设置成国际象棋棋盘样式（黑可以用0, 白可以用1）
+
+
+### （4）求解线性方程组
+
+给定一个方程组，如何求出其的方程解。有多种方法，分析各种方法的优缺点（最简单的方式是消元方）。
+
+例如
+```
+3x + 4y + 2z = 10
+5x + 3y + 4z = 14
+8x + 2y + 7z = 20
+```
+
+编程写出求解的程序
+
+
+### （5） 翻转一个数组（第一个元素变成最后一个）
+
+
+### （6） 产生一个十乘十随机数组，并且找出最大和最小值
+
+
+## Matplotlib
+
+
+## (1) 画出一个二次函数，同时画出梯形法求积分时的各个梯形
+例如：
+![matplot_ex1](images/matplot_ex1.png)
+
+
+## （2） 绘制函数 $f(x) = sin^2(x - 2) e^{-x^2}$
+需要画出标题，x，y轴。x的取值范围是[0, 2]
+
+![matplot_ex2](images/matplot_ex2.png)
+
+
+
+## Reference
+* [100 numpy exercises](https://github.com/rougier/numpy-100)
--- a/homework_02_numpy_matplotlib/images/matplot_ex1.png
+++ b/homework_02_numpy_matplotlib/images/matplot_ex1.png
--- a/homework_02_numpy_matplotlib/images/matplot_ex2.png
+++ b/homework_02_numpy_matplotlib/images/matplot_ex2.png
--- a/report_01_交通事故理赔审核预测/Exercise
+++ b/report_01_交通事故理赔审核预测/Exercise
--- a/report_01_交通事故理赔审核预测/Exercise
+++ b/report_01_交通事故理赔审核预测/Exercise
@ -0,0 +1,169 @@
+# -*- coding: utf-8 -*-
+# ---
+# jupyter:
+#   jupytext_format_version: '1.2'
+#   kernelspec:
+#     display_name: Python 3
+#     language: python
+#     name: python3
+#   language_info:
+#     codemirror_mode:
+#       name: ipython
+#       version: 3
+#     file_extension: .py
+#     mimetype: text/x-python
+#     name: python
+#     nbconvert_exporter: python
+#     pygments_lexer: ipython3
+#     version: 3.5.2
+# ---
+
+# # Exercise - 交通事故理赔审核预测
+#
+#
+# 这个比赛的链接：http://sofasofa.io/competition.php?id=2
+#
+#
+# * 任务类型：二元分类
+#
+# * 背景介绍：在交通摩擦（事故）发生后，理赔员会前往现场勘察、采集信息，这些信息往往影响着车主是否能够得到保险公司的理赔。训练集数据包括理赔人员在现场对该事故方采集的36条信息，信息已经被编码，以及该事故方最终是否获得理赔。我们的任务是根据这36条信息预测该事故方没有被理赔的概率。
+#
+# * 数据介绍：训练集中共有200000条样本，预测集中有80000条样本。 
+# ![data_description](images/data_description.png)
+#
+# * 评价方法：Precision-Recall AUC
+#
+
+# ## Demo code
+#
+
+import pandas as pd
+import numpy as np
+import os
+import matplotlib.pyplot as plt
+# %matplotlib inline
+
+# read data
+homePath = "data"
+trainPath = os.path.join(homePath, "train.csv")
+testPath = os.path.join(homePath, "test.csv")
+submitPath = os.path.join(homePath, "sample_submit.csv")
+trainData = pd.read_csv(trainPath)
+testData = pd.read_csv(testPath)
+submitData = pd.read_csv(submitPath)
+
+# 参照数据说明，CaseID这列是没有意义的编号，因此这里将他丢弃。
+#
+# ~drop()函数：axis指沿着哪个轴，0为行，1为列；inplace指是否在原数据上直接操作
+#
+
+# 去掉没有意义的一列
+trainData.drop("CaseId", axis=1, inplace=True)
+testData.drop("CaseId", axis=1, inplace=True)
+
+# # 快速了解数据
+#
+# ~head()：默认显示前5行数据，可指定显示多行，例如.head(15)显示前15行
+#
+
+trainData.head(15)
+
+
+# 显示数据简略信息，可以每列有多少非空的值，以及每列数据对应的数据类型。
+#
+#
+
+trainData.info()
+
+
+# ~hist():绘制直方图，参数figsize可指定输出图片的尺寸。
+#
+
+trainData.hist(figsize=(20, 20))
+
+
+# 想要了解特征之间的相关性，可计算相关系数矩阵。然后可对某个特征来排序。
+#
+#
+
+corr_matrix = trainData.corr()
+corr_matrix["Evaluation"].sort_values(ascending=False) # ascending=False 降序排列
+
+# 从训练集中分离标签
+
+y = trainData['Evaluation']
+trainData.drop("Evaluation", axis=1, inplace=True)
+
+# 使用K-Means训练模型
+#
+# KMeans()：
+# * `n_clusters`指要预测的有几个类；
+# * `init`指初始化中心的方法，默认使用的是`k-means++`方法，而非经典的K-means方法的随机采样初始化，当然你可以设置为random使用随机初始化；
+# * `n_jobs`指定使用CPU核心数，-1为使用全部CPU。
+
+# +
+# do k-means
+from sklearn.cluster import KMeans
+est = KMeans(n_clusters=2, init="k-means++", n_jobs=-1)
+est.fit(trainData, y)
+
+y_train = est.predict(trainData)
+y_pred = est.predict(testData)
+
+# 保存预测的结果
+submitData['Evaluation'] = y_pred
+submitData.to_csv("submit_data.csv", index=False)
+
+# +
+# calculate accuracy
+from sklearn.metrics import accuracy_score
+
+acc_train = accuracy_score(y, y_train)
+print("acc_train = %f" % (acc_train))
+# -
+
+# ## 随机森林
+#
+# 使用K-means可能得到的结果没那么理想。在官网上，举办方给出了两个标杆模型，效果最好的是随机森林。以下是代码，读者可以自己测试。
+#
+#
+
+# +
+import pandas as pd
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.metrics import accuracy_score
+
+# 读取数据
+train = pd.read_csv("data/train.csv")
+test = pd.read_csv("data/test.csv")
+submit = pd.read_csv("data/sample_submit.csv")
+
+# 删除id
+train.drop('CaseId', axis=1, inplace=True)
+test.drop('CaseId', axis=1, inplace=True)
+
+# 取出训练集的y
+y_train = train.pop('Evaluation')
+
+# 建立随机森林模型
+clf = RandomForestClassifier(n_estimators=100, random_state=0)
+clf.fit(train, y_train)
+y_pred = clf.predict_proba(test)[:, 1]
+
+# 输出预测结果至my_RF_prediction.csv
+submit['Evaluation'] = y_pred
+submit.to_csv('my_RF_prediction.csv', index=False)
+
+
+
+# +
+# freature importances
+print(clf.feature_importances_)
+
+# Train accuracy
+from sklearn.metrics import accuracy_score
+y_train_pred = clf.predict(train)
+print(y_train_pred)
+
+acc_train = accuracy_score(y_train, y_train_pred)
+print("acc_train = %f" % (acc_train))
--- a/report_01_交通事故理赔审核预测/data.zip
+++ b/report_01_交通事故理赔审核预测/data.zip
--- a/report_01_交通事故理赔审核预测/data/sample_submit.csv
+++ b/report_01_交通事故理赔审核预测/data/sample_submit.csv
--- a/report_01_交通事故理赔审核预测/data/test.csv
+++ b/report_01_交通事故理赔审核预测/data/test.csv
--- a/report_01_交通事故理赔审核预测/data/train.csv
+++ b/report_01_交通事故理赔审核预测/data/train.csv
--- a/report_01_交通事故理赔审核预测/images/data_description.png
+++ b/report_01_交通事故理赔审核预测/images/data_description.png
--- a/report_01_交通事故理赔审核预测/my_RF_prediction.csv
+++ b/report_01_交通事故理赔审核预测/my_RF_prediction.csv
--- a/report_01_交通事故理赔审核预测/submit_data.csv
+++ b/report_01_交通事故理赔审核预测/submit_data.csv
--- a/report_03_Titanic/.ipynb_checkpoints/Titanic-checkpoint.ipynb
+++ b/report_03_Titanic/.ipynb_checkpoints/Titanic-checkpoint.ipynb
@ -0,0 +1,6 @@
+{
+ "cells": [],
+ "metadata": {},
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/report_03_Titanic/Titanic.ipynb
+++ b/report_03_Titanic/Titanic.ipynb
@ -0,0 +1,71 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Titanic\n",
+    "\n",
+    "## Competition Description\n",
+    "The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.\n",
+    "\n",
+    "One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.\n",
+    "\n",
+    "In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.\n",
+    "\n",
+    "## Practice Skills\n",
+    "* Binary classification\n",
+    "* Python & SKLearn\n",
+    "\n",
+    "## Data\n",
+    "The data has been split into two groups:\n",
+    "\n",
+    "* training set (train.csv)\n",
+    "* test set (test.csv)\n",
+    "\n",
+    "The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the `ground truth`) for each passenger. Your model will be based on `features` like passengers' gender and class. You can also use feature engineering to create new features.\n",
+    "\n",
+    "The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.\n",
+    "\n",
+    "We also include `gender_submission.csv`, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.\n",
+    "\n",
+    "### Data description\n",
+    "![data description1](images/data_description1.png)\n",
+    "![data description2](images/data_description2.png)\n",
+    "\n",
+    "\n",
+    "### Variable Notes\n",
+    "pclass: A proxy for socio-economic status (SES)\n",
+    "* 1st = Upper\n",
+    "* 2nd = Middle\n",
+    "* 3rd = Lower\n",
+    "\n",
+    "\n",
+    "## Links\n",
+    "* [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.5.2"
+  },
+  "main_language": "python"
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/report_03_Titanic/Titanic.py
+++ b/report_03_Titanic/Titanic.py
@ -0,0 +1,58 @@
+# ---
+# jupyter:
+#   jupytext_format_version: '1.2'
+#   kernelspec:
+#     display_name: Python 3
+#     language: python
+#     name: python3
+#   language_info:
+#     codemirror_mode:
+#       name: ipython
+#       version: 3
+#     file_extension: .py
+#     mimetype: text/x-python
+#     name: python
+#     nbconvert_exporter: python
+#     pygments_lexer: ipython3
+#     version: 3.5.2
+# ---
+
+# # Titanic
+#
+# ## Competition Description
+# The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
+#
+# One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
+#
+# In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
+#
+# ## Practice Skills
+# * Binary classification
+# * Python & SKLearn
+#
+# ## Data
+# The data has been split into two groups:
+#
+# * training set (train.csv)
+# * test set (test.csv)
+#
+# The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the `ground truth`) for each passenger. Your model will be based on `features` like passengers' gender and class. You can also use feature engineering to create new features.
+#
+# The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
+#
+# We also include `gender_submission.csv`, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.
+#
+# ### Data description
+# ![data description1](images/data_description1.png)
+# ![data description2](images/data_description2.png)
+#
+#
+# ### Variable Notes
+# pclass: A proxy for socio-economic status (SES)
+# * 1st = Upper
+# * 2nd = Middle
+# * 3rd = Lower
+#
+#
+# ## Links
+# * [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic)
--- a/report_03_Titanic/data.zip
+++ b/report_03_Titanic/data.zip
--- a/report_03_Titanic/images/data_description1.png
+++ b/report_03_Titanic/images/data_description1.png
--- a/report_03_Titanic/images/data_description2.png
+++ b/report_03_Titanic/images/data_description2.png