Titanic

Workflow stages 参考链接 比赛链接

The competition solution workflow goes through seven stages described in the Data Science Solutions book.

1.Question or problem definition.
2.Acquire training and testing data.
3.Wrangle, prepare, cleanse the data.
4.Analyze, identify patterns, and explore the data.
5.Model, predict and solve the problem.
6.Visualize, report, and present the problem solving steps and final solution.
7.Supply or submit the results.

The workflow indicates general sequence of how each stage may follow the other. However there are use cases with exceptions.

a.We may combine mulitple workflow stages. We may analyze by visualizing data.
b.Perform a stage earlier than indicated. We may analyze data before and after wrangling.
c.Perform a stage multiple times in our workflow. Visualize stage may be used multiple times.
d.Drop a stage altogether. We may not need supply stage to productize or service enable our dataset for a competition.

Question and problem definition

Competition sites like Kaggle define the problem to solve or questions to ask while providing the datasets for training your data science model and testing the model results against a test dataset. The question or problem definition for Titanic Survival competition is described here at Kaggle.

Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.

We may also want to develop some early understanding about the domain of our problem. This is described on the Kaggle competition description page here. Here are the highlights to note.

On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.
Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Workflow goals

The data science solutions workflow solves for seven major goals.

Classifying. We may want to classify or categorize our samples. We may also want to understand the implications or correlation of different classes with our solution goal.

Correlating. One can approach the problem based on available features within the training dataset. Which features within the dataset contribute significantly to our solution goal? Statistically speaking is there a correlation among a feature and solution goal? As the feature values change does the solution state change as well, and visa-versa? This can be tested both for numerical and categorical features in the given dataset. We may also want to determine correlation among features other than survival for subsequent goals and workflow stages. Correlating certain features may help in creating, completing, or correcting features.

Converting. For modeling stage, one needs to prepare the data. Depending on the choice of model algorithm one may require all features to be converted to numerical equivalent values. So for instance converting text categorical values to numeric values.

Completing. Data preparation may also require us to estimate any missing values within a feature. Model algorithms may work best when there are no missing values.

Correcting. We may also analyze the given training dataset for errors or possibly innacurate values within features and try to corrent these values or exclude the samples containing the errors. One way to do this is to detect any outliers among our samples or features. We may also completely discard a feature if it is not contribting to the analysis or may significantly skew the results.

Creating. Can we create new features based on an existing feature or a set of features, such that the new feature follows the correlation, conversion, completeness goals.

Charting. How to select the right visualization plots and charts depending on nature of the data and the solution goals.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#数据分析
import pandas as pd
import numpy as np
import random as rnd

#可视化
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#机器学习
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC,LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
1
2
3
4
5
6
7
#读取数据
train_df=pd.read_csv('./data/Titanic/train.csv')
test_df=pd.read_csv('./data/Titanic/test.csv')
combine=[train_df,test_df]

#数据分析
print(train_df.columns.values)
1
2
['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
'Ticket' 'Fare' 'Cabin' 'Embarked']
1
train_df.head(2)
1
test_df.head(2)
1
train_df.tail()
1
2
3
train_df.info()
print('_'*40)
test_df.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 332 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 417 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
1
train_df.describe()
1
train_df.describe(include=['O'])

类别、序列、离散特征的分析

1
train_df[['Pclass','Survived']].groupby(['Pclass'],as_index=False).mean().sort_values(by='Survived',ascending=False)
1
train_df[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)
1
train_df[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False

可视化数值特征

1
2
g=sns.FacetGrid(train_df,col='Survived')
g.map(plt.hist,'Age',bins=20)
1
<seaborn.axisgrid.FacetGrid at 0xc574f60>

png

1
2
3
grid=sns.FacetGrid(train_df,col='Survived',row='Pclass',size=2.2,aspect=1.6)
grid.map(plt.hist,'Age',alpha=0.5,bins=20)
grid.add_legend()
1
<seaborn.axisgrid.FacetGrid at 0xc619278>

png

1
2
3
grid=sns.FacetGrid(train_df,row='Embarked',size=2.2,aspect=1.6)
grid.map(sns.pointplot,'Pclass','Survived','Sex',palette='deep')
grid.add_legend()
1
<seaborn.axisgrid.FacetGrid at 0xcd8f358>

png

1
2
3
grid=sns.FacetGrid(train_df,row='Embarked',col='Survived',size=2.2,aspect=1.6)
grid.map(sns.barplot,'Sex','Fare',alpha=0.5,ci=None)
grid.add_legend()
1
<seaborn.axisgrid.FacetGrid at 0xcf12f60>

png

去除特征

1
print("Before",train_df.shape,test_df.shape,combine[0].shape,combine[1].shape)
1
Before (891, 12) (418, 11) (891, 12) (418, 11)
1
2
3
4
train_df=train_df.drop(['Ticket','Cabin'],axis=1)
test_df=test_df.drop(['Ticket','Cabin'],axis=1)
combine=[train_df,test_df]
print("After",train_df.shape,test_df.shape,combine[0].shape,combine[1].shape)
1
After (891, 10) (418, 9) (891, 10) (418, 9)
1
train_df.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(3)
memory usage: 69.7+ KB
1
train_df.head(1)

creating new feature extracting from existing

1
2
for dataset in combine:
dataset['Title']=dataset.Name.str.extract('([A-Za-z]+)\.',expand=False)
1
dataset.shape
1
(418, 10)
1
pd.crosstab(train_df['Title'],train_df['Sex']
1
2
3
4
5
6
7
8
9
10
#replace many titles with a more common name or classify them as Rare
for dataset in combine:
dataset['Title']=dataset['Title'].replace(['Lady','Countess','Capt','Col',\
'Don','Dr','Major','Rev','Sir',\
'Jonkheer','Dona'],'Rare')
dataset['Title']=dataset['Title'].replace(['Mlle','Miss'])
dataset['Title']=dataset['Title'].replace('Ms','Miss')
dataset['Title']=dataset['Title'].replace('Mme','Mrs')

train_df[['Title','Survived']].groupby(['Title'],as_index=False).mean()
1
2
3
4
5
#将类别特征转换成数值特征
title_mapping={"Mr":1,"Miss":2,"Mrs":3,"Master":4,"Rare":5}
for dataset in combine:
dataset['Title']=dataset['Title'].map(title_mapping)
dataset['Title']=dataset['Title'].fillna(0)
1
train_df.head()
1
test_df.head(1)
1
2
3
#删除训练集和测试集中的Name特征,删除训练集中的name特征
train_df=train_df.drop(['Name','PassengerId'],axis=1)
test_df=test_df.drop(['Name'],axis=1)
1
2
combine=[train_df,test_df]
train_df.shape,test_df.shape
1
((891, 9), (418, 9))
1
2
3
for dataset in combine:
dataset['Sex']=dataset['Sex'].map({'female':1,'male':0}).astype(int)
train_df.head()
1
train_df.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
Survived 891 non-null int64
Pclass 891 non-null int64
Sex 891 non-null int32
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Embarked 889 non-null object
Title 891 non-null int64
dtypes: float64(2), int32(1), int64(5), object(1)
memory usage: 59.2+ KB

补全缺失值(数值连续特征)

1.生成随机数[mean,std]
2.通过相关特征预测预测缺失值,如年龄,性别,pclass特征,可以根据相同性别,pclass的样本的中位数预测年龄
3.前两种方法组合,用相关特征的样本的均值和方差之间进行随机生成数据
第1,2种方法都包含随机性,一般选择第二种

1
2
3
grid=sns.FacetGrid(train_df,row='Pclass',col='Sex',size=2.2,aspect=1.6)
grid.map(plt.hist,'Age',alpha=.5,bins=20)
grid.add_legend()
1
<seaborn.axisgrid.FacetGrid at 0xe211f28>

png

1
2
guess_ages=np.zeros((2,3))
guess_ages
1
2
array([[ 0.,  0.,  0.],
[ 0., 0., 0.]])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
for dataset in combine:
for i in range(0,2):
for j in range(0,3):
guess_df=dataset[(dataset['Sex']==i) & (dataset['Pclass']==j+1)]['Age'].dropna()
#age_mean=guess_df.mean()
#age_std=guess_df.std()
#age_guess=rnd.uniform(age_mean-age_std,age_mean+age_std)
age_guess=guess_df.median()
#convert random age float to nearest .5 age
guess_ages[i,j]=int(age_guess/0.5+0.5)*0.5
for i in range(0,2):
for j in range(0,3):
dataset.loc[(dataset.Age.isnull()) & (dataset.Sex==i) & (dataset.Pclass==j+1),\
'Age']=guess_ages[i,j]
dataset['Age']=dataset['Age'].astype(int)
train_df.head()
1
2
3
#let us create Age bands and determine correlations with Survived
train_df['AgeBand']=pd.cut(train_df['Age'],5)
train_df[['AgeBand','Survived']].groupby(['AgeBand'],as_index=False).mean().sort_values(by='AgeBand',ascending=True)
1
2
3
4
5
6
7
8
#let us replace Age with ordinals based on these bands
for dataset in combine:
dataset.loc[dataset['Age']<=16,'Age']=0
dataset.loc[(dataset['Age']>16) & (dataset['Age']<=32),'Age']=1
dataset.loc[(dataset['Age']>32) & (dataset['Age']<=48),'Age']=2
dataset.loc[(dataset['Age']>48) & (dataset['Age']<=64),'Age']=3
dataset.loc[dataset['Age']>64,'Age']
train_df.head()
1
2
3
4
#we can not removet the AgeBand feature
train_df=train_df.drop(['AgeBand'],axis=1)
combine=[train_df,test_df]
train_df.head()
1
2
3
4
5
6
7
8
#create new feature combining existing features
'''
create a new feature for FamilySize which combines Parch and SibSp.This will enable us
to drop Parch and SibSp from our datasets
'''
for dataset in combine:
dataset['FamilySize']=dataset['SibSp']+dataset['Parch']+1
train_df[['FamilySize','Survived']].groupby(['FamilySize'],as_index=False).mean().sort_values(by='Survived',ascending=False)
1
2
3
4
5
6
#we can create another feature called IsAlone
for dataset in combine:
dataset['IsAlone']=0
dataset.loc[dataset['FamilySize']==1,'IsAlone']=1

train_df[['IsAlone','Survived']].groupby(['IsAlone'],as_index=False).mean()
1
2
3
4
#let us drop Parch,SibSp,and FamilySize features in faver of IsAlone
train_df=train_df.drop(['Parch','SibSp','FamilySize'],axis=1)
test_df=test_df.drop(['Parch','SibSp','FamilySize'],axis=1)
combine=[train_df,test_df]
1
train_df.head()
1
2
3
4
#we can also create an artificial feature combining Pclass and Age
for dataset in combine:
dataset['Age*Class']=dataset.Age*dataset.Pclass
train_df.loc[:,['Age*Class','Age','Pclass']].head(10)
1
2
#completing a categorical feature
train_df.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
Survived 891 non-null int64
Pclass 891 non-null int64
Sex 891 non-null int32
Age 891 non-null int32
Fare 891 non-null float64
Embarked 889 non-null object
Title 891 non-null int64
IsAlone 891 non-null int64
Age*Class 891 non-null int64
dtypes: float64(1), int32(2), int64(5), object(1)
memory usage: 55.8+ KB
1
2
3
4
5
#Embarked feature takes S,Q,C values based on port of embarkation.
#our training dataset has two missing values,
#We simply fill these with the most common occurance
freq_port=train_df.Embarked.dropna().mode()[0]
freq_port
1
'S'
1
2
3
for dataset in combine:
dataset['Embarked']=dataset['Embarked'].fillna(freq_port)
train_df[['Embarked','Survived']].groupby(['Embarked'],as_index=False).mean().sort_values(by='Survived',ascending=False)
1
2
3
4
5
6
#Converting categorical feature to numeric
#we can now convert the EmbarkedFill feature by creating a new numeric Port feature
for dataset in combine:
dataset['Embarked']=dataset['Embarked'].map({'S':0,'C':1,'Q':2}).astype(int)

train_df.head()
1
2
#Quick completing and converting a numeric feature
test_df.head()
1
test_df.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 9 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Sex 418 non-null int32
Age 418 non-null int32
Fare 417 non-null float64
Embarked 418 non-null int32
Title 418 non-null int64
IsAlone 418 non-null int64
Age*Class 418 non-null int64
dtypes: float64(1), int32(3), int64(5)
memory usage: 24.6 KB
1
2
3
4
5
6
7
8
9
10
11
'''
we can now complete the Fare feature for single missing value in test dataset using
mode to get the value that occurs most frequently for this feature.we do this in a single
line of code.
Note that we are not creating an intermediate new feature or doing any further analysis
for correlation to guess missing feature as we are replacing only a single value.The completion goal
achieves desired requirement for model algorithm to operate on nonnull values.
we may also want round off the fare to two decimals as it represents currency.
'''
test_df['Fare'].fillna(test_df['Fare'].dropna().median(),inplace=True)
test_df.head()
1
2
3
#we can not create FareBand
train_df['FareBand']=pd.qcut(train_df['Fare'],4)
train_df[['FareBand','Survived']].groupby(['FareBand'],as_index=False).mean().sort_values(by='FareBand',ascending=True)
1
2
3
4
5
6
7
8
9
10
11
#convert the Fare feature to ordinal values based on the FareBand
for dataset in combine:
dataset.loc[dataset['Fare']<=7.91,'Fare']=0
dataset.loc[(dataset['Fare']>7.91) & (dataset['Fare']<=14.454),'Fare']=1
dataset.loc[(dataset['Fare']>14.454) & (dataset['Fare']<=31),'Fare']=2
dataset.loc[dataset['Fare']>31,'Fare']=3
dataset['Fare']=dataset['Fare'].astype(int)

train_df=train_df.drop(['FareBand'],axis=1)
combine=[train_df,test_df]
train_df.head()
1
2
#And the test dataset
test_df.head()

Model,Predict and solve

接下来我们 要训练一个模型进行预测。一共有60+预测模型以供我们选择,我们必须理解问题的类型,将解决方案缩小到几个可选的模型中去。我们的问题是一个分类和回归的问题,我们想要确定输出(survived or not)和特征变量之间的关系。我们将要执行一种监督学习的模型,在训练过程中会给定数据。在监督学习和分类回归条件下,我们将模型缩小到一下几种:

1.逻辑回归
2.KNN
3.支持向量机
4.贝叶斯
5.决策树
6.随机森林
7.感知机
8.人工神经网络
9.RVM or Relevance Vector Machine

1
2
3
4
X_train=train_df.drop("Survived",axis=1)
Y_train=train_df["Survived"]
X_test=test_df.drop("PassengerId",axis=1).copy()
X_train.shape,Y_train.shape,X_test.shape
1
((891, 8), (891,), (418, 8))
1
train_df.head()
1
test_df.head()

1.Logistic Regression

1
2
3
4
5
logreg=LogisticRegression()
logreg.fit(X_train,Y_train)
Y_pred=logreg.predict(X_test)
acc_log=round(logreg.score(X_train,Y_train)*100, 2)
acc_log
1
80.579999999999998
1
2
3
4
5
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2) #???输出的居然不是2位小数
acc_log
1
80.579999999999998

‘’’
We can use Logistic Regression to validate our assumptions and decisions for feature creating and completing goals.
This can be done by calculating the coefficient of the features in the decision function.

Positive coefficients increase the log-odds of the response (and thus increase the probability),
and negative coefficients decrease the log-odds of the response (and thus decrease the probability).

1.Sex is highest positivie coefficient, implying as the Sex value increases (male: 0 to female: 1), the probability of Survived=1 increases the most.
2.Inversely as Pclass increases, probability of Survived=1 decreases the most.
3.This way Age*Class is a good artificial feature to model as it has second highest negative correlation with Survived.
4.So is Title as second highest positive correlation.

‘’’

1
2
3
4
coeff_df=pd.DataFrame(train_df.columns.delete(0))
coeff_df.columns=['Feature']
coeff_df["Correlation"]=pd.Series(logreg.coef_[0])
coeff_df.sort_values(by='Correlation',ascending=False)

SVM

SVM training algorithm builds a model that assigns new test samples to one category or the other, making it a non-probabilistic binary linear classifier.
Note that the model generates a confidence score which is higher than Logistics Regression model

1
2
3
4
5
svc=SVC()
svc.fit(X_train,Y_train)
Y_pred=svc.predict(X_test)
acc_svc=round(svc.score(X_train,Y_train)*100,2)
acc_svc
1
84.290000000000006

KNN

In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. A sample is classified by a majority vote of its neighbors, with the sample being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
KNN confidence score is better than Logistics Regression and not worse than SVM.

1
2
3
4
5
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,Y_train)
Y_pred=knn.predict(X_test)
acc_knn=round(knn.score(X_train,Y_train)*100,2)
acc_knn
1
84.849999999999994

Bayes

In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features. Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features) in a learning problem.
The model generated confidence score is the lowest among the models evaluated so far.

1
2
3
4
5
6
#Gaussian Naive Bayes
gaussian=GaussianNB()
gaussian.fit(X_train,Y_train)
Y_pred=gaussian.predict(X_test)
acc_gaussian=round(gaussian.score(X_train,Y_train)*100,2)
acc_gaussian
1
71.379999999999995

Perceptron

The perceptron is an algorithm for supervised learning of binary classifiers (functions that can decide whether an input, represented by a vector of numbers, belongs to some specific class or not). It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector. The algorithm allows for online learning, in that it processes elements in the training set one at a time.

1
2
3
4
5
perceptron=Perceptron()
perceptron.fit(X_train,Y_train)
Y_pred=perceptron.predict(X_test)
acc_perceptron=round(perceptron.score(X_train,Y_train)*100,2)
acc_perceptron
1
2
D:\noSystem\software\Anaconda3\lib\site-packages\sklearn\linear_model\stochastic_gradient.py:128: FutureWarning: max_iter and tol parameters have been added in <class 'sklearn.linear_model.perceptron.Perceptron'> in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
"and default tol will be 1e-3." % type(self), FutureWarning)
1
79.569999999999993

Linear SVC

1
2
3
4
5
linear_svc=LinearSVC()
linear_svc.fit(X_train,Y_train)
Y_pred=linear_svc.predict(X_test)
acc_linear_svc=round(linear_svc.score(X_train,Y_train)*100,2)
acc_linear_svc
1
80.129999999999995

Stochastic Gradient Descent

1
2
3
4
5
sgd=SGDClassifier()
sgd.fit(X_train,Y_train)
Y_pres=sgd.predict(X_test)
acc_sgd=round(sgd.score(X_train,Y_train)*100,2)
acc_sgd
1
2
3
4
D:\noSystem\software\Anaconda3\lib\site-packages\sklearn\linear_model\stochastic_gradient.py:128: FutureWarning: max_iter and tol parameters have been added in <class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
"and default tol will be 1e-3." % type(self), FutureWarning)

80.25

Decision Tree

This model uses a decision tree as a predictive model which maps features (tree branches) to conclusions about the target value (tree leaves). Tree models where the target variable can take a finite set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees.
The model confidence score is the highest among models evaluated so far.

1
2
3
4
5
decision_tree=DecisionTreeClassifier()
decision_tree.fit(X_train,Y_train)
Y_pred=decision_tree.predict(X_test)
acc_decision_tree=round(decision_tree.score(X_train,Y_train)*100,2)
acc_decision_tree
1
87.209999999999994

Random Forest

The next model Random Forests is one of the most popular. Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees (n_estimators=100) at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
The model confidence score is the highest among models evaluated so far. We decide to use this model’s output (Y_pred) for creating our competition submission of results.

1
2
3
4
5
6
random_forest=RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train,Y_train)
Y_pred=random_forest.predict(X_test)
random_forest.score(X_train,Y_train)
acc_random_forest=round(random_forest.score(X_train,Y_train)*100,2)
acc_random_forest
1
87.209999999999994

模型评估

We can now rank our evaluation of all the models to choose the best one for our problem. While both Decision Tree and Random Forest score the same, we choose to use Random Forest as they correct for decision trees’ habit of overfitting to their training set.

1
2
3
4
5
6
7
8
models=pd.DataFrame({
'Model':['Support Vector Machines','KNN','Logistic Regression','Random Forest',
'Naive Bayes','Perceptron','Stochastic Gradient Decent','Linear SVC',
'Decision Tree'],
'Score':[acc_svc,acc_knn,acc_log,acc_random_forest,acc_gaussian,acc_perceptron,
acc_sgd,acc_linear_svc,acc_decision_tree]
})
models.sort_values(by='Score',ascending=False)
1
2
3
4
5
submission=pd.DataFrame({
"PassengerID":test_df["PassengerId"],
"Survived":Y_pred
})
submission.to_csv('./data/Titanic/output/submission.csv',index=False)

Our submission to the competition site Kaggle results in scoring 3,883 of 6,082 competition entries. This result is indicative while the competition is running. This result only accounts for part of the submission dataset. Not bad for our first attempt. Any suggestions to improve our score are most welcome.