Let’s use python to predict who will survive from the Titanic disaster!
In this project we will cover some basics of data-science and machine learning:
- clean data
- leverage the data for predictions
In short we will retrieve data from CSV files, clean the data, and train an estimator to perform binary classification.
Step 1: get the data
train.csv from Kaggle
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S 2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
This is the training data, we know details about some passengers and whether they survived.
For instance Miss Heikkinen, a 26 y.o, 3rd class passenger did survive!
In the testing data, we don’t have the “Survived” information, we have
to predict it for every passenger in the
Step 2: clean the data
We will drop the “Name”, “Ticket” and “Cabin” information as we don’t care for it. Some values are missing, some passengers don’t have an “Age”, “Fare”, or “Embarked” information.
We can either drop the lines where information is missing, or we can input some “forged” values, this is what we will do here.
Step 3: Extract numpy arrays from pandas dataframe
We will create 3 sets of data:
X_train, Y_trainused to train our classifier
X_val, Y_valto check that our classifier can generalize from unknown data
X_testused to predict the fate of passengers whom we don’t have the “Survived” info
The training labels are in the 2nd column of our training dataframe, and we don’t care for the 1st column which is the passenger ID.
X = train_df.iloc[:,2:].values Y = train_df['Survived'].values X_test = test_df.iloc[:,1:].values from sklearn.model_selection import train_test_split X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.3)
Step 4: train an estimator
A Random forest will do the trick:
from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(n_estimators=100, random_state=0) clf = clf.fit(X_train, Y_train) print(clf.score(X_val, Y_val))
The classifier’s score on the validation data is
perfect, but not bad either.
Step 5: predict the output
We will predict from the testing data, and create file that we can upload in Kaggle.
Y_test = clf.predict(X_test) output = pd.DataFrame(test_df['PassengerId']) output.insert(1, "Survived", Y_test) print(output.head()) output.to_csv(path_or_buf="titanic_predict.csv", index=False)
And that’s it!
Some things could be improved:
- Cross-Validation could help us find a better model and/or more suitable parameters.
- Training the classifier on the whole training data before applying it to the test data could improve the classifier’s performance.
- And many other things…
But with some lines of python and the help of
numpy we succesfully built a machine learning system!