The Titanic Disaster Data set is a commonly used entry level data set on Kaggle to predict whether a passenger is likely to have survived the Titanic. However, I am going to try to do it properly by taking a detailed and thorough look at the dataset. Unfortunately, a lot of people on the Kaggle leaderboard have perfect scores because they over-fitted their models. Their perfect scores do not mean anything at all. However, I am just competing against myself to create a better and better model fit so I do not care.
Predict whether titanic passengers are likely to survive or not based on the data in the training data set.
The data is split into two csv files:
The training set is what is used to train the machine learning model and the test set to confirm results. Normally we would have to manually make this split, but conveniently the creators have done this for us. Let’s take a look at the first five entries:
df_test = pd.read_csv('../input/titanic/test.csv') df_train = pd.read_csv('../input/titanic/train.csv') print(trainDataVar.head(5))
Note that the above output is all one table just split into three horizontally.
So we can see straight away that we’ve got a bit of a mix of variable types and some null values in the cabin column. Also quite a few of the discrete numerical variables are categorical variables, some of which are ordinal.
PassengerId is our primary key, it identifies unique passengers.
Survived determines whether that passenger survived or not. 0 for not surviving, 1 for surviving. Categorical, but could be considered ordinal.
Pclass is the passengers specific class, 1 for 1st, 2 for 2nd and 3 for 3rd. Ordinal.
Name is just the name of the passenger. Categorical.
Sex is the gender, categorical. (Male or Female)
Age is the passengers age, looks to be discrete numerical.
SibSp is the number of siblings and spouses aboard the ship. Discrete numerical.
Parch is the number of parents/children aboard. Discrete numerical.
Ticket seems to be categorical, but it could be numerical in some sense.
Fare is how much the customer paid for their ticket.
Cabin number. Although it has letters in it, it seems to be somewhat numerical as well with a potential ordering.
Embarked is the port that the passenger left from. C = Cherbourg, Q = Queenstown, S = Southampton. Categorical.
We’re not interested here in whether or not our ideas line up with stats, we’re just developing some concepts and ideas about the data. Most of my ideas about the titanic admittedly come from James Cameron’s movie, but he did a lot of research so the ideas might stand up quite well.
The first thing is that we are expecting people to survive based on their class. Higher class people should be more likely to survive and lower class people less likely.
We’d expect gender and age to have some effect, this was the time of ladies and children first after all.
The fare and cabin numbers could be interesting. It would be presumed that a higher fair means a better cabin, and a better cabin may mean more safety? This may require looking at a blueprint/floor plan. The real question here is whether there is any potential for feature engineering at all, particularly using outside data.
On a little bit further investigation and thanks to information assembled by a nuclear physicist named Paul Lee, it turns out there is quite a bit more complexity than expected concerning cabins (http://www.paullee.com/titanic/belowdecks.php).
As the above picture shows, there seem to be people with different duty types at different levels. We know now from the floor plans that the letter in cabin names is their deck and as the alphabet continues the decks decrease. Deck A being the highest deck with passengers and deck G being the lowest. There is also a class relationship occuring here as well, with the lower a deck someone is in the lower their social class (https://en.wikipedia.org/wiki/Titanic#:~:text=All%20three%20of%20the%20Olympic,which%20the%20lifeboats%20were%20housed). We expect to see some collinearity here with survival. Ultimately this means we can start to ask the question as to whether or not where a passenger’s cabin is located is factor in their survival. The intuition here is that the earlier a cabin letter in the alphabetical distribution, the more likely survival is. We can use a bit of feature engineering here to check for whether a certain letter is in a cabin number and assign it an ordinal value.
correlations = df_train.corr() # convert train_df to a series of correlation values??? correlations = correlations["Survived"].sort_values(ascending=False) # Find correlation with SalePrice #features = correlations.index[1:6] # Why is this 1 to 6??? correlations
The correlation analysis would seem to support the hypothesis that class is associated with surival as your class increases numerically, therefore decreasing in terms of social class, the odds of survival start to decrease with a negative correlation of -0.338481. This is further reinforced with ticket fare price having a positive correlation, so the more a passenger paid for their ticket the higher their odds of surviving were.
The main feature we are going to engineer for this dataset is a combination of Pclass and Cabin, which will be named CabinClass. We can create this by simply multiplying Pclass with our ???
Making Predictions/Model Fitting
To start with let’s have a go at using just three predictors to check for the outcome: pclass, gender and age. This should give us an interesting return because we expect women, children and upper class people to have a higher survival rate as per our previous hypothesizing.
Y = B0 + B1pclass + b2gender + b3age