Titanic Disaster Analysis¶
Introduction¶
one of most famous disasters in human history, happened in April 1912.
people who didn't know about it from books, knew about it from the high budget production of tragedic romantic movie with the same name.
for me, it represents how human arrogance can lead to severe consequences. the captin underestimated the iceburg mass and manufacturer overestimated the ship power to the limit that they didn't provide enough survival boats. which decreased the probability of survive when disaster happened.
Data Analysis Process Phases¶
1- Questioning the data
2- Data Wrangling (data acquisition , data cleaning, datastoring)
3- Data Exploration (finding patterns)
4- Drawing conclusions (or even prediction)
5- Communicate findings
Questions¶
although survival in such disaster is almost random, but in the following analysis, we will try to answer the following questions:
1- is any factors that increased the probability of servival ?
and if so,
2- what are these factors and which factors have more impact than other?
Procedure (Work done to investigate questions)¶
this analysis will be conducted by comparing the ratios of surviving persons with different criteria within same category together, for example, comparing ratio for females survival versus males in terms of Sex category, and comparing ratio of survivals of passengers for different age periods.
a simple python function is self explaining in code below, used to determine this ratio value, and it will be called for almost each criteria serparately.
finding should be confirmation or rejection of the hypothes that some factors affect or correlate with survival probablitiy. and hence to determine which criteria had more impact than other, and which is neutral.
Pre Analysis¶
reading files and declaring a comparison function that will be used later to compare different data aspects.
i will name 2 data sets here:
one with survival attribute as titanictraindf and the other as testtitanictraindf
# importing the necessary libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use('seaborn-deep')
# reading data from files
titanicfilename = './titanic-data.csv'
titanictraindf = pd.read_csv(titanicfilename)
testtitanicfilename = './test.csv'
testtitanictraindf = pd.read_csv(testtitanicfilename)
# this function compares the survived vs non-survived ratio for the passed columns
def compare_operators_with_merge(survived,nonservived,columnName):
# 1- making 2 groups survived / non-survived based on passed column
survivedgroup=survived[['PassengerId','Survived',columnName]].groupby(['Survived',columnName],as_index=False).count()
nonsurvivedgroup=nonsurvived[['PassengerId','Survived',columnName]].groupby(['Survived',columnName],as_index=False).count()
# 2- join the 2 groups on the passed column to compare from left group vs right group
merged=survivedgroup.merge(nonsurvivedgroup, on=[columnName], how='inner')
# 3- get the relative ratio between different items of passed column , and normalizing it
result=merged['PassengerId_x']/merged['PassengerId_y']
return (pd.DataFrame({columnName:merged[columnName].astype(str) ,'result':result/result.sum()}))
Data Wrangling¶
before wrangling, data summarization is needed to guide our wrangling and analysis¶
here we will compare train data set summary versus test data set summary to check if our train data is representing a good sample or biased sample. summary will show me statistics of numerical data Cabin and Sex should be compared separately.
# displaying summary of numbers
print('\r\n\x1b[1;31m'+'Summary of train and test data sets (numeric fields)'+'\x1b[0m')
print(titanictraindf.describe())
print(testtitanictraindf.describe())
ratio=titanictraindf[['Pclass','Age','SibSp','Parch','Fare']].describe()/testtitanictraindf[['Pclass','Age','SibSp','Parch','Fare']].describe()
print('\r\n\x1b[1;31m'+'Ratio between train and test data sets'+'\x1b[0m')
print (ratio)
print('\r\n\x1b[1;31m'+'Exploring Sex and Cabin attributes(non numeric)'+'\x1b[0m')
print (titanictraindf['Sex'].describe())
print (testtitanictraindf['Sex'].describe())
print ('')
print (titanictraindf['Cabin'].describe())
print (testtitanictraindf['Cabin'].describe())
print('\r\n\x1b[1;31m'+'missing data in Age and Cabin'+'\x1b[0m')
print('number of persons with missing age data',titanictraindf[titanictraindf['Age'].isnull()]['Age'].size )
print('number of persons with missing cabin data',titanictraindf[titanictraindf['Cabin'].isnull()]['Cabin'].size )
nullAgeData=titanictraindf[titanictraindf['Age'].isnull()]
print ('survived persons with missing age',nullAgeData[nullAgeData['Survived']==1]['Age'].size)
print ('non survived persons with missing age',nullAgeData[nullAgeData['Survived']==0]['Age'].size)
print ('sample survived',titanictraindf[titanictraindf['Survived']==1]['PassengerId'].size)
print ('sample nonsurvived',titanictraindf[titanictraindf['Survived']==0]['PassengerId'].size)
print ('total survived ',722)
print ('total nonsurvived ',1502)
conclusion from first data exploration¶
data sampling looks good as ratio between train data set and test data set is almost 1 (mean and std) in most attributes, especially for Pclass and Age , ratio is less in other attributes
majority of passingers are male, 577 out of 891 in train data set, 266 out of 418 in test data set ratio is also similar between 63 -64 % of passengers so it's also a good unbiased sample.
most of data is existing , except for Age and Cabin attributes
- 177 out of 891 age data is missing, survival ratio of missing age is (52/177)=29.4%
- 687 out of 891 cabin data is missing, so it won't be considered in the analysis
survival ratio in train dataset is 38.4% , total survival ratio is 32.5%
- so survival ratio in train data set is higher than actual ratio, this should be considered when predicting the survival probability in test data set
Outliers.¶
in both train and test data summary, data looks good with acceptable distance from the mean (less than 2* std) except for parent/children sibling/spouse. where there is values of 8 and 9, which is very far from the mean. so, extracting data for SibSp column where value =8, one can find it's logical, it looks like a big family traveling together their names: Sage, Master. Thomas Henry
Sage, Miss. Constance Gladys
Sage, Mr. Frederick
Sage, Mr. George John Jr
Sage, Miss. Stella Anna
Sage, Mr. Douglas Bullen
Sage, Miss. Dorothy Edith
they all share first name "Sage" , Ticket: CA. 2343 Fare: 69.55 Embark: S and they all miss age and Cabin data.¶
so their data is acceptable and could be included.
- for Parch column, also the value 6 is far from mean, the value belongs to a lady with 43 years old, it looks logical also at that time to have 6 children , especially she rided from Embark, same as previous outlier( it seems people near that city are having big families)
the following table summarizes the survival versus casualty.
| Set | Total | Survived | Non-Survived |
|---|---|---|---|
| Train Data | 891 | 342 | 549 |
| Original numbers | 2224 | 722 | 1502 |
Data Wrangling¶
train data set can be found in Kaggle website https://www.kaggle.com/c/titanic/data
1- name field will be excluded in analysis as it's not applicable to detect any pattern from it.
2- age will be groups to periods of 10's of years to simplify analysis, undefined age will be replaced with mean when comparing calculating ratio between survivals and non-survivals as number of missing fields in age is 177, but the ratio of survival is the close to ratio of survival of total data but it won't be included when exploring data separately as it could impact the analysis.
3- fares will be grouped also in packages of 25 each.
4- undefined cabins will be represented as 'na' as there is 687 cabin data missing out of 891, it's unlikely that cabin will give any solid result
5- remaining fields (Pclass, Parch, SibSp) will be represented as it's.
6- after investigating outliers in previous section, no data will be excluded.
7- data will be splitted to survived and non-survived to simplify comparisons
# wrangling data
# 1- grouping age by groups of ten years,unknown Age will be replaced with mean age
# 2- grouping ticket fare, as it may include a hidden pattern about the location of cabin within the same class
# making groups of 25
agemean=titanictraindf['Age'].mean()
age=(titanictraindf['Age'].fillna(agemean)/10).astype(int)*10
#agewithoutnull=titanictraindf['Age'].dropna()
Fare=(titanictraindf['Fare']/25).astype(int)*25 #.fillna(-25)
titanictrainwrangleddf=pd.DataFrame(
{'PassengerId':titanictraindf['PassengerId'], 'Survived':titanictraindf['Survived'],
'SibSp':titanictraindf['SibSp'], 'Age':age ,'OriginalAge':titanictraindf['Age'] ,
'Pclass':titanictraindf['Pclass'],'Sex':titanictraindf['Sex'],
'Name':titanictraindf['Name'],'Parch':titanictraindf['Parch'],
'Ticket':titanictraindf['Ticket'],'Fare':Fare,
'Cabin':titanictraindf['Cabin'].fillna('na'),'Embarked':titanictraindf['Embarked']} )
# splitting data to survived and nonsurvived to compare
survived=titanictrainwrangleddf[titanictrainwrangleddf['Survived']==1]
nonsurvived=titanictrainwrangleddf[titanictrainwrangleddf['Survived']==0]
1. Age Analysis¶
resultOfAgeFactor=compare_operators_with_merge(survived,nonsurvived,'Age')
n_bins = 10
fig, axes = plt.subplots(nrows=1, ncols=3 , figsize=(9, 3))
ax0, ax1 ,ax2 = axes.flatten()
ax0.hist(survived['OriginalAge'].dropna(), 8, histtype='bar', label=survived ,color='green')
ax0.legend(prop={'size': 10})
ax0.set_ylabel('number of persons')
ax0.set_xlabel('survived persons per age')
ax0.set_title('survivals')
ax1.hist(nonsurvived['OriginalAge'].dropna(), 8, histtype='bar', label=nonsurvived,color='red')
ax1.legend(prop={'size': 10})
ax1.set_title('nonsurvivals')
ax1.set_xlabel('non-survived persons per age')
ax2.hist(titanictraindf['Age'].dropna(), 8, histtype='bar',color='blue')
ax2.legend(prop={'size': 10})
ax2.set_title('complete population')
ax2.set_xlabel('total persons per age')
fig.tight_layout()
#fig.suptitle('Categorical Plotting')
plt.show()
Graph 1.1 Conclusion¶
- the three histograms looks similar as they are positively skewed graphs.
- however survival graph has bigger ratio for kids period less than 10 years. this will be clearer when comparing both (survived/nonsurvived) graphes beside each other in the following section.
hist_sex=plt.hist([survived['Age'], nonsurvived['Age']], color=['g','r'], alpha=0.8, bins=8,normed=True,label=['Survived','Non-Survived'])
plt.title("Normalized comparison between survived vs Non-survived over decades.")
plt.xlabel="Age periods"
plt.ylabel("Ratio of Survival")
plt.legend()
plt.show()
Graph 1.2 Conclusion¶
in this graph, comparison is simpler.
- kids below 10 had higher probability of survive.
- in most age periods, survive and non-survive are close, except for period (20-30), probability of survival is less than other ages, and period between (70-80) also.
2. Gender Analysis¶
resultOfSexFactor=compare_operators_with_merge(survived,nonsurvived,'Sex')
labels =resultOfSexFactor['Sex']
sex_ratio = resultOfSexFactor['result']*100
explode = (0, 0.1)
fig1, ax1 = plt.subplots()
ax1.pie(sex_ratio, explode=explode, labels=labels, autopct='%1.1f%%',shadow=True, startangle=90,colors=['g','blue'])
ax1.axis('equal')
ax1.legend()
ax1.set_title('Gender Survival Ratio Graph')
#ax1.set_xlabel("survival ratio (femals vs. males)")
plt.show()
titanictrainwrangleddf['Sex'].describe()
Conclusion¶
in this graph, it shows the survival ratio comparison between females vs males.
- it's quite clear that females ratio dominates males ratio by several multiples, so although percentage of females in train sample is about 35.24% and male percentage is 577/891 =64.76% but survival ratio in females is 92.5% while it's 7.5% in males. for me it shows a noble behavior.
3. Class Analysis¶
resultOfClassFactor=compare_operators_with_merge(survived,nonsurvived,'Pclass')
resultOfClassFactor.columns=['Class','Normalized Ratio']
ax =resultOfClassFactor.plot.bar(resultOfClassFactor['Class'],legend=True).set_ylabel('relative survival ratio ')
plt.suptitle("Class Normalized Survival Ratio Graph")
plt.show()
Conclusion¶
in this graph, Normalized survival ratio shows that people in first class have much higher survival propability than second class (arround the double) and second class people have higher survival probability than third class people (almost trible)
- it's very sad fact that poor people have one sixth survival probability of rich people chance.
4. Sibling/Spouse Analysis¶
resultOfSiblingSpouseFactor=compare_operators_with_merge(survived,nonsurvived,'SibSp')
resultOfSiblingSpouseFactor.columns=['SibSp','Normalized Ratio']
ax =resultOfSiblingSpouseFactor.plot.bar(resultOfSiblingSpouseFactor['SibSp'],legend=True).set_ylabel('relative survival ratio ')
plt.suptitle("Sibling/Spouse Normalized Survival Ratio Graph")
plt.show()
Conclusion¶
This graph shows the normalized survival ratio for different categories of persons based on the number of siblings/spouse they have.
- persons with 1 Sibling/spouse have higher probability to survive, then persons with 2 then with None.
- graph didn't show ratio for people who have more than 4 sibling/spouse, it means none of those having more than 4 sibling/spouse have survived.
5. Parent/Child Analysis¶
resultOfParentChildFactor=compare_operators_with_merge(survived,nonsurvived,'Parch')
resultOfParentChildFactor.columns=['Parch','Normalized Ratio']
ax =resultOfParentChildFactor.plot.bar(resultOfParentChildFactor['Parch'],legend=True).set_ylabel('relative survival ratio ')
plt.suptitle("Parent/Child Normalized Survival Ratio Graph")
plt.show()
Conclusion¶
in this graph, it shows the survival ratio comparison between people with different number of parent/children.
- persons who have 3 related parent/children have higher survival probability than others.
- graph didn't show ratio for people who have more than 5 parent/children, it means none of those having more than 5 have survived.
6. Cabin Analysis¶
resultOfCabinFactor=compare_operators_with_merge(survived,nonsurvived,'Cabin')
resultOfCabinFactor.columns=['Cabin','Normalized Ratio']
ax =resultOfCabinFactor.plot.bar(resultOfCabinFactor['Cabin'],legend=True).set_ylabel('relative survival ratio ')
plt.suptitle("Cabin Normalized Survival Ratio Graph")
plt.show()
Conclusion¶
This graph shows the survival ratio comparison between different cabins. however there is 687 missing cabin data out of 891, so this analysis is useless, just done for a complete analysis coverage.
7. Fare Analysis¶
resultOfFareFactor=compare_operators_with_merge(survived,nonsurvived,'Fare')
resultOfFareFactor.columns=['Fare','Normalized Ratio']
ax =resultOfFareFactor.plot.bar(resultOfFareFactor['Fare'],legend=True).set_ylabel('relative survival ratio ')
plt.suptitle("Fare Normalized Survival Ratio Graph")
plt.show()
titanictrainwrangleddf.groupby(['Pclass'])['Fare'].describe()
Conclusion¶
This graph shows the survival ratio for different fare categories.
- it's quite clear that fares within category arround 125 has dominant survival ratio than all other categries, at least double of the category coming next to it.
- it's quite logical to have all high ratios of survivals with fares of 75 and above, it complies with the class fares as explained in table below the graph, as we concluded before that people in first class had about double survival ratio higher than people in second class and 6 times survival ratio than people in third class.
8. Embark Analysis¶
resultOfPortFactor=compare_operators_with_merge(survived,nonsurvived,'Embarked')
resultOfPortFactor.columns=['Embarked','Normalized Ratio']
ax =resultOfPortFactor.plot.bar(resultOfPortFactor['Embarked'],legend=True).set_ylabel('relative survival ratio ')
plt.suptitle("Embark Normalized Survival Ratio Graph")
plt.show()
Conclusion¶
in this graph, it shows the survival ratio comparison between different embarks.
- people from embark Cherbourg had double higher survival probability than people from Queenstown or Southampton
Findings¶
from previous simple analysis, it's clear that there are existing factors that increased the probability of survival.
1- the most important factor is passenger's gender, it's very clear that probability of survival for females exceeds males more than 10 times.
2- second factor is the passenger's class. first class passengers are almost 200% probable to survive more than second class but more than 500% than third class
3- third important factor is age, kids less than 10 years are more probable to be rescued than any other age periods 200% at least.
after that , there are some factors are almost similar in importace but not very important like previous 3
4- having single sibling or spouse has higher probability of survival than others. 5- having 3 children is strangely has higher probability of survival.
Unknown Finding¶
6- Cabins { D & F2 } has almost twice higher probability of survival than other cabins, this can be ignored as majority of cabin data is missing
Strange findings¶
7- Cherbourg port passengers had twice probability of survival than Southampton and Queenstown ports
actually, the tragedy of Titanic movie can be clearer now, as it clarifies love relationship between 2 persons, one of them with highest probability of survival versus the other with lowest probability. it's the same difference between the same 2 persons in real life, where there is a big difference on social scale between both persons.
References¶
https://www.kaggle.com/c/titanic/data http://pandas.pydata.org/pandas-docs/stable/dsintro.html#panel http://matplotlib.org/api/pyplot_api.html http://pandas.pydata.org/pandas-docs/stable/groupby.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html#pandas.DataFrame.plot