Sunday, December 30, 2018

Tableau (create multiline graph without date, and with discrete dimension) USA Flights Delay

Steps:
1- drag the dimension to Columns shelf
2- drag the measures to row shelf , it will create multi bar graphs vertically
3- change the chart to "side-by-side" bars

4-change in marks panel, from bar to line
5- remove measure names from the columns shelf. (this is the trick)



Sunday, September 10, 2017

Titanic (survival analysis)

Titanic Project

Titanic Disaster Analysis

Introduction

one of most famous disasters in human history, happened in April 1912.

people who didn't know about it from books, knew about it from the high budget production of tragedic romantic movie with the same name.

for me, it represents how human arrogance can lead to severe consequences. the captin underestimated the iceburg mass and manufacturer overestimated the ship power to the limit that they didn't provide enough survival boats. which decreased the probability of survive when disaster happened.

Data Analysis Process Phases

1- Questioning the data

2- Data Wrangling (data acquisition , data cleaning, datastoring)

3- Data Exploration (finding patterns)

4- Drawing conclusions (or even prediction)

5- Communicate findings

Questions

although survival in such disaster is almost random, but in the following analysis, we will try to answer the following questions:

1- is any factors that increased the probability of servival ?

and if so,

2- what are these factors and which factors have more impact than other?

Procedure (Work done to investigate questions)

this analysis will be conducted by comparing the ratios of surviving persons with different criteria within same category together, for example, comparing ratio for females survival versus males in terms of Sex category, and comparing ratio of survivals of passengers for different age periods.

a simple python function is self explaining in code below, used to determine this ratio value, and it will be called for almost each criteria serparately.

finding should be confirmation or rejection of the hypothes that some factors affect or correlate with survival probablitiy. and hence to determine which criteria had more impact than other, and which is neutral.

Pre Analysis

reading files and declaring a comparison function that will be used later to compare different data aspects.

i will name 2 data sets here:

one with survival attribute as titanictraindf and the other as testtitanictraindf

In [7]:
# importing the necessary libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina' 
plt.style.use('seaborn-deep')

# reading data from files
titanicfilename = './titanic-data.csv'
titanictraindf = pd.read_csv(titanicfilename)
testtitanicfilename = './test.csv'
testtitanictraindf = pd.read_csv(testtitanicfilename)

# this function compares the survived vs non-survived ratio for the passed columns
def compare_operators_with_merge(survived,nonservived,columnName):
    # 1- making 2 groups survived / non-survived based on passed column
    survivedgroup=survived[['PassengerId','Survived',columnName]].groupby(['Survived',columnName],as_index=False).count()
    nonsurvivedgroup=nonsurvived[['PassengerId','Survived',columnName]].groupby(['Survived',columnName],as_index=False).count()
    # 2- join the 2 groups on the passed column to compare from left group vs right group
    merged=survivedgroup.merge(nonsurvivedgroup, on=[columnName], how='inner')
    # 3- get the relative ratio between different items of passed column , and normalizing it 
    result=merged['PassengerId_x']/merged['PassengerId_y']
    return (pd.DataFrame({columnName:merged[columnName].astype(str) ,'result':result/result.sum()}))

Data Wrangling

before wrangling, data summarization is needed to guide our wrangling and analysis

here we will compare train data set summary versus test data set summary to check if our train data is representing a good sample or biased sample. summary will show me statistics of numerical data Cabin and Sex should be compared separately.

In [8]:
# displaying summary of numbers
print('\r\n\x1b[1;31m'+'Summary of train and test data sets (numeric fields)'+'\x1b[0m')
print(titanictraindf.describe())
print(testtitanictraindf.describe())
ratio=titanictraindf[['Pclass','Age','SibSp','Parch','Fare']].describe()/testtitanictraindf[['Pclass','Age','SibSp','Parch','Fare']].describe()
print('\r\n\x1b[1;31m'+'Ratio between train and test data sets'+'\x1b[0m')
print (ratio)

print('\r\n\x1b[1;31m'+'Exploring Sex and Cabin attributes(non numeric)'+'\x1b[0m')
print (titanictraindf['Sex'].describe())
print (testtitanictraindf['Sex'].describe())
print ('')
print (titanictraindf['Cabin'].describe())
print (testtitanictraindf['Cabin'].describe())
print('\r\n\x1b[1;31m'+'missing data in Age and Cabin'+'\x1b[0m')
print('number of persons with missing age data',titanictraindf[titanictraindf['Age'].isnull()]['Age'].size )
print('number of persons with missing cabin data',titanictraindf[titanictraindf['Cabin'].isnull()]['Cabin'].size )
nullAgeData=titanictraindf[titanictraindf['Age'].isnull()]
print ('survived persons with missing age',nullAgeData[nullAgeData['Survived']==1]['Age'].size)
print ('non survived persons with missing age',nullAgeData[nullAgeData['Survived']==0]['Age'].size)

print ('sample survived',titanictraindf[titanictraindf['Survived']==1]['PassengerId'].size)
print ('sample nonsurvived',titanictraindf[titanictraindf['Survived']==0]['PassengerId'].size)

print ('total survived ',722)
print ('total nonsurvived ',1502)
Summary of train and test data sets (numeric fields)
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  
       PassengerId      Pclass         Age       SibSp       Parch        Fare
count   418.000000  418.000000  332.000000  418.000000  418.000000  417.000000
mean   1100.500000    2.265550   30.272590    0.447368    0.392344   35.627188
std     120.810458    0.841838   14.181209    0.896760    0.981429   55.907576
min     892.000000    1.000000    0.170000    0.000000    0.000000    0.000000
25%     996.250000    1.000000   21.000000    0.000000    0.000000    7.895800
50%    1100.500000    3.000000   27.000000    0.000000    0.000000   14.454200
75%    1204.750000    3.000000   39.000000    1.000000    0.000000   31.500000
max    1309.000000    3.000000   76.000000    8.000000    9.000000  512.329200

Ratio between train and test data sets
         Pclass       Age     SibSp     Parch      Fare
count  2.131579  2.150602  2.131579  2.131579  2.136691
mean   1.019020  0.981056  1.169076  0.972599  0.903922
std    0.993150  1.024348  1.229698  0.821310  0.888850
min    1.000000  2.470588       NaN       NaN       NaN
25%    2.000000  0.958333       NaN       NaN  1.001849
50%    1.000000  1.037037       NaN       NaN  1.000000
75%    1.000000  0.974359  1.000000       NaN  0.984127
max    1.000000  1.052632  1.000000  0.666667  1.000000

Exploring Sex and Cabin attributes(non numeric)
count      891
unique       2
top       male
freq       577
Name: Sex, dtype: object
count      418
unique       2
top       male
freq       266
Name: Sex, dtype: object

count         204
unique        147
top       B96 B98
freq            4
Name: Cabin, dtype: object
count                  91
unique                 76
top       B57 B59 B63 B66
freq                    3
Name: Cabin, dtype: object

missing data in Age and Cabin
number of persons with missing age data 177
number of persons with missing cabin data 687
survived persons with missing age 52
non survived persons with missing age 125
sample survived 342
sample nonsurvived 549
total survived  722
total nonsurvived  1502

conclusion from first data exploration

  • data sampling looks good as ratio between train data set and test data set is almost 1 (mean and std) in most attributes, especially for Pclass and Age , ratio is less in other attributes

  • majority of passingers are male, 577 out of 891 in train data set, 266 out of 418 in test data set ratio is also similar between 63 -64 % of passengers so it's also a good unbiased sample.

  • most of data is existing , except for Age and Cabin attributes

    • 177 out of 891 age data is missing, survival ratio of missing age is (52/177)=29.4%
    • 687 out of 891 cabin data is missing, so it won't be considered in the analysis
  • survival ratio in train dataset is 38.4% , total survival ratio is 32.5%

    • so survival ratio in train data set is higher than actual ratio, this should be considered when predicting the survival probability in test data set

Outliers.

  • in both train and test data summary, data looks good with acceptable distance from the mean (less than 2* std) except for parent/children sibling/spouse. where there is values of 8 and 9, which is very far from the mean. so, extracting data for SibSp column where value =8, one can find it's logical, it looks like a big family traveling together their names: Sage, Master. Thomas Henry

    Sage, Miss. Constance Gladys

    Sage, Mr. Frederick

    Sage, Mr. George John Jr

    Sage, Miss. Stella Anna

    Sage, Mr. Douglas Bullen

    Sage, Miss. Dorothy Edith

they all share first name "Sage" , Ticket: CA. 2343 Fare: 69.55 Embark: S and they all miss age and Cabin data.

so their data is acceptable and could be included.

  • for Parch column, also the value 6 is far from mean, the value belongs to a lady with 43 years old, it looks logical also at that time to have 6 children , especially she rided from Embark, same as previous outlier( it seems people near that city are having big families)

the following table summarizes the survival versus casualty.

Set Total Survived Non-Survived
Train Data 891 342 549
Original numbers 2224 722 1502

Data Wrangling

train data set can be found in Kaggle website https://www.kaggle.com/c/titanic/data

1- name field will be excluded in analysis as it's not applicable to detect any pattern from it.

2- age will be groups to periods of 10's of years to simplify analysis, undefined age will be replaced with mean when comparing calculating ratio between survivals and non-survivals as number of missing fields in age is 177, but the ratio of survival is the close to ratio of survival of total data but it won't be included when exploring data separately as it could impact the analysis.

3- fares will be grouped also in packages of 25 each.

4- undefined cabins will be represented as 'na' as there is 687 cabin data missing out of 891, it's unlikely that cabin will give any solid result

5- remaining fields (Pclass, Parch, SibSp) will be represented as it's.

6- after investigating outliers in previous section, no data will be excluded.

7- data will be splitted to survived and non-survived to simplify comparisons

In [9]:
# wrangling data 
# 1- grouping age by groups of ten years,unknown Age will be replaced with mean age
# 2- grouping ticket fare, as it may include a hidden pattern about the location of cabin within the same class
#    making groups of 25 
agemean=titanictraindf['Age'].mean()
age=(titanictraindf['Age'].fillna(agemean)/10).astype(int)*10
#agewithoutnull=titanictraindf['Age'].dropna()
Fare=(titanictraindf['Fare']/25).astype(int)*25  #.fillna(-25)

titanictrainwrangleddf=pd.DataFrame(
    {'PassengerId':titanictraindf['PassengerId'],  'Survived':titanictraindf['Survived'],  
     'SibSp':titanictraindf['SibSp'], 'Age':age ,'OriginalAge':titanictraindf['Age'] ,
     'Pclass':titanictraindf['Pclass'],'Sex':titanictraindf['Sex'],
    'Name':titanictraindf['Name'],'Parch':titanictraindf['Parch'],
    'Ticket':titanictraindf['Ticket'],'Fare':Fare,
    'Cabin':titanictraindf['Cabin'].fillna('na'),'Embarked':titanictraindf['Embarked']} ) 

# splitting data to survived and nonsurvived to compare
survived=titanictrainwrangleddf[titanictrainwrangleddf['Survived']==1]
nonsurvived=titanictrainwrangleddf[titanictrainwrangleddf['Survived']==0]

1. Age Analysis

In [10]:
resultOfAgeFactor=compare_operators_with_merge(survived,nonsurvived,'Age')
n_bins = 10
fig, axes = plt.subplots(nrows=1, ncols=3 , figsize=(9, 3))
ax0, ax1 ,ax2 = axes.flatten()

ax0.hist(survived['OriginalAge'].dropna(), 8, histtype='bar', label=survived ,color='green')
ax0.legend(prop={'size': 10})
ax0.set_ylabel('number of persons')
ax0.set_xlabel('survived persons per age')
ax0.set_title('survivals')

ax1.hist(nonsurvived['OriginalAge'].dropna(), 8, histtype='bar', label=nonsurvived,color='red')
ax1.legend(prop={'size': 10})
ax1.set_title('nonsurvivals')
ax1.set_xlabel('non-survived persons per age')

ax2.hist(titanictraindf['Age'].dropna(), 8, histtype='bar',color='blue')
ax2.legend(prop={'size': 10})
ax2.set_title('complete population')
ax2.set_xlabel('total persons per age')

fig.tight_layout()
#fig.suptitle('Categorical Plotting')
plt.show()

Graph 1.1 Conclusion

  • the three histograms looks similar as they are positively skewed graphs.
  • however survival graph has bigger ratio for kids period less than 10 years. this will be clearer when comparing both (survived/nonsurvived) graphes beside each other in the following section.
In [11]:
hist_sex=plt.hist([survived['Age'], nonsurvived['Age']], color=['g','r'], alpha=0.8, bins=8,normed=True,label=['Survived','Non-Survived'])
plt.title("Normalized comparison between survived vs Non-survived over decades.")
plt.xlabel="Age periods"
plt.ylabel("Ratio of Survival")
plt.legend()

plt.show()

Graph 1.2 Conclusion

in this graph, comparison is simpler.

  • kids below 10 had higher probability of survive.
  • in most age periods, survive and non-survive are close, except for period (20-30), probability of survival is less than other ages, and period between (70-80) also.

2. Gender Analysis

In [12]:
resultOfSexFactor=compare_operators_with_merge(survived,nonsurvived,'Sex')
labels =resultOfSexFactor['Sex']
sex_ratio = resultOfSexFactor['result']*100
explode = (0, 0.1)  
fig1, ax1 = plt.subplots()
ax1.pie(sex_ratio, explode=explode, labels=labels, autopct='%1.1f%%',shadow=True, startangle=90,colors=['g','blue'])
ax1.axis('equal') 
ax1.legend()
ax1.set_title('Gender Survival Ratio Graph')
#ax1.set_xlabel("survival ratio  (femals vs. males)")
plt.show()

titanictrainwrangleddf['Sex'].describe()
Out[12]:
count      891
unique       2
top       male
freq       577
Name: Sex, dtype: object

Conclusion

in this graph, it shows the survival ratio comparison between females vs males.

  • it's quite clear that females ratio dominates males ratio by several multiples, so although percentage of females in train sample is about 35.24% and male percentage is 577/891 =64.76% but survival ratio in females is 92.5% while it's 7.5% in males. for me it shows a noble behavior.

3. Class Analysis

In [13]:
resultOfClassFactor=compare_operators_with_merge(survived,nonsurvived,'Pclass')
resultOfClassFactor.columns=['Class','Normalized Ratio']
ax =resultOfClassFactor.plot.bar(resultOfClassFactor['Class'],legend=True).set_ylabel('relative survival ratio ')
plt.suptitle("Class Normalized Survival Ratio Graph")
plt.show()

Conclusion

in this graph, Normalized survival ratio shows that people in first class have much higher survival propability than second class (arround the double) and second class people have higher survival probability than third class people (almost trible)

  • it's very sad fact that poor people have one sixth survival probability of rich people chance.

4. Sibling/Spouse Analysis

In [14]:
resultOfSiblingSpouseFactor=compare_operators_with_merge(survived,nonsurvived,'SibSp')
resultOfSiblingSpouseFactor.columns=['SibSp','Normalized Ratio']
ax =resultOfSiblingSpouseFactor.plot.bar(resultOfSiblingSpouseFactor['SibSp'],legend=True).set_ylabel('relative survival ratio ')
plt.suptitle("Sibling/Spouse Normalized Survival Ratio Graph")
plt.show()

Conclusion

This graph shows the normalized survival ratio for different categories of persons based on the number of siblings/spouse they have.

  • persons with 1 Sibling/spouse have higher probability to survive, then persons with 2 then with None.
  • graph didn't show ratio for people who have more than 4 sibling/spouse, it means none of those having more than 4 sibling/spouse have survived.

5. Parent/Child Analysis

In [15]:
resultOfParentChildFactor=compare_operators_with_merge(survived,nonsurvived,'Parch')
resultOfParentChildFactor.columns=['Parch','Normalized Ratio']
ax =resultOfParentChildFactor.plot.bar(resultOfParentChildFactor['Parch'],legend=True).set_ylabel('relative survival ratio ')
plt.suptitle("Parent/Child Normalized Survival Ratio Graph")
plt.show()

Conclusion

in this graph, it shows the survival ratio comparison between people with different number of parent/children.

  • persons who have 3 related parent/children have higher survival probability than others.
  • graph didn't show ratio for people who have more than 5 parent/children, it means none of those having more than 5 have survived.

6. Cabin Analysis

In [16]:
resultOfCabinFactor=compare_operators_with_merge(survived,nonsurvived,'Cabin')
resultOfCabinFactor.columns=['Cabin','Normalized Ratio']
ax =resultOfCabinFactor.plot.bar(resultOfCabinFactor['Cabin'],legend=True).set_ylabel('relative survival ratio ')
plt.suptitle("Cabin  Normalized Survival Ratio Graph")
plt.show()

Conclusion

This graph shows the survival ratio comparison between different cabins. however there is 687 missing cabin data out of 891, so this analysis is useless, just done for a complete analysis coverage.

7. Fare Analysis

In [17]:
resultOfFareFactor=compare_operators_with_merge(survived,nonsurvived,'Fare')
resultOfFareFactor.columns=['Fare','Normalized Ratio']
ax =resultOfFareFactor.plot.bar(resultOfFareFactor['Fare'],legend=True).set_ylabel('relative survival ratio ')
plt.suptitle("Fare Normalized Survival Ratio Graph")
plt.show()

titanictrainwrangleddf.groupby(['Pclass'])['Fare'].describe()
Out[17]:
count mean std min 25% 50% 75% max
Pclass
1 216.0 76.504630 76.443808 0.0 25.0 50.0 75.0 500.0
2 184.0 10.190217 14.126351 0.0 0.0 0.0 25.0 50.0
3 491.0 3.564155 10.595795 0.0 0.0 0.0 0.0 50.0

Conclusion

This graph shows the survival ratio for different fare categories.

  • it's quite clear that fares within category arround 125 has dominant survival ratio than all other categries, at least double of the category coming next to it.
  • it's quite logical to have all high ratios of survivals with fares of 75 and above, it complies with the class fares as explained in table below the graph, as we concluded before that people in first class had about double survival ratio higher than people in second class and 6 times survival ratio than people in third class.

8. Embark Analysis

In [18]:
resultOfPortFactor=compare_operators_with_merge(survived,nonsurvived,'Embarked')
resultOfPortFactor.columns=['Embarked','Normalized Ratio']
ax =resultOfPortFactor.plot.bar(resultOfPortFactor['Embarked'],legend=True).set_ylabel('relative survival ratio ')
plt.suptitle("Embark Normalized Survival Ratio Graph")
plt.show()

Conclusion

in this graph, it shows the survival ratio comparison between different embarks.

  • people from embark Cherbourg had double higher survival probability than people from Queenstown or Southampton

Findings

from previous simple analysis, it's clear that there are existing factors that increased the probability of survival.

1- the most important factor is passenger's gender, it's very clear that probability of survival for females exceeds males more than 10 times.

2- second factor is the passenger's class. first class passengers are almost 200% probable to survive more than second class but more than 500% than third class

3- third important factor is age, kids less than 10 years are more probable to be rescued than any other age periods 200% at least.

after that , there are some factors are almost similar in importace but not very important like previous 3

4- having single sibling or spouse has higher probability of survival than others. 5- having 3 children is strangely has higher probability of survival.

Unknown Finding

6- Cabins { D & F2 } has almost twice higher probability of survival than other cabins, this can be ignored as majority of cabin data is missing

Strange findings

7- Cherbourg port passengers had twice probability of survival than Southampton and Queenstown ports

actually, the tragedy of Titanic movie can be clearer now, as it clarifies love relationship between 2 persons, one of them with highest probability of survival versus the other with lowest probability. it's the same difference between the same 2 persons in real life, where there is a big difference on social scale between both persons.