[Kaggle] Titanic Competition

Notice

가슴에 새길 말..

Recent Posts

Recent Comments

Tags more

Today

Total

Archives

관리 메뉴

SiLaure's Data

[Kaggle] Titanic Competition 본문

Records of/Learning

[Kaggle] Titanic Competition

data_soin 2021. 7. 30. 01:46

Reference : https://www.kaggle.com/c/titanic

1. Titanic data load 및 환경 세팅

1) 라이브러리 불러오기

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot') # ggplot style을 사용합니다.

2) 타이타닉 데이터 불러오기

train = pd.read_csv("../../../../../Kaggle/data/titanic/train.csv")
test = pd.read_csv("../../../../../Kaggle/data/titanic/test.csv")

3) titanic 생존 여부 분석에 필요하지 않은 column들 없애기

Ⅰ.결측치 확인 및 처리 -- Imputation

결측치 확인

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
#   Column       Non-Null Count  Dtype
---  ------       --------------  -----
0   PassengerId  891 non-null    int64
1   Survived     891 non-null    int64
2   Pclass       891 non-null    int64
3   Name         891 non-null    object
4   Sex          891 non-null    object
5   Age          714 non-null    float64
6   SibSp        891 non-null    int64
7   Parch        891 non-null    int64
8   Ticket       891 non-null    object
9   Fare         891 non-null    float64
10  Cabin        204 non-null    object
11  Embarked     889 non-null    object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

필요하지 않은 column : PassengerId, Name, Cabin, Ticket다음은 전부 891이 나오므로, 전부 다른 정보를 가지고 있다.
(grouping되지 않는다.)

train.PassengerId.nunique()
train.Name.nunique()

다음은 전부 891이 나오므로, 전부 다른 정보를 가지고 있다. (grouping되지 않는다.)
모든 정보가 unique하면 -모두가 구별되는 정보라면- 의미가 없다.
cf) Ticket.nunique()가 891이 아닌 이유는 동승자 때문에

column 지우기

train.drop(columns = ["PassengerId", "Name", "Cabin", "Ticket"], inplace=True)

drop을 해도 원본 데이터 train은 변하지 않는다.

바꾸려면 train = train.drop(columns = [""]) 또는 train.drop(columns = [""], inplace=True)

Ⅱ. Encoding categorical feature (==dtype이 "object"인 column들)
Ordinal Encoding / One-hot Encoding

💡 one-hot encoding을 먼저 해버리면
feature 별 분석하는 과정이 더 복잡해지기 때문에
맨 마지막에 해 준다.💡

● 연산이 불가능한 범주형 정보들(categorical feature)을 변환해 준다.

2. Pivot table을 이용한 feature별 데이터 분석

성별 정보를 위주로 데이터를 바라보기 위해 pivot table 생성
어떤 것을 기준(pivot)으로 바꾼 data table
default는 mean값(평균값)

grouped = pd.pivot_table(data=train, index=["Sex"])

Q1. pivot table에 없는 column은 왜 없을까?

A1.Embarked는 문자열이므로 계산할 수 없음.

Q2. Survived column의 "수치"가 pivot table에서 어떤 의미를 가질까?

A2. 살았으면 1, 죽었으면 0이므로 합계의 평균은 생존률이다.

bar plot

grouped.plot()
grouped.plot(kind='barh')

1) scale이 다른 두 feature를 비교하기 위해 정규화가 필요하다. --표준정규분포도
2) sample 수가 충분하여 비교를 할 수 있는지 확인해야 한다.

- Pivot Table with multi-index (동시에 여러 feature의 관점에서 보기)

성별과 승객등급별에 따라 pivot table을 생성한다.

grouped = pd.pivot_table(data=train, index=["Sex", "Pclass"])
grouped

그냥 grouped = pd.pivot_table(data=train, index=["Pclass"] 면 생존자 전체를 보여준다.

Pclass가 1인 Female의 생존률은 1에 가깝다.

- Aggregation function in Pivot Tables

얻은 pivot table에 원하는 통계량을 계산한다.
e.g. mean(), sum(), min() 등등

agrrfun = {"column이름" : np.함수}

함수를 call하는 것이 아니기 때문에 ()는 들어가지 않는다.

성별/승객등급을 기준으로 만든 pivot table에서 age에 대해서 평균값을, survived에 대해서 sum값을 계산한다.

grouped = pd.pivot_table(data=train, index=["Sex", "Pclass"], aggfunc={"Age" : np.mean, "Survived" : np.sum})
grouped

사실 애초에 value_counts()를 썼으면,,,scale을 미리 알 수 있었을텐데...

train.Sex.value_counts()

출력 :
male 577
female 314
Name: Sex, dtype: int64

values= 옵션을 통해 특정 column의 value를 지정할 수 있다.

grouped = pd.pivot_table(data=train, index=["Sex", "Pclass"], 
                         values=["Survived"], aggfunc=np.sum)

grouped
# multi-index indexing

bar plot

grouped.plot(kind="barh")

Sex, Pclass는 tuple이다.

columns= 를 통해서 생존여부에 대한 정보를 위주로 pivot table이 형성될 수 있게 지정한다.

grouped = pd.pivot_table(data=train, 
                         index=["Sex"], 
                         columns=["Pclass"], 
                         values=["Survived"], 
                         aggfunc=np.sum)
grouped

출력 :

grouped.plot(kind="barh")

출력 :

- Handling missing data in Pivot Tables

실제 데이터셋을 다룰때, 비어있는 정보(없어졌거나, 얻지 못한)에 대한 처리를 꼭 해야 한다.
정보가 비어있으면 전처리 이후에 ML 방법들을 사용할 수 없기 때문이다.

결측치를 처리하는 case

1) 원본 데이터의 결측치를 처리
2) pivot table에서 결측치가 생기는 경우 이를 처리

결측치를 찾는 code : isnull()
어떤 데이터(row --사람)이 결측치를 포함하고 있는지 확인하기 : any() 또는 any(axis=1) -> 1이 row

train.isnull()
train.isnull().any(axis=1)

train[train.isnull().any(axis=1)]

출력 :

pandas pivot_table에서는 2가지 방법으로 NaN값을 처리할 수 있다.

1) dropna - drops all null values in pivot table : 버리기
2) fill_value - replace all null values in pivot table with the specified values : 채우기

또는 날려버리지 않고, 평균이나 최소/최대값을 사용하는 경우도 있다.

Age column의 결측치들을 Age column의 평균값으로 채운다.

train.Age.fillna(train.Age.mean(), inplace=True)
train

출력 :

# Embarked column의 결측지들을 Pclass가 1이고, Sex가 Female인 사람들의 최빈값으로 채우기
# 1) 결측치 확인
# train[train.isnull().any(axis=1)]

# 2) 최빈값 확인
train[(train.Pclass == 1) & (train.Sex == "female")].Embarked.value_counts()

# 3-1) 채우는 방법
train.Embarked.fillna("S", inplace=True)

# 3-2) 바꾸는 방법
train.loc[train.isnull().any(axis=1), "Embarked"] = "S"

확인

train[train.isnull().any(axis=1)]
pd.pivot_table(data=train, index=["Sex"])

출력 :

- pivot table로 만들기

pivot table에 있는 NaN을 채우는 방법

grouped = pd.pivot_table(data=train,
                        index=["Sex", "Survived", "Pclass"],
                        columns=["Embarked"],
                        values=["Age"]) # , fill_value=train.Age.mean()
grouped

출력 :

결측치를 처리했을 때 다른 column이 원래의 수치와 다르지 않다면,
그 column들의 수치들은 서로 연관되어 있지 않다는 것을 의미한다.

저작자표시 비영리 변경금지 (새창열림)

'Records of > Learning' 카테고리의 다른 글

[selenium] 01. 사이트에 로그인하여 데이터 크롤링하기 (0)	2021.08.01
[BeautifulSoup] 02~04. id class 속성/CSS/정규표현식을 이용하여 원하는 값 추출하기 (0)	2021.08.01
[beautifulsoup] 01. beautifulsoup 모듈 사용하여 HTML 파싱하기(parsing) (0)	2021.07.29
[API] Open API를 활용하여 json 데이터 추출하기(공공데이터 API) (0)	2021.07.29
[API] 01. requests 모듈 사용하기(HTTP 통신) (0)	2021.07.29

'Records of/Learning' Related Articles

Comments

SiLaure's Data

[Kaggle] Titanic Competition 본문

[Kaggle] Titanic Competition

1. Titanic data load 및 환경 세팅

2. Pivot table을 이용한 feature별 데이터 분석

- Pivot Table with multi-index (동시에 여러 feature의 관점에서 보기)

- Aggregation function in Pivot Tables

- Handling missing data in Pivot Tables

- pivot table로 만들기

'Records of > Learning' 카테고리의 다른 글

티스토리툴바