[Pandas] 04. DataFrame Indexing

Notice

가슴에 새길 말..

Recent Posts

Recent Comments

Tags more

Today

Total

Archives

관리 메뉴

SiLaure's Data

[Pandas] 04. DataFrame Indexing 본문

Records of/Learning

[Pandas] 04. DataFrame Indexing

data_soin 2021. 7. 26. 00:10

Pandas에는 DataFrame과 Series밖에 없다.

- DataFrame Indexing

Indexing : 데이터에서 어떤 특정 조건을 만족하는 원소를 찾는 방법.

: "이게 된다고?" 할 정도로
전체 DataFrame에서 조건에 만족하는 데이터를 쉽게 찾아서 조작할 때 유용하게 사용할 수 있다.

Python list indexing과 Numpy fancy indexing 이 혼재되어 있어 앞의 두 가지를 복습하면 이해하기 쉬울 것 !

- pandas dataframe은 column 이름을 이용하여 기본적인 Indexing이 가능하다.

dataframe에 바로 indexing을 사용하면 column을 indexing 해 온다.
여러 컬럼을 가져올 때는 list 형식으로 가져와야 한다.

# A를 indexing
df["A"]

출력 :

2021-01-01    1.007072
2021-01-02    0.000000
2021-01-03    0.000000
2021-01-04    1.645999
2021-01-05    1.600783
2021-01-06    0.382107
Freq: D, Name: A, dtype: float64

row를 indexing하게 되면 KeyError가 발생한다.

df["2021-01-01"]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\anaconda3\envs\datascience\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3079             try:
-> 3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: '2021-01-01'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-102-1e1d21feff80> in <module>
      5
      6 # row를 indexing하게 되면 KeyError가 발생한다.
----> 7 df["2021-01-01"]
      8 # --pandas의 기본적인 indexing은 dictionary의 indexing과 같다.
      9 # == "key"를 indexing == "key" == "column"

~\anaconda3\envs\datascience\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3022             if self.columns.nlevels > 1:
   3023                 return self._getitem_multilevel(key)
-> 3024             indexer = self.columns.get_loc(key)
   3025             if is_integer(indexer):
   3026                 indexer = [indexer]

~\anaconda3\envs\datascience\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:
-> 3082                 raise KeyError(key) from err
   3083
   3084         if tolerance is not None:

KeyError: '2021-01-01'

KeyError가 발생한다는 것 : pandas의 dataframe에 key가 존재한다.
pandas의 기본적인 indexing은 dictionary의 indexing과 같다.
"key"를 indexing == "key" == "column"
series의 column의 이름이 key가 된다.

특정 날짜를 통한 Indexing --index를 기준으로

index 이름을 알고 있을 때, 찾고자 하는 게 index 이름일 때

df.loc["2021-01-01"]

A    1.007072
B    1.173345
C   -0.452918
D    0.285477
Name: 2021-01-01 00:00:00, dtype: float64

pd.Series 와 같다.

type(df.loc["2021-01-01"])

출력 :

pandas.core.series.Series

특정 위치를 통한 indexing -- index 이름을 기준으로

index의 순서를 알고있을 때 사용

df.iloc[0]

출력 :

A    1.007072
B    1.173345
C   -0.452918
D    0.285477
Name: 2021-01-01 00:00:00, dtype: float64

dataframe에서 slicing을 이용하면 row 단위로 잘려나온다.
column은 slicing이 불가하다. --직접 column 이름을 입력해서 가져와야 한다.

숫자를 그냥 사용하게 되면 index(양의 정수)를 이용한 slicing

# 앞에서 3줄을 slicing
df[:3]

# column을 sclicing하는 방법 대신 가져오는 방법
# 1)
df[["A", "B", "C"]]

# 2)
df.columns[ :3]
	# 위의 결과가 "A", "B", "C"
    
df[df.columns[:3]]

출력 :

df에서 index value를 기준으로 indexing도 가능하다. (여전히 row 단위)

index의 값을 사용하게되면 Index를 이용한 slicing

# 20210102부터 20210104까지 잘라봅니다.
df['2021-01-02':'2021-01-04']

df.loc['2021-01-02']

출력 :

A   -1.664352
B    1.038657
C   -0.644446
D    0.049940
Name: 2021-01-02 00:00:00, dtype: float64

# df.loc는 특정값을 기준으로 indexing합니다. (key - value)
# 2021-01-01값을 가지는 row를 가져옵니다.

df.loc[dates[0]] df.loc['20210702': :]
--df.loc['20210702'] 와 같다

df.loc[:, "A"]
-df.loc["A"]

df.loc는 특정값을 기준으로 indexing합니다. (key - value)

# 2021-01-01값을 가지는 row를 가져온다.
df.loc[dates[0]]

출력 :

A    1.007072
B    1.173345
C   -0.452918
D    0.285477
Name: 2021-01-01 00:00:00, dtype: float64

df.loc에 2차원 indexing도 가능하다.
Numpy 때는 2차원 array에서 [row:column] 형식을 사용했었는데,
Pandas의 dataframe에서는 column의 이름과 row의 index로 구분된다.
dataframe에서 2차원 indexing을 할 때, column들은 리스트로 넘겨줄 수 있다.

df.loc[ : , ['A', 'B']]

[ : , ["A", "B"] ]의 의미는 모든 row에 대해서 columns는 A, B만 가져오라는 의미

출력 :

이번엔 slicing을 통해 특정 row중에서 columns는 A, B인 dataframe

df.loc['2021-01-03' : '2021-01-05', ['A', 'B']]

출력 :

특정 row를 index값을 통한 indexing

df.loc['2021-01-02',  ['A', 'B']]

출력 :

A -1.664352
B 1.038657
Name: 2021-01-02 00:00:00, dtype: float64

결과는 row 한 줄이므로 data type은 Series

type(df.loc['2021-01-02',  ['A', 'B']])

출력 :

pandas.core.series.Series

2차원 리스트 indexing과 같은 원리가 된다.

특정 row(index)에 특정 column 값.

df.loc['2021-01-01', 'C']

출력 :

-0.452917846960589

df.iloc는 정수를 이용한 indexing과 같다.(row 기준)

3은 4번째를 의미

df.iloc[1, 2]

iloc로 2차원 indexing을 하게되면
row 기준으로 index 3, 4를 가져오고 column 기준으로 0, 1을 가져온다.

df.iloc[1][2]

df.iloc의 indexing은 numpy array의 2차원 index와 동일해진다.

slicing이 아닌 직접 리스트 형태로 기재하는 indexing은 필터링과 같은 결과를 가져온다.

df.iloc[[1, 2, 4], [0, 3]]

출력 :

Q. 2차원 indexing에 뒤에가 : 면 어떤 의미일까?

A. numpy array의 2차원 indexing과 같다.

df.iloc[1:3, : ]

출력 :

df.iloc[: , 1:3]

출력 :

- fancy indexing

pandas는 fancy indexing을 지원한다. (사실 numpy에서 지원하기 때문에 pandas도 지원한다.)
fancy indexing이란 조건문을 통해 indexing을 할 수 있는 방법으로,
True와 False를 원소로 하는 리스트를 통해 masking하는 원리로 동작한다.

df > 0

출력 :

column A에 있는 원소들중에 0보다 큰 데이터 가져오기

df.A > 0
== df['A'] > 0

출력 :

2021-01-01     True
2021-01-02    False
2021-01-03    False
2021-01-04     True
2021-01-05     True
2021-01-06     True
Freq: D, Name: A, dtype: bool

0보다 큰 A열의 원소 가져오기

df[df['A'] > 0]['A'] 
df.loc[df['A'] > 0, 'A'] # 추천

dataframe에서 masking하고 dataframe이 결과로 나오는 경우
--fancy indexing

df[df['A'] > 0]

출력 :

df[df > 0]

출력 :

Nan을 0으로 치환

df[df < 0] = 0
df

출력 :

indexing을 직접 Series에 적용하면?

chain indexing
: 나열하는 indexing
--indexing이 앞에서부터 뒤로 쭉 순서대로 적용된다.

df['A'][df['A'] > 0]

'A'에 대해서 'A'가 0보다 큰 것에 대해 masking

출력 :

2021-01-01    1.007072
2021-01-04    1.645999
2021-01-05    1.600783
2021-01-06    0.382107
Name: A, dtype: float64

dataframe 하나를 복사

df2 = df.copy()

dataframe은 dictionary와 비슷한 방식으로 assignment가 가능하다.
--dictionary에서 indexing 또는 원소를 변경하는 것과 같다.

df2에 ['one', 'one','two','three','four','three'] 리스트를 column의 value로 하는 column E를 추가
--column E가 없다면 생성, 이미 column E가 존재한다면 update 된다.

df2['E'] = ['one', 'one','two','three','four','three']
df2

출력 :

df.isin은 해당 value들이 들어있는 row에 대해선 True를 가지는 Series를 리턴한다.

df2['E'].isin(['two','four'])

two 또는 four를 가지면 True

출력 :

2021-01-01    False
2021-01-02    False
2021-01-03     True
2021-01-04    False
2021-01-05     True
2021-01-06    False
Freq: D, Name: E, dtype: bool

masking 결과를 dataframe으로 변환

df2[df2['E'].isin(['two','four'])]

출력 :

정리

# 1)
df['A'] # column을 가져온다


# 2)
df[ : 3] # row를 잘라서 가져온다.


# 3) fancy indexing  -- index / columns
df.iloc[0, 2]
	== df.loc['2021-01-01', 'C']
		==df.values[0, 2]

'Records of > Learning' 카테고리의 다른 글

[Seaborn] 01. Seaborn이란 (0)	2021.07.26
[Pandas] 05. 외부 데이터 읽고 쓰기 (0)	2021.07.26
[Pandas] 03. DataFrame Method (0)	2021.07.25
[Pandas] 01. Pandas란 / 02. Pandas의 기본 자료구조(Series, DataFrame) (0)	2021.07.25
[Numpy] Performance Test (0)	2021.07.25

'Records of/Learning' Related Articles

Comments

SiLaure's Data

[Pandas] 04. DataFrame Indexing 본문

[Pandas] 04. DataFrame Indexing

- DataFrame Indexing

정리

'Records of > Learning' 카테고리의 다른 글

티스토리툴바