SiLaure's Data
[EDA] Kaggle - Instacart Market Basket Analysis 본문
나는 EDA라는 말을 처음 들었다. EDM이면 몰라도...(ㅎ)
아무튼 EDA는 Exploratory Data Analysis의 준말로 탐색적 자료 분석을 뜻한다.
존 튜키라는 미국의 통계학자가 창안한 자료분석 방법론이다.
주어진 자료만 가지고 충분한 정보를 찾을 수 있도록 여러가지 탐색적 자료 분석 방법을 개발했는데,
boxplot가 대표적인 예이다.나는 EDA를 통계 정보를 가설이 아닌 객관적 자료 정보의 분석을 통해 자료를 이해하는 것이라고 해석했다.이번 미니 프로젝트에서 EDA를 수행할 대상으로 다음 5가지 주제가 있었다.
1. 신용카드 사용자 연체 예측 AI 경진대회 데이터셋
2. All Lending Club loan data (사용자 대출 여부 예측 대회 데이터셋)
3. Groceries Dataset
4. Instacart Market Basket Analysis (고객의 물품 재구매 예측 여부 대회 데이터셋)
5. Prudential Life Insurance Assessment (보험 고객의 리스크 예측 대회 데이터셋)
나는 주제를 선정하기 전날 마켓컬리에서 4만원을 주문한 전적이 있으므로,,내가 가장 재미를 느낄 것 같은 4번을 선택했다.잘 모르는 걸 하면 어려움+흥미없음 으로 드랍하게 될까봐...나의 EDA를 수행하기 전 코드 필사 3~4개 정도를 하라고 하셔서, 3개정도 했다.아직 블로그에는 1개밖에 포스팅 하지 않았지만..이제 앞으로 어떻게 EDA를 수행했는지 가보자고~~~~!
필요한 4대 lib import¶
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
파일 읽어오기¶
In [2]:
order_products_train_df=pd.read_csv("order_products__train.csv")
order_products_prior_df=pd.read_csv("order_products__prior.csv")
orders_df=pd.read_csv("orders.csv")
products_df=pd.read_csv("products.csv")
aisles_df=pd.read_csv("aisles.csv")
departments_df=pd.read_csv("departments.csv")
살펴보기(info,describe등)¶
In [3]:
orders_df.info()
orders_df.head(10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
# Column Dtype
--- ------ -----
0 order_id int64
1 user_id int64
2 eval_set object
3 order_number int64
4 order_dow int64
5 order_hour_of_day int64
6 days_since_prior_order float64
dtypes: float64(1), int64(5), object(1)
memory usage: 182.7+ MB
Out[3]:
order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | |
---|---|---|---|---|---|---|---|
0 | 2539329 | 1 | prior | 1 | 2 | 8 | NaN |
1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15.0 |
2 | 473747 | 1 | prior | 3 | 3 | 12 | 21.0 |
3 | 2254736 | 1 | prior | 4 | 4 | 7 | 29.0 |
4 | 431534 | 1 | prior | 5 | 4 | 15 | 28.0 |
5 | 3367565 | 1 | prior | 6 | 2 | 7 | 19.0 |
6 | 550135 | 1 | prior | 7 | 1 | 9 | 20.0 |
7 | 3108588 | 1 | prior | 8 | 1 | 14 | 14.0 |
8 | 2295261 | 1 | prior | 9 | 1 | 16 | 0.0 |
9 | 2550362 | 1 | prior | 10 | 4 | 8 | 30.0 |
In [4]:
products_df.info()
products_df.head(10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49688 entries, 0 to 49687
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 product_id 49688 non-null int64
1 product_name 49688 non-null object
2 aisle_id 49688 non-null int64
3 department_id 49688 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.5+ MB
Out[4]:
product_id | product_name | aisle_id | department_id | |
---|---|---|---|---|
0 | 1 | Chocolate Sandwich Cookies | 61 | 19 |
1 | 2 | All-Seasons Salt | 104 | 13 |
2 | 3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 |
3 | 4 | Smart Ones Classic Favorites Mini Rigatoni Wit... | 38 | 1 |
4 | 5 | Green Chile Anytime Sauce | 5 | 13 |
5 | 6 | Dry Nose Oil | 11 | 11 |
6 | 7 | Pure Coconut Water With Orange | 98 | 7 |
7 | 8 | Cut Russet Potatoes Steam N' Mash | 116 | 1 |
8 | 9 | Light Strawberry Blueberry Yogurt | 120 | 16 |
9 | 10 | Sparkling Orange Juice & Prickly Pear Beverage | 115 | 7 |
In [5]:
products_df.describe()
Out[5]:
product_id | aisle_id | department_id | |
---|---|---|---|
count | 49688.000000 | 49688.000000 | 49688.000000 |
mean | 24844.500000 | 67.769582 | 11.728687 |
std | 14343.834425 | 38.316162 | 5.850410 |
min | 1.000000 | 1.000000 | 1.000000 |
25% | 12422.750000 | 35.000000 | 7.000000 |
50% | 24844.500000 | 69.000000 | 13.000000 |
75% | 37266.250000 | 100.000000 | 17.000000 |
max | 49688.000000 | 134.000000 | 21.000000 |
In [6]:
order_products_prior_df.info()
order_products_prior_df.head(10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32434489 entries, 0 to 32434488
Data columns (total 4 columns):
# Column Dtype
--- ------ -----
0 order_id int64
1 product_id int64
2 add_to_cart_order int64
3 reordered int64
dtypes: int64(4)
memory usage: 989.8 MB
Out[6]:
order_id | product_id | add_to_cart_order | reordered | |
---|---|---|---|---|
0 | 2 | 33120 | 1 | 1 |
1 | 2 | 28985 | 2 | 1 |
2 | 2 | 9327 | 3 | 0 |
3 | 2 | 45918 | 4 | 1 |
4 | 2 | 30035 | 5 | 0 |
5 | 2 | 17794 | 6 | 1 |
6 | 2 | 40141 | 7 | 1 |
7 | 2 | 1819 | 8 | 1 |
8 | 2 | 43668 | 9 | 0 |
9 | 3 | 33754 | 1 | 1 |
In [7]:
order_products_prior_df.describe()
Out[7]:
order_id | product_id | add_to_cart_order | reordered | |
---|---|---|---|---|
count | 3.243449e+07 | 3.243449e+07 | 3.243449e+07 | 3.243449e+07 |
mean | 1.710749e+06 | 2.557634e+04 | 8.351076e+00 | 5.896975e-01 |
std | 9.873007e+05 | 1.409669e+04 | 7.126671e+00 | 4.918886e-01 |
min | 2.000000e+00 | 1.000000e+00 | 1.000000e+00 | 0.000000e+00 |
25% | 8.559430e+05 | 1.353000e+04 | 3.000000e+00 | 0.000000e+00 |
50% | 1.711048e+06 | 2.525600e+04 | 6.000000e+00 | 1.000000e+00 |
75% | 2.565514e+06 | 3.793500e+04 | 1.100000e+01 | 1.000000e+00 |
max | 3.421083e+06 | 4.968800e+04 | 1.450000e+02 | 1.000000e+00 |
In [8]:
order_products_train_df.info()
order_products_train_df.head(10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1384617 entries, 0 to 1384616
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 1384617 non-null int64
1 product_id 1384617 non-null int64
2 add_to_cart_order 1384617 non-null int64
3 reordered 1384617 non-null int64
dtypes: int64(4)
memory usage: 42.3 MB
Out[8]:
order_id | product_id | add_to_cart_order | reordered | |
---|---|---|---|---|
0 | 1 | 49302 | 1 | 1 |
1 | 1 | 11109 | 2 | 1 |
2 | 1 | 10246 | 3 | 0 |
3 | 1 | 49683 | 4 | 0 |
4 | 1 | 43633 | 5 | 1 |
5 | 1 | 13176 | 6 | 0 |
6 | 1 | 47209 | 7 | 0 |
7 | 1 | 22035 | 8 | 1 |
8 | 36 | 39612 | 1 | 0 |
9 | 36 | 19660 | 2 | 1 |
In [9]:
aisles_df.info()
aisles_df.head(10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134 entries, 0 to 133
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 aisle_id 134 non-null int64
1 aisle 134 non-null object
dtypes: int64(1), object(1)
memory usage: 2.2+ KB
Out[9]:
aisle_id | aisle | |
---|---|---|
0 | 1 | prepared soups salads |
1 | 2 | specialty cheeses |
2 | 3 | energy granola bars |
3 | 4 | instant foods |
4 | 5 | marinades meat preparation |
5 | 6 | other |
6 | 7 | packaged meat |
7 | 8 | bakery desserts |
8 | 9 | pasta sauce |
9 | 10 | kitchen supplies |
In [10]:
departments_df.info()
departments_df.head(10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 department_id 21 non-null int64
1 department 21 non-null object
dtypes: int64(1), object(1)
memory usage: 464.0+ bytes
Out[10]:
department_id | department | |
---|---|---|
0 | 1 | frozen |
1 | 2 | other |
2 | 3 | bakery |
3 | 4 | produce |
4 | 5 | alcohol |
5 | 6 | international |
6 | 7 | beverages |
7 | 8 | pets |
8 | 9 | dry goods pasta |
9 | 10 | bulk |
- orders_df는 주문에 대한 모든 정보를 가지고 있다.
- order_products_train과 order_products_prior에 있는 열이 동일하다.
- 고객의 모든 이전 주문 정보는 order_products_prior 파일에 있다.
In [12]:
# orders_df에서 user_id 당 eval_set의 개수가 어떻게 되어 있는지 확인
orders_df.groupby("eval_set")['user_id'].nunique()
Out[12]:
eval_set
prior 206209
test 75000
train 131209
Name: user_id, dtype: int64
- 총 user_id 개수는 206,209
- 이 중 131,209명이 train set이고 나머지 75000명을 예측해야 함
In [14]:
cnt_srs = orders_df.eval_set.value_counts()
plt.figure(figsize=(12, 8))
# cnt_srs.index는 Index(['prior', 'train', 'test'], dtype='object')
# cnt_srs.values는 4~100의 order_number
sns.barplot(cnt_srs.index, cnt_srs.values)
plt.xlabel('Eval set type', fontsize=12)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.title('Countof rows in each dataset', fontsize=15)
plt.show()
C:\Users\user\anaconda3\envs\datascience\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(
In [15]:
temp = orders_df.groupby("user_id")["order_number"]
temp
Out[15]:
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001B615BD7D00>
In [16]:
# orders_df를 user_id를 기준으로
# user_id가 가진 order_number의 최댓값을 구하는 함수를 적용하고
# index를 colomn으로 변환
cnt_srs = orders_df.groupby("user_id")["order_number"].agg(np.max).reset_index()
cnt_srs
Out[16]:
user_id | order_number | |
---|---|---|
0 | 1 | 11 |
1 | 2 | 15 |
2 | 3 | 13 |
3 | 4 | 6 |
4 | 5 | 5 |
... | ... | ... |
206204 | 206205 | 4 |
206205 | 206206 | 68 |
206206 | 206207 | 17 |
206207 | 206208 | 50 |
206208 | 206209 | 14 |
206209 rows × 2 columns
In [17]:
cnt_srs = cnt_srs.order_number.value_counts()
cnt_srs
Out[17]:
4 23986
5 19590
6 16165
7 13850
8 11700
...
94 57
91 56
97 54
98 50
99 47
Name: order_number, Length: 97, dtype: int64
- order_number는 4~100으로, 총 97건의 고객 주문 데이터가 있다
In [19]:
plt.figure(figsize=(12, 8))
sns.barplot(cnt_srs.index, cnt_srs.values)
C:\Users\user\anaconda3\envs\datascience\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(
Out[19]:
<AxesSubplot:>
- 4개 미만을 주문한 데이터는 없다.
- 다음은 요일, 시간대에 따른 주문 분포
In [22]:
plt.figure(figsize=(12, 8))
sns.countplot(x="order_dow", data = orders_df)
plt.xlabel('Day of week', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title("Frequency of order by week day")
plt.show()
- 0이 일요일일까 월요일일까?
- 미국에서는 일요일이 일주일의 시작이니 0이 일요일일까, 1이 일요일일까...?
일요일이었음
In [24]:
plt.figure(figsize=(12, 8))
sns.countplot(x="order_hour_of_day", data=orders_df)
plt.xlabel('Hour of day', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title("Frequency of order by hour of day", fontsize=15)
plt.show()
- 10~16시까지 제일 주문이 많다. (나는 자기 전에 시키는데ㅎㅎ)
- 재주문은 며칠 간격으로 이루어 질까?
In [26]:
plt.figure(figsize=(12, 8))
sns.countplot(x="days_since_prior_order", data=orders_df)
plt.xlabel('Days since prior order', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title("Frequency distribution by das since prior order", fontsize=15)
plt.show()
- 한달 주기가 제일 많고, 그 다음이 1주일 뒤 > 6> ... > 2일 그 이후로는 일주일 단위로 전후 날짜보다 더 많이 주문한다.
- 재주문 비율로 확인하기
In [29]:
order_products_prior_df.reordered
Out[29]:
0 1
1 1
2 0
3 1
4 0
..
32434484 1
32434485 0
32434486 0
32434487 1
32434488 1
Name: reordered, Length: 32434489, dtype: int64
In [30]:
sum_ord_prod_prior = order_products_prior_df.reordered.sum()
In [31]:
order_products_prior_df.shape
Out[31]:
(32434489, 4)
In [32]:
order_products_prior_df.shape[0]
Out[32]:
32434489
In [33]:
# order_products_prior의 재주문 비율
cnt_ord_prod_prior = order_products_prior_df.shape[0]
pct_ord_prod_prior = sum_ord_prod_prior/cnt_ord_prod_prior *100
pct_ord_prod_prior.round(2)
Out[33]:
58.97
In [34]:
# order_products_train의 재주문 비율
sum_ord_prod_train = order_products_train_df.reordered.sum()
cnt_ord_prod_train = order_products_train_df.shape[0]
pct_ord_prod_train = sum_ord_prod_train/cnt_ord_prod_train*100
pct_ord_prod_train.round(2)
Out[34]:
59.86
- 약 59~60%가 재주문
- 재주문된 적이 없는 건 확인하기
- order_products_prior_df에서 order_id를 기준으로 reordered한 합을 구해서 0인 것들을 찾으면 된다.
In [37]:
# order_products_prior_df에서 order_id를 기준으로 reordered한 합
sum_reord_prior_df = order_products_prior_df.groupby("order_id")["reordered"].agg("sum").reset_index()
sum_reord_prior_df
Out[37]:
order_id | reordered | |
---|---|---|
0 | 2 | 6 |
1 | 3 | 8 |
2 | 4 | 12 |
3 | 5 | 21 |
4 | 6 | 0 |
... | ... | ... |
3214869 | 3421079 | 0 |
3214870 | 3421080 | 4 |
3214871 | 3421081 | 0 |
3214872 | 3421082 | 4 |
3214873 | 3421083 | 4 |
3214874 rows × 2 columns
In [38]:
# 재주문 된 적이 없는 건들
non_reord_prior_df = sum_reord_prior_df["reordered"].loc[sum_reord_prior_df["reordered"]<1]
non_reord_prior_df
Out[38]:
4 0
5 0
11 0
17 0
21 0
..
3214842 0
3214863 0
3214867 0
3214869 0
3214871 0
Name: reordered, Length: 388513, dtype: int64
In [39]:
# 재주문된 적 없는 건들은 0이므로, 1번이라도 재주문된 적 있는 건들을 1로 치환한다
sum_reord_prior_df["reordered"].loc[sum_reord_prior_df["reordered"]>=1] = 1
sum_reord_prior_df
# 재주문된 건수(1)와 재주문 되지 않은 건수(0)
sum_reord_prior_df.reordered.value_counts()
Out[39]:
1 2826361
0 388513
Name: reordered, dtype: int64
In [40]:
# user_id를 기준으로 한 재주문된 건수(1)와 재주문 되지 않은 건수(0)의 비율
sum_reord_prior_df.reordered.value_counts()/sum_reord_prior_df.shape[0] *100
Out[40]:
1 87.915141
0 12.084859
Name: reordered, dtype: float64
In [41]:
# train set의 경우도 실행
sum_reord_train_df = order_products_train_df.groupby("order_id")["reordered"].agg("sum").reset_index()
sum_reord_train_df["reordered"].loc[sum_reord_train_df["reordered"]>=1] = 1
sum_reord_train_df.reordered.value_counts()/sum_reord_train_df.shape[0] *100
Out[41]:
1 93.444047
0 6.555953
Name: reordered, dtype: float64
- train set의 경우 재주문 하지 않은 비율이 약 6.5%
- 주문 당 구매한 제품 수는?
In [43]:
cnt_each_prod_prior = order_products_prior_df.groupby("order_id")["add_to_cart_order"].agg(np.max).reset_index()
cnt_each_prod_prior
Out[43]:
order_id | add_to_cart_order | |
---|---|---|
0 | 2 | 9 |
1 | 3 | 8 |
2 | 4 | 13 |
3 | 5 | 26 |
4 | 6 | 3 |
... | ... | ... |
3214869 | 3421079 | 1 |
3214870 | 3421080 | 9 |
3214871 | 3421081 | 7 |
3214872 | 3421082 | 7 |
3214873 | 3421083 | 10 |
3214874 rows × 2 columns
- 수량만 뽑아보자
In [44]:
cnt_each_prod_prior = cnt_each_prod_prior.add_to_cart_order.value_counts()
cnt_each_prod_prior
Out[44]:
5 228330
6 227675
4 222081
7 220006
3 207027
...
114 1
112 1
94 1
90 1
145 1
Name: add_to_cart_order, Length: 113, dtype: int64
In [45]:
# 따로 그리기
plt.figure(figsize=(20, 10))
sns.barplot(cnt_each_prod_prior.index, cnt_each_prod_prior.values)
plt.xlabel('Number of add to cart order', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title("Count of Each products in Prior set", fontsize=15)
Out[45]:
Text(0.5, 1.0, 'Count of Each products in Prior set')
In [46]:
cnt_each_prod_train = order_products_train_df.groupby("order_id")["add_to_cart_order"].agg(np.max).reset_index()
cnt_each_prod_train = cnt_each_prod_train.add_to_cart_order.value_counts()
plt.figure(figsize=(20, 10))
sns.barplot(cnt_each_prod_train.index, cnt_each_prod_train.values)
plt.xlabel('Number of add to cart order', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title("Count of Each products in Train set", fontsize=15)
Out[46]:
Text(0.5, 1.0, 'Count of Each products in Train set')
In [47]:
# 한꺼번에 그리기
fig, (ax1, ax2) = plt.subplots(nrows=2)
fig.set_size_inches(20, 15)
cnt_each_prod_prior = order_products_prior_df.groupby("order_id")["add_to_cart_order"].count().to_frame()["add_to_cart_order"].value_counts().to_frame()
cnt_each_prod_prior["count"] = cnt_each_prod_prior["add_to_cart_order"]
cnt_each_prod_prior["nvr_ord"] = cnt_each_prod_prior.index
cnt_each_prod_train = order_products_train_df.groupby("order_id")["add_to_cart_order"].count().to_frame()["add_to_cart_order"].value_counts().to_frame()
cnt_each_prod_train["count"] = cnt_each_prod_train["add_to_cart_order"]
cnt_each_prod_train["nvr_ord"] = cnt_each_prod_train.index
sns.barplot(data=cnt_each_prod_prior.head(50), x="nvr_ord", y="count", ax=ax1)
sns.barplot(data=cnt_each_prod_train.head(50), x="nvr_ord", y="count", ax=ax2)
ax1.set(xlabel='Order Count', title="Count of Items People Buy In Prior")
ax2.set(xlabel='Order Count', title="Count of Items People Buy In Train")
Out[47]:
[Text(0.5, 0, 'Order Count'),
Text(0.5, 1.0, 'Count of Items People Buy In Train')]
- 5번 주문이 제일 많고 개수가 많아질 수록 주문이 감소한다.
- 제품 상세 목록을 확인하자. (제일 재밌는 부분)
In [49]:
products_df.head(10)
Out[49]:
product_id | product_name | aisle_id | department_id | |
---|---|---|---|---|
0 | 1 | Chocolate Sandwich Cookies | 61 | 19 |
1 | 2 | All-Seasons Salt | 104 | 13 |
2 | 3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 |
3 | 4 | Smart Ones Classic Favorites Mini Rigatoni Wit... | 38 | 1 |
4 | 5 | Green Chile Anytime Sauce | 5 | 13 |
5 | 6 | Dry Nose Oil | 11 | 11 |
6 | 7 | Pure Coconut Water With Orange | 98 | 7 |
7 | 8 | Cut Russet Potatoes Steam N' Mash | 116 | 1 |
8 | 9 | Light Strawberry Blueberry Yogurt | 120 | 16 |
9 | 10 | Sparkling Orange Juice & Prickly Pear Beverage | 115 | 7 |
In [50]:
aisles_df.head(10)
Out[50]:
aisle_id | aisle | |
---|---|---|
0 | 1 | prepared soups salads |
1 | 2 | specialty cheeses |
2 | 3 | energy granola bars |
3 | 4 | instant foods |
4 | 5 | marinades meat preparation |
5 | 6 | other |
6 | 7 | packaged meat |
7 | 8 | bakery desserts |
8 | 9 | pasta sauce |
9 | 10 | kitchen supplies |
In [51]:
departments_df.head(10)
Out[51]:
department_id | department | |
---|---|---|
0 | 1 | frozen |
1 | 2 | other |
2 | 3 | bakery |
3 | 4 | produce |
4 | 5 | alcohol |
5 | 6 | international |
6 | 7 | beverages |
7 | 8 | pets |
8 | 9 | dry goods pasta |
9 | 10 | bulk |
- prior 정보와 매치하기 위해 표 merge 작업
In [52]:
mg_ord_prod_df = pd.merge(order_products_prior_df, products_df, on='product_id', how='left')
mg_ord_prod_df = pd.merge(mg_ord_prod_df, aisles_df, on='aisle_id', how='left')
mg_ord_prod_df = pd.merge(mg_ord_prod_df, departments_df, on='department_id', how='left')
In [53]:
mg_ord_prod_df.info()
mg_ord_prod_df.head(50)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32434489 entries, 0 to 32434488
Data columns (total 9 columns):
# Column Dtype
--- ------ -----
0 order_id int64
1 product_id int64
2 add_to_cart_order int64
3 reordered int64
4 product_name object
5 aisle_id int64
6 department_id int64
7 aisle object
8 department object
dtypes: int64(6), object(3)
memory usage: 2.4+ GB
Out[53]:
order_id | product_id | add_to_cart_order | reordered | product_name | aisle_id | department_id | aisle | department | |
---|---|---|---|---|---|---|---|---|---|
0 | 2 | 33120 | 1 | 1 | Organic Egg Whites | 86 | 16 | eggs | dairy eggs |
1 | 2 | 28985 | 2 | 1 | Michigan Organic Kale | 83 | 4 | fresh vegetables | produce |
2 | 2 | 9327 | 3 | 0 | Garlic Powder | 104 | 13 | spices seasonings | pantry |
3 | 2 | 45918 | 4 | 1 | Coconut Butter | 19 | 13 | oils vinegars | pantry |
4 | 2 | 30035 | 5 | 0 | Natural Sweetener | 17 | 13 | baking ingredients | pantry |
5 | 2 | 17794 | 6 | 1 | Carrots | 83 | 4 | fresh vegetables | produce |
6 | 2 | 40141 | 7 | 1 | Original Unflavored Gelatine Mix | 105 | 13 | doughs gelatins bake mixes | pantry |
7 | 2 | 1819 | 8 | 1 | All Natural No Stir Creamy Almond Butter | 88 | 13 | spreads | pantry |
8 | 2 | 43668 | 9 | 0 | Classic Blend Cole Slaw | 123 | 4 | packaged vegetables fruits | produce |
9 | 3 | 33754 | 1 | 1 | Total 2% with Strawberry Lowfat Greek Strained... | 120 | 16 | yogurt | dairy eggs |
10 | 3 | 24838 | 2 | 1 | Unsweetened Almondmilk | 91 | 16 | soy lactosefree | dairy eggs |
11 | 3 | 17704 | 3 | 1 | Lemons | 123 | 4 | packaged vegetables fruits | produce |
12 | 3 | 21903 | 4 | 1 | Organic Baby Spinach | 123 | 4 | packaged vegetables fruits | produce |
13 | 3 | 17668 | 5 | 1 | Unsweetened Chocolate Almond Breeze Almond Milk | 91 | 16 | soy lactosefree | dairy eggs |
14 | 3 | 46667 | 6 | 1 | Organic Ginger Root | 83 | 4 | fresh vegetables | produce |
15 | 3 | 17461 | 7 | 1 | Air Chilled Organic Boneless Skinless Chicken ... | 35 | 12 | poultry counter | meat seafood |
16 | 3 | 32665 | 8 | 1 | Organic Ezekiel 49 Bread Cinnamon Raisin | 112 | 3 | bread | bakery |
17 | 4 | 46842 | 1 | 0 | Plain Pre-Sliced Bagels | 93 | 3 | breakfast bakery | bakery |
18 | 4 | 26434 | 2 | 1 | Honey/Lemon Cough Drops | 11 | 11 | cold flu allergy | personal care |
19 | 4 | 39758 | 3 | 1 | Chewy 25% Low Sugar Chocolate Chip Granola | 3 | 19 | energy granola bars | snacks |
20 | 4 | 27761 | 4 | 1 | Oats & Chocolate Chewy Bars | 48 | 14 | breakfast bars pastries | breakfast |
21 | 4 | 10054 | 5 | 1 | Kellogg's Nutri-Grain Apple Cinnamon Cereal | 48 | 14 | breakfast bars pastries | breakfast |
22 | 4 | 21351 | 6 | 1 | Nutri-Grain Soft Baked Strawberry Cereal Break... | 48 | 14 | breakfast bars pastries | breakfast |
23 | 4 | 22598 | 7 | 1 | Kellogg's Nutri-Grain Blueberry Cereal | 48 | 14 | breakfast bars pastries | breakfast |
24 | 4 | 34862 | 8 | 1 | Tiny Twists Pretzels | 107 | 19 | chips pretzels | snacks |
25 | 4 | 40285 | 9 | 1 | Traditional Snack Mix | 125 | 19 | trail mix snack mix | snacks |
26 | 4 | 17616 | 10 | 1 | Goldfish Cheddar Baked Snack Crackers | 78 | 19 | crackers | snacks |
27 | 4 | 25146 | 11 | 1 | Original Orange Juice | 31 | 7 | refrigerated | beverages |
28 | 4 | 32645 | 12 | 1 | Sugarfree Energy Drink | 64 | 7 | energy sports drinks | beverages |
29 | 4 | 41276 | 13 | 1 | Energy Drink | 64 | 7 | energy sports drinks | beverages |
30 | 5 | 13176 | 1 | 1 | Bag of Organic Bananas | 24 | 4 | fresh fruits | produce |
31 | 5 | 15005 | 2 | 1 | Just Crisp, Parmesan | 89 | 13 | salad dressing toppings | pantry |
32 | 5 | 47329 | 3 | 1 | Fresh Fruit Salad | 1 | 20 | prepared soups salads | deli |
33 | 5 | 27966 | 4 | 1 | Organic Raspberries | 123 | 4 | packaged vegetables fruits | produce |
34 | 5 | 23909 | 5 | 1 | 2% Reduced Fat Milk | 84 | 16 | milk | dairy eggs |
35 | 5 | 48370 | 6 | 1 | Sensitive Toilet Paper | 54 | 17 | paper goods | household |
36 | 5 | 13245 | 7 | 1 | Natural Artesian Water, Mini & Mobile | 115 | 7 | water seltzer sparkling water | beverages |
37 | 5 | 9633 | 8 | 1 | Matzos, Thin, Tea | 33 | 6 | kosher foods | international |
38 | 5 | 27360 | 9 | 1 | Boneless Skinless Chicken Breast Fillets | 49 | 12 | packaged poultry | meat seafood |
39 | 5 | 6348 | 10 | 1 | Mini Original Babybel Cheese | 21 | 16 | packaged cheese | dairy eggs |
40 | 5 | 40878 | 11 | 1 | Macaroni And Cheese | 4 | 9 | instant foods | dry goods pasta |
41 | 5 | 6184 | 12 | 1 | Clementines | 32 | 4 | packaged produce | produce |
42 | 5 | 48002 | 13 | 1 | Biscuits Orange Pim's | 61 | 19 | cookies cakes | snacks |
43 | 5 | 20914 | 14 | 1 | Dairy Milk Fruit & Nut Chocolate Bar | 45 | 19 | candy chocolate | snacks |
44 | 5 | 37011 | 15 | 1 | Artichokes | 83 | 4 | fresh vegetables | produce |
45 | 5 | 12962 | 16 | 1 | Apricot Preserves | 88 | 13 | spreads | pantry |
46 | 5 | 45698 | 17 | 1 | One Ply Choose A Size Big Roll Paper Towel Rolls | 54 | 17 | paper goods | household |
47 | 5 | 24773 | 18 | 1 | Wafer, Chocolate | 61 | 19 | cookies cakes | snacks |
48 | 5 | 18569 | 19 | 1 | French Lavender Hand Wash | 127 | 11 | body lotions soap | personal care |
49 | 5 | 41176 | 20 | 1 | American Slices Cheese | 21 | 16 | packaged cheese | dairy eggs |
표가 넘나리 길지만,,, 블로그에서 편집하기 너무 번거로우므로 그냥 둠
In [54]:mg_ord_prod_df['product_name'].value_counts().reset_index
Out[54]:
<bound method Series.reset_index of Banana 472565
Bag of Organic Bananas 379450
Organic Strawberries 264683
Organic Baby Spinach 241921
Organic Hass Avocado 213584
...
Chewy Chocolate Chip 1
Indulgent Cherry & Dark Chocolate Whole Milk Greek Yogurt 1
Salted Caramel Craze Ice Cream 1
Flax Oil, Organic, Omega-3, Original Formula 1
Seltzer Blueberry Lemonade 1
Name: product_name, Length: 49677, dtype: int64>
- 제품 별 구매 수 내림차순으로 뽑아내고 column renaming
In [55]:
cnt_srs_prod=mg_ord_prod_df['product_name'].value_counts().reset_index()
cnt_srs_prod.columns = ['product_name', 'frequency_count']
cnt_srs_prod.head(20)
Out[55]:
product_name | frequency_count | |
---|---|---|
0 | Banana | 472565 |
1 | Bag of Organic Bananas | 379450 |
2 | Organic Strawberries | 264683 |
3 | Organic Baby Spinach | 241921 |
4 | Organic Hass Avocado | 213584 |
5 | Organic Avocado | 176815 |
6 | Large Lemon | 152657 |
7 | Strawberries | 142951 |
8 | Limes | 140627 |
9 | Organic Whole Milk | 137905 |
10 | Organic Raspberries | 137057 |
11 | Organic Yellow Onion | 113426 |
12 | Organic Garlic | 109778 |
13 | Organic Zucchini | 104823 |
14 | Organic Blueberries | 100060 |
15 | Cucumber Kirby | 97315 |
16 | Organic Fuji Apple | 89632 |
17 | Organic Lemon | 87746 |
18 | Apple Honeycrisp Organic | 85020 |
19 | Organic Grape Tomatoes | 84255 |
- 20위까지만 뽑아봤는데, 대부분이 유기농 제품, 그것도 과일이 대부분이었다.
- 품목 별로 묶어보자
In [56]:
cnt_srs_prod_2 = mg_ord_prod_df['aisle'].value_counts().reset_index().head(20)
cnt_srs_prod_2.columns = ['aisle', 'frequency_counts']
cnt_srs_prod_2
Out[56]:
aisle | frequency_counts | |
---|---|---|
0 | fresh fruits | 3642188 |
1 | fresh vegetables | 3418021 |
2 | packaged vegetables fruits | 1765313 |
3 | yogurt | 1452343 |
4 | packaged cheese | 979763 |
5 | milk | 891015 |
6 | water seltzer sparkling water | 841533 |
7 | chips pretzels | 722470 |
8 | soy lactosefree | 638253 |
9 | bread | 584834 |
10 | refrigerated | 575881 |
11 | frozen produce | 522654 |
12 | ice cream ice | 498425 |
13 | crackers | 458838 |
14 | energy granola bars | 456386 |
15 | eggs | 452134 |
16 | lunch meat | 395130 |
17 | frozen meals | 390299 |
18 | baby food formula | 382456 |
19 | fresh herbs | 377741 |
- 역시 20위까지 뽑아봤는데 전부 식료품이며 과일이 제일 많고 그 다음이 야채였다.
- 카테고리 별로 묶는다면??
In [57]:
cnt_srs_prod_3 = mg_ord_prod_df['department'].value_counts().reset_index().head(20)
cnt_srs_prod_3.columns = ['department', 'frequency_counts']
cnt_srs_prod_3
Out[57]:
department | frequency_counts | |
---|---|---|
0 | produce | 9479291 |
1 | dairy eggs | 5414016 |
2 | snacks | 2887550 |
3 | beverages | 2690129 |
4 | frozen | 2236432 |
5 | pantry | 1875577 |
6 | bakery | 1176787 |
7 | canned goods | 1068058 |
8 | deli | 1051249 |
9 | dry goods pasta | 866627 |
10 | household | 738666 |
11 | breakfast | 709569 |
12 | meat seafood | 708931 |
13 | personal care | 447123 |
14 | babies | 423802 |
15 | international | 269253 |
16 | alcohol | 153696 |
17 | pets | 97724 |
18 | missing | 69145 |
19 | other | 36291 |
- 1위는 농산물, 2위는 유제품/알류임을 확인
- 카테고리 별 재주문 비율을 확인해보자
In [58]:
reord_dept = mg_ord_prod_df.groupby(['department'])['reordered'].agg('mean').reset_index()
reord_dept
Out[58]:
department | reordered | |
---|---|---|
0 | alcohol | 0.569924 |
1 | babies | 0.578971 |
2 | bakery | 0.628141 |
3 | beverages | 0.653460 |
4 | breakfast | 0.560922 |
5 | bulk | 0.577040 |
6 | canned goods | 0.457405 |
7 | dairy eggs | 0.669969 |
8 | deli | 0.607719 |
9 | dry goods pasta | 0.461076 |
10 | frozen | 0.541885 |
11 | household | 0.402178 |
12 | international | 0.369229 |
13 | meat seafood | 0.567674 |
14 | missing | 0.395849 |
15 | other | 0.407980 |
16 | pantry | 0.346721 |
17 | personal care | 0.321129 |
18 | pets | 0.601285 |
19 | produce | 0.649913 |
20 | snacks | 0.574180 |
In [59]:
# 제품 이름에 organic이 들어간 농산물의 재구매 주기는 어떻게 될까?
organic_ord = mg_ord_prod_df["product_name"].loc[mg_ord_prod_df["product_name"].str.contains("Organic")]
organic_ord
# 왼쪽의 0, 1, 12, ...는 뭐지
Out[59]:
0 Organic Egg Whites
1 Michigan Organic Kale
12 Organic Baby Spinach
14 Organic Ginger Root
15 Air Chilled Organic Boneless Skinless Chicken ...
...
32434481 Organic Mixed Berry Yogurt & Fruit Snack
32434482 Organic Strawberry Yogurt & Fruit Snack
32434483 Organic Strawberry & Mango Dried Tiny Fruits
32434485 Organic Mini Sandwich Crackers Peanut Butter
32434488 Organic Sweet & Salty Peanut Pretzel Granola ...
Name: product_name, Length: 10250392, dtype: object
여기서부터 오류 대잔치
In [60]:
# organic을 포함하지 않는 제품은 organic 제품이 아니라고 가정하고 Non organic으로 바꾼다
# if mg_ord_prod_df["product_name"].loc[mg_ord_prod_df["product_name"].str.contains("organic")] :
# mg_ord_prod_df["product_name"].loc[mg_ord_prod_df["product_name"]].str.replace("organic")
# elif mg_ord_prod_df["product_name"].loc[mg_ord_prod_df["product_name"]].str.replace("non organic")
In [61]:
mg_ord_prod_df["organic"] = 'a'
In [62]:
mg_ord_prod_df["product_name"].loc[mg_ord_prod_df["product_name"].str.contains("organic")]
Out[62]:
Series([], Name: product_name, dtype: object)
In [63]:
# if mg_ord_prod_df["product_name"].loc[mg_ord_prod_df["product_name"].str.contains("Organic")] :
# mg_ord_prod_df["organic"].replace('a', 'yes')
# else :
# mg_ord_prod_df["organic"].replace('a', 'no')
In [ ]:
# def organic(mg_ord_prod_df) :
# if mg_ord_prod_df["product_name"].loc[mg_ord_prod_df["product_name"].str.contains("rganic")] :
# return "organic"
# else :
# return "non organic"
In [64]:
import numpy as np
mg_ord_prod_df["organic"]=np.where(mg_ord_prod_df["product_name"].str.contains("rganic"), "organic", "non organic")
In [65]:
mg_ord_prod_df
# 반나절 만에 성공,,,,,ㅠㅠ눈물남 ㅠ
Out[65]:
order_id | product_id | add_to_cart_order | reordered | product_name | aisle_id | department_id | aisle | department | organic | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 33120 | 1 | 1 | Organic Egg Whites | 86 | 16 | eggs | dairy eggs | organic |
1 | 2 | 28985 | 2 | 1 | Michigan Organic Kale | 83 | 4 | fresh vegetables | produce | organic |
2 | 2 | 9327 | 3 | 0 | Garlic Powder | 104 | 13 | spices seasonings | pantry | non organic |
3 | 2 | 45918 | 4 | 1 | Coconut Butter | 19 | 13 | oils vinegars | pantry | non organic |
4 | 2 | 30035 | 5 | 0 | Natural Sweetener | 17 | 13 | baking ingredients | pantry | non organic |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
32434484 | 3421083 | 39678 | 6 | 1 | Free & Clear Natural Dishwasher Detergent | 74 | 17 | dish detergents | household | non organic |
32434485 | 3421083 | 11352 | 7 | 0 | Organic Mini Sandwich Crackers Peanut Butter | 78 | 19 | crackers | snacks | organic |
32434486 | 3421083 | 4600 | 8 | 0 | All Natural French Toast Sticks | 52 | 1 | frozen breakfast | frozen | non organic |
32434487 | 3421083 | 24852 | 9 | 1 | Banana | 24 | 4 | fresh fruits | produce | non organic |
32434488 | 3421083 | 5020 | 10 | 1 | Organic Sweet & Salty Peanut Pretzel Granola ... | 3 | 19 | energy granola bars | snacks | organic |
32434489 rows × 10 columns
In [ ]:
# 이건 필요 없숴
# organic_ord_df = organic_ord.value_counts().reset_index()
# organic_ord_df.columns = ['organic_product_name', 'frequency_count']
# organic_ord_df
In [ ]:
# pip install 'traitlets==4.3.3' --force-reinstall
# 왜 깔았노
In [66]:
order_products_prior_df['organic'] = mg_ord_prod_df['organic']
In [67]:
# grouped = order_products_prior_df.set_index(['order_id', 'organic'])
# grouped
# 안됨
In [68]:
order_products_prior_df['organic'].unique()
order_products_prior_df['organic'].value_counts()
Out[68]:
non organic 22184097
organic 10250392
Name: organic, dtype: int64
In [69]:
# prior set에서 organic 제품과 non organic 제품의 비율은 약 3:7
ord_prior_organic_df = order_products_prior_df['organic'].value_counts(normalize=True)
In [70]:
plt.figure(figsize=(12, 8))
sns.barplot(ord_prior_organic_df.index, ord_prior_organic_df.values)
plt.xlabel('Organic?', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title("Count of Each products of Organic or non organic", fontsize=15)
Out[70]:
Text(0.5, 1.0, 'Count of Each products of Organic or non organic')
In [71]:
pct_ord_prod_prior
ord_prior_organic_df.organic
Out[71]:
0.316033713372207
In [72]:
ord_prior_organic_df["organic"]
Out[72]:
0.316033713372207
In [73]:
# 위에서 계산한 user_id를 기준으로 한 재주문된 건수(1)와 재주문 되지 않은 건수(0)의 비율
# sum_reord_prior_df.reordered.value_counts()/sum_reord_prior_df.shape[0] *100
# ==> 재주문된 건수의 비율은 87.91
# pct_ord_prod_prior가
pct_ord_prod_prior = sum_reord_prior_df.reordered.value_counts()/sum_reord_prior_df.shape[0] *100
pct_ord_prod_prior_or = ord_prior_organic_df["organic"]/pct_ord_prod_prior[1]*100
pct_ord_prod_prior_nor = ord_prior_organic_df["non organic"]/pct_ord_prod_prior[1]*100
In [74]:
pct_ord_prod_prior_or.round(2)
Out[74]:
0.36
In [75]:
pct_ord_prod_prior_nor.round(2)
Out[75]:
0.78
In [76]:
ratio = [pct_ord_prod_prior_or.round(2),pct_ord_prod_prior_nor.round(2) ]
labels = ['Organic Products', 'Non Organic Products']
explode = [0.10, 0.10]
plt.figure(figsize=(10, 10))
plt.title("Ratio of Organic/Non Organic Products Reordered", fontsize=25)
plt.pie(ratio, labels=labels, autopct='%.1f%%', startangle=90, counterclock=False, explode=explode, textprops = {'fontsize':20})
plt.show()
In [ ]:
'Records of > Projects' 카테고리의 다른 글
[EDA] Instacart Market Basket Analysis - 코드 필사(1) (0) | 2021.08.13 |
---|---|
[ToyPrj-Crawling] CGV 리뷰 크롤링 하기(final) (1) | 2021.08.03 |
[ToyPrj-Crawling] CGV 리뷰 크롤링 하기(4) (0) | 2021.08.03 |
[ToyPrj-Crawling] CGV 리뷰 크롤링 하기(3) (0) | 2021.08.03 |
[ToyPrj-Crawling] CGV 리뷰 크롤링 하기(2) (0) | 2021.08.03 |
Comments