Notice

가슴에 새길 말..

Recent Posts

Recent Comments

Tags more

Today

Total

Archives

관리 메뉴

SiLaure's Data

[EDA] Instacart Market Basket Analysis - 코드 필사(1) 본문

Records of/Projects

[EDA] Instacart Market Basket Analysis - 코드 필사(1)

data_soin 2021. 8. 13. 00:35

코드 필사 첫 번째

출처 : https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart

In [11]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()

In [12]:

%matplotlib inline
pd.options.mode.chained_assignment = None

============================================================================

In [28]:

# from subprocess import check_output
# print(check_output(["ls", "/content/drive/MyDrive/instacart-market-basket-analysis"]).decode("utf8"))

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-28-8ad670734474> in <module>
      1 from subprocess import check_output
----> 2 print(check_output(["ls", "input"]).decode("utf8"))

~\anaconda3\envs\datascience\lib\subprocess.py in check_output(timeout, *popenargs, **kwargs)
    413         kwargs['input'] = empty
    414 
--> 415     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
    416                **kwargs).stdout
    417 

~\anaconda3\envs\datascience\lib\subprocess.py in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    491         kwargs['stderr'] = PIPE
    492 
--> 493     with Popen(*popenargs, **kwargs) as process:
    494         try:
    495             stdout, stderr = process.communicate(input, timeout=timeout)

~\anaconda3\envs\datascience\lib\subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors, text)
    856                             encoding=encoding, errors=errors)
    857 
--> 858             self._execute_child(args, executable, preexec_fn, close_fds,
    859                                 pass_fds, cwd, env,
    860                                 startupinfo, creationflags, shell,

~\anaconda3\envs\datascience\lib\subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_start_new_session)
   1309             # Start the process
   1310             try:
-> 1311                 hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
   1312                                          # no special security
   1313                                          None, None,

FileNotFoundError: [WinError 2] 지정된 파일을 찾을 수 없습니다

============================================================================

subprocess : 쉘상에서 명령을 수행하는 걸 가능하게 하는 모듈

리눅스에서는 잘 동작하는 코드페이지가 디코딩 문제로 인해 윈도우에서는 동작하지 않을 때,
코드페이지를 변경하여 사용한다.

[출처] python subprocess (윈도우)|작성자 용용

============================================================================

In [29]:

order_products_train_df = pd.read_csv("order_products__train.csv")
order_products_prior_df = pd.read_csv("order_products__prior.csv")
orders_df = pd.read_csv("orders.csv")
products_df = pd.read_csv("products.csv")
aisles_df = pd.read_csv("aisles.csv")
departments_df = pd.read_csv("departments.csv")

In [30]:

# df 살펴보기
orders_df.head()

Out[30]:

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
0	2539329	1	prior	1	2	8	NaN
1	2398795	1	prior	2	3	7	15.0
2	473747	1	prior	3	3	12	21.0
3	2254736	1	prior	4	4	7	29.0
4	431534	1	prior	5	4	15	28.0

In [31]:

order_products_prior_df.head()

Out[31]:

	order_id	product_id	add_to_cart_order	reordered
0	2	33120	1	1
1	2	28985	2	1
2	2	9327	3	0
3	2	45918	4	1
4	2	30035	5	0

In [32]:

order_products_train_df.head()

Out[32]:

	order_id	product_id	add_to_cart_order	reordered
0	1	49302	1	1
1	1	11109	2	1
2	1	10246	3	0
3	1	49683	4	0
4	1	43633	5	1

In [18]:

# 우리가 볼 수 있듯이 orders.csv는 주문을 구매한 사용자, 구매한 날짜, 사전 주문일 등 특정 주문 ID에 대한 모든 정보를 가지고 있습니다.

# order_products_train과 order_products_prior에 있는 열이 동일합니다. 그렇다면 이 파일들의 차이점은 무엇입니까?

# 앞서 언급했듯이, 이 데이터셋에서는 4~100건의 고객 주문이 제공되며(이 내용은 나중에 확인) 재주문될 제품을 예측해야 합니다. 

# 그래서 사용자의 마지막 주문을 꺼내 기차와 테스트 세트로 나누었습니다. 고객의 모든 이전 주문 정보는 order_products_prior 파일에 있습니다. 

# 또한 orders.csv 파일에 지정된 행이 연결되는 세 개의 데이터 집합(사전, 교육 또는 테스트) 중 어떤 데이터 집합으로 구성되는지 알려 주는 열이 있습니다.

# 주문_products*csv 파일에는 재주문 상태와 함께 지정된 주문에서 구매한 제품에 대한 자세한 정보가 있습니다.

# 먼저 세 세트의 행 수를 계산해 보겠습니다.

In [19]:

cnt_srs = orders_df.eval_set.value_counts()

plt.figure(figsize=(12,8))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color=color[1])

plt.xlabel('Eval set type', fontsize=12)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.title('Count of rows in each dataset', fontsize = 15)
plt.xticks(rotation=0)

plt.show()

C:\Users\user\anaconda3\envs\datascience\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

In [20]:

#orders_df

Out[20]:

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
0	2539329	1	prior	1	2	8	NaN
1	2398795	1	prior	2	3	7	15.0
2	473747	1	prior	3	3	12	21.0
3	2254736	1	prior	4	4	7	29.0
4	431534	1	prior	5	4	15	28.0
...	...	...	...	...	...	...	...
3421078	2266710	206209	prior	10	5	18	29.0
3421079	1854736	206209	prior	11	4	10	30.0
3421080	626363	206209	prior	12	1	12	18.0
3421081	2977660	206209	prior	13	1	12	7.0
3421082	272231	206209	train	14	6	14	30.0

3421083 rows × 7 columns

In [35]:

# orders_df.groupby("eval_set")["user_id"].aggregate(get_unique_count)
orders_df.groupby("eval_set")["user_id"].nunique()

Out[35]:

eval_set
prior    206209
test      75000
train    131209
Name: user_id, dtype: int64

In [ ]:

# 그래서 총 206,209명의 고객이 있습니다.

# 이 중 131,209명의 마지막 구매 고객이 열차 세트로 주어지며, 나머지 75,000명의 고객을 예측해야 합니다.

# 이제 고객의 주문이 4개에서 100개까지 제공된다는 주장을 검증해 보겠습니다.

In [36]:

cnt_srs = orders_df.groupby("user_id")["order_number"].aggregate(np.max).reset_index()
cnt_srs

Out[36]:

	user_id	order_number
0	1	11
1	2	15
2	3	13
3	4	6
4	5	5
...	...	...
206204	206205	4
206205	206206	68
206206	206207	17
206207	206208	50
206208	206209	14

206209 rows × 2 columns

In [37]:

cnt_srs = cnt_srs.order_number.value_counts()

cnt_srs

Out[37]:

4     23986
5     19590
6     16165
7     13850
8     11700
      ...  
94       57
91       56
97       54
98       50
99       47
Name: order_number, Length: 97, dtype: int64

In [38]:

plt.figure(figsize=(12, 8))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color= color[2])

C:\Users\user\anaconda3\envs\datascience\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

Out[38]:

<AxesSubplot:>

In [ ]:

# 따라서 데이터 페이지에 주어진 4개 미만의 주문은 없으며 최대 상한은 100입니다.

# 이제 요일에 따라 주문 습관이 어떻게 변하는지 알아보겠습니다.

In [39]:

plt.figure(figsize=(12,8))
sns.countplot(x="order_dow", data = orders_df, color=color[0])

plt.xlabel('Day of Week', fontsize=12)
plt.ylabel('Count', fontsize = 12)
plt.xticks(rotation=0)
plt.title("Frequency of order by week day", fontsize = 15)

plt.show()

In [ ]:

# 0과 1은 주문이 많은 토요일과 일요일이고 수요일에는 적은 것 같습니다.

# 이제 하루 중 시간에 대한 분배 상태를 알아보겠습니다.

In [40]:

plt.figure(figsize=(12,8))
sns.countplot(x="order_hour_of_day", data = orders_df, color=color[1])

plt.xlabel('Hour of day', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
plt.title("Frequency of order by hour of day", fontsize=15)
plt.show()

In [ ]:

# 그래서 대부분의 주문은 낮에 이루어집니다. 이제 요일과 요일을 결합하여 분포를 살펴보겠습니다.

In [41]:

grouped_df = orders_df.groupby(["order_dow", "order_hour_of_day"])["order_number"].aggregate("count").reset_index()
grouped_df = grouped_df.pivot('order_dow', 'order_hour_of_day', 'order_number')

grouped_df

Out[41]:

order_hour_of_day	0	1	2	3	4	5	6	7	8	9	...	14	15	16	17	18	19	20	21	22	23
order_dow
0	3936	2398	1409	963	813	1168	3329	12410	28108	40798	...	54552	53954	49463	39753	29572	22654	18277	14423	11246	6887
1	3674	1830	1105	748	809	1607	5370	16571	34116	51908	...	46764	46403	44761	36792	28977	22145	16281	11943	8992	5620
2	3059	1572	943	719	744	1399	4758	13245	24635	36314	...	37173	37469	37541	32151	26470	20084	15039	10653	8146	5358
3	2952	1495	953	654	719	1355	4562	12396	22553	32312	...	34773	35990	35273	30368	25001	19249	13795	10278	8242	5181
4	2642	1512	899	686	730	1330	4401	12493	21814	31409	...	33625	34222	34093	29378	24425	19350	14186	10796	8812	5645
5	3189	1672	1016	841	910	1574	4866	13434	24015	34232	...	37407	37508	35860	29955	24310	18741	13322	9515	7498	5265
6	3306	1919	1214	863	802	1136	3243	11319	22960	30839	...	38748	38093	35562	30398	24157	18346	13392	10501	8532	6087

7 rows × 24 columns

In [42]:

plt.figure(figsize=(12,8))
sns.heatmap(grouped_df)
plt.title("Frequency of Day of week VS. Hour of day")
plt.show()

In [ ]:

# 토요일 저녁과 일요일 아침이 황금시간대인 것 같습니다.

# 이제 주문 사이의 시간간격을 확인해 보겠습니다.

In [43]:

plt.figure(figsize=(12, 8))
sns.countplot(x="days_since_prior_order", data=orders_df, color=color[3])
plt.xlabel('Days since prior order', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
plt.title("Frequency distribution by days since prior order", fontsize=15)
plt.show()

In [ ]:

# 고객이 일주일에 한 번(7일 피크 확인) 또는 한 달에 한 번(30일 피크) 주문하는 것으로 보입니다.

# 또한 14일, 21일, 28일(주간 간격)에 더 작은 봉우리를 볼 수 있었습니다.

# 우리의 목표는 재주문 파악이므로, prior set와 train set의 재주문 비율을 확인하도록 하겠습니다.

In [44]:

order_products_prior_df.reordered
order_products_prior_df.reordered.sum()

Out[44]:

19126536

In [45]:

order_products_prior_df.shape
order_products_prior_df.shape[0]

Out[45]:

32434489

In [46]:

# percentage of re-orders in prior set #
order_products_prior_df.reordered.sum() / order_products_prior_df.shape[0]

Out[46]:

0.5896974667922161

In [47]:

# percentage of re-orders in train set #
order_products_train_df.reordered.sum() / order_products_train_df.shape[0]

Out[47]:

0.5985944127509629

In [ ]:

# 평균적으로 주문 제품의 약 59%가 재주문된 제품인 것이 확인되었다.

In [ ]:

# 재주문이 아닌 제품 : 59%의 제품이 재주문된 것을 확인했으니, 재주문된 제품이 없는 상황도 있을 것입니다. 지금 확인해 보겠습니다.

In [48]:

grouped_df

Out[48]:

order_hour_of_day	0	1	2	3	4	5	6	7	8	9	...	14	15	16	17	18	19	20	21	22	23
order_dow
0	3936	2398	1409	963	813	1168	3329	12410	28108	40798	...	54552	53954	49463	39753	29572	22654	18277	14423	11246	6887
1	3674	1830	1105	748	809	1607	5370	16571	34116	51908	...	46764	46403	44761	36792	28977	22145	16281	11943	8992	5620
2	3059	1572	943	719	744	1399	4758	13245	24635	36314	...	37173	37469	37541	32151	26470	20084	15039	10653	8146	5358
3	2952	1495	953	654	719	1355	4562	12396	22553	32312	...	34773	35990	35273	30368	25001	19249	13795	10278	8242	5181
4	2642	1512	899	686	730	1330	4401	12493	21814	31409	...	33625	34222	34093	29378	24425	19350	14186	10796	8812	5645
5	3189	1672	1016	841	910	1574	4866	13434	24015	34232	...	37407	37508	35860	29955	24310	18741	13322	9515	7498	5265
6	3306	1919	1214	863	802	1136	3243	11319	22960	30839	...	38748	38093	35562	30398	24157	18346	13392	10501	8532	6087

7 rows × 24 columns

In [49]:

grouped_df = order_products_prior_df.groupby("order_id")["reordered"].aggregate("sum").reset_index()
grouped_df

Out[49]:

	order_id	reordered
0	2	6
1	3	8
2	4	12
3	5	21
4	6	0
...	...	...
3214869	3421079	0
3214870	3421080	4
3214871	3421081	0
3214872	3421082	4
3214873	3421083	4

3214874 rows × 2 columns

In [50]:

grouped_df["reordered"].loc[grouped_df["reordered"] >1] = 1 # 1번이라도 재주문 된 상품은 1로 표시
grouped_df.reordered.value_counts() / grouped_df.shape[0]

Out[50]:

1    0.879151
0    0.120849
Name: reordered, dtype: float64

In [ ]:

# prior set 주문의 약 12%는 재주문 항목이 없는 반면, train set에서는 6.5%입니다.

# 이제 각 주문에서 구매한 제품 수를 봅시다.

In [51]:

grouped_df = order_products_train_df.groupby("order_id")["add_to_cart_order"].aggregate("max").reset_index()
cnt_srs = grouped_df.add_to_cart_order.value_counts()

In [52]:

grouped_df

Out[52]:

	order_id	add_to_cart_order
0	1	8
1	36	8
2	38	9
3	96	7
4	98	49
...	...	...
131204	3421049	6
131205	3421056	5
131206	3421058	8
131207	3421063	4
131208	3421070	3

131209 rows × 2 columns

In [53]:

cnt_srs

Out[53]:

5     8895
6     8708
7     8541
4     8218
3     8033
      ... 
68       2
66       2
75       1
77       1
67       1
Name: add_to_cart_order, Length: 75, dtype: int64

In [54]:

plt.figure(figsize=(12,8))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8)

plt.xlabel('Number of Products in the given order', fontsize=12)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xticks(rotation=0)
plt.show()

C:\Users\user\anaconda3\envs\datascience\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

In [ ]:

# 최대값이 5인 오른쪽 꼬리 분포!

# 제품 세부 정보를 살펴보기 전에 나머지 세 파일도 살펴보겠습니다.

In [55]:

products_df.head()

Out[55]:

	product_id	product_name	aisle_id	department_id
0	1	Chocolate Sandwich Cookies	61	19
1	2	All-Seasons Salt	104	13
2	3	Robust Golden Unsweetened Oolong Tea	94	7
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1
4	5	Green Chile Anytime Sauce	5	13

In [56]:

aisles_df.head()

Out[56]:

	aisle_id	aisle
0	1	prepared soups salads
1	2	specialty cheeses
2	3	energy granola bars
3	4	instant foods
4	5	marinades meat preparation

In [57]:

departments_df.head()

Out[57]:

	department_id	department
0	1	frozen
1	2	other
2	3	bakery
3	4	produce
4	5	alcohol

In [ ]:

# 이제 이러한 제품 세부 정보를 order_prior 세부 정보와 병합하겠습니다.

In [58]:

order_products_prior_df = pd.merge(order_products_prior_df, products_df, on='product_id', how='left')
order_products_prior_df

Out[58]:

	order_id	product_id	add_to_cart_order	reordered	product_name	aisle_id	department_id
0	2	33120	1	1	Organic Egg Whites	86	16
1	2	28985	2	1	Michigan Organic Kale	83	4
2	2	9327	3	0	Garlic Powder	104	13
3	2	45918	4	1	Coconut Butter	19	13
4	2	30035	5	0	Natural Sweetener	17	13
...	...	...	...	...	...	...	...
32434484	3421083	39678	6	1	Free & Clear Natural Dishwasher Detergent	74	17
32434485	3421083	11352	7	0	Organic Mini Sandwich Crackers Peanut Butter	78	19
32434486	3421083	4600	8	0	All Natural French Toast Sticks	52	1
32434487	3421083	24852	9	1	Banana	24	4
32434488	3421083	5020	10	1	Organic Sweet & Salty Peanut Pretzel Granola ...	3	19

32434489 rows × 7 columns

In [59]:

order_products_prior_df = pd.merge(order_products_prior_df, aisles_df, on='aisle_id', how='left')
order_products_prior_df

Out[59]:

	order_id	product_id	add_to_cart_order	reordered	product_name	aisle_id	department_id	aisle
0	2	33120	1	1	Organic Egg Whites	86	16	eggs
1	2	28985	2	1	Michigan Organic Kale	83	4	fresh vegetables
2	2	9327	3	0	Garlic Powder	104	13	spices seasonings
3	2	45918	4	1	Coconut Butter	19	13	oils vinegars
4	2	30035	5	0	Natural Sweetener	17	13	baking ingredients
...	...	...	...	...	...	...	...	...
32434484	3421083	39678	6	1	Free & Clear Natural Dishwasher Detergent	74	17	dish detergents
32434485	3421083	11352	7	0	Organic Mini Sandwich Crackers Peanut Butter	78	19	crackers
32434486	3421083	4600	8	0	All Natural French Toast Sticks	52	1	frozen breakfast
32434487	3421083	24852	9	1	Banana	24	4	fresh fruits
32434488	3421083	5020	10	1	Organic Sweet & Salty Peanut Pretzel Granola ...	3	19	energy granola bars

32434489 rows × 8 columns

In [60]:

order_products_prior_df = pd.merge(order_products_prior_df, departments_df, on='department_id', how='left')
order_products_prior_df

Out[60]:

	order_id	product_id	add_to_cart_order	reordered	product_name	aisle_id	department_id	aisle	department
0	2	33120	1	1	Organic Egg Whites	86	16	eggs	dairy eggs
1	2	28985	2	1	Michigan Organic Kale	83	4	fresh vegetables	produce
2	2	9327	3	0	Garlic Powder	104	13	spices seasonings	pantry
3	2	45918	4	1	Coconut Butter	19	13	oils vinegars	pantry
4	2	30035	5	0	Natural Sweetener	17	13	baking ingredients	pantry
...	...	...	...	...	...	...	...	...	...
32434484	3421083	39678	6	1	Free & Clear Natural Dishwasher Detergent	74	17	dish detergents	household
32434485	3421083	11352	7	0	Organic Mini Sandwich Crackers Peanut Butter	78	19	crackers	snacks
32434486	3421083	4600	8	0	All Natural French Toast Sticks	52	1	frozen breakfast	frozen
32434487	3421083	24852	9	1	Banana	24	4	fresh fruits	produce
32434488	3421083	5020	10	1	Organic Sweet & Salty Peanut Pretzel Granola ...	3	19	energy granola bars	snacks

32434489 rows × 9 columns

In [61]:

cnt_srs = order_products_prior_df['product_name'].value_counts().reset_index().head(20)
cnt_srs

Out[61]:

	index	product_name
0	Banana	472565
1	Bag of Organic Bananas	379450
2	Organic Strawberries	264683
3	Organic Baby Spinach	241921
4	Organic Hass Avocado	213584
5	Organic Avocado	176815
6	Large Lemon	152657
7	Strawberries	142951
8	Limes	140627
9	Organic Whole Milk	137905
10	Organic Raspberries	137057
11	Organic Yellow Onion	113426
12	Organic Garlic	109778
13	Organic Zucchini	104823
14	Organic Blueberries	100060
15	Cucumber Kirby	97315
16	Organic Fuji Apple	89632
17	Organic Lemon	87746
18	Apple Honeycrisp Organic	85020
19	Organic Grape Tomatoes	84255

In [62]:

cnt_srs.colums = ['product_name', 'frequency_count']
cnt_srs

<ipython-input-62-a0cb5648fa4f>:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  cnt_srs.colums = ['product_name', 'frequency_count']

Out[62]:

	index	product_name
0	Banana	472565
1	Bag of Organic Bananas	379450
2	Organic Strawberries	264683
3	Organic Baby Spinach	241921
4	Organic Hass Avocado	213584
5	Organic Avocado	176815
6	Large Lemon	152657
7	Strawberries	142951
8	Limes	140627
9	Organic Whole Milk	137905
10	Organic Raspberries	137057
11	Organic Yellow Onion	113426
12	Organic Garlic	109778
13	Organic Zucchini	104823
14	Organic Blueberries	100060
15	Cucumber Kirby	97315
16	Organic Fuji Apple	89632
17	Organic Lemon	87746
18	Apple Honeycrisp Organic	85020
19	Organic Grape Tomatoes	84255

In [63]:

# 유기농 제품들이 대부분입니다. 또한 그것들 중 대다수는 과일입니다.

# 이제 중요한 aisles를 봅시다.

In [64]:

cnt_srs = order_products_prior_df['aisle'].value_counts().head(20)
cnt_srs

Out[64]:

fresh fruits                     3642188
fresh vegetables                 3418021
packaged vegetables fruits       1765313
yogurt                           1452343
packaged cheese                   979763
milk                              891015
water seltzer sparkling water     841533
chips pretzels                    722470
soy lactosefree                   638253
bread                             584834
refrigerated                      575881
frozen produce                    522654
ice cream ice                     498425
crackers                          458838
energy granola bars               456386
eggs                              452134
lunch meat                        395130
frozen meals                      390299
baby food formula                 382456
fresh herbs                       377741
Name: aisle, dtype: int64

In [65]:

plt.figure(figsize=(12,8))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color=color[5])

plt.xlabel('Aisle', fontsize=12)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

C:\Users\user\anaconda3\envs\datascience\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

In [66]:

# 맨 위의 두 통로는 신선한 과일과 신선한 야채입니다.

# 부서 분포:

# 이제 부서(코너?카테고리?) 별 분포를 확인하겠습니다.

In [67]:

plt.figure(figsize=(30, 30))
temp_series = order_products_prior_df['department'].value_counts()
temp_series

Out[67]:

produce            9479291
dairy eggs         5414016
snacks             2887550
beverages          2690129
frozen             2236432
pantry             1875577
bakery             1176787
canned goods       1068058
deli               1051249
dry goods pasta     866627
household           738666
breakfast           709569
meat seafood        708931
personal care       447123
babies              423802
international       269253
alcohol             153696
pets                 97724
missing              69145
other                36291
bulk                 34573
Name: department, dtype: int64

<Figure size 2160x2160 with 0 Axes>

In [68]:

labels = (np.array(temp_series.index))
labels

Out[68]:

array(['produce', 'dairy eggs', 'snacks', 'beverages', 'frozen', 'pantry',
       'bakery', 'canned goods', 'deli', 'dry goods pasta', 'household',
       'breakfast', 'meat seafood', 'personal care', 'babies',
       'international', 'alcohol', 'pets', 'missing', 'other', 'bulk'],
      dtype=object)

In [69]:

sizes = (np.array((temp_series / temp_series.sum())) *100 )
sizes

Out[69]:

array([29.22596067, 16.69215754,  8.90271464,  8.29403848,  6.8952281 ,
        5.7826624 ,  3.62819652,  3.29297002,  3.24114556,  2.67193049,
        2.27740909,  2.18769903,  2.18573198,  1.37854184,  1.30663998,
        0.83014411,  0.47386595,  0.30129656,  0.21318357,  0.11189015,
        0.10659332])

In [70]:

plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=200)
plt.title("Departments distribution", fontsize=15)
plt.show()

In [71]:

plt.figure(figsize=(15, 15))
temp_series = order_products_prior_df['department'].value_counts()
labels = (np.array(temp_series.index))
sizes = (np.array((temp_series / temp_series.sum())) *100 )
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=200)
plt.title("Departments distribution", fontsize=20)
plt.show()

In [72]:

# produce는 가장 큰 부서입니다. 이제 각 부서의 재주문 비율을 확인해 보겠습니다.
 
# 부서 별 재주문 비율:

In [73]:

grouped_df = order_products_prior_df.groupby(["department"])["reordered"].aggregate("mean").reset_index()
grouped_df

Out[73]:

	department	reordered
0	alcohol	0.569924
1	babies	0.578971
2	bakery	0.628141
3	beverages	0.653460
4	breakfast	0.560922
5	bulk	0.577040
6	canned goods	0.457405
7	dairy eggs	0.669969
8	deli	0.607719
9	dry goods pasta	0.461076
10	frozen	0.541885
11	household	0.402178
12	international	0.369229
13	meat seafood	0.567674
14	missing	0.395849
15	other	0.407980
16	pantry	0.346721
17	personal care	0.321129
18	pets	0.601285
19	produce	0.649913
20	snacks	0.574180

In [74]:

plt.figure(figsize=(12, 8))
sns.pointplot(grouped_df['department'].values, grouped_df['reordered'].values, alpha=0.8, color=color[2])

plt.xlabel('Dpartment', fontsize=12)
plt.ylabel('Reorder Ratio', fontsize=12)
plt.title("Departmetn wise reorder ratio", fontsize=15)
plt.xticks(rotation='vertical')
plt.show()

C:\Users\user\anaconda3\envs\datascience\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

In [75]:

# Personal Care는 재주문 비율이 낮고 Dairy Eggs(유제품 계란)은 재주문 비율이 높다.

# Aisle - 재주문 비율:
grouped_df

Out[75]:

	department	reordered
0	alcohol	0.569924
1	babies	0.578971
2	bakery	0.628141
3	beverages	0.653460
4	breakfast	0.560922
5	bulk	0.577040
6	canned goods	0.457405
7	dairy eggs	0.669969
8	deli	0.607719
9	dry goods pasta	0.461076
10	frozen	0.541885
11	household	0.402178
12	international	0.369229
13	meat seafood	0.567674
14	missing	0.395849
15	other	0.407980
16	pantry	0.346721
17	personal care	0.321129
18	pets	0.601285
19	produce	0.649913
20	snacks	0.574180

In [76]:

grouped_df = order_products_prior_df.groupby(["department_id", "aisle"])["reordered"].aggregate("mean").reset_index()

fig, ax = plt.subplots(figsize=(12,20))
ax.scatter(grouped_df.reordered.values, grouped_df.department_id.values)

for i, txt in enumerate(grouped_df.aisle.values) :
    ax.annotate(txt, (grouped_df.reordered.values[i],
                      grouped_df.department_id.values[i]),
                rotation=45, ha='center', va='center', color='green')
    
plt.xlabel('Reorder Ratio')
plt.ylabel('Department_id')
plt.title("Reorder Ratio of Different Aisles", fontsize=15)
plt.show()

In [77]:

# 장바구니에 추가 - 재주문 비율:

# 이제 카트에 제품을 추가하는 순서가 재주문 비율에 어떤 영향을 미치는지 살펴보겠습니다.

In [78]:

order_products_prior_df["add_to_cart_order_mod"] = order_products_prior_df["add_to_cart_order"].copy()

order_products_prior_df["add_to_cart_order_mod"].loc[order_products_prior_df["add_to_cart_order_mod"] > 70] = 70

grouped_df = order_products_prior_df.groupby(["add_to_cart_order_mod"])["reordered"].aggregate("mean").reset_index()

grouped_df

Out[78]:

	add_to_cart_order_mod	reordered
0	1	0.677533
1	2	0.676251
2	3	0.658037
3	4	0.636958
4	5	0.617383
...	...	...
65	66	0.407002
66	67	0.397059
67	68	0.398352
68	69	0.393846
69	70	0.435714

70 rows × 2 columns

In [79]:

plt.figure(figsize=(12, 8))
sns.pointplot(grouped_df['add_to_cart_order_mod'].values, grouped_df['reordered'].values, alpha=0.8, color=color[2])

plt.xlabel('Add to cart order', fontsize=12)
plt.ylabel('Reorder Ratio', fontsize=12)
plt.title("Add to cart order - Reorder Ratio", fontsize=15)
plt.xticks(rotation='vertical')
plt.show()

C:\Users\user\anaconda3\envs\datascience\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

In [80]:

# 처음에 카트에 추가된 제품들은 나중에 추가된 제품들에 비해 재주문될 가능성이 더 높아 보입니다.
# 우리가 자주 구매하던 모든 제품을 먼저 주문하고 새로운 제품을 찾는 경향이 있기 때문에 저도 이 말은 이해가 됩니다.

In [81]:

# 시간 기반 변수별로 비율 재정렬:
order_products_train_df = pd.merge(order_products_train_df, orders_df, on='order_id', how='left')
grouped_df = order_products_train_df.groupby(["order_dow"])["reordered"].aggregate("mean").reset_index()

plt.figure(figsize=(12,8))
sns.barplot(grouped_df['order_dow'].values, grouped_df['reordered'].values, alpha=0.8, color=color[3])
plt.ylabel('Reorder ratio', fontsize=12)
plt.xlabel('Day of week', fontsize=12)
plt.title("Reorder ratio across day of week", fontsize=15)
plt.xticks(rotation='vertical')
plt.ylim(0.5, 0.7)
plt.show()

C:\Users\user\anaconda3\envs\datascience\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

In [82]:

grouped_df = order_products_train_df.groupby(["order_hour_of_day"])["reordered"].aggregate("mean").reset_index()

plt.figure(figsize=(12, 8))
sns.barplot(grouped_df['order_hour_of_day'].values, grouped_df['reordered'].values, alpha=0.8, color=color[4])

plt.xlabel('Hour of day', fontsize=12)
plt.ylabel('Reorder Ratio', fontsize=12)
plt.title("Reorder Ratio Across Hour of Day", fontsize=15)
plt.xticks(rotation='vertical')
plt.ylim(0.5, 0.7)
plt.show()

C:\Users\user\anaconda3\envs\datascience\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

In [83]:

grouped_df = order_products_train_df.groupby(["order_dow", "order_hour_of_day"])["reordered"].aggregate("mean").reset_index()

grouped_df = grouped_df.pivot('order_dow', 'order_hour_of_day', 'reordered')

plt.figure(figsize=(12, 6))
sns.heatmap(grouped_df)
plt.title("Reorder Ratio of Day of Week VS. Hour of Day")
plt.show()

In [84]:

# 늦은 반나절에 비해 이른 아침에는 재주문 비율이 꽤 높은 것 같습니다.

-끝-¶

저작자표시 비영리 변경금지 (새창열림)

'Records of > Projects' 카테고리의 다른 글

[EDA] Kaggle - Instacart Market Basket Analysis (0)	2021.08.18
[ToyPrj-Crawling] CGV 리뷰 크롤링 하기(final) (4)	2021.08.03
[ToyPrj-Crawling] CGV 리뷰 크롤링 하기(4) (0)	2021.08.03
[ToyPrj-Crawling] CGV 리뷰 크롤링 하기(3) (0)	2021.08.03
[ToyPrj-Crawling] CGV 리뷰 크롤링 하기(2) (0)	2021.08.03

'Records of/Projects' Related Articles

Comments

SiLaure's Data

[EDA] Instacart Market Basket Analysis - 코드 필사(1) 본문

[EDA] Instacart Market Basket Analysis - 코드 필사(1)

subprocess : 쉘상에서 명령을 수행하는 걸 가능하게 하는 모듈

-끝-¶

'Records of > Projects' 카테고리의 다른 글

티스토리툴바