미니 프로젝트 문제 풀이

미니 프로젝트 1 문제

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
1. tmdb_5000_movies 데이터 셋 분석

1) 예산과 장르 관계?

2) 키워드로 많이 사용된 단어는?

3) 장르와 키워드 관계는?

4) 평균 평점과 장르 사이의 관계?

5) 연도별로 많이 제작된 영화 장르는?

6) 인기도와 예산 관계는?

7) 영화 run time과 인기도 사이에 관계가 있을까?


2. tmdb_5000_movies 데이터 기반 추천시스템 제작




3. dataset 데이터 분석 및 연관 규칙 생성

1) dataset 파일 분석

2) 연도별 많이 / 적게 팔린 아이템은?

3) member id 에 따른 구매 연(월, 요일)도, 아이템 분석

- 무슨 요일에 와서 구매를 많이 했을까?

4) 연관 규칙 생성

5) 연관 규칙에 따른 vip member에게 어떤 상품을 추천할까?

-vip는 매출횟수가 가장 많은 상위 100명



4. 와인의 화학 조성을 사용하여 와인의 종류를 예측하기 위한 데이터이다. load_wine() 명령으로 로드하며 다음과 같이 구성되어 있다. 와인의 종류를 예측할 수 있는 모델을 생성하시오.

from sklearn.datasets import load_wine

1,2번 문제

1
2
3
4
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
1
2
3
t_df = pd.read_csv('../Downloads/tmdb_5000_movies.csv')
g_df = pd.read_csv('../Downloads/Groceries_dataset.csv')
pd.options.display.max_columns=50
1
2
3
t_df.drop(columns=['homepage','tagline','status'],inplace=True)
t_df.dropna(inplace=True)
t_df.reset_index(drop=True,inplace=True)
1
2
3
4
5
6
7
# json 열을 name만 추출하여 전처리, keywords는 공백이 유의미한 단위이므로 따로 처리
json_col = ['genres','production_companies','production_countries','spoken_languages']
for i in json_col:
t_df[i] = t_df[i].apply(lambda x : eval(x))
t_df[i] = t_df[i].apply(lambda x : [d['name'] for d in x]).apply(lambda x : ' '.join(x))
t_df['keywords'] = t_df['keywords'].apply(lambda x : eval(x))
t_df['keywords'] = t_df['keywords'].apply(lambda x : [d['name'] for d in x]).apply(lambda x : ','.join(x))
1
2
3
4
5
6
# 1) 예산과 장르 관계?

t_df['budget_cat'] = pd.qcut(t_df['budget'],5,duplicates='drop')
t_df.pivot_table(index='budget_cat',columns='genres',aggfunc='count')

# 웨스턴풍은 저예산이 많고, 어드벤쳐쪽은 고예산이 많다.

budget ... vote_count
genres Action Action Adventure Action Adventure Animation Comedy Family Action Adventure Animation Comedy Family Fantasy Romance Action Adventure Animation Comedy Family Fantasy Science Fiction Action Adventure Animation Comedy Science Fiction Action Adventure Animation Family Action Adventure Animation Family Fantasy Action Adventure Animation Fantasy Science Fiction Action Adventure Animation Science Fiction Thriller Action Adventure Comedy Action Adventure Comedy Crime Action Adventure Comedy Crime Drama Action Adventure Comedy Crime Mystery Thriller Action Adventure Comedy Crime Romance Thriller Action Adventure Comedy Crime Thriller Action Adventure Comedy Drama Family Music Romance Action Adventure Comedy Drama Foreign Action Adventure Comedy Drama Mystery Action Adventure Comedy Drama Science Fiction Thriller Action Adventure Comedy Family Action Adventure Comedy Family Fantasy Action Adventure Comedy Family Fantasy Science Fiction Action Adventure Comedy Family Science Fiction ... War War Action War Action Adventure Drama Thriller War Action Drama History Thriller War Adventure Drama Romance War Comedy Drama War Crime Drama Mystery Romance Thriller War Drama War Drama Action War Drama History War Drama History Action War Drama History Action Romance War Drama Romance War History Action Adventure Drama Romance War History Drama War Western Western Western Action Drama History Western Adventure Western Animation Adventure Comedy Family Western Comedy Western Drama Western Drama Adventure Thriller Western History Western History War
budget_cat
(-0.001, 7500000.0] 27 6 1 0 1 0 1 0 0 1 0 3 1 1 0 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 2 0 2 0 0 1 0 0 1 15 1 0 0 1 0 0 0 0
(7500000.0, 22000000.0] 0 3 0 0 0 0 0 0 0 0 0 4 2 0 0 0 0 1 0 0 1 0 0 1 0 ... 1 0 0 0 0 1 0 2 0 0 0 1 0 0 1 0 2 0 0 0 0 2 0 1 0
(22000000.0, 50000000.0] 0 7 0 0 0 1 0 0 0 0 1 5 0 0 0 0 0 0 0 0 0 2 0 0 1 ... 0 0 0 1 2 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(50000000.0, 380000000.0] 0 5 7 2 0 0 0 1 1 0 0 8 0 0 1 1 3 0 0 1 0 1 1 1 0 ... 1 1 1 0 0 0 0 0 1 1 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1

4 rows × 18800 columns

1
2
3
4
5
6
7
8
9
10
11
12
# 2) 키워드로 많이 사용된 단어는? 
t_df.keywords
from collections import Counter
a = []
for i in range(len(t_df['keywords'])):
a.append(t_df['keywords'][i])
a = ''.join(a)
words = a.split(',')
counter = Counter(words)
print(counter.most_common(5))

# ('independent film', 192), ('murder', 172), ('violence', 146), ('dystopia', 120), ('duringcreditsstinger', 116)
[('independent film', 192), ('murder', 172), ('violence', 146), ('dystopia', 120), ('duringcreditsstinger', 116)]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# 3) 장르와 키워드 관계는?
"""
1) Tf(Term Frequency)

하나의 문서(문장)에서 특정 단어가 등장하는 횟수

2) Idf(Inverse Document Frequency)

Df(Document Frequency)는 문서 빈도. 특정 단어가 몇 개의 문서(문장)에서 등장하는지를 수치화 한 것. 그것의 역수가 idf다.
보통 그냥 역수를 취하기 보다는 아래처럼 수식화한다.
역수 개념을 사용하는 이유는, 적은 문서(문장)에 등장할수록 큰 숫자가 되게하고 반대로 많은 문서(문장)에 등장할수록 숫자를 작아지게 함으로써
여러 문서(문장)에 의미 없이 사용되는 단어의 가중치를 줄이기 위해서다.
"""
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

tfidf_vector = TfidfVectorizer()

#장르와 키워드를 백터화
tfidf_matrix = tfidf_vector.fit_transform(t_df['genres']+ ' ' + t_df['keywords']).toarray()

#벡터화되기 전 이름들을 추출
tfidf_matrix_feature = tfidf_vector.get_feature_names_out()

#데이터 프레임으로 만들기
tfidf_matrix = pd.DataFrame(tfidf_matrix, columns=tfidf_matrix_feature, index = t_df.title)

#코사인 유사도로 유사 관계 파악
cosine_sim = cosine_similarity(tfidf_matrix)
cosine_sim_df = pd.DataFrame(cosine_sim, index = t_df.title, columns = t_df.title)
cosine_sim_df

title Avatar Pirates of the Caribbean: At World's End Spectre The Dark Knight Rises John Carter Spider-Man 3 Tangled Avengers: Age of Ultron Harry Potter and the Half-Blood Prince Batman v Superman: Dawn of Justice Superman Returns Quantum of Solace Pirates of the Caribbean: Dead Man's Chest The Lone Ranger Man of Steel The Chronicles of Narnia: Prince Caspian The Avengers Pirates of the Caribbean: On Stranger Tides Men in Black 3 The Hobbit: The Battle of the Five Armies The Amazing Spider-Man Robin Hood The Hobbit: The Desolation of Smaug The Golden Compass King Kong ... Rampage Slacker Dutch Kills Dry Spell Flywheel Backmask The Puffy Chair Stories of Our Lives Breaking Upwards All Superheroes Must Die Pink Flamingos Clean The Circle Tin Can Man Cure On The Downlow Sanctuary: Quite a Conundrum Bang Primer Cavite El Mariachi Newlyweds Signed, Sealed, Delivered Shanghai Calling My Date with Drew
title
Avatar 1.000000 0.033911 0.017435 0.004221 0.248501 0.035139 0.024894 0.059988 0.023661 0.022763 0.043666 0.022542 0.022596 0.010072 0.088634 0.043253 0.077686 0.102427 0.143133 0.092661 0.021569 0.029362 0.045073 0.022733 0.013829 ... 0.010297 0.008659 0.000000 0.061893 0.000000 0.000000 0.023269 0.000000 0.031020 0.074724 0.030946 0.000000 0.000000 0.0 0.000000 0.012150 0.000000 0.000000 0.048523 0.000000 0.006793 0.056478 0.026349 0.0 0.0
Pirates of the Caribbean: At World's End 0.033911 1.000000 0.021037 0.034132 0.012511 0.111809 0.011378 0.017092 0.044458 0.027466 0.029152 0.027200 0.474144 0.012153 0.029551 0.068232 0.038306 0.208616 0.008180 0.028966 0.037160 0.029612 0.030120 0.027430 0.142986 ... 0.012424 0.000000 0.000000 0.000000 0.015401 0.000000 0.000000 0.000000 0.000000 0.000000 0.018194 0.038199 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.008197 0.000000 0.021293 0.0 0.0
Spectre 0.017435 0.021037 1.000000 0.078061 0.070146 0.060568 0.030749 0.111236 0.017189 0.057658 0.103059 0.549934 0.022014 0.018065 0.062035 0.076458 0.062227 0.023734 0.012160 0.023387 0.021013 0.024623 0.018135 0.096466 0.024804 ... 0.045032 0.000000 0.097970 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.012025 0.000000 0.000000 0.0 0.070458 0.000000 0.000000 0.000000 0.000000 0.000000 0.123730 0.000000 0.000000 0.0 0.0
The Dark Knight Rises 0.004221 0.034132 0.078061 1.000000 0.004502 0.051288 0.024913 0.062652 0.000000 0.180473 0.109040 0.109994 0.005329 0.004373 0.150406 0.023430 0.059022 0.005746 0.065193 0.005662 0.082179 0.005961 0.000000 0.000000 0.012210 ... 0.061043 0.079586 0.186559 0.000000 0.007272 0.019680 0.020162 0.094137 0.000000 0.120491 0.013621 0.010274 0.027136 0.0 0.019153 0.004788 0.032963 0.010261 0.093245 0.023555 0.033142 0.000000 0.007541 0.0 0.0
John Carter 0.248501 0.012511 0.070146 0.004502 1.000000 0.012965 0.049956 0.088164 0.010223 0.034291 0.154023 0.024046 0.013092 0.032782 0.151277 0.060955 0.105641 0.073203 0.119286 0.063651 0.012497 0.031320 0.078151 0.057371 0.014752 ... 0.010983 0.009237 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.079708 0.000000 0.000000 0.000000 0.0 0.034295 0.012961 0.000000 0.000000 0.051759 0.000000 0.007246 0.000000 0.000000 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
El Mariachi 0.006793 0.008197 0.123730 0.033142 0.007246 0.008494 0.000000 0.009899 0.000000 0.008641 0.009172 0.053005 0.008578 0.007039 0.009297 0.000000 0.009326 0.009248 0.011084 0.009113 0.008188 0.009594 0.000000 0.000000 0.009665 ... 0.056640 0.000000 0.146807 0.000000 0.000000 0.031677 0.000000 0.000000 0.000000 0.032542 0.010962 0.000000 0.000000 0.0 0.019168 0.000000 0.053056 0.000000 0.009552 0.037913 1.000000 0.000000 0.000000 0.0 0.0
Newlyweds 0.056478 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.053283 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.051092 0.000000 0.332759 0.000000 0.000000 0.411991 0.000000 0.549240 0.000000 0.036634 0.000000 0.000000 0.0 0.000000 0.000000 0.275376 0.000000 0.000000 0.000000 0.000000 1.000000 0.154088 0.0 0.0
Signed, Sealed, Delivered 0.026349 0.021293 0.000000 0.007541 0.000000 0.022065 0.000000 0.064265 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.060542 0.000000 0.008210 0.000000 0.000000 0.000000 0.000000 0.000000 0.005281 ... 0.009198 0.007873 0.033924 0.051274 0.006188 0.000000 0.080640 0.080105 0.084631 0.000000 0.005645 0.008742 0.023091 0.0 0.042630 0.004074 0.042432 0.008732 0.005635 0.000000 0.000000 0.154088 1.000000 0.0 0.0
Shanghai Calling 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0
My Date with Drew 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.132800 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.044517 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 1.0

4799 rows × 4799 columns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#4) 평균 평점과 장르 사이의 관계?

tfidf_vector = TfidfVectorizer()

#장르를 백터화
tfidf_matrix = tfidf_vector.fit_transform(t_df['genres']).toarray()

#벡터화되기 전 이름들을 추출
tfidf_matrix_feature = tfidf_vector.get_feature_names_out()

#유사도 값이 아닌 이진화
tfidf_matrix = pd.DataFrame(tfidf_matrix, columns=tfidf_matrix_feature)
tfidf_matrix = tfidf_matrix.applymap(lambda x : 0 if x == 0 else 1)

# 평점을 범주화 (평점 5가 가장 높음)
t_df['cat_voteav'] = pd.qcut(t_df['vote_average'],5,labels=[1,2,3,4,5])
tfidf_matrix['vote_average'] = t_df['cat_voteav']
tfidf_matrix.groupby('vote_average').sum()

# 호러*코미디가 평점이 낮고, 음악&역사&드라마가 평점이 높은 경향이 있다.

action adventure animation comedy crime documentary drama family fantasy fiction foreign history horror movie music mystery romance science thriller tv war western
vote_average
1 266 149 31 427 95 19 273 122 99 137 7 10 203 4 22 59 152 137 288 4 7 15
2 291 203 54 458 146 9 416 124 109 122 3 21 130 1 32 70 207 122 309 1 19 12
3 279 162 45 397 190 13 503 102 83 104 5 44 95 1 39 102 192 104 321 1 29 13
4 173 144 51 267 139 24 578 92 67 98 13 54 55 1 50 55 200 98 200 1 36 16
5 145 132 53 173 126 43 526 73 66 74 6 68 36 1 42 62 143 74 156 1 53 26
1
2
3
4
5
6
7
8
9
# 5) 연도별로 많이 제작된 영화 장르는?

# 평점을 범주화 (평점 5가 가장 높음)
t_df['cat_year'] = pd.qcut(t_df['vote_average'],10,labels=range(1,11))
tfidf_matrix['cat_year'] = t_df['cat_year']
tfidf_matrix.groupby('cat_year').sum(numeric_only=True)

#액션은 최근에 줄어드는 경향이 있음.
#드라마는 늘어나는 경향이 있음.

action adventure animation comedy crime documentary drama family fantasy fiction foreign history horror movie music mystery romance science thriller tv war western
cat_year
1 135 69 12 203 45 14 136 61 42 76 2 2 114 2 12 29 69 76 145 2 3 10
2 131 80 19 224 50 5 137 61 57 61 5 8 89 2 10 30 83 61 143 2 4 5
3 127 79 24 205 59 5 156 56 44 59 1 4 56 0 17 27 83 59 136 0 3 5
4 164 124 30 253 87 4 260 68 65 63 2 17 74 1 15 43 124 63 173 1 16 7
5 117 68 10 154 85 3 187 37 26 43 4 13 44 1 18 43 81 43 136 1 13 4
6 162 94 35 243 105 10 316 65 57 61 1 31 51 0 21 59 111 61 185 0 16 9
7 95 72 28 130 59 7 231 41 33 52 4 27 32 0 24 27 87 52 95 0 17 8
8 78 72 23 137 80 17 347 51 34 46 9 27 23 1 26 28 113 46 105 1 19 8
9 74 59 18 88 59 17 235 24 26 31 2 31 19 0 23 32 68 31 85 0 21 10
10 71 73 35 85 67 26 291 49 40 43 4 37 17 1 19 30 75 43 71 1 32 16
1
2
3
4
5
# 6) 인기도와 예산 관계는?

t_df.pivot_table(index = t_df['cat_voteav'], columns='budget_cat',aggfunc='count')['budget']
#평점 5가 높은거임.
#평점이 높다고 예산이 높은 영화이지는 않지만, 평점이 낮으면 저예산 영화일 확률은 높다.

budget_cat (-0.001, 7500000.0] (7500000.0, 22000000.0] (22000000.0, 50000000.0] (50000000.0, 380000000.0]
cat_voteav
1 508 174 177 138
2 352 212 250 234
3 345 217 249 214
4 375 202 188 157
5 341 180 139 147
1
2
3
4
5
6
# 7) 영화 run time과 인기도 사이에 관계가 있을까?

t_df['cat_runtime'] = pd.qcut(t_df['runtime'],5,labels=range(1,6))
t_df.pivot_table(index = t_df['cat_runtime'], columns='cat_voteav',aggfunc='count')['budget']

# 확실히 런타임이 길면 길 수록 평점이 높은 경향이 있다.

cat_voteav 1 2 3 4 5
cat_runtime
1 389 229 156 111 87
2 278 241 202 158 87
3 195 248 240 169 123
4 90 217 238 245 167
5 45 113 189 239 343
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def genre_recommendations(df , mat, items):
a = mat.loc[df].sort_values(ascending=False)[1:21]
scale_score = pd.DataFrame(a)
scale_score = scale_score.reset_index()
for i in range(20):
scale_score.loc[i,['score']] = (items[items['title'] == a.index[i]]['vote_average']).values

#스케일링
scale_score['total_score'] = scale_score.iloc[:,1]+(scale_score.iloc[:,2]/0.8)
return scale_score.sort_values(by='total_score',ascending=False)[:10]


# 추천 시스템 예시 - Avatar와 유사한 영화 10개 추천
genre_recommendations('Avatar',cosine_sim_df,t_df).reset_index(drop=True)

title Avatar score total_score
0 Alien 0.377950 7.9 10.252950
1 Aliens 0.427672 7.7 10.052672
2 Star Trek Into Darkness 0.420840 7.4 9.670840
3 Gravity 0.417716 7.3 9.542716
4 Treasure Planet 0.320138 7.2 9.320138
5 Stargate: The Ark of Truth 0.315874 6.9 8.940874
6 Spaceballs 0.329561 6.7 8.704561
7 Starship Troopers 0.298873 6.7 8.673873
8 Silent Running 0.410423 6.3 8.285423
9 Space Dogs 0.391469 6.3 8.266469

3번 문제

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# #3. dataset 데이터 분석 및 연관 규칙 생성

# 1) dataset 파일 분석

# 2) 연도별 많이 / 적게 팔린 아이템은?

# 3) member id 에 따른 구매 연(월, 요일)도, 아이템 분석

# - 무슨 요일에 와서 구매를 많이 했을까?

# 4) 연관 규칙 생성

# 5) 연관 규칙에 따른 vip member에게 어떤 상품을 추천할까?

# -vip는 매출횟수가 가장 많은 상위 100명
1
2
#Date열을 datetime으로 변경
g_df['Date'] = pd.to_datetime(g_df['Date'],format='%d-%m-%Y' )
1
2
# na.nan데이터 없음
g_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38765 entries, 0 to 38764
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Member_number    38765 non-null  int64         
 1   Date             38765 non-null  datetime64[ns]
 2   itemDescription  38765 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 908.7+ KB
1
2
3
4
# 2) 연도별 많이 / 적게 팔린 아이템은?
g_df['year'] = g_df['Date'].dt.year
a = g_df.pivot_table(index='year',columns='itemDescription',aggfunc='count')['Date']
a.max(axis=1)
year
2014    1038.0
2015    1464.0
dtype: float64
1
2
3
a
#Whole milk가 연도별로 가장 높고,
#bags, toilet cleaner 가장 적다

itemDescription Instant food products UHT-milk abrasive cleaner artif. sweetener baby cosmetics bags baking powder bathroom cleaner beef berries beverages bottled beer bottled water brandy brown bread butter butter milk cake bar candles candy canned beer canned fish canned fruit canned vegetables cat food ... sparkling wine specialty bar specialty cheese specialty chocolate specialty fat specialty vegetables spices spread cheese sugar sweet spreads syrup tea tidbits toilet cleaner tropical fruit turkey vinegar waffles whipped/sour cream whisky white bread white wine whole milk yogurt zwieback
year
2014 37.0 160.0 12.0 13.0 1.0 1.0 67.0 11.0 177.0 128.0 109.0 319.0 504.0 17.0 323.0 273.0 126.0 54.0 34.0 136.0 266.0 70.0 8.0 49.0 109.0 ... 25.0 110.0 37.0 125.0 20.0 10.0 23.0 48.0 155.0 34.0 13.0 19.0 12.0 5.0 364.0 27.0 29.0 166.0 365.0 3.0 213.0 82.0 1038.0 640.0 24.0
2015 23.0 163.0 10.0 16.0 2.0 3.0 55.0 6.0 339.0 199.0 142.0 368.0 429.0 21.0 248.0 261.0 137.0 39.0 32.0 83.0 451.0 46.0 13.0 33.0 68.0 ... 21.0 100.0 35.0 115.0 9.0 1.0 17.0 52.0 110.0 35.0 8.0 8.0 10.0 NaN 668.0 53.0 22.0 114.0 297.0 5.0 149.0 94.0 1464.0 694.0 36.0

2 rows × 167 columns

1
2
3
4
5
#3

g_df['weekday'] = g_df['Date'].dt.weekday#0이 월요일 6이 일요일
g_df['month'] = g_df['Date'].dt.month
g_df.groupby('Member_number').sum(numeric_only=True)

year weekday month
Member_number
1000 26192 43 76
1001 24175 35 51
1002 16116 46 36
1003 16114 24 30
1004 42296 27 150
... ... ... ...
4996 20150 24 90
4997 12088 36 50
4998 4030 4 20
4999 32236 58 62
5000 14101 27 34

3898 rows × 3 columns

1
g_df_dummies = pd.get_dummies(g_df,columns=['weekday','month'])
1
2
3
4
g_df_dummies.groupby('Member_number').sum(numeric_only=True).iloc[:,1:].sum()
#0이 월요일 6이 일요일
#요일은 상관 없이 구매량은 비슷하다.
#월도 크게 차이가 나는 편은 아니다.
weekday_0    5382
weekday_1    5558
weekday_2    5562
weekday_3    5620
weekday_4    5562
weekday_5    5551
weekday_6    5530
month_1      3324
month_2      2997
month_3      3133
month_4      3260
month_5      3408
month_6      3264
month_7      3300
month_8      3496
month_9      3059
month_10     3261
month_11     3254
month_12     3009
dtype: int64
1
# 4,5 
1
2
3
4
5
6
# 매출횟수 상위 100명에게 vip타이틀 부여
b = g_df.groupby('Member_number')['itemDescription'].size().sort_values(ascending=False)[:100]
g_df['vip']= g_df['Member_number'].apply(lambda x : 1 if x in b.index else 0)

#vip
vip = g_df[g_df['vip']==1]
/var/folders/pr/27tft1vj6396wqnj02ngz5p80000gn/T/ipykernel_28378/1060107556.py:2: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.
  b = g_df.groupby('Member_number')['itemDescription'].size().sort_values(ascending=False)[:100]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules


#vip들이 구매한 상품을 그룹화(장바구니화)
vip['itemDescription'] = vip['itemDescription'].apply(lambda x : x+',')
vip_item = vip.groupby('Member_number')['itemDescription'].sum().to_list()

items = [0 for x in range(len(vip_item))]
for i in range(len(vip_item)):
items[i] = vip_item[i].split(',')

# TransactionEncoder(아이템 더미화)
te = TransactionEncoder()
te_result = te.fit(items).transform(items)
df = pd.DataFrame(te_result, columns=te.columns_ ,dtype='int')

# apriori
apr_result = apriori(df, min_support=0.09, use_colnames=True)
lift_based = association_rules(apr_result, metric='lift', min_threshold=3)

# # #lift와 confidence를 소수점 1,2자리까지만 출력 -> 정렬
lift_based['lift'] = lift_based['lift'].apply(lambda x : f'{x:.1f}')
lift_based['confidence'] = lift_based['confidence'].apply(lambda x : f'{x:.2f}')

#frozenset -> object로 변경 (잘못된 공백값을 제거하기 위해)
lift_based['antecedents'] = lift_based['antecedents'].apply(lambda x: ', '.join(list(x))).str.replace(' ','')
lift_based['consequents'] = lift_based['consequents'].apply(lambda x: ', '.join(list(x))).str.replace(' ','')

# , 제거
lift_based['antecedents'] = lift_based['antecedents'].apply(lambda x : x if x[0] != ',' else x[1:])
lift_based['consequents'] = lift_based['consequents'].apply(lambda x : x if x[0] != ',' else x[1:])

#lift와 confidence로 정렬
lift_sort = lift_based.sort_values(by = ['lift','confidence'], ascending=False)

# yogurt,tropicalfruit,rolls/buns & wholemilk,othervegetables,cannedbeer
# 이것들은 여기 상점 VIP의 연관성이 가장 높은 물품들이다. 이 들의 동선을 살짝 멀찍이 띄어두고
# 그 사이에 이벤트 상품들을 판매하면 다른 상품들의 매출도 늘 것 같다.
lift_sort
/var/folders/pr/27tft1vj6396wqnj02ngz5p80000gn/T/ipykernel_28378/1298349058.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  vip['itemDescription'] = vip['itemDescription'].apply(lambda x : x+',')
/opt/homebrew/lib/python3.10/site-packages/mlxtend/frequent_patterns/fpcommon.py:111: DeprecationWarning: DataFrames with non-bool types result in worse computationalperformance and their support might be discontinued in the future.Please use a DataFrame with bool type
  warnings.warn(

antecedents consequents antecedent support consequent support support confidence lift leverage conviction
8 yogurt,tropicalfruit,rolls/buns othervegetables,cannedbeer,wholemilk 0.11 0.21 0.09 0.82 3.9 0.0669 4.345000
14 yogurt,tropicalfruit,rolls/buns othervegetables,cannedbeer,wholemilk 0.11 0.21 0.09 0.82 3.9 0.0669 4.345000
20 yogurt,tropicalfruit,rolls/buns othervegetables,cannedbeer,wholemilk 0.11 0.21 0.09 0.82 3.9 0.0669 4.345000
9 othervegetables,cannedbeer,wholemilk yogurt,tropicalfruit,rolls/buns 0.21 0.11 0.09 0.43 3.9 0.0669 1.557500
15 othervegetables,cannedbeer,wholemilk yogurt,tropicalfruit,rolls/buns 0.21 0.11 0.09 0.43 3.9 0.0669 1.557500
21 othervegetables,cannedbeer,wholemilk yogurt,tropicalfruit,rolls/buns 0.21 0.11 0.09 0.43 3.9 0.0669 1.557500
6 yogurt,rolls/buns,tropicalfruit,wholemilk othervegetables,cannedbeer 0.10 0.25 0.09 0.90 3.6 0.0650 7.500000
12 tropicalfruit,wholemilk,yogurt,rolls/buns othervegetables,cannedbeer 0.10 0.25 0.09 0.90 3.6 0.0650 7.500000
17 yogurt,rolls/buns,tropicalfruit,wholemilk othervegetables,cannedbeer 0.10 0.25 0.09 0.90 3.6 0.0650 7.500000
11 othervegetables,cannedbeer yogurt,rolls/buns,tropicalfruit,wholemilk 0.25 0.10 0.09 0.36 3.6 0.0650 1.406250
18 othervegetables,cannedbeer yogurt,rolls/buns,tropicalfruit,wholemilk 0.25 0.10 0.09 0.36 3.6 0.0650 1.406250
23 othervegetables,cannedbeer tropicalfruit,wholemilk,yogurt,rolls/buns 0.25 0.10 0.09 0.36 3.6 0.0650 1.406250
0 yogurt,tropicalfruit,rolls/buns othervegetables,cannedbeer 0.11 0.25 0.09 0.82 3.3 0.0625 4.125000
2 yogurt,tropicalfruit,rolls/buns othervegetables,cannedbeer 0.11 0.25 0.09 0.82 3.3 0.0625 4.125000
4 yogurt,tropicalfruit,rolls/buns othervegetables,cannedbeer 0.11 0.25 0.09 0.82 3.3 0.0625 4.125000
1 othervegetables,cannedbeer yogurt,tropicalfruit,rolls/buns 0.25 0.11 0.09 0.36 3.3 0.0625 1.390625
3 othervegetables,cannedbeer yogurt,tropicalfruit,rolls/buns 0.25 0.11 0.09 0.36 3.3 0.0625 1.390625
5 othervegetables,cannedbeer yogurt,tropicalfruit,rolls/buns 0.25 0.11 0.09 0.36 3.3 0.0625 1.390625
10 yogurt,cannedbeer,othervegetables rolls/buns,tropicalfruit,wholemilk 0.13 0.22 0.09 0.69 3.1 0.0614 2.535000
16 yogurt,cannedbeer,othervegetables rolls/buns,tropicalfruit,wholemilk 0.13 0.22 0.09 0.69 3.1 0.0614 2.535000
22 yogurt,cannedbeer,othervegetables rolls/buns,tropicalfruit,wholemilk 0.13 0.22 0.09 0.69 3.1 0.0614 2.535000
7 rolls/buns,tropicalfruit,wholemilk yogurt,cannedbeer,othervegetables 0.22 0.13 0.09 0.41 3.1 0.0614 1.472308
13 rolls/buns,tropicalfruit,wholemilk yogurt,cannedbeer,othervegetables 0.22 0.13 0.09 0.41 3.1 0.0614 1.472308
19 rolls/buns,tropicalfruit,wholemilk yogurt,cannedbeer,othervegetables 0.22 0.13 0.09 0.41 3.1 0.0614 1.472308

4번 문제

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# 4. 와인의 화학 조성을 사용하여 와인의 종류를 예측하기 위한 데이터이다. load_wine() 명령으로 로드하며 다음과 같이 구성되어 있다. 
# 와인의 종류를 예측할 수 있는 모델을 생성하시오.

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers
from sklearn.model_selection import train_test_split


wine = load_wine()

w_df = wine.data
w_target = wine.target

#표준화
scaler = StandardScaler()
w_df = scaler.fit_transform(w_df)
xtrain , xtest, ytrain, ytest = train_test_split(w_df,w_target,test_size= 0.3)

#테스트가 값이 3종류라 카테고리화하기
ytrain = to_categorical(ytrain)
ytest = to_categorical(ytest)

model=Sequential()
model.add(Dense(3, input_dim=13, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

hist = model.fit(xtrain, ytrain, epochs=300, batch_size=1, validation_data=(xtest, ytest))

epochs = range(1, len(hist.history['accuracy']) + 1)
plt.plot(epochs, hist.history['loss'])
plt.plot(epochs, hist.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

# accuracy: 1.0000
Author

InhwanCho

Posted on

2022-11-27

Updated on

2022-11-27

Licensed under

Comments