Posted 2022-11-27Updated 2022-11-27Samples23 minutes read (About 3495 words)0 visits

미니 프로젝트 문제 풀이

미니 프로젝트 1 문제

1. tmdb_5000_movies 데이터 셋 분석

1) 예산과 장르 관계?

2) 키워드로 많이 사용된 단어는? 

3) 장르와 키워드 관계는?

4) 평균 평점과 장르 사이의 관계?

5) 연도별로 많이 제작된 영화 장르는?

6) 인기도와 예산 관계는?

7) 영화 run time과 인기도 사이에 관계가 있을까?


2. tmdb_5000_movies 데이터 기반 추천시스템 제작




3. dataset 데이터 분석 및 연관 규칙 생성

1) dataset 파일 분석

2) 연도별 많이 / 적게 팔린 아이템은?

3) member id 에 따른 구매 연(월, 요일)도, 아이템 분석

- 무슨 요일에 와서 구매를 많이 했을까?

4) 연관 규칙 생성

5) 연관 규칙에 따른 vip member에게 어떤 상품을 추천할까?

-vip는 매출횟수가 가장 많은 상위 100명 



4. 와인의 화학 조성을 사용하여 와인의 종류를 예측하기 위한 데이터이다. load_wine() 명령으로 로드하며 다음과 같이 구성되어 있다. 와인의 종류를 예측할 수 있는 모델을 생성하시오.

from sklearn.datasets import load_wine

1,2번 문제

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

1
2
3

t_df = pd.read_csv('../Downloads/tmdb_5000_movies.csv')
g_df = pd.read_csv('../Downloads/Groceries_dataset.csv')
pd.options.display.max_columns=50

1
2
3

t_df.drop(columns=['homepage','tagline','status'],inplace=True)
t_df.dropna(inplace=True)
t_df.reset_index(drop=True,inplace=True)

# json 열을 name만 추출하여 전처리, keywords는 공백이 유의미한 단위이므로 따로 처리
json_col = ['genres','production_companies','production_countries','spoken_languages']
for i in json_col:
    t_df[i] = t_df[i].apply(lambda x : eval(x))
    t_df[i] = t_df[i].apply(lambda x : [d['name'] for d in x]).apply(lambda x : ' '.join(x))
t_df['keywords'] = t_df['keywords'].apply(lambda x : eval(x))
t_df['keywords'] = t_df['keywords'].apply(lambda x : [d['name'] for d in x]).apply(lambda x : ','.join(x))

# 1) 예산과 장르 관계?

t_df['budget_cat'] = pd.qcut(t_df['budget'],5,duplicates='drop')
t_df.pivot_table(index='budget_cat',columns='genres',aggfunc='count')

# 웨스턴풍은 저예산이 많고, 어드벤쳐쪽은 고예산이 많다.

	budget																									...	vote_count
genres		Action	Action Adventure	Action Adventure Animation Comedy Family	Action Adventure Animation Comedy Family Fantasy Romance	Action Adventure Animation Comedy Family Fantasy Science Fiction	Action Adventure Animation Comedy Science Fiction	Action Adventure Animation Family	Action Adventure Animation Family Fantasy	Action Adventure Animation Fantasy Science Fiction	Action Adventure Animation Science Fiction Thriller	Action Adventure Comedy	Action Adventure Comedy Crime	Action Adventure Comedy Crime Drama	Action Adventure Comedy Crime Mystery Thriller	Action Adventure Comedy Crime Romance Thriller	Action Adventure Comedy Crime Thriller	Action Adventure Comedy Drama Family Music Romance	Action Adventure Comedy Drama Foreign	Action Adventure Comedy Drama Mystery	Action Adventure Comedy Drama Science Fiction Thriller	Action Adventure Comedy Family	Action Adventure Comedy Family Fantasy	Action Adventure Comedy Family Fantasy Science Fiction	Action Adventure Comedy Family Science Fiction	...	War	War Action	War Action Adventure Drama Thriller	War Action Drama History Thriller	War Adventure Drama Romance	War Comedy Drama	War Crime Drama Mystery Romance Thriller	War Drama	War Drama Action	War Drama History	War Drama History Action	War Drama History Action Romance	War Drama Romance	War History Action Adventure Drama Romance	War History Drama	War Western	Western	Western Action Drama History	Western Adventure	Western Animation Adventure Comedy Family	Western Comedy	Western Drama	Western Drama Adventure Thriller	Western History	Western History War
budget_cat
(-0.001, 7500000.0]	27	6	1	0	1	0	1	0	0	1	0	3	1	1	0	0	0	0	1	0	0	0	0	0	0	...	0	0	0	0	0	0	0	2	0	2	0	0	1	0	0	1	15	1	0	0	1	0	0	0	0
(7500000.0, 22000000.0]	0	3	0	0	0	0	0	0	0	0	0	4	2	0	0	0	0	1	0	0	1	0	0	1	0	...	1	0	0	0	0	1	0	2	0	0	0	1	0	0	1	0	2	0	0	0	0	2	0	1	0
(22000000.0, 50000000.0]	0	7	0	0	0	1	0	0	0	0	1	5	0	0	0	0	0	0	0	0	0	2	0	0	1	...	0	0	0	1	2	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
(50000000.0, 380000000.0]	0	5	7	2	0	0	0	1	1	0	0	8	0	0	1	1	3	0	0	1	0	1	1	1	0	...	1	1	1	0	0	0	0	0	1	1	1	0	0	1	0	0	1	0	1	1	0	0	1	0	1

4 rows × 18800 columns

# 2) 키워드로 많이 사용된 단어는? 
t_df.keywords
from collections import Counter
a = []
for i in range(len(t_df['keywords'])):
    a.append(t_df['keywords'][i])
a = ''.join(a)
words = a.split(',')
counter = Counter(words)
print(counter.most_common(5))

# ('independent film', 192), ('murder', 172), ('violence', 146), ('dystopia', 120), ('duringcreditsstinger', 116)

[('independent film', 192), ('murder', 172), ('violence', 146), ('dystopia', 120), ('duringcreditsstinger', 116)]

# 3) 장르와 키워드 관계는?
"""
1) Tf(Term Frequency)

하나의 문서(문장)에서 특정 단어가 등장하는 횟수

2) Idf(Inverse Document Frequency)

Df(Document Frequency)는 문서 빈도. 특정 단어가 몇 개의 문서(문장)에서 등장하는지를 수치화 한 것. 그것의 역수가 idf다.
보통 그냥 역수를 취하기 보다는 아래처럼 수식화한다. 
역수 개념을 사용하는 이유는, 적은 문서(문장)에 등장할수록 큰 숫자가 되게하고 반대로 많은 문서(문장)에 등장할수록 숫자를 작아지게 함으로써
여러 문서(문장)에 의미 없이 사용되는 단어의 가중치를 줄이기 위해서다.
"""
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

tfidf_vector = TfidfVectorizer()

#장르와 키워드를 백터화
tfidf_matrix = tfidf_vector.fit_transform(t_df['genres']+ ' ' + t_df['keywords']).toarray()

#벡터화되기 전 이름들을 추출
tfidf_matrix_feature = tfidf_vector.get_feature_names_out()

#데이터 프레임으로 만들기
tfidf_matrix = pd.DataFrame(tfidf_matrix, columns=tfidf_matrix_feature, index = t_df.title)

#코사인 유사도로 유사 관계 파악
cosine_sim = cosine_similarity(tfidf_matrix)
cosine_sim_df = pd.DataFrame(cosine_sim, index = t_df.title, columns = t_df.title)
cosine_sim_df

title	Avatar	Pirates of the Caribbean: At World's End	Spectre	The Dark Knight Rises	John Carter	Spider-Man 3	Tangled	Avengers: Age of Ultron	Harry Potter and the Half-Blood Prince	Batman v Superman: Dawn of Justice	Superman Returns	Quantum of Solace	Pirates of the Caribbean: Dead Man's Chest	The Lone Ranger	Man of Steel	The Chronicles of Narnia: Prince Caspian	The Avengers	Pirates of the Caribbean: On Stranger Tides	Men in Black 3	The Hobbit: The Battle of the Five Armies	The Amazing Spider-Man	Robin Hood	The Hobbit: The Desolation of Smaug	The Golden Compass	King Kong	...	Rampage	Slacker	Dutch Kills	Dry Spell	Flywheel	Backmask	The Puffy Chair	Stories of Our Lives	Breaking Upwards	All Superheroes Must Die	Pink Flamingos	Clean	The Circle	Tin Can Man	Cure	On The Downlow	Sanctuary: Quite a Conundrum	Bang	Primer	Cavite	El Mariachi	Newlyweds	Signed, Sealed, Delivered	Shanghai Calling	My Date with Drew
title
Avatar	1.000000	0.033911	0.017435	0.004221	0.248501	0.035139	0.024894	0.059988	0.023661	0.022763	0.043666	0.022542	0.022596	0.010072	0.088634	0.043253	0.077686	0.102427	0.143133	0.092661	0.021569	0.029362	0.045073	0.022733	0.013829	...	0.010297	0.008659	0.000000	0.061893	0.000000	0.000000	0.023269	0.000000	0.031020	0.074724	0.030946	0.000000	0.000000	0.0	0.000000	0.012150	0.000000	0.000000	0.048523	0.000000	0.006793	0.056478	0.026349	0.0	0.0
Pirates of the Caribbean: At World's End	0.033911	1.000000	0.021037	0.034132	0.012511	0.111809	0.011378	0.017092	0.044458	0.027466	0.029152	0.027200	0.474144	0.012153	0.029551	0.068232	0.038306	0.208616	0.008180	0.028966	0.037160	0.029612	0.030120	0.027430	0.142986	...	0.012424	0.000000	0.000000	0.000000	0.015401	0.000000	0.000000	0.000000	0.000000	0.000000	0.018194	0.038199	0.000000	0.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.008197	0.000000	0.021293	0.0	0.0
Spectre	0.017435	0.021037	1.000000	0.078061	0.070146	0.060568	0.030749	0.111236	0.017189	0.057658	0.103059	0.549934	0.022014	0.018065	0.062035	0.076458	0.062227	0.023734	0.012160	0.023387	0.021013	0.024623	0.018135	0.096466	0.024804	...	0.045032	0.000000	0.097970	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.012025	0.000000	0.000000	0.0	0.070458	0.000000	0.000000	0.000000	0.000000	0.000000	0.123730	0.000000	0.000000	0.0	0.0
The Dark Knight Rises	0.004221	0.034132	0.078061	1.000000	0.004502	0.051288	0.024913	0.062652	0.000000	0.180473	0.109040	0.109994	0.005329	0.004373	0.150406	0.023430	0.059022	0.005746	0.065193	0.005662	0.082179	0.005961	0.000000	0.000000	0.012210	...	0.061043	0.079586	0.186559	0.000000	0.007272	0.019680	0.020162	0.094137	0.000000	0.120491	0.013621	0.010274	0.027136	0.0	0.019153	0.004788	0.032963	0.010261	0.093245	0.023555	0.033142	0.000000	0.007541	0.0	0.0
John Carter	0.248501	0.012511	0.070146	0.004502	1.000000	0.012965	0.049956	0.088164	0.010223	0.034291	0.154023	0.024046	0.013092	0.032782	0.151277	0.060955	0.105641	0.073203	0.119286	0.063651	0.012497	0.031320	0.078151	0.057371	0.014752	...	0.010983	0.009237	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.079708	0.000000	0.000000	0.000000	0.0	0.034295	0.012961	0.000000	0.000000	0.051759	0.000000	0.007246	0.000000	0.000000	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
El Mariachi	0.006793	0.008197	0.123730	0.033142	0.007246	0.008494	0.000000	0.009899	0.000000	0.008641	0.009172	0.053005	0.008578	0.007039	0.009297	0.000000	0.009326	0.009248	0.011084	0.009113	0.008188	0.009594	0.000000	0.000000	0.009665	...	0.056640	0.000000	0.146807	0.000000	0.000000	0.031677	0.000000	0.000000	0.000000	0.032542	0.010962	0.000000	0.000000	0.0	0.019168	0.000000	0.053056	0.000000	0.009552	0.037913	1.000000	0.000000	0.000000	0.0	0.0
Newlyweds	0.056478	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.053283	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.051092	0.000000	0.332759	0.000000	0.000000	0.411991	0.000000	0.549240	0.000000	0.036634	0.000000	0.000000	0.0	0.000000	0.000000	0.275376	0.000000	0.000000	0.000000	0.000000	1.000000	0.154088	0.0	0.0
Signed, Sealed, Delivered	0.026349	0.021293	0.000000	0.007541	0.000000	0.022065	0.000000	0.064265	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.060542	0.000000	0.008210	0.000000	0.000000	0.000000	0.000000	0.000000	0.005281	...	0.009198	0.007873	0.033924	0.051274	0.006188	0.000000	0.080640	0.080105	0.084631	0.000000	0.005645	0.008742	0.023091	0.0	0.042630	0.004074	0.042432	0.008732	0.005635	0.000000	0.000000	0.154088	1.000000	0.0	0.0
Shanghai Calling	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.0
My Date with Drew	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.132800	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.044517	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	1.0

4799 rows × 4799 columns

#4) 평균 평점과 장르 사이의 관계?

tfidf_vector = TfidfVectorizer()

#장르를 백터화
tfidf_matrix = tfidf_vector.fit_transform(t_df['genres']).toarray()

#벡터화되기 전 이름들을 추출
tfidf_matrix_feature = tfidf_vector.get_feature_names_out()

#유사도 값이 아닌 이진화
tfidf_matrix = pd.DataFrame(tfidf_matrix, columns=tfidf_matrix_feature)
tfidf_matrix = tfidf_matrix.applymap(lambda x : 0 if x == 0 else 1)

# 평점을 범주화 (평점 5가 가장 높음)
t_df['cat_voteav'] = pd.qcut(t_df['vote_average'],5,labels=[1,2,3,4,5])
tfidf_matrix['vote_average'] = t_df['cat_voteav']
tfidf_matrix.groupby('vote_average').sum()

# 호러*코미디가 평점이 낮고, 음악&역사&드라마가 평점이 높은 경향이 있다.

	action	adventure	animation	comedy	crime	documentary	drama	family	fantasy	fiction	foreign	history	horror	movie	music	mystery	romance	science	thriller	tv	war	western
vote_average
1	266	149	31	427	95	19	273	122	99	137	7	10	203	4	22	59	152	137	288	4	7	15
2	291	203	54	458	146	9	416	124	109	122	3	21	130	1	32	70	207	122	309	1	19	12
3	279	162	45	397	190	13	503	102	83	104	5	44	95	1	39	102	192	104	321	1	29	13
4	173	144	51	267	139	24	578	92	67	98	13	54	55	1	50	55	200	98	200	1	36	16
5	145	132	53	173	126	43	526	73	66	74	6	68	36	1	42	62	143	74	156	1	53	26

# 5) 연도별로 많이 제작된 영화 장르는?

# 평점을 범주화 (평점 5가 가장 높음)
t_df['cat_year'] = pd.qcut(t_df['vote_average'],10,labels=range(1,11))
tfidf_matrix['cat_year'] = t_df['cat_year']
tfidf_matrix.groupby('cat_year').sum(numeric_only=True)

#액션은 최근에 줄어드는 경향이 있음.
#드라마는 늘어나는 경향이 있음.

	action	adventure	animation	comedy	crime	documentary	drama	family	fantasy	fiction	foreign	history	horror	movie	music	mystery	romance	science	thriller	tv	war	western
cat_year
1	135	69	12	203	45	14	136	61	42	76	2	2	114	2	12	29	69	76	145	2	3	10
2	131	80	19	224	50	5	137	61	57	61	5	8	89	2	10	30	83	61	143	2	4	5
3	127	79	24	205	59	5	156	56	44	59	1	4	56	0	17	27	83	59	136	0	3	5
4	164	124	30	253	87	4	260	68	65	63	2	17	74	1	15	43	124	63	173	1	16	7
5	117	68	10	154	85	3	187	37	26	43	4	13	44	1	18	43	81	43	136	1	13	4
6	162	94	35	243	105	10	316	65	57	61	1	31	51	0	21	59	111	61	185	0	16	9
7	95	72	28	130	59	7	231	41	33	52	4	27	32	0	24	27	87	52	95	0	17	8
8	78	72	23	137	80	17	347	51	34	46	9	27	23	1	26	28	113	46	105	1	19	8
9	74	59	18	88	59	17	235	24	26	31	2	31	19	0	23	32	68	31	85	0	21	10
10	71	73	35	85	67	26	291	49	40	43	4	37	17	1	19	30	75	43	71	1	32	16

# 6) 인기도와 예산 관계는?

t_df.pivot_table(index = t_df['cat_voteav'], columns='budget_cat',aggfunc='count')['budget']
#평점 5가 높은거임.
#평점이 높다고 예산이 높은 영화이지는 않지만, 평점이 낮으면 저예산 영화일 확률은 높다.

budget_cat	(-0.001, 7500000.0]	(7500000.0, 22000000.0]	(22000000.0, 50000000.0]	(50000000.0, 380000000.0]
cat_voteav
1	508	174	177	138
2	352	212	250	234
3	345	217	249	214
4	375	202	188	157
5	341	180	139	147

# 7) 영화 run time과 인기도 사이에 관계가 있을까?

t_df['cat_runtime'] = pd.qcut(t_df['runtime'],5,labels=range(1,6))
t_df.pivot_table(index = t_df['cat_runtime'], columns='cat_voteav',aggfunc='count')['budget']

# 확실히 런타임이 길면 길 수록 평점이 높은 경향이 있다.

cat_voteav	1	2	3	4	5
cat_runtime
1	389	229	156	111	87
2	278	241	202	158	87
3	195	248	240	169	123
4	90	217	238	245	167
5	45	113	189	239	343

def genre_recommendations(df , mat, items):
    a = mat.loc[df].sort_values(ascending=False)[1:21]
    scale_score = pd.DataFrame(a)
    scale_score = scale_score.reset_index()
    for i in range(20):
        scale_score.loc[i,['score']] = (items[items['title'] == a.index[i]]['vote_average']).values
    
    #스케일링
    scale_score['total_score'] = scale_score.iloc[:,1]+(scale_score.iloc[:,2]/0.8)
    return scale_score.sort_values(by='total_score',ascending=False)[:10]


# 추천 시스템 예시 - Avatar와 유사한 영화 10개 추천
genre_recommendations('Avatar',cosine_sim_df,t_df).reset_index(drop=True)

	title	Avatar	score	total_score
0	Alien	0.377950	7.9	10.252950
1	Aliens	0.427672	7.7	10.052672
2	Star Trek Into Darkness	0.420840	7.4	9.670840
3	Gravity	0.417716	7.3	9.542716
4	Treasure Planet	0.320138	7.2	9.320138
5	Stargate: The Ark of Truth	0.315874	6.9	8.940874
6	Spaceballs	0.329561	6.7	8.704561
7	Starship Troopers	0.298873	6.7	8.673873
8	Silent Running	0.410423	6.3	8.285423
9	Space Dogs	0.391469	6.3	8.266469

3번 문제

# #3. dataset 데이터 분석 및 연관 규칙 생성

# 1) dataset 파일 분석

# 2) 연도별 많이 / 적게 팔린 아이템은?

# 3) member id 에 따른 구매 연(월, 요일)도, 아이템 분석

# - 무슨 요일에 와서 구매를 많이 했을까?

# 4) 연관 규칙 생성

# 5) 연관 규칙에 따른 vip member에게 어떤 상품을 추천할까?

# -vip는 매출횟수가 가장 많은 상위 100명

1 2	#Date열을 datetime으로 변경 g_df['Date'] = pd.to_datetime(g_df['Date'],format='%d-%m-%Y' )

1 2	# na.nan데이터 없음 g_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38765 entries, 0 to 38764
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Member_number    38765 non-null  int64         
 1   Date             38765 non-null  datetime64[ns]
 2   itemDescription  38765 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 908.7+ KB

# 2) 연도별 많이 / 적게 팔린 아이템은?
g_df['year'] = g_df['Date'].dt.year
a = g_df.pivot_table(index='year',columns='itemDescription',aggfunc='count')['Date']
a.max(axis=1)

year
2014    1038.0
2015    1464.0
dtype: float64

1
2
3

a
#Whole milk가 연도별로 가장 높고,
#bags, toilet cleaner 가장 적다

itemDescription	Instant food products	UHT-milk	abrasive cleaner	artif. sweetener	baby cosmetics	bags	baking powder	bathroom cleaner	beef	berries	beverages	bottled beer	bottled water	brandy	brown bread	butter	butter milk	cake bar	candles	candy	canned beer	canned fish	canned fruit	canned vegetables	cat food	...	sparkling wine	specialty bar	specialty cheese	specialty chocolate	specialty fat	specialty vegetables	spices	spread cheese	sugar	sweet spreads	syrup	tea	tidbits	toilet cleaner	tropical fruit	turkey	vinegar	waffles	whipped/sour cream	whisky	white bread	white wine	whole milk	yogurt	zwieback
year
2014	37.0	160.0	12.0	13.0	1.0	1.0	67.0	11.0	177.0	128.0	109.0	319.0	504.0	17.0	323.0	273.0	126.0	54.0	34.0	136.0	266.0	70.0	8.0	49.0	109.0	...	25.0	110.0	37.0	125.0	20.0	10.0	23.0	48.0	155.0	34.0	13.0	19.0	12.0	5.0	364.0	27.0	29.0	166.0	365.0	3.0	213.0	82.0	1038.0	640.0	24.0
2015	23.0	163.0	10.0	16.0	2.0	3.0	55.0	6.0	339.0	199.0	142.0	368.0	429.0	21.0	248.0	261.0	137.0	39.0	32.0	83.0	451.0	46.0	13.0	33.0	68.0	...	21.0	100.0	35.0	115.0	9.0	1.0	17.0	52.0	110.0	35.0	8.0	8.0	10.0	NaN	668.0	53.0	22.0	114.0	297.0	5.0	149.0	94.0	1464.0	694.0	36.0

2 rows × 167 columns

#3

g_df['weekday'] = g_df['Date'].dt.weekday#0이 월요일 6이 일요일
g_df['month'] = g_df['Date'].dt.month
g_df.groupby('Member_number').sum(numeric_only=True)

	year	weekday	month
Member_number
1000	26192	43	76
1001	24175	35	51
1002	16116	46	36
1003	16114	24	30
1004	42296	27	150
...	...	...	...
4996	20150	24	90
4997	12088	36	50
4998	4030	4	20
4999	32236	58	62
5000	14101	27	34

3898 rows × 3 columns

1	g_df_dummies = pd.get_dummies(g_df,columns=['weekday','month'])

g_df_dummies.groupby('Member_number').sum(numeric_only=True).iloc[:,1:].sum()
#0이 월요일 6이 일요일
#요일은 상관 없이 구매량은 비슷하다.
#월도 크게 차이가 나는 편은 아니다.

weekday_0    5382
weekday_1    5558
weekday_2    5562
weekday_3    5620
weekday_4    5562
weekday_5    5551
weekday_6    5530
month_1      3324
month_2      2997
month_3      3133
month_4      3260
month_5      3408
month_6      3264
month_7      3300
month_8      3496
month_9      3059
month_10     3261
month_11     3254
month_12     3009
dtype: int64

# 4,5

# 매출횟수 상위 100명에게 vip타이틀 부여
b = g_df.groupby('Member_number')['itemDescription'].size().sort_values(ascending=False)[:100]
g_df['vip']= g_df['Member_number'].apply(lambda x : 1 if x in b.index else 0)

#vip
vip = g_df[g_df['vip']==1]

/var/folders/pr/27tft1vj6396wqnj02ngz5p80000gn/T/ipykernel_28378/1060107556.py:2: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.
  b = g_df.groupby('Member_number')['itemDescription'].size().sort_values(ascending=False)[:100]

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules


#vip들이 구매한 상품을 그룹화(장바구니화)
vip['itemDescription'] = vip['itemDescription'].apply(lambda x : x+',')
vip_item = vip.groupby('Member_number')['itemDescription'].sum().to_list()

items = [0 for x in range(len(vip_item))]
for i in range(len(vip_item)):
    items[i] = vip_item[i].split(',')

# TransactionEncoder(아이템 더미화)
te = TransactionEncoder()
te_result = te.fit(items).transform(items)
df = pd.DataFrame(te_result, columns=te.columns_ ,dtype='int')

# apriori
apr_result = apriori(df, min_support=0.09,  use_colnames=True)
lift_based = association_rules(apr_result, metric='lift', min_threshold=3)

# # #lift와 confidence를 소수점 1,2자리까지만 출력 -> 정렬
lift_based['lift'] = lift_based['lift'].apply(lambda x : f'{x:.1f}')
lift_based['confidence'] = lift_based['confidence'].apply(lambda x : f'{x:.2f}')

#frozenset -> object로 변경 (잘못된 공백값을 제거하기 위해)
lift_based['antecedents'] = lift_based['antecedents'].apply(lambda x: ', '.join(list(x))).str.replace(' ','')
lift_based['consequents'] = lift_based['consequents'].apply(lambda x: ', '.join(list(x))).str.replace(' ','')

# , 제거
lift_based['antecedents'] = lift_based['antecedents'].apply(lambda x : x if x[0] != ',' else x[1:])
lift_based['consequents'] = lift_based['consequents'].apply(lambda x : x if x[0] != ',' else x[1:])

#lift와 confidence로 정렬
lift_sort = lift_based.sort_values(by = ['lift','confidence'], ascending=False)

# yogurt,tropicalfruit,rolls/buns & wholemilk,othervegetables,cannedbeer
# 이것들은 여기 상점 VIP의 연관성이 가장 높은 물품들이다. 이 들의 동선을 살짝 멀찍이 띄어두고 
# 그 사이에 이벤트 상품들을 판매하면 다른 상품들의 매출도 늘 것 같다.
lift_sort

/var/folders/pr/27tft1vj6396wqnj02ngz5p80000gn/T/ipykernel_28378/1298349058.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  vip['itemDescription'] = vip['itemDescription'].apply(lambda x : x+',')
/opt/homebrew/lib/python3.10/site-packages/mlxtend/frequent_patterns/fpcommon.py:111: DeprecationWarning: DataFrames with non-bool types result in worse computationalperformance and their support might be discontinued in the future.Please use a DataFrame with bool type
  warnings.warn(

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
8	yogurt,tropicalfruit,rolls/buns	othervegetables,cannedbeer,wholemilk	0.11	0.21	0.09	0.82	3.9	0.0669	4.345000
14	yogurt,tropicalfruit,rolls/buns	othervegetables,cannedbeer,wholemilk	0.11	0.21	0.09	0.82	3.9	0.0669	4.345000
20	yogurt,tropicalfruit,rolls/buns	othervegetables,cannedbeer,wholemilk	0.11	0.21	0.09	0.82	3.9	0.0669	4.345000
9	othervegetables,cannedbeer,wholemilk	yogurt,tropicalfruit,rolls/buns	0.21	0.11	0.09	0.43	3.9	0.0669	1.557500
15	othervegetables,cannedbeer,wholemilk	yogurt,tropicalfruit,rolls/buns	0.21	0.11	0.09	0.43	3.9	0.0669	1.557500
21	othervegetables,cannedbeer,wholemilk	yogurt,tropicalfruit,rolls/buns	0.21	0.11	0.09	0.43	3.9	0.0669	1.557500
6	yogurt,rolls/buns,tropicalfruit,wholemilk	othervegetables,cannedbeer	0.10	0.25	0.09	0.90	3.6	0.0650	7.500000
12	tropicalfruit,wholemilk,yogurt,rolls/buns	othervegetables,cannedbeer	0.10	0.25	0.09	0.90	3.6	0.0650	7.500000
17	yogurt,rolls/buns,tropicalfruit,wholemilk	othervegetables,cannedbeer	0.10	0.25	0.09	0.90	3.6	0.0650	7.500000
11	othervegetables,cannedbeer	yogurt,rolls/buns,tropicalfruit,wholemilk	0.25	0.10	0.09	0.36	3.6	0.0650	1.406250
18	othervegetables,cannedbeer	yogurt,rolls/buns,tropicalfruit,wholemilk	0.25	0.10	0.09	0.36	3.6	0.0650	1.406250
23	othervegetables,cannedbeer	tropicalfruit,wholemilk,yogurt,rolls/buns	0.25	0.10	0.09	0.36	3.6	0.0650	1.406250
0	yogurt,tropicalfruit,rolls/buns	othervegetables,cannedbeer	0.11	0.25	0.09	0.82	3.3	0.0625	4.125000
2	yogurt,tropicalfruit,rolls/buns	othervegetables,cannedbeer	0.11	0.25	0.09	0.82	3.3	0.0625	4.125000
4	yogurt,tropicalfruit,rolls/buns	othervegetables,cannedbeer	0.11	0.25	0.09	0.82	3.3	0.0625	4.125000
1	othervegetables,cannedbeer	yogurt,tropicalfruit,rolls/buns	0.25	0.11	0.09	0.36	3.3	0.0625	1.390625
3	othervegetables,cannedbeer	yogurt,tropicalfruit,rolls/buns	0.25	0.11	0.09	0.36	3.3	0.0625	1.390625
5	othervegetables,cannedbeer	yogurt,tropicalfruit,rolls/buns	0.25	0.11	0.09	0.36	3.3	0.0625	1.390625
10	yogurt,cannedbeer,othervegetables	rolls/buns,tropicalfruit,wholemilk	0.13	0.22	0.09	0.69	3.1	0.0614	2.535000
16	yogurt,cannedbeer,othervegetables	rolls/buns,tropicalfruit,wholemilk	0.13	0.22	0.09	0.69	3.1	0.0614	2.535000
22	yogurt,cannedbeer,othervegetables	rolls/buns,tropicalfruit,wholemilk	0.13	0.22	0.09	0.69	3.1	0.0614	2.535000
7	rolls/buns,tropicalfruit,wholemilk	yogurt,cannedbeer,othervegetables	0.22	0.13	0.09	0.41	3.1	0.0614	1.472308
13	rolls/buns,tropicalfruit,wholemilk	yogurt,cannedbeer,othervegetables	0.22	0.13	0.09	0.41	3.1	0.0614	1.472308
19	rolls/buns,tropicalfruit,wholemilk	yogurt,cannedbeer,othervegetables	0.22	0.13	0.09	0.41	3.1	0.0614	1.472308

4번 문제

# 4. 와인의 화학 조성을 사용하여 와인의 종류를 예측하기 위한 데이터이다. load_wine() 명령으로 로드하며 다음과 같이 구성되어 있다. 
# 와인의 종류를 예측할 수 있는 모델을 생성하시오.

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers
from sklearn.model_selection import train_test_split


wine = load_wine()

w_df = wine.data
w_target = wine.target

#표준화
scaler = StandardScaler()
w_df = scaler.fit_transform(w_df)
xtrain , xtest, ytrain, ytest = train_test_split(w_df,w_target,test_size= 0.3)

#테스트가 값이 3종류라 카테고리화하기
ytrain = to_categorical(ytrain)
ytest = to_categorical(ytest)

model=Sequential()
model.add(Dense(3, input_dim=13, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

hist = model.fit(xtrain, ytrain, epochs=300, batch_size=1, validation_data=(xtest, ytest))

epochs = range(1, len(hist.history['accuracy']) + 1)
plt.plot(epochs, hist.history['loss'])
plt.plot(epochs, hist.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

# accuracy: 1.0000

미니 프로젝트 문제 풀이

http://inhwancho.github.io/2022/11/27/Personal_folder/2022-11-27-mini-project-first/

Author

InhwanCho

Posted on

2022-11-27

Updated on

2022-11-27

Licensed under

#samples

미니 프로젝트 문제 풀이

미니 프로젝트 1 문제

1,2번 문제

3번 문제

4번 문제

Author

Posted on

Updated on

Licensed under

Comments

Categories

Recents

Archives

Subscribe for updates