๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿš— Major Study (Bachelor)/๐ŸŸฅ Machine Learning

Clustering using Centroid-based approach. What is k-Means Algorithm

by UKHYUN22 2022. 11. 1.
728x90

Centroid-based Approach๋Š” k-Means ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ๊ธฐ์กด์˜ Hierachicalํ•œ ๋ฐฉ์‹์ด Time๊ณผ Space๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์ด ์†Œ์š”๋œ๋‹ค๋Š” ์ ์— ์ฐฉ์•ˆํ•˜์—ฌ ๋“ฑ์žฅํ•œ๋‹ค. ์ข€ ๋” ํšจ์œจ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•œ Clustering ๋ฐฉ์‹์ด ํ•„์š”ํ•˜๋‹ค.

 

k-Means์˜ ์žฅ์ ์€ ์ดํ•ดํ•˜๊ณ  ๊ตฌํ˜„ํ•˜๊ธฐ ์‰ฝ๋‹ค๋Š” ๊ฒƒ์ด์ง€๋งŒ ์˜ค์ง local minimun๋งŒ์ด ์–ป์–ด์ง„๋‹ค๋Š” ๊ฒƒ์ด ๋‹จ์ ์ด๋‹ค. ๋˜ํ•œ ์ดˆ๊ธฐํ™”ํ•˜๊ธฐ ํž˜๋“ค๊ณ  outlier์— ๋ฏผ๊ฐํ•˜๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

 

k-Means ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์•„์ด๋””์–ด์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž. ๊ฐ๊ฐ์˜ ๋ฐ์ดํ„ฐ๋“ค์ด ๊ฐ๊ฐ์˜ Cluster๋กœ ๋…๋ฆฝ์ ์œผ๋กœ ํ• ๋‹น์‹œํ‚จ๋‹ค. Dissimilariy ๋ฐฉ์‹์œผ๋กœ ๊ฑฐ๋ฆฌ ์ธก์ • ๋ฐฉ์‹์„ ์ •์˜ํ•˜๊ณ , within cluster์˜ ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ์ตœ์†Œํ™”ํ•  ์ˆ˜ ์žˆ๋Š” Cluster๋กœ ๊ฒฐ์ •๋œ๋‹ค.

 

k-Means ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ ˆ์ฐจ๋ฅผ ๋ณด์ž. ์ฒ˜์Œ์— Cluster์˜ ๊ฐœ์ˆ˜๋ฅผ k๊ฐœ๋กœ ์„ค์ •ํ•˜๊ณ  ์‹œ์ž‘ํ•œ๋‹ค. ๊ฐ๊ฐ์˜ Point๋ฅผ ํ•˜๋‚˜์˜ Cluster๋กœ ์ดˆ๊ธฐํ™”ํ•œ๋‹ค. ๊ฐ๊ฐ์˜ Data point์— ๋Œ€ํ•ด์„œ cluster์˜ ์ค‘์‹ฌ๊ณผ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ฃผ๋ณ€ Data ๋“ค์„ ๊ฒฐ์ •ํ•˜๊ฒŒ ๋œ๋‹ค. ์ดํ›„ ๋‹ค์‹œ Cluster์˜ ์ค‘์‹ฌ์„ ๊ณ„์‚ฐํ•œ๋‹ค. ํ•ด๋‹น ๋ฐฉ์‹์„ Centroid๊ฐ€ ์ˆ˜๋ ดํ•  ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•œ๋‹ค.

 

Optimization Problem์œผ๋กœ Sum of squared Error๊ฐ€ ์กด์žฌํ•œ๋‹ค. ๊ฐ๊ฐ์˜ Data point๊ณผ Cluster์˜ Centroid ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ์ œ๊ณฑํ•˜์—ฌ ๋”ํ•œ ๊ฒƒ์˜ ์ตœ์†Ÿ๊ฐ’์„ ๊ณ„์‚ฐํ•˜๊ฒŒ ๋œ๋‹ค. ์ด๋ฅผ SSE๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. ์ตœ์†Ÿ๊ฐ’์„ ๊ฐ–๋Š” S๊ฐ€ ์ •ํ•ด์กŒ์„ ๋•Œ, centroid์˜ ํŽธ์ฐจ๋ฅผ 0์œผ๋กœ ๋‘์–ด ๋‹ค์Œ Centroid์˜ ์ง€์ ์„ ์ตœ์‹ ํ™”์‹œํ‚จ๋‹ค.

 

 

k-Means ์•Œ๊ณ ๋ฆฌ์ฆ˜์—๋Š” ์–ด๋–ค ๋ฌธ์ œ์ ์ด ์žˆ์„๊นŒ? Distance Metric ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. Metric์ด ์ •์˜๋˜์ง€ ์•Š์•˜์„ ๋•Œ์™€ Cluster๊ฐ€ non-convexํ•œ ๊ฒฝ์šฐ Distance๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒƒ์ด ์ ํ•ฉํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ด๋‹ค.

 

Data Normalization ๊ฐœ๋…์ด ๋“ฑ์žฅํ•œ๋‹ค. Euclidean ๊ฑฐ๋ฆฌ๋ฅผ ์ธก์ •ํ–ˆ์„ ๋•Œ ์ ˆ๋Œ€์  ์ฐจ์ด๊ฐ€ ๋” ํฐ Data instance์— ๋Œ€ํ•ด์„œ ๋” ํฐ ์˜ค์ฐจ์˜ ๊ฐ’์„ ๋„ฃ๊ฒŒ ๋˜๋ฉด Clusterํ•˜๋Š”๋ฐ ์žˆ์–ด์„œ ์ƒ๋Œ€์ ์œผ๋กœ ๊ณ„์‚ฐ์ด ์•ˆ๋˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

 

๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ๊ฐ์˜ Data ์†์„ฑ๋“ค์„ ๋น„๊ต ๊ฐ€๋Šฅํ•˜๋„๋ก Mean-Std Normalization์„ ์‹œํ‚จ๋‹ค. Min-max scaling์ด๋ผ๊ณ  ๋ถˆ๋ฆฌ๊ธฐ๋„ ํ•œ๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด k์˜ ๊ฐ’์„ ์–ด๋–ป๊ฒŒ ์„ค์ •ํ•˜๋Š”๊ฐ€? K์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•ด์•ผ ์–ด๋Š K๊ฐ€ ๋” ์ ํ•ฉํ•œ์ง€๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค. Elbow Method๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ ์ด๋Š” ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ฒ”์œ„์˜ k๋ฅผ ๋ชจ๋‘ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ ์ค‘์— within variance๊ฐ€ ๊ฐ€์žฅ ์ž‘์€ ๊ฐ’์ด ๋‚˜์˜ค๊ฒŒ ํ•˜๋Š” k๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

 

K์˜ ๊ฐ’์„ 1๋ถ€ํ„ฐ ์˜ฌ๋ ค์„œ SSE๊ฐ€ ์ค„์–ด๋“œ๋Š” ์–‘์ƒ์„ ํ™•์ธํ•œ๋‹ค. ์ด๋•Œ ์ฃผ์˜ํ•  ๊ฒƒ์€ Elbow Method๋Š” SSE์˜ ์ตœ์†Œ์ ์„ ์ฐพ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋‹ค. ์ •์ฒด๊ฐ€ ์‹œ์ž‘๋˜๋Š” ๊ตฌ๊ฐ„์„ ์ฐพ๋Š”๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.

 

k-Means ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ISSUE๋กœ outlier์— ๋ฏผ๊ฐํ•˜๋‹ค๊ณ  ํ–ˆ๋‹ค. ์ด์— ๋Œ€์•ˆ์ฑ…์œผ๋กœ๋Š” Sotf assignment ๋ฐฉ์‹์ด ์žˆ๋‹ค. Mixture of Gaussian์˜ ๋ฐฉ์‹์œผ๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๊ฐ€์งˆ ํ™•๋ฅ ์— ๋Œ€ํ•œ ์ ‘๊ทผ์„ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  k-means๊ฐ€ ์•„๋‹Œ k-medoids ๋ฐฉ์‹์œผ๋กœ Cluster์˜ ํ‰๊ท ์ด ์•„๋‹Œ Median์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

 

๋‹ค์Œ์œผ๋กœ ์–ด๋–ป๊ฒŒ Clusterdml centroid๋ฅผ ์ดˆ๊ธฐํ™”ํ•  ๊ฒƒ์ธ๊ฐ€? Randomํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ์‹, Heuristicํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ์‹์ด ์กด์žฌํ•œ๋‹ค. k-Mean++ ๋ฐฉ์‹๋„ ์žˆ๋Š”๋ฐ ์ด๋Š” Initial Point๋กœ๋ถ€ํ„ฐ ํ•˜๋‚˜ํ•˜๋‚˜ ์ง€์ ์„ ์žก๋Š” ๊ฒƒ์ด๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด ์–ด๋–ค Clutser๊ฐ€ ์ข‹์€ Cluster์ธ๊ฐ€? Intra-cluster์˜ Similarity๊ฐ€ ๋†’์„ ๋•Œ, ์ฆ‰ ๊ตฐ์ง‘ ๋‚ด์˜ ์œ ์‚ฌ๋„๊ฐ€ ๋†’๊ณ  ๊ตฐ์ง‘ ๊ฐ„์˜ ์œ ์‚ฌ๋„๊ฐ€ ๋‚ฎ์„ ๋•Œ์ด๋‹ค. (Inter-cluster similarity is low)

 

Internal Criterion ์˜ˆ์‹œ๋กœ Silhouette coefficient ๊ฐ€ ์žˆ๋‹ค. SSE๋Œ€์‹  Silhouette๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ๋„ ์žˆ์Œ์„ ์•Œ๊ณ ์žˆ์ž.

 

External Criterion์˜ ๊ฒฝ์šฐ Clustering์˜ ๊ฒฐ๊ณผ๋ฌผ์„ ์ด๋ฏธ ๊ฒฐ์ •๋œ ๋‹ต์•ˆ๊ณผ ๋น„๊ต๋ฅผ ํ•˜์—ฌ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋Š” Labeled Data๊ฐ€ ํ•„์š”ํ•˜๋‹ค.