Lecture 6: Training Neural Networks 1

5 minute read

Activation Functions

ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋Š” ๊ฐ€์ค‘์น˜์™€ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณฑํ•œ ๊ฐ’์— ์ •์งํ™” ๋ณ€์ˆ˜๋ฅผ ๋”ํ•œ ๊ฐ’์„ ๋‹ค์Œ ๋ ˆ์ด์–ด๋กœ ์–ด๋– ํ•œ ๊ฐ’์œผ๋กœ ์ „๋‹ฌํ• ์ง€ ์ง€์ •ํ•˜๋Š” ํ•จ์ˆ˜์ด๋‹ค. ๋Œ€ํ‘œ์ ์ธ ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋กœ Sigmoid, tanh, ReLU, Leaky ReLU, Maxout, ELU ๊ฐ€ ์žˆ๋‹ค.

Sigmoid

์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜์˜ ๋ชจ์Šต์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

$\sigma(x)=1/(1+e^{-x})$

์ด ํ•จ์ˆ˜๋Š” ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ 0๊ณผ 1์‚ฌ์ด๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด ํ•จ์ˆ˜์—๋Š” ์น˜๋ช…์ ์ธ ๋‹จ์  3๊ฐœ๊ฐ€ ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ž˜ ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š”๋ฐ, 3๊ฐ€์ง€ ๋‹จ์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๊ฐ€ ํŠน์ • ๊ฐ’์„ ๋„˜์–ด๊ฐ€๊ฑฐ๋‚˜ ๋„˜์ง€ ๋ชปํ•˜๋ฉด gradient๊ฐ€ 0์œผ๋กœ ์ˆ˜๋ ดํ•œ๋‹ค. ์ฆ‰, gradient descent๋ฅผ ํ•˜์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์—ญ์ „ํŒŒ๋ฅผ ํ•˜์ง€ ๋ชปํ•œ๋‹ค.
  2. ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜์˜ ์ถœ๋ ฅ๊ฐ’์˜ ์ค‘์‹ฌ์ด 0์ด ๋˜์ง€ ์•Š๋Š”๋‹ค. ์ถœ๋ ฅ๊ฐ’์˜ ์ค‘์‹ฌ์ด 0์ด ๋˜์ง€ ์•Š์œผ๋ฉด ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์˜ gradient๊ฐ€ ๋ชจ๋‘ ์–‘์ˆ˜์ด๊ฑฐ๋‚˜ ์Œ์ˆ˜์˜ ํ˜•ํƒœ๋ฅผ ๋„๊ฒŒ๋œ๋‹ค. ์ด ๋ง์ธ์ฆ‰์Šจ ์—ญ์ „ํŒŒ๋ฅผ ํ• ๋•Œ ๊ฐ€์ค‘์น˜ ๊ฐ’์ด ๋ชจ๋‘ ์ฆ๊ฐ€ํ•˜๊ฑฐ๋‚˜ ๊ฐ์†Œํ•˜์—ฌ ์ตœ์ ์˜ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์„ ์ฐพ๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๋ง์ด ๋œ๋‹ค.
  3. ์ž์—ฐํ•จ์ˆ˜ $e$ ๊ฐ€ ๋“ค์–ด๊ฐ€๋Š” ๊ณ„์‚ฐ์ด๋ฏ€๋กœ ๊ณ„์‚ฐ๊ณผ์ •์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฐ๋‹ค.

tanh

tanh ํ•จ์ˆ˜์˜ ๋ชจ์Šต์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

์ด ํ•จ์ˆ˜๋Š” ์ถœ๋ ฅ๊ฐ’์˜ ์ค‘์‹ฌ์ด 0์ด ๋˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์ ๋งŒ ์ œ์™ธํ•˜๋ฉด ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜์™€ ๊ฐ™์€ ๋‹จ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์–ด ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ž์ฃผ ์‚ฌ์šฉ๋˜์ง€๋Š” ์•Š๋Š”๋‹ค.

ReLU

ReLU์˜ ๋ชจ์Šต์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

$f(x)=max(0,x)$

ReLU์˜ ์žฅ์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๊ฐ€ ์–‘์ˆ˜์ผ๋•Œ gradient๊ฐ€ 0๋กœ ์ˆ˜๋ ดํ•˜์ง€ ์•Š๋Š”๋‹ค.
  2. ๊ณ„์‚ฐ๊ณผ์ •์ด ๋‹จ์ˆœํ•˜๊ณ  ๋น ๋ฅด๋‹ค.
  3. ์‹ค์ œ ๋‰ด๋Ÿฐ๊ณผ ์ž‘๋™ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋น„์Šทํ•˜๋‹ค.

๋ฐ˜๋ฉด ๋‹จ์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๊ฐ€ ์Œ์ˆ˜์ผ๋•Œ gradient๊ฐ€ 0์œผ๋กœ ์ˆ˜๋ ดํ•œ๋‹ค.
  2. ์ถœ๋ ฅ๊ฐ’์˜ ์ค‘์‹ฌ์ด 0์ด ๋˜์ง€ ์•Š๋Š”๋‹ค.
  3. ์•„๋ž˜ ์‚ฌ์ง„๊ณผ ๊ฐ™์ด ๋ฐ์ดํ„ฐ๋“ค์ด ์กด์žฌํ•˜๋Š” ์ง€์—ญ ๋ฐ–์— ํ•จ์ˆ˜๊ฐ€ ์กด์žฌํ•˜๋ฉด dead ReLU, ์ฆ‰ ์ฃฝ์€ ReLU๊ฐ€ ๋˜์–ด ๋” ์ด์ƒ ์—…๋ฐ์ดํŠธ๊ฐ€ ๋˜์ง€ ์•Š๋Š” ํ•จ์ˆ˜๊ฐ€ ๋œ๋‹ค.

Leaky ReLU, PreReLU

Leaky ReLU์˜ ๋ชจ์Šต์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

$f(x)=max(0.01x,x)$

๊ธฐ๋ณธ์ ์ธ ํ˜•ํƒœ๋Š” ReLU์™€ ๋น„์Šทํ•˜๋‚˜ ReLU์— ๋น„ํ•ด ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๊ฐ€ ์Œ์ˆ˜์ผ๋•Œ gradient๊ฐ€ 0์œผ๋กœ ์ˆ˜๋ ดํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์žฅ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

PreReLU๋„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ReLU์™€ ๋น„์Šทํ•œ ํ•จ์ˆ˜์ด๋‹ค. $f(x)=max(\alpha x,x)$

์—ฌ๊ธฐ์„œ $\alpha$๋Š” ์—ญ์ „ํŒŒ ๊ณผ์ •์„ ํ†ตํ•ด ์ฐพ์•„๋‚˜๊ฐ„๋‹ค.

ELU

ELU์˜ ๋ชจ์Šต์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

ReLU์˜ ๋ชจ๋“  ์žฅ์ ์„ ๊ณต์œ ํ•˜๊ณ  ์žˆ์œผ๋‚˜ ์ž์—ฐํ•จ์ˆ˜ $e$๊ฐ€ ํฌํ•จ๋œ ๊ณ„์‚ฐ๊ณผ์ •์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ณ„์‚ฐํ•˜๋Š” ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฐ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.

Maxout

Maxout์˜ ๋ชจ์Šต์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

$max(w_1^{T}x+b_1,w_2^{T}x+b_2)$

์ด ํ•จ์ˆ˜๋Š” ReLU์™€ LeakyReLU๋ฅผ ์ผ๋ฐ˜ํ™”ํ•˜์—ฌ ReLU์˜ ๋‹จ์ ์„ ๋ชจ๋‘ ๊ทน๋ณตํ•œ ๊ฒƒ์— ์˜์˜๊ฐ€ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด ReLU์˜ ๊ณ„์‚ฐ๊ณผ์ •์˜ ๋‘๋ฐฐ๊ฐ€ ๋œ๋‹ค๋Š” ์ ์ด ๋‹จ์ ์ด๋‹ค.

Choosing Activation Functions

์–ด๋– ํ•œ ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ์จ์•ผ๋˜๋Š”์ง€์— ๋Œ€ํ•œ ๊ฐ„๋‹จํ•œ ํŒ์ด๋‹ค.

  • ์ฒ˜์Œ์—๋Š” ReLU ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋•Œ learning rate๋ฅผ ์ž˜ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค.
  • ๋‹ค์Œ์œผ๋กœ Leaky ReLU, Maxout, ELU๋ฅผ ์‚ฌ์šฉํ•ด๋ณธ๋‹ค.
  • tanh๋Š” ํ•œ๋ฒˆ์ฏค ์‹œ๋„ํ•ด๋ณด๋‚˜ ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ๊ธฐ๋Œ€๋Š” ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.
  • sigmoid๋Š” ์ ˆ๋Œ€ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค.

Data Preprocessing

์ด๋ฏธ์ง€๋ฅผ ํ›ˆ๋ จ์‹œํ‚ค๊ธฐ ์ „์— ์ „์ฒ˜๋ฆฌ๋ฅผ ํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค. ์ถ”์ฒœ๋˜๋Š” ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์€ zero centering, ์ฆ‰ ๋ฐ์ดํ„ฐ์˜ ํ‰๊ท ์ด 0์ด ๋˜๊ฒŒ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋•Œ, ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ์™€ ํฌ๊ธฐ๋ฅผ ๋ณด์กดํ•˜๊ธฐ ์œ„ํ•ด ์ •๊ทœํ™” ๊ณผ์ •์€ ์ž˜ ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š”๋‹ค.

zero centering ์™ธ์˜ ์ถ”์ฒœํ•˜๋Š” ์ „์ฒ˜๋ฆฌ๋Š” ์ด๋ฏธ์ง€์˜ ์ฐจ์›์„ ์ค„์ด๋Š” PCA, ๋ถ„์‚ฐ์„ 1๋กœ ๋งŒ๋“œ๋Š” whitening ๋“ฑ์ด ์žˆ๋‹ค.

์—ฌ๊ธฐ์„œ ์ •๊ทœํ™”๋ฅผ ์œ„ํ•ด ํ‰๊ท ์„ ๊ตฌํ•˜๋Š” 2๊ฐ€์ง€์˜ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค.

  1. ์ด๋ฏธ์ง€์˜ ์ฐจ์›์€ ์ƒ๊ฐํ•˜์ง€ ์•Š๊ณ  ๊ฐ ํ”ฝ์…€์˜ ํ‰๊ท ์„ ๊ตฌํ•œ๋‹ค. (ex: AlexNet)
  2. ์ฐจ์›๋ณ„๋กœ ๊ฐ ํ”ฝ์…€์˜ ํ‰๊ท ์„ ๊ตฌํ•œ๋‹ค. (ex: VGGNet)

Weight initalization

์ •ํ™•ํ•œ ํ›ˆ๋ จ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์ดˆ๊ธฐ ๊ฐ€์ค‘์น˜ ๊ฐ’์„ ์ž˜ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. ์–ด๋– ํ•œ ๊ฐ’์„ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์„์ง€ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ํ•œ๋ฒˆ ์‚ดํŽด๋ณด์ž.

  1. ์ž‘์€ ๋žœ๋ค ์ˆซ์ž๋กœ ์„ค์ •ํ•œ๋‹ค
    • ๋ ˆ์ด์–ด๊ฐ€ ์ค‘์ฒฉ๋˜๋ฉฐ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณ„์† ๊ณฑํ–ˆ์„๋•Œ ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ 0์— ๊ฐ€๊นŒ์›Œ์ง€๋ฏ€๋กœ ์ข‹์€ ๋ฐฉ๋ฒ•์ด ์•„๋‹ˆ๋‹ค.


2. ํฐ ๋žœ๋ค ์ˆซ์ž๋กœ ์„ค์ •ํ•œ๋‹ค.

  • ๋ ˆ์ด์–ด๊ฐ€ ์ค‘์ฒฉ๋˜๋ฉฐ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณ„์† ๊ณฑํ–ˆ์„๋•Œ ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ -1 ๋˜๋Š” 1์— ๊ฐ€๊นŒ์›Œ์ง€๋ฏ€๋กœ ์ข‹์€ ๋ฐฉ๋ฒ•์ด ์•„๋‹ˆ๋‹ค.


3. Xavier initalization

  • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ๋ถ„์‚ฐ๊ณผ ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ์˜ ๋ถ„์‚ฐ์„ ๊ฐ™๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
  • ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๊ฐ€ ํฌ๋ฉด ๊ฐ€์ค‘์น˜์˜ ๊ฐ’์„ ์ž‘๊ฒŒ ์„ค์ •ํ•˜๊ณ , ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๊ฐ€ ์ž‘์œผ๋ฉด ๊ฐ€์ค‘์น˜์˜ ๊ฐ’์„ ํฌ๊ฒŒ ์„ค์ •ํ•œ๋‹ค.
  • ๊ฐ€์žฅ ์ด์ƒ์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ, ๋ ˆ์ด์–ด๋ฅผ ์ค‘์ฒฉํ• ์ˆ˜๋ก ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ ๊ฐ์†Œํ•˜๋ฉฐ ํŠน์ • ๊ฐ’์œผ๋กœ ์ˆ˜๋ ดํ•˜๋Š” ํ˜•ํƒœ์ด๋‹ค.

  • ReLU๋ฅผ ์‚ฌ์šฉํ• ๋•Œ๋Š” ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ 2๋กœ ๋‚˜๋ˆ„์–ด์•ผ์ง€ ์ •์ƒ์ ์œผ๋กœ ์ž‘๋™ํ•œ๋‹ค.

Batch Normalization

๋ฐฐ์น˜ ์ •๊ทœํ™”(Batch Normalization) ๋Š” ์ด๋ฏธ์ง€๋ฅผ ํ•™์Šตํ•˜๋ฉด์„œ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๊ฐ€ ๋ณ€ํ˜•๋˜์ง€ ์•Š๋„๋ก ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค. ์ผ๋ฐ˜ ์ •๊ทœํ™”์™€ ๋‹ค๋ฅธ ์ ์€ ์ฐจ์›๋ณ„๋กœ ๋ฐฐ์น˜์˜ ์ •๊ทœํ™”๋ฅผ ํ•œ๋‹ค๋Š” ๊ฒƒ์— ์žˆ๋‹ค. ๋ฐฐ์น˜ ์ •๊ทœํ™”์˜ ๊ธฐ๋ณธ์ ์ธ ๊ณต์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

$\hat{x}^{(k)}=\frac{x^{(k)}-E[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}$

๋ฐฐ์น˜ ์ •๊ทœํ™”๋Š” ์ฃผ๋กœ fully connected layer, convolutional layer ๋ฐ”๋กœ ๋‹ค์Œ์— ์ด๋ฃจ์–ด์ง„๋‹ค.

๊ทธ๋ฆฌ๊ณ  ๋ฐฐ์น˜ ์ •๊ทœํ™”๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์‹œ ์ •๊ทœํ™” ๋˜๊ธฐ ์ „์˜ ๋ฐ์ดํ„ฐ๋กœ ๋ณ€๊ฒฝํ•ด์•ผ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์ด ์ƒ๊ธด๋‹ค. ๋”ฐ๋ผ์„œ ์ด์ „ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณต๊ตฌํ•˜๊ธฐ ์œ„ํ•œ ๋ณ€์ˆ˜์ธ $\gamma$ ์™€ $\beta$ ๋˜ํ•œ ๋ฐฐ์น˜ ์ •๊ทœํ™” ๊ณผ์ •์—์„œ ๊ณ„์‚ฐํ•œ๋‹ค.

$y^{(k)}=\gamma^{(k)} \hat{x}^{(k)}+\beta^{(k)}$

๋ฐฐ์น˜ ์ •๊ทœํ™”์˜ ์žฅ์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. gradient flow๊ฐ€ ๊ฐ„๋‹จํ•ด์ง„๋‹ค.
  2. ๋†’์€ learning rate์˜ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•ด์ง„๋‹ค.
  3. ์ดˆ๊ธฐ ๊ฐ€์ค‘์น˜ ๊ฐ’ ์„ค์ •์— ๋Œ€ํ•œ ๋ถ€๋‹ด์„ ๋œ์–ด์ค€๋‹ค.
  4. dropout ์‚ฌ์šฉ์˜ ํ•„์š”์„ฑ์ด ๋‹ค์†Œ ๊ฐ์†Œํ•œ๋‹ค.

์•„๋ž˜ ์‚ฌ์ง„์— ๋ฐฐ์น˜ ์ •๊ทœํ™”์— ๋Œ€ํ•ด ๋ฐฐ์šด ๋‚ด์šฉ์„ ๊ฐ„๋‹จํ•˜๊ฒŒ ์ •๋ฆฌํ•œ ๋‚ด์šฉ์ด ๋‚˜์™€์žˆ๋‹ค.

Monitoring the training

์ด๋ฏธ์ง€์˜ ํ›ˆ๋ จ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ณผ์ •์œผ๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค.

  1. ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ์ „์ฒ˜๋ฆฌํ•œ๋‹ค.
# zero-centering ๋ฐฉ๋ฒ•์„ ์ด์šฉ
X -= np.mean(X, axis=0)


2. ํ›ˆ๋ จ ๋ชจ๋ธ์˜ ๊ตฌ์กฐ๋ฅผ ์„ค๊ณ„ํ•œ๋‹ค. ์ด๋•Œ, ์ธต์˜ ๊ฐœ์ˆ˜, ๋…ธ๋“œ์˜ ๊ฐœ์ˆ˜๋“ฑ์„ ์ •ํ•œ๋‹ค.

# ๋ ˆ์ด์–ด์˜ ์ˆ˜๋Š” 50๊ฐœ, ๋ ˆ์ด์–ด๋‹น ๋…ธ๋“œ ์ˆ˜๋Š” 10๊ฐœ๋กœ ์„ค์ •ํ•œ๋‹ค.


3. ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œํ‚จ๋‹ค. ์ด๋•Œ, ์ •์งํ™”(regularization)๊ฐ’์€ 0, learning rate๋Š” ์ ๋‹นํžˆ ์ž‘๊ฒŒ ์„ค์ •ํ•œ๋‹ค. ๋˜ํ•œ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์˜ ์ผ๋ถ€์— ๊ณผ์ ํ•ฉ ํ•  ์ˆ˜ ์žˆ๋„๋ก 20๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋งŒ ํ›ˆ๋ จ์‹œํ‚จ๋‹ค.

model = init_two_layer_model(32*32*3,50,10) #input size, hidden layer size, number of classes
trainer = ClassifierTrainer()
X_tiny = X_train[:20]   # 20๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋งŒ ํ›ˆ๋ จ์‹œํ‚จ๋‹ค.
y_tiny = y_train[:20]
best_model, stats = trainer.train(X_tiny,y_tiny,X_tiny,y_tiny,
                                        model,two_layer_net,
                                        num_epochs=200, reg=0.0,
                                        update='sgd', learning_rate_decay=1,
                                        sample_batch=False,
                                        learning_rate=1e-3, verbose=True)

์œ„ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰์‹œํ‚ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด epoch๋ณ„ ํ›ˆ๋ จ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜จ๋‹ค.

ํ›ˆ๋ จ์„ ์ง„ํ–‰ํ• ์ˆ˜๋ก ์†์‹ค๊ฐ’(cost)์€ ๊ฐ์†Œํ•˜๊ณ  ์ •ํ™•๋„(train)์€ ์ฆ๊ฐ€ํ•˜์—ฌ ํ›ˆ๋ จ์ด ์ •์ƒ์ ์œผ๋กœ ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ๋‹ค๋Š” ์‚ฌ์‹ค์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.


4. ์ •์งํ™” ๊ฐ’์„ ๋ณ€๊ฒฝํ•˜๊ณ  learning rate ๊ฐ’์„ ๋” ์ž‘๊ฒŒ ์„ค์ •ํ•˜์—ฌ ๋ชจ๋ธ์„ ๋‹ค์‹œ ํ›ˆ๋ จ์‹œ์ผœ๋ณธ๋‹ค.

model = init_two_layer_model(32*32*3,50,10) #input size, hidden layer size, number of classes
trainer = ClassifierTrainer()
best_model, stats = trainer.train(X_train,y_train,X_val,y_val,
                                        model,two_layer_net,
                                        num_epochs=10, reg=0.000001,
                                        update='sgd', learning_rate_decay=1,
                                        sample_batch=True,
                                        learning_rate=1e-6, verbose=True)

ํ›ˆ๋ จ์ด ์ง„ํ–‰ํ• ์ˆ˜๋ก ์†์‹ค๊ฐ’๊ณผ ์ •ํ™•๋„ ๊ฐ’์ด ๊ฑฐ์˜ ๊ทธ๋Œ€๋กœ ์ธ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ˜„์ƒ์€ learning rate์„ ๋„ˆ๋ฌด ๋‚ฎ๊ฒŒ ํ•˜์˜€๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.


5. ๊ทธ๋ ‡๋‹ค๋ฉด learning rate๋ฅผ ํฌ๊ฒŒ ์„ค์ •ํ•˜์—ฌ ๋ชจ๋ธ์„ ๋‹ค์‹œ ํ›ˆ๋ จ์‹œ์ผœ๋ณด์ž.

model = init_two_layer_model(32*32*3,50,10) #input size, hidden layer size, number of classes
trainer = ClassifierTrainer()
best_model, stats = trainer.train(X_train,y_train,X_val,y_val,
                                        model,two_layer_net,
                                        num_epochs=10, reg=0.000001,
                                        update='sgd', learning_rate_decay=1,
                                        sample_batch=True,
                                        learning_rate=1e6, verbose=True)

์†์‹ค๊ฐ’์ด NaN์ด ๋‚˜์™”๋‹ค. ์ด๋Ÿฌํ•œ ํ˜„์ƒ์€ learning rate๊ฐ€ ์ง€๋‚˜์น˜๊ฒŒ ํฌ๊ฒŒ ์„ค์ •๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋‚˜ํƒ€๋‚˜์˜€๋‹ค.

Hyperparameter Optimization

Cross validation

์ตœ์ ์˜ parameter์„ ์ฐพ๊ธฐ ์œ„ํ•œ ๊ณผ์ •์œผ๋กœ ๊ต์ฐจ ๊ฒ€์ฆ(cross-validation)๊ฐ€ ์žˆ๋‹ค. ์ •์งํ™” ๋ณ€์ˆ˜์™€ learning rate ๋ณ€์ˆ˜๋ฅผ ์ผ์ •ํ•œ ๋ฒ”์œ„๋กœ ์„ค์ •ํ•œ ํ›„ ๊ทธ ๋ฒ”์œ„ ์•ˆ์—์„œ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๋ชจ๋ธ์˜ parameter์„ ์ฐพ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

max_count = 100
for count int xrange(max_count):
    reg = 10**uniform(-4,0)
    lr = 10**uniform(-3,-4)
    trainer = ClassifierTrainer()
    model = init_two_layer_model(32*32*3,50,10) #input size, hidden layer size, number of classes
    trainer = ClassifierTrainer()
    best_model_local, stats = trainer.train(X_train,y_train,X_val,y_val,
                                        model,two_layer_net,
                                        num_epochs=5, reg=reg,
                                        update='momentum', learning_rate_decay=0.9,
                                        sample_batches=True, batch_size=100,
                                        learning_rate=lr, verbose=True)

๋นจ๊ฐ„์ƒ‰ ๋ฐ•์Šค๊ฐ€ ์ณ์ง„ ๋ชจ๋ธ์˜ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์˜ ์ •ํ™•๋„(val_acc)๊ฐ€ ๋†’๊ฒŒ ๋‚˜ํƒ€๋‚˜ ์ด ๋ชจ๋ธ์— ์‚ฌ์šฉ๋œ ์ •์งํ™” ๋ณ€์ˆ˜์™€ learning rate ๋ณ€์ˆ˜๋ฅผ ๋ฒ”์œ„๋กœ ์ง€์ •ํ•ด ๋‹ค์‹œ ๊ต์ฐจ๊ฒ€์ฆ์„ ์‹ค์‹œํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด์ „ ๋‹จ์›์—์„œ ๋žœ๋ค์„œ์น˜(Random search)๋Š” ์ƒ๋‹นํžˆ ๋ฌด์‹ํ•˜๊ณ  ์ž˜ ์“ฐ์ด์ง€ ์•Š๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ผ๊ณ  ํ•œ์ ์ด ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ํ›ˆ๋ จ์‹œํ‚ฌ๋•Œ๋Š” ๋žœ๋ค์„œ์น˜๊ฐ€ ์ƒ๋‹นํžˆ ์œ ์šฉํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ผ์ •ํ•œ ๊ทœ์น™์— ๋”ฐ๋ผ ์ •๋ ฌ๋œ parameter๋ณด๋‹ค ์™„์ „ํžˆ ๋žœ๋คํ•œ parameter์„ ์“ฐ๋Š” ๊ฒƒ์ด ์˜ˆ์ƒ์น˜ ๋ชปํ•˜๊ฒŒ ๋” ํšจ๊ณผ์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ผ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

Visualization of loss curve

๋‹ค์Œ์€ learning rate๋ฅผ ์–ด๋–ป๊ฒŒ ์„ค์ •ํ•˜๋Š๋ƒ์— ๋”ฐ๋ผ ๋‚˜ํƒ€๋‚˜๋Š” ์†์‹ค๊ฐ’์„ ๊ทธ๋ž˜ํ”„๋กœ ํ‘œํ˜„ํ•œ ๊ฒƒ์ด๋‹ค.

๋งŒ์•ฝ ์•„๋ž˜ ์‚ฌ์ง„๊ณผ ๊ฐ™์ด loss ๊ฐ’์ด ํ•œ๋™์•ˆ ์ผ์ •ํ•˜๋ฉด ์ดˆ๊ธฐ ๊ฐ€์ค‘์น˜ ๊ฐ’์„ ์ž˜๋ชป ์„ค์ •ํ•˜์˜€๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค.

์•„๋ž˜ ๊ทธ๋ž˜ํ”„๋ฅผ ํ•œ๋ฒˆ ๋ณด์ž. ๋นจ๊ฐ„์ƒ‰ ๊ทธ๋ž˜ํ”„์™€ ์ดˆ๋ก์ƒ‰ ๊ทธ๋ž˜ํ”„์˜ ์ฐจ์ด๊ฐ€ ํฐ ๊ฒฝ์šฐ๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ์ง€๋‚˜์น˜๊ฒŒ ๊ณผ์ ํ•ฉํ•˜์—ฌ ์ •์งํ™” ๊ณผ์ •์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ์ด๋‹ค.

Summary

  • ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋Š” ReLU๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
  • ์ „์ฒ˜๋ฆฌ๊ณผ์ •์€ zero-centering ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•œ๋‹ค.
  • ์ดˆ๊ธฐ ๊ฐ€์ค‘์น˜ ๊ฐ’ ์„ค์ •์€ Xavier initalization ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•œ๋‹ค
  • ๋ฐฐ์น˜ ์ •๊ทœํ™”๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ์ถ”์ฒœํ•œ๋‹ค.
  • ์ตœ์ ์˜ parameter๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ๊ต์ฐจ ๊ฒ€์ฆ์„ ์ด์šฉํ•˜๋‚˜, ๊ฐ€๋Šฅํ•˜๋ฉด ๋žœ๋ค ์„œ์น˜๋„ ์‹œ๋„ํ•ด๋ณธ๋‹ค.

Comments