Algorithms & Models

[๐Ÿ“Œ namdarineโ€™s AI Review] Attention is All You Need

namdarine โ€ข

๐Ÿ“š Transformer ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ Transformer๋Š” ๋”ฅ๋Ÿฌ๋‹ ์—ญ์‚ฌ์ƒ ๊ฐ€์žฅ ์˜ํ–ฅ๋ ฅ ์žˆ๋Š” ๊ตฌ์กฐ ์ค‘ ํ•˜๋‚˜๋กœ ChatGPT๋ฅผ ํฌํ•จํ•œ ์˜ค๋Š˜๋‚ ์˜ ์ƒ์„ฑํ˜• AI ๋Œ€๋ถ€๋ถ„์˜ ๊ธฐ๋ฐ˜์ด ๋œ๋‹ค. ์ด ๋ฆฌ๋ทฐ์—์„œ๋Š” ๊ทธ ์‹œ์ž‘์ ์ด ๋œ ๋…ผ๋ฌธ โ€˜Attention is All You Needโ€™๋ฅผ ํ’€์–ด๋ณด๋ ค ํ•œ๋‹ค.

Multi-head attention flow
์ƒ์„ฑํ˜• ์ธ๊ณต์ง€๋Šฅ์ด ์ƒ์„ฑํ•œ ์ด๋ฏธ์ง€์ž…๋‹ˆ๋‹ค.

๋…ผ๋ฌธ ์š”์•ฝ

Transformer ๊ตฌ์กฐ ์ œ์•ˆ ๋ฐ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์™€ long-range dependency ์ฒ˜๋ฆฌ์˜ ํ˜์‹ 

ํ•ต์‹ฌ๋งŒ ์ •๋ฆฌํ•˜๋ฉด

  • Transformer๋Š” ์ˆœํ™˜ ๊ตฌ์กฐ๋ฅผ attention์œผ๋กœ ๋Œ€์ฒดํ•˜์—ฌ ์ˆœ์ฐจ์  ๊ณ„์‚ฐ ์—†์ด ๋ณ‘๋ ฌ ํ•™์Šต๊ณผ ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ ๊ตฌ์กฐ์ด๋‹ค.
  • ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ๋Š” scaled dot-product attention, multi-head attention, position-wise feed-forward network, ์‚ฌ์ธยท์ฝ”์‚ฌ์ธ ๊ธฐ๋ฐ˜ positional encoding์ด๋‹ค.
  • Encoder์—์„œ๋Š” self-attention์„ ํ†ตํ•ด ๋ชจ๋“  ํ† ํฐ์ด ์„œ๋กœ๋ฅผ ์ฐธ์กฐํ•˜๊ณ , Decoder์—์„œ๋Š” ๋ฏธ๋ž˜ ํ† ํฐ์„ ๋งˆ์Šคํ‚นํ•œ self-attention์œผ๋กœ auto-regressive ์ƒ์„ฑ์„ ์œ ์ง€ํ•˜๋ฉฐ encoder-decoder attention์œผ๋กœ ์ž…๋ ฅ ์ „์ฒด์™€ ์—ฐ๊ฒฐ๋œ๋‹ค.
  • ์ „์ฒด ๊ตฌ์กฐ๋Š” residual connection๊ณผ layer normalization์„ ๊ฒฐํ•ฉํ•œ encoder/decoder ์Šคํƒ์„ ํ†ตํ•ด ์•ˆ์ •์ ์ด๊ณ  ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค.
  • WMT 2014 ์˜์–ด-๋…์ผ์–ด ๋ฐ ์˜์–ด-ํ”„๋ž‘์Šค์–ด ๋ฒˆ์—ญ ์‹คํ—˜์—์„œ Base์™€ Big ๋ชจ๋ธ ๋ชจ๋‘ ๋‹น์‹œ ์ตœ์ฒจ๋‹จ BLEU ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ•˜๋ฉฐ, ํšจ์œจ์ ์ธ ํ•™์Šต ๋Œ€๋น„ ๋†’์€ ํ’ˆ์งˆ์„ ์ž…์ฆํ–ˆ๋‹ค.

๊ธฐ์กด ๋ชจ๋ธ์˜ ํ•œ๊ณ„ ๋ฐ ๋ฌธ์ œ ์ •์˜

RNN, LSTM, GRN ์—ฌ์ „ํžˆ ์ข‹์€ ๋ชจ๋ธ๋“ค์ด์ง€๋งŒ ๋ณ‘๋ ฌํ™”ํ•˜๋ฉด ์—ฌ๋Ÿฌ ์ œ์•ฝ์ด ์ƒ๊ธด๋‹ค. ๊ทธ ์ค‘ ๋ฉ”๋ชจ๋ฆฌ์˜ ์ œ์•ฝ์œผ๋กœ ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ์— ํ•œ๊ณ„๊ฐ€ ์กด์žฌํ•œ๋‹ค. ์ด์— ๋”ฐ๋ผ factorization tricks, conditional computation์ด ์ œ์•ˆ๋˜์—ˆ์ง€๋งŒ ์—ฌ์ „ํžˆ ์ˆœ์ฐจ์  ์—ฐ์‚ฐ์ด๋ผ๋Š” ๊ธฐ๋ณธ์ ์ด ์ œ์•ฝ์ด ๋‚จ์•„์žˆ๋‹ค.
์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ์œ„ํ•ด ์ €์ž๋Š” Transformer๋ผ๋Š” ์ƒˆ๋กœ์šด ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ๋ฐ˜๋ณต์„ ํ”ผํ•˜๊ณ  Attention mechanism์„ ํ†ตํ•ด ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ๊ฐ„์˜ ๊ธ€๋กœ๋ฒŒ ์˜์กด์„ฑ์„ ๋„์ถœํ•œ๋‹ค.

์ˆœํ™˜ ๋ชจ๋ธ

RNN์„ ํฌํ•จํ•œ ์ˆœํ™˜ ๋ชจ๋ธ์˜ ์ˆœ์ฐจ์ ์ธ ๊ณ„์‚ฐ ๋ฐฉ์‹์—๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ์ง„ํ–‰๋œ๋‹ค.

  1. ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ ์‹œํ€€์Šค์˜ ๊ธฐํ˜ธ ์œ„์น˜ (symbol positions)๋ฅผ ๋”ฐ๋ผ ๊ณ„์‚ฐ์„ ๋ถ„ํ•ด
  2. ๊ณ„์‚ฐ ์‹œ๊ฐ„์˜ ๋‹จ๊ณ„์™€ ์œ„์น˜๋ฅผ ์ผ์น˜์‹œํ‚ค๋ฉด์„œ ์ด๋Ÿฐ hidden state (htโˆ’1h_{t-1})์™€ ํ˜„์žฌ ์œ„์น˜ (tt)์˜ ์ž…๋ ฅ์„ ํ•จ์ˆ˜๋กœ ์‚ฌ์šฉํ•˜๋ฉด์„œ ํ˜„์žฌ์˜ hidden state (hth_{t})๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.
    -> hidden state (hth_{t})๋ž€ ์ด์ „ ๋‹จ์–ด์—์„œ ํ•™์Šตํ•œ ์ •๋ณด๋ฅผ ์ž ์‹œ ์ €์žฅํ•˜๋Š” ๊ธฐ์–ต ์žฅ์น˜ ๊ฐ™์€ ๊ฐœ๋…์ด๋‹ค.

์ด ๋ฐฉ์‹์˜ ์ œ์•ฝ ์‚ฌํ•ญ์€ ํฌ๊ฒŒ ๋‘๊ฐ€์ง€๊ฐ€ ์กด์žฌํ•œ๋‹ค.
์ฒซ ๋ฒˆ์งธ๋Š” training examples ๋‚ด์—์„œ ๋ณ‘๋ ฌํ™” ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์ด๋‹ค. ์‹œํ€€์Šค์˜ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์งˆ ์ˆ˜๋ก ๋” ์ค‘์š”ํ•ด์ง€๋Š”๋ฐ ์‹œํ€€์Šค๊ฐ€ ๊ธธ์–ด์งˆ ์ˆ˜๋ก ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ (long-range dependencies)์„ ํ•™์Šตํ•˜๋Š”๋ฐ ์–ด๋ ค์›€์ด ์žˆ๋‹ค.
๋‘ ๋ฒˆ์งธ๋Š” ๋ฉ”๋จธ๋ฆฌ ์ œ์•ฝ์œผ๋กœ ์ธํ•ด ์—ฌ๋Ÿฌ examples์— ๊ฑธ์ฒ˜ batching์— ํ•œ๊ณ„๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

ํ•ต์‹ฌ ์ œ์•ˆ

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ํฌ๊ฒŒ ์„ธ ๊ฐ€์ง€ ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋Š” Attention, ๋‘ ๋ฒˆ์งธ๋Š” position-wise Feed-Forward networks, ๋งˆ์ง€๋ง‰์œผ๋กœ positional encoding์ด๋‹ค.

Attention

Transformer์— ์–ด๋–ป๊ฒŒ, ์–ด๋–ค attention์ด ์ ์šฉ๋˜์—ˆ๋Š”์ง€ ์‚ดํŽด๋ณด๊ธฐ์ „์— attention์ด ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋Š”์ง€์™€ ๋‘ ๊ฐ€์ง€ attention์„ ๊ฐ„๋žตํ•˜๊ฒŒ ์•Œ์•„๋ณด์ž.
๋ณดํ†ต attention์€ query์™€ ํ•œ ์Œ์˜ key-value ์Œ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„์„œ ์ถœ๋ ฅ์„ ๋งคํ•‘ํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ query, key-value, ์ถœ๋ ฅ์€ ๋ฒกํ„ฐ์ด๋‹ค.
์ถœ๋ ฅ์€ values์˜ ๊ฐ€์ค‘ ํ•ฉ์œผ๋กœ ๊ณ„์‚ฐ๋˜๊ณ , ๊ฐ value์— ํ• ๋‹น๋˜๋Š” ๊ฐ€์ค‘์น˜๋Š” query์™€ ํ•ด๋‹น key ์‚ฌ์ด์˜ compatibility function์„ ํ†ตํ•ด ๊ณ„์‚ฐ๋œ๋‹ค.
Attention์€ ๊ฐ ๋‹จ์–ด๊ฐ€ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ๋ฐ”๋ผ๋ณผ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์€ ํšŒ์˜์—์„œ ๋ชจ๋‘๊ฐ€ ๋™์‹œ์— ์„œ๋กœ์˜ ๋ฐœ์–ธ์„ ์ฐธ๊ณ ํ•˜๋ฉฐ ์˜๊ฒฌ์„ ์ •๋ฆฌํ•˜๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•˜๋‹ค.

Scaled dot-product Attention

Transformer์—์„œ ์‚ฌ์šฉ๋˜๋Š” ํŠน์ • attention์ด๋‹ค. ์ž…๋ ฅ์€ Query (q), key (k), value(v)๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค. ๊ฐ๊ฐ dkd_{k} ์ฐจ์›์˜ query์™€ key, dvd_{v} ์ฐจ์›์˜ value์ด๋‹ค. ์—ฌ๋Ÿฌ query, key, value๊ฐ€ ํ–‰๋ ฌ ํ˜•ํƒœ๋กœ ๋ฌถ์—ฌ์„œ ์ฒ˜๋ฆฌ๊ฐ€ ๋œ๋‹ค.

๊ณ„์‚ฐ ๋‹จ๊ณ„:

๋‹จ๊ณ„์„ค๋ช…
1. dot productq์™€ ๋ชจ๋“  k ๊ฐ„์˜ dot product๋กœ ๊ณ„์‚ฐ๋˜๊ณ  QktQk^{t}๋กœ ํ‘œํ˜„๋œ๋‹ค.
2. Scaling๊ณ„์‚ฐ๋œ ๊ฐ dot product ๊ฐ’์„ dk\sqrt{d_{k}}๋กœ ๋‚˜๋ˆˆ๋‹ค. ์ด scaling์€ dkd_{k}๊ฐ’์ด ํด ๋•Œ dot product์˜ ํฌ๊ธฐ๊ฐ€ ๋„ˆ๋ฌด ์ปค์ ธ์„œ softmax ํ•จ์ˆ˜๊ฐ€ ๋งค์šฐ ์ž‘์€ gradient ์˜์—ญ์œผ๋กœ ๊ฐ€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์ˆ˜ํ–‰๋œ๋‹ค.
3. SoftmaxScaling ๋œ ๊ฒฐ๊ณผ์— softmax ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์—ฌ ๊ฐ ๊ฐ’์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ์–ป๋Š”๋‹ค.
4. Weighted sum์–ป์–ด์ง„ ๊ฐ€์ค‘์น˜๋ฅผ v์— ๊ณฑํ•œ ํ›„ ํ•ฉ์‚ฐํ•˜์—ฌ ์ตœ์ข… ์ถœ๋ ฅ์„ ๊ณ„์‚ฐ ํ•œ๋‹ค. softmax(Qktdk)vsoftmax(\frac{Qk^{t}}{\sqrt{d_{k}}}) v

์ด attention์˜ ์žฅ์ ์€ dot product attention์€ ํ•˜๋‚˜์˜ hidden layer๋ฅผ ๊ฐ€์ง„ Feed-Forward Network๋ฅผ ์‚ฌ์šฉํ•˜๋Š” additive attention๊ณผ ์ด๋ก ์  ๋ณต์žก๋„๋Š” ์œ ์‚ฌํ•˜์ง€๋งŒ ๊ณ ๋„๋กœ ์ตœ์ ํ™”๋œ ํ–‰๋ ฌ ๊ณฑ์…ˆ ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ตฌํ˜„๋  ์ˆ˜ ์žˆ๊ธฐ์— ์‹ค์ œ๋กœ ํ›จ์”ฌ ๋น ๋ฅด๊ณ  ๊ณต๊ฐ„ ํšจ์œจ์ ์ด๋‹ค.

Multi-head Attention

๋‹จ์ผ attention ํ•จ์ˆ˜๋ฅผ dmodeld_{model} ์ฐจ์›์˜ k, v, Q๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ๋Œ€์‹  Q, k, v๋ฅผ h๋ฒˆ ๊ฐ๊ฐ ๋‹ค๋ฅธ ํ•™์Šต๋œ linear projection์„ ํ†ตํ•ด dkd_{k}, dkd_{k}, dvd_{v} ์ฐจ์›์œผ๋กœ ์„ ํ˜•์ ์œผ๋กœ ํˆฌ์˜๋œ๋‹ค. ๊ฐ ํˆฌ์˜๋œ Q, k, v ๋ฒ„์ „์— ๋Œ€ํ•ด attention ํ•จ์ˆ˜๋ฅผ ๋ณ‘๋ ฌ๋กœ ์ˆ˜ํ–‰ํ•˜์—ฌ dvd_{v} ์ฐจ์›์˜ ์ถœ๋ ฅ ๊ฐ’์„ ์–ป๋Š”๋‹ค. ์ด๋ ‡๊ฒŒ ์–ป์€ h๊ฐœ์˜ ์ถœ๋ ฅ ๊ฐ’๋“ค์„ concatenated ํ•œ ๋‹ค์Œ ๋‹ค์‹œ ํ•œ๋ฒˆ ์„ ํ˜• ํˆฌ์˜ํ•˜์—ฌ ์ตœ์ข…๊ฐ’์„ ์–ป๋Š”๋‹ค.
์ด attention์˜ ๋ชฉ์ ์€ ๋ชจ๋ธ์ด ์„œ๋กœ ๋‹ค๋ฅธ ํ‘œํ˜„ ๋ถ€๋ถ„ ๊ณต๊ฐ„ (different representation subspaces)์—์„œ ๋‹ค๋ฅธ ์œ„์น˜์˜ ์ •๋ณด์— ๋™์‹œ์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค. ๋‹จ์ผ attention head์˜ ๊ฒฝ์šฐ ํ‰๊ท ํ™”๋กœ ์ธํ•ด ์ด๋Ÿฌํ•œ ๋Šฅ๋ ฅ์ด ์ œํ•œ ๋  ์ˆ˜ ์žˆ๊ธฐ๋•Œ๋ฌธ์ด๋‹ค.

  • ์ด ๋…ผ๋ฌธ์—์„œ๋Š” h=8h = 8๊ฐœ์˜ ๋ณ‘๋ ฌ attention layer๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ๊ฐ head์— ๋Œ€ํ•ด dk=dv=dmodelh=64d_k = d_v = \frac{d_{model}}{h} = 64๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ๊ฐ head์˜ ์ฐจ์›์ด ์ค„์–ด๋“ค๊ธฐ ๋•Œ๋ฌธ์— ์ „์ฒด ๊ณ„์‚ฐ ๋น„์šฉ์€ ์ „์ฒด ์ฐจ์›์„ ๊ฐ€์ง„ ๋‹จ์ผ head attention๊ณผ ์œ ์‚ฌํ•˜๋‹ค.

Multi-head attention์€ ๋งˆ์น˜ ํ•œ ์‚ฌ๋žŒ์ด ์—ฌ๋Ÿฌ ๋ช…์˜ ์—ญํ•  (๋””์ž์ด๋„ˆ, ๊ฐœ๋ฐœ์ž, ์„ธ๋ฌด์‚ฌ)์„ ๋™์‹œ์— ๋งก์•„์„œ ๋‹ค์–‘ํ•œ ๊ด€์ ์œผ๋กœ ๋ฌธ์ œ๋ฅผ ๋ฐ”๋ผ๋ณด๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค.

๊ทธ๋Ÿผ Transformer ๋‚ด attention์„ ์–ด๋–ป๊ฒŒ ์ ์šฉํ–ˆ์„๊นŒ?

  • Encoder-Decoder attention
    Decoder๋Š” ์ด์ „ decoder layer์˜ Q๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , encoder ์ถœ๋ ฅ์—์„œ ๋ฉ”๋ชจ๋ฆฌ k์™€ v๋ฅผ ๊ฐ€์ ธ์™€ decoder์˜ ๋ชจ๋“  ์œ„์น˜๊ฐ€ ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ๋ชจ๋“  ์œ„์น˜์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค.
  • Encoder self-attention
    Encoder ๋‚ด์˜ self-attention layer์—์„œ๋Š” k, v, Q ๋ชจ๋‘ ์ด์ „ encoder layer์˜ ์ถœ๋ ฅ์—์„œ ๋‚˜์˜จ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด์„œ encoder์˜ ๊ฐ ์œ„์น˜๊ฐ€ ์ด์ „ layer์˜ ๋ชจ๋“  ์œ„์น˜์— ์ง‘์ค‘ ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • Decoder self-attention Decoder์˜ self-attention layer์—์„œ๋Š” decoder์˜ ๊ฐ ์œ„์น˜๊ฐ€ ํ•ด๋‹น ์œ„์น˜๋ฅผ ํฌํ•จํ•˜์—ฌ decoder๋‚ด์˜ ๋ชจ๋“  ์ด์ „ ์œ„์น˜์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” auto-regression ์†์„ฑ์„ ๋ณด์กดํ•˜๊ธฐ ์œ„ํ•ด ๋ฏธ๋ž˜ ์œ„์น˜๋กœ์˜ ์ •๋ณด ํ๋ฆ„์„ ๋งˆ์Šคํ‚นํ•˜์—ฌ ๋ฐฉ์ง€ํ•œ๋‹ค.

์ด๋Ÿฌํ•œ attention mechanism์˜ ์ž‘๋™ ๋ฐฉ์‹์€ Transformer๊ฐ€ ์ˆœํ™˜์ด๋‚˜ convolution ์—†์ด ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ ์‹œํ€€์Šค ๋‚ด์—์„œ ๊ธฐํ˜ธ๋“ค ๊ฐ„์˜ ๊ฑฐ๋ฆฌ์— ๊ด€๊ณ„์—†์ด ์˜์กด์„ฑ์„ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋ฉฐ ์ด๋Š” ๊ธฐ์กด ์ˆœํ™˜ ๋ชจ๋ธ์˜ ํ•œ๊ณ„์ธ ์ˆœ์ฐจ์  ๊ณ„์‚ฐ ๋ฐ ๋ณ‘๋ ฌํ™” ๋ถˆ๊ฐ€ ๋ฌธ์ œ๋ฅผ ๊ทน๋ณตํ•˜๊ณ  ํ›ˆ๋ จ ์‹œ๊ฐ„ ๋‹จ์ถ• ๋ฐ ๋ฒˆ์—ญ ํ’ˆ์งˆ ํ–ฅ์ƒ์œผ๋กœ ์ด์–ด์ง€๋Š” ํ•ต์‹ฌ์ ์ธ ์š”์†Œ๊ฐ€ ๋œ๋‹ค.

Transformer ๋‚ด์—์„œ์˜ Encoder-Decoder

Encoding Decoding stact
Figure 1. The Transformer - model architecture. Source: Vaswani et al., 2017, "Attention Is All You Need" (2017)

Encoder stack of Transformer

N=6N = 6๊ฐœ์˜ ๋™์ผํ•œ layer๋กœ ์ด๋ฃจ์–ด์ง„ ์Šคํƒ์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค. ๊ฐ layer๋Š” ๋‘ ๊ฐœ์˜ sub-layer๋กœ ์ด๋ฃจ์–ด์ ธ์žˆ๋‹ค. ์ฒซ sub-layer๋Š” multi-head self-attnetion mechanism์ด๊ณ  ๋‘ ๋ฒˆ์งธ sub-layer๋Š” ๊ฐ„๋‹จํ•œ ์œ„์น˜๋ณ„ ์™„์ „ํžˆ ์—ฐ๊ฒฐ๋œ Feed-Forward ๋„คํŠธ์›Œํฌ์ด๋‹ค. ๊ฐ ๋‘ ๊ฐœ์˜ sub-layer๋Š” residual connection์„ ์ ์šฉํ•œ ํ›„์— layer normalization์„ ์ง„ํ–‰ํ•œ๋‹ค. ์ฆ‰ ๊ฐ sub-layer์˜ ์ถœ๋ ฅ์€ LayerNorm(x+Sublayer(x))LayerNorm(x + Sublayer(x)) ๋กœ ํ‘œํ˜„๋œ๋‹ค. ์—ฌ๊ธฐ์„œ Sublayer(x)Sublayer(x)๋Š” sub-layer ์ž์ฒด์— ์˜ํ•ด์„œ ๊ตฌํ˜„๋œ ํ•จ์ˆ˜์ด๋‹ค. ์ถœ๋ ฅ ์ฐจ์›์€ ์ด๋Ÿฌํ•œ residual connection์„ ์šฉ์ดํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์˜ ๋ชจ๋“  sub-layer์™€ embedding layer๋Š” dmodel=512d_{model} = 512 ์ฐจ์›์˜ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•œ๋‹ค.

  • Residual connection์€ ํžŒํŠธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•˜๋‹ค. ํ•™์ƒ์ด ์ˆ˜ํ•™ ๋ฌธ์ œ๋ฅผ ํ’€๋‹ค๊ฐ€ ๋ง‰ํ˜”์„๋•Œ ์„ ์ƒ๋‹˜์ด ์ •๋‹ต์„ ์•Œ๋ ค์ฃผ๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ ํ•™์ƒ์ด ์ง€๊ธˆ๊นŒ์ง€ ํ‘ผ ๊ณผ์ • (๊ธฐ์กด ์ž…๋ ฅ)์„ ์œ ์ง€ํ•œ ์ฑ„ ๋ถ€์กฑํ•œ ๋ถ€๋ถ„๋งŒ ํžŒํŠธ (์ถ”๊ฐ€ ๊ณ„์‚ฐ)๋ฅผ ์ค€๋‹ค. ์ด์ฒ˜๋Ÿผ Transformer์˜ ๊ฐ ๋ ˆ์ด์–ด๋Š” ์ด์ „์˜ ๊ณ„์‚ฐ ๊ฒฐ๊ณผ๋ฅผ ๊ทธ๋Œ€๋กœ ์ด์–ด๋ฐ›์•„ ์—ฌ๊ธฐ์— ์•ฝ๊ฐ„์˜ ๋ณด์ •๋งŒ ์ถ”๊ฐ€ํ•˜๊ณ  ๋‚˜์•„๊ฐ€๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต์˜ ํšจ์œจ์„ฑ๊ณผ ์•ˆ์ •์„ฑ์„ ๋†’์ธ๋‹ค.
    ์ด๊ฒƒ์ด ๋ฐ”๋กœ residual connection: x+f(x)= x + f(x) = ๊ธฐ์กด ์ •๋ณด + ์ƒˆ๋กญ๊ฒŒ ํ•™์Šต๋œ ๋ณ€ํ™”๋Ÿ‰

Decoder stack of Transformer

Encoder stack๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ N=6N = 6๊ฐœ์˜ ๋™์ผํ•œ layer๋กœ ์ด๋ฃจ์–ด์ง„ ์Šคํƒ์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค. Decoder๋Š” ๊ฐ encoder layer์— ์žˆ๋Š” ๋‘ ๊ฐœ์˜ sub-layer ์™ธ์—๋„ ์„ธ ๋ฒˆ์งธ sub-layer๋ฅผ ์‚ฝ์ž…ํ•œ๋‹ค. ์ด ์„ธ ๋ฒˆ์งธ sub-layer๋Š” encoder stack์˜ ์ถœ๋ ฅ์— ๋Œ€ํ•ด multi-head attention์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. Encoder stack๊ณผ ๋น„์Šทํ•˜๊ฒŒ ๊ฐ sub-layer ์ฃผ๋ณ€์— residual connection์„ ์‚ฌ์šฉํ•œ ๋‹ค์Œ layer normalization์„ ์ง„ํ–‰ํ•œ๋‹ค.
Decoder stack ๋‚ด์˜ self-attention sub-layer๋Š” ๋ฏธ๋ž˜ ์œ„์น˜์— ๋Œ€ํ•œ ์ •๋ณด ํ๋ฆ„์„ ๋ฐฉ์ง€ํ•˜๋„๋ก ์ˆ˜์ • ๋œ๋‹ค. ์ด๋ฅผ ๋งˆ์Šคํ‚น (masking)์ด๋ผ ํ•œ๋‹ค. ์ด๋Š” auto-regressive property๋ฅผ ๋ณด์กดํ•˜๊ธฐ ์œ„ํ•ด์„œ์ด๋‹ค. ์ฆ‰ ์œ„์น˜ ii์˜ ์˜ˆ์ธก์€ ii๋ณด๋‹ค ์ž‘์€ ์œ„์น˜์˜ ์•Œ๋ ค์ง„ ์ถœ๋ ฅ์—๋งŒ ์˜์กด ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” scaled dot-product attention๋‚ด์—์„œ ๋ถˆ๋ฒ•์ ์ธ ์—ฐ๊ฒฐ์— ํ•ด๋‹นํ•˜๋Š” softmax ์ž…๋ ฅ์˜ ๋ชจ๋“  ๊ฐ’๋“ค์„ ๋งˆ์Šคํ‚น (์Œ์˜ ๋ฌดํ•œ๋Œ€๋กœ ์„ค์ •)์„ ๊ตฌํ˜„ํ•œ๋‹ค.

์ด๋Ÿฌํ•œ ๊ตฌ์กฐ๊ฐ€ ์ˆœ์ฐจ์  ๊ณ„์‚ฐ์˜ ๊ทผ๋ณธ์ ์ธ ์ œ์•ฝ, ๋ณ‘๋ ฌํ™” ๋ถˆ๊ฐ€๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ํ›จ์”ฌ ๋” ๋งŽ์€ ๋ณ‘๋ ฌํ™”๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์—ฌ ํ›ˆ๋ จ ์‹œ๊ฐ„์„ ๋‹จ์ถ•ํ•˜๋ฉด์„œ๋„ ๋›ฐ์–ด๋‚œ ๋ฒˆ์—ญ ํ’ˆ์งˆ์„ ๋‹ฌ์„ฑํ•˜๋Š” ํ•ต์‹ฌ์ ์ธ ์—ญํ• ์„ ํ•œ๋‹ค.
Encoder๋Š” ์ „์ฒด ๋ฌธ์žฅ์„ ์š”์•…ํ•˜๋Š” ์—ญํ• , Decoder๋Š” ์ด ์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹จ์–ด๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์—ญํ• ์„ ํ•œ๋‹ค. ๊ฐ๊ฐ 6๊ฐœ์˜ ๋ ˆ์ด์–ด๊ฐ€ ์Œ“์—ฌ ์žˆ์œผ๋ฉฐ ์ •๋ณด ํ๋ฆ„์„ ์กฐ์ •ํ•˜๋Š” ํ•„ํ„ฐ์™€ ์ •์ œ ์žฅ์น˜๊ฐ€ ์žˆ๋‹ค.

์™œ self-attention ์ผ๊นŒ? (Table 1 ์ฐธ๊ณ )

Self-attention
์ƒ์„ฑํ˜• ์ธ๊ณต์ง€๋Šฅ์ด ์ƒ์„ฑํ•œ ์ด๋ฏธ์ง€์ž…๋‹ˆ๋‹ค.
  1. ๋ ˆ์ด์–ด๋‹น ์ด ๊ณ„์‚ฐ ๋ณต์žก์„ฑ

    Layer์„ค๋ช…
    Self-attention์‹œํ€€์Šค์˜ ๊ธธ์ด (nn)์ด representation ์ฐจ์› (dd)๋ณด๋‹ค ์ž‘์„๋•Œ recurrent layer๋ณด๋‹ค ๋น ๋ฅด๋‹ค. ํŠน์ • ์ƒํ™ฉ (n<dn < d)์—์„œ O(n2โ‹…d)O(n^{2} \cdot d) ์˜ ๋ณต์žก๋„๋ฅผ ๊ฐ€์ง
    Recurrent layerO(nโ‹…d2)O(n \cdot d^{2}) ๋ณต์žก๋„
    Convolutional layerkernel ํฌ๊ธฐ (kk)์— ๋”ฐ๋ผ O(kโ‹…nโ‹…d2)O(k \cdot n \cdot d^{2}) ๋ณต์žก๋„

    ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋งค์šฐ ๊ธด ์‹œํ€€์Šค์˜ ๊ฒฝ์šฐ self-attention์„ ์ธ์ ‘ํ•œ rr ํฌ๊ธฐ์˜ ์˜์—ญ์œผ๋กœ ์ œํ•œํ•˜์—ฌ O(rโ‹…nโ‹…d)O(r \cdot n \cdot d) ๋กœ ๋ณต์žก๋„๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค. ๋˜ํ•œ, k=nk = n ๊ฒฝ์šฐ separable convolution์˜ ๋ณต์žก๋„ O(kโ‹…nโ‹…(d+n)โ‹…d2)O(k \cdot n \cdot (d + n) \cdot d^{2}) ๋Š” self-attention layer์™€ positional Feed-Forward layer์˜ ์กฐํ•ฉ๊ณผ ์œ ์‚ฌํ•˜๋‹ค๊ณ  ์„ค๋ช…ํ•œ๋‹ค.

  2. ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅํ•œ ๊ณ„์‚ฐ๋Ÿ‰

    Layer์„ค๋ช…
    Self-attentionO(1)O(1) ์˜ ์ตœ๋Œ€ ๊ฒฝ๋กœ ๊ธธ์ด. ์ด๋Š” ๋ชจ๋“  ์œ„์น˜๊ฐ€ ์ƒ์ˆ˜์ ์ธ ์ˆ˜์˜ ์ˆœ์ฐจ ์—ฐ์‚ฐ์œผ๋กœ ์—ฐ๊ฒฐ๋œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ. ๋†’์€ ๋ณ‘๋ ฌํ™”๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ
    Recurrent layerO(n)O(n) ์˜ ์ˆœ์ฐจ ์—ฐ์‚ฐ ์š”๊ตฌ. ๋ณธ์งˆ์ ์œผ๋กœ ์ˆœ์ฐจ์ ์ธ ํŠน์„ฑ์„ ๊ฐ€์ง. ์ด๋Š” ํ•™์Šต ์˜ˆ์‹œ ๋‚ด์—์„œ ๋ณ‘๋ ฌํ™”๋ฅผ ์–ด๋ ต๊ฒŒ ํ•˜๊ณ  ๊ธด ์‹œํ€€์Šค ๊ธธ์ด์—์„œ ํŠนํžˆ ๋ฌธ์ œ
    Convolutional layerO(1)O(1) ์˜ ์ˆœ์ฐจ ์—ฐ์‚ฐ์œผ๋กœ ๊ณ„์‚ฐ๋˜์ง€๋งŒ ๋ชจ๋“  ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ์œ„์น˜๋ฅผ ์—ฐ๊ฒฐํ•˜๋ ค๋ฉด ์—ฌ๋Ÿฌ layer๋ฅผ ์Œ“์•„์•ผ ํ•จ
  3. ์žฅ๊ฑฐ๋ฆฌ ์ข…์†์„ฑ ๊ฐ„์˜ ๊ฒฝ๋กœ ๊ธธ์ด

    Layer์„ค๋ช…
    Self-attentionO(1)O(1) ์˜ ์ตœ๋Œ€ ๊ฒฝ๋กœ ๊ธธ์ด. ์ด๋Š” ์‹œํ€€์Šค ๋‚ด์˜ ๋ชจ๋“  ์œ„์น˜๊ฐ€ ์ƒ์ˆ˜์ ์ธ ์ˆ˜์˜ ์—ฐ์‚ฐ์„ ํ†ตํ•ด ์ง์ ‘์ ์œผ๋กœ ์—ฐ๊ฒฐ๋  ์ˆ˜ ์žˆ์Œ์„ ์˜๋ฏธ. ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ ํ•™์Šต์„ ๋” ์‰ฝ๊ฒŒ ํ•จ
    Recurrent layerO(n)O(n) ์˜ ์ตœ๋Œ€ ๊ฒฝ๋กœ ๊ธธ์ด
    Convolutional layer์—ฐ์† kernel์˜ ๊ฒฝ์šฐ O(nk)O(\frac{n}{k}), dilated convolutions์˜ ๊ฒฝ์šฐ O(logโกkn)O(\log_{k}{n}) ์˜ ๊ฒฝ๋กœ ๊ธธ์ด๋ฅผ ๊ฐ€์ง. ๋ชจ๋“  ์œ„์น˜๋ฅผ ์—ฐ๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ layer๋ฅผ ์Œ“์•„์•ผ ํ•˜๋ฏ€๋กœ ๊ฒฝ๋กœ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์ง

ํ•œ ํ‘œ๋กœ ์š”์•ฝํ•˜๋ฉด

ํ•ญ๋ชฉSelf-AttentionRNNCNN
๊ณ„์‚ฐ ๋ณต์žก๋„O(n2โ‹…d)O(n^{2} \cdot d)O(nโ‹…d2)O(n \cdot d^{2})O(kโ‹…nโ‹…d2)O(k \cdot n \cdot d^{2})
๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅ์„ฑ๋งค์šฐ ๋†’์Œ (O(1)O(1))๋‚ฎ์Œ (O(n)O(n))๋ณดํ†ต (O(1)O(1))
์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ ๊ฒฝ๋กœO(1)O(1)O(n)O(n)O(nk)O(\frac{n}{k}) - O(logโกkn)O(\log_{k}{n})
  • Side Benefit: Self-attention์€ ๋” interpretable ๋ชจ๋ธ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค. ๊ฐœ๋ณ„ attention heads๊ฐ€ ๋‹ค๋ฅธ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•™์Šต๋˜๊ณ  ๋งŽ์€ head๊ฐ€ ๋ฌธ์žฅ์˜ ๊ตฌ๋ฌธ์  ๋ฐ ์˜๋ฏธ์  ๊ตฌ์กฐ์™€ ๊ด€๋ จ๋œ ํ–‰๋™์„ ๋ณด์ธ๋‹ค.

์ด๋Ÿฌํ•œ self-attention์˜ ์ด์ ์„ ํ™œ์šฉํ•˜์—ฌ Transformer๋Š” ๊ธฐ์กด ๋ชจ๋ธ๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฅด๊ฒŒ ํ›ˆ๋ จ๋  ์ˆ˜ ์žˆ๊ณ  ๋” ๋†’์€ ๋ฒˆ์—ญ ํ’ˆ์งˆ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

Position-wise Feed-Forward Networks

Encoder์™€ decoder์˜ sub-layer๋งˆ๋‹ค ์™„์ „ํžˆ ์—ฐ๊ฒฐ๋œ Feed-Forward ๋„คํŠธ์›Œํฌ๋ฅผ ํฌํ•จํ•œ๋‹ค. ์ด๋Š” ๊ฐ ์œ„์น˜์— ๋…๋ฆฝ์ ์ด๊ณ  ๋™์ผํ•˜๊ฒŒ ์ ์šฉ๋œ๋‹ค. ์ด ๋„คํŠธ์›Œํฌ๋Š” ๋‘ ๊ฐœ์˜ linear transformation๊ณผ ๊ทธ ์‚ฌ์ด์— ReLU activation function์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. FFN(x)=maxโก(0,xW1+b1)W2+b2FFN(x) = \max(0, xW_{1} + b_{1})W_{2} + b_{2}
Linear transformation์€ ์„œ๋กœ ๋‹ค๋ฅธ ์œ„์น˜์—์„œ ๋™์ผํ•˜์ง€๋งŒ layer๋งˆ๋‹ค ๋‹ค๋ฅธ parameter๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋ฅผ kernel ํฌ๊ฐ€๊ธฐ 1์ธ ๋‘ ๊ฐœ์˜ convolution์œผ๋กœ ์„ค๋ช…ํ•  ์ˆ˜๋„ ์žˆ๋‹ค. ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์˜ ์ฐจ์›์€ ๋ชจ๋ธ์˜ ์ฐจ์›๊ณผ ๊ฐ™์€ dmodel=512d_{model} = 512 ์ด๊ณ  inner-layer์˜ ์ฐจ์›์€ dff=2048d_{ff} = 2048 ์ด๋‹ค. ์—ฌ๊ธฐ์„œ inner-layer์˜ ์ฐจ์›์ด 2048์ธ ์ด์œ ๊ฐ€ ์–ธ๊ธ‰๋˜์ง€๋Š” ์•Š์•˜์ง€๋งŒ ๋” ํฐ ์ฐจ์›์ผ ์ˆ˜๋ก ๋” ๋‚˜์€ ๋ชจ๋ธ ํผํฌ๋จผ์Šค๊ฐ€ ๋‚˜์™”๋‹ค. dff=2048d_{ff} = 2048 ์ผ๋•Œ BLEU ๊ฐ’์€ 25.8, dff=1042d_{ff} = 1042 ์ผ๋•Œ BLEU ๊ฐ’์€ 25.4, dff=4096d_{ff} = 4096 ์ผ๋•Œ BLEU ๊ฐ’์€ 26.2์œผ๋กœ ๋‚˜์™”๋‹ค. (Table 3 ์ฐธ๊ณ )

  • Feed-Forward ๋„คํŠธ์›Œํฌ๋Š” ๊ฐ ๋‹จ์–ด๊ฐ€ ๋ฌธ๋งฅ์— ๋งž๊ฒŒ ์˜๋ฏธ๋ฅผ ์กฐ์ •ํ•˜๋Š” โ€˜ํ•„ํ„ฐ๋ง ์žฅ์น˜โ€™ ์—ญํ• ์„ ํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Positional Encoding

์ด ๋…ผ๋ฌธ์€ Transformer๊ฐ€ ์ˆœํ™˜ (recurrence)๊ณผ ์ปจ๋ณผ๋ฃจ์…˜ (convolution)์ด ์—†์œผ๋ฏ€๋กœ ์‹œํ€€์Šค ๋‚ด ํ† ํฐ์˜ ์ƒ๋Œ€์  ๋˜๋Š” ์ ˆ๋Œ€์  ์œ„์น˜์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ฃผ์ž…ํ•˜๋Š” โ€˜positional embeddingsโ€™๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์ €์ž๋Š” positional encoding์„ ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋” ์Šคํƒ์˜ ๊ฐ€์žฅ ์•„๋ž˜์ธต์— ์žˆ๋Š” ์ž…๋ ฅ ์ž„๋ฒ ๋”ฉ์— ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. Positional encoding์˜ ์ฐจ์›์€ ๋‹ค๋ฅธ embedding๊ณผ ๋”ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๊ฐ™์€ dmodeld_{model} ์ฐจ์›์ด๋‹ค.

๊ตฌํ˜„ ๋ฐฉ์‹

  • ์‚ฌ์ธ, ์ฝ”์‚ฌ์ธ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • PEpos,2i=sinโกpos100002idmodelPE_{pos, 2i} = \sin{\frac{pos}{10000^{\frac{2i}{d_{model}}}}} -> ์ง์ˆ˜ ์ฐจ์›์—์„œ ์‚ฌ์šฉ
  • PEpos,2i+1=cosโกpos100002idmodelPE_{pos, 2i + 1} = \cos{\frac{pos}{10000^{\frac{2i}{d_{model}}}}} -> ํ™€์ˆ˜ ์ฐจ์›์—์„œ ์‚ฌ์šฉ
  • ๋ฌธ์žฅ ๋‚ด์—์„œ ์ ˆ๋Œ€์  ์œ„์น˜ (pos)๋ฅผ ํŒŒ์•…ํ•˜๊ณ  ์ด ์œ„์น˜๋ฅผ ์œ„ ์‚ฌ์ธ, ์ฝ”์‚ฌ์ธ ํ•จ์ˆ˜์— ๋Œ€์ž…ํ•œ๋‹ค. 0โ‰คiโ‰ฅ5120 \leq i \geq 512
    -> ์ˆ˜์‹์ด ๋ณต์žกํ•ด ๋ณด์ด์ง€๋งŒ ๋‹จ์–ด์˜ ์œ„์น˜์— ๋”ฐ๋ผ ๊ณ ์œ ํ•œ ํŒจํ„ด์„ ์ž…ํžˆ๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ผ์ข…์˜ ๋‹จ์–ด โ€˜์ขŒํ‘œโ€™๋ฅผ ์ˆ˜ํ•™์ ์œผ๋กœ ๋งŒ๋“ค์–ด์ฃผ๋Š” ์…ˆ์ด๋‹ค.

=> ์ด์œ : ์–ด๋–ค ๊ณ ์ •๋œ ์˜คํ”„์…‹ (offset) k์— ๋Œ€ํ•ด์„œ๋„ PEpos+kPE_{pos + k} ๊ฐ€ PEposPE_{pos} ์˜ ์„ ํ˜• ํ•จ์ˆ˜๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ์–ด ๋ชจ๋ธ์ด ์ƒ๋Œ€์  ์œ„์น˜์— ์˜ํ•ด attentionํ•˜๋Š” ๊ฒƒ์„ ์‰ฝ๊ฒŒ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋ผ๋Š” ๊ฐ€์„ค ๋•Œ๋ฌธ์ด๋‹ค. ๋˜ํ•œ ํ›ˆ๋ จ์‹œ ์ ‘ํ•˜๋Š” ์‹œํ€€์Šค๋ณด๋‹ค ๋” ๊ธด ์‹œํ€€์Šค ๊ธธ์ด์—๋„ ๋ชจ๋ธ์ด extrapolateํ•  ์ˆ˜ ์žˆ๋„๋ก ํ—ˆ์šฉํ•œ๋‹ค.

Positional encoding์€ Transformer๊ฐ€ ๋‹จ์–ด์˜ ์ˆœ์„œ๋ฅผ ์ธ์ง€ํ•˜๊ฒŒ ํ•˜๋Š” ๋‚ด๋น„๊ฒŒ์ดํ„ฐ์™€ ๊ฐ™๋‹ค. ๊ฐ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์— ๊ณ ์œ ํ•œ ์ขŒํ‘œ๋ฅผ ๋ถ€์—ฌํ•˜์—ฌ ๋ชจ๋ธ์ด ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์‹œํ€€์Šค ๋‚ด์—์„œ์˜ ์œ„์น˜ ๊ด€๊ณ„๊นŒ์ง€ ์ดํ•ดํ•˜๋„๋ก ๋•๋Š”๋‹ค. ๋งˆ์น˜ ์ฑ…์„ ์ฝ์„ ๋•Œ ๋‚ด์šฉ (์ž„๋ฒ ๋”ฉ)๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ช‡ ๋ฒˆ์งธ ํŽ˜์ด์ง€ (positional encoding)์— ์žˆ๋Š”์ง€ ์•Œ์•„์•ผ ์ „์ฒด ํ๋ฆ„์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ๊ณผ ์œ ์‚ฌํ•˜๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด โ€œ๊ทธ๋Š” ๋Œ์•„์™”๋‹คโ€๋Š” ๋ฌธ์žฅ์ด ์†Œ์„ค์˜ ์ดˆ๋ฐ˜๊ณผ ๋งˆ์ง€๋ง‰์— ๋“ฑ์žฅํ•  ๋•Œ ๊ทธ ๋ฌธ์žฅ์˜ ํ•ด์„์€ ์™„์ „ํžˆ ๋‹ฌ๋ผ์ง„๋‹ค.
Transformer๋Š” ๋ฐ˜๋ณต ๊ตฌ์กฐ๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์–ด ์ˆœ์„œ๋ฅผ ์ง์ ‘ ์•Œ์ง€ ๋ชปํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ๊ฐ ๋‹จ์–ด์— โ€œํŽ˜์ด์ง€ ๋ฒˆํ˜ธโ€๊ฐ™์€ ์œ„์น˜ ์ •๋ณด (์ขŒํ‘œ) ๋ฅผ ์‚ฌ์ธ/์ฝ”์‚ฌ์ธ ํŒจํ„ด์œผ๋กœ ๋ถ€์—ฌํ•œ๋‹ค. ์ด ์œ„์น˜ ์ •๋ณด ๋•๋ถ„์— ๋ชจ๋ธ์€ โ€œ๋ˆ„๊ฐ€ ๋ˆ„๊ตฌ๋ฅผ ์ˆ˜์‹ํ•˜๋Š”์ง€โ€, โ€œ๋ฌธ์žฅ ํ๋ฆ„์ด ์–ด๋–ป๊ฒŒ ์ด์–ด์ง€๋Š”์ง€โ€ ๊ฐ™์€ ์ˆœ์„œ ๊ธฐ๋ฐ˜ ์˜๋ฏธ๋ฅผ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค. ๋งˆ์น˜ ๋‹จ์–ด์— ์ง€๋„ ์œ„์˜ GPS ์ขŒํ‘œ๋ฅผ ์ฐ์–ด์ฃผ๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค.

ํ›ˆ๋ จ

๋ฐ์ดํ„ฐ

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋‘ ๊ฐ€์ง€์˜ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ–ˆ๋‹ค. ํ•˜๋‚˜๋Š” ์•ฝ 450๋งŒ ๊ฐœ์˜ ๋ฌธ์žฅ ์Œ์œผ๋กœ ๊ตฌ์„ฑ๋œ ํ‘œ์ค€ WMT 2014 ์˜์–ด-๋…์ผ์–ด์ด๊ณ  ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” 3,600๋งŒ ๋ฌธ์žฅ์œผ๋กœ ๊ตฌ์„ฑ๋œ ํ›จ์”ฌ ๋” ํฐ WMT 2014 ์˜์–ด-ํ”„๋ž‘์Šค์–ด ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค.
WMT 2014 ์˜์–ด-๋…์ผ์–ด์—๋Š” ์•ฝ 37,000๊ฐœ์˜ ํ† ํฐ์œผ๋กœ ๊ตฌ์„ฑ๋œ ์†Œ์Šค-ํƒ€์ผ“ ๊ณต์œ  ์–ดํœ˜๋ฅผ ์‚ฌ์šฉํ–ˆ๊ณ  ๋ฐ”์ดํŠธ ์Œ ์ธ์ฝ”๋”ฉ (byte-pair encoding, BPE)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์žฅ์„ ์ธ์ฝ”๋”ฉํ–ˆ๋‹ค.
WMT 2014 ์˜์–ด-ํ”„๋ž‘์Šค์–ด๋Š” 32,000๊ฐœ์˜ ์›Œ๋“œ ํ”ผ์Šค (word-piece) ์–ดํœ˜๋กœ ํ† ํฐ์„ ๋ถ„ํ• ํ–ˆ๋‹ค.

๋ฐฐ์น˜ ์ฒ˜๋ฆฌ ๋ฐ ์ผ์ •

๋ฌธ์žฅ ์Œ์€ ๋Œ€๋žต์ ์ธ ์‹œํ€€์Šค ๊ธธ์ด๋ณ„๋กœ ํ•จ๊ป˜ ๋ฐฐ์น˜๋˜์—ˆ๋‹ค. ๊ฐ ํ›ˆ๋ จ ๋ฐฐ์น˜๋Š” ๋Œ€๋žต 25,000๊ฐœ์˜ ์†Œ์Šค ํ† ํฐ๊ณผ 25,000๊ฐœ์˜ ํƒ€๊ฒŸ ํ† ํฐ์„ ํฌํ•จํ–ˆ๋‹ค. ๋ชจ๋ธ์€ 8๊ฐœ์˜ NVIDIA P100 GPU๋ฅผ ์‚ฌ์šฉํ•ด์„œ ํ›ˆ๋ จ๋˜์—ˆ๋‹ค. Base ๋ชจ๋ธ์€ ์ด 1000,000 ์Šคํ… (์•ฝ 12์‹œ๊ฐ„)๋™์•ˆ ํ›ˆ๋ จ๋˜์—ˆ๊ณ  ๊ฐ ํ›ˆ๋ จ ์Šคํ…์€ ์•ฝ 0.4์ดˆ๊ฐ€ ์†Œ์š”๋˜์—ˆ๋‹ค. Big ๋ชจ๋ธ์€ 300,000 ์Šคํ… (3.5์ผ) ๋™์•ˆ ํ›ˆ๋ จ๋˜์—ˆ๊ณ  ์Šคํ…๋‹น ์‹œ๊ฐ„์€ 1.0์ดˆ์˜€๋‹ค.

์ตœ์ ํ™” ๋ฐ ์ •๊ทœํ™”

  • ์˜ตํ‹ฐ๋งˆ์ด์ €: ฯต1=0.9\epsilon_{1} = 0.9, ฯต2=0.98\epsilon_{2} = 0.98, ฯ‘=10โˆ’9\vartheta = 10^{-9}๋ฅผ ์‚ฌ์šฉํ•˜๋Š” Adam ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.
  • ํ•™์Šต๋ฅ : ํ›ˆ๋ จ ๊ณผ์ • ๋™์•ˆ ํ•™์Šต๋ฅ ์„ ๋ณ€ํ™”์‹œ์ผฐ์œผ๋ฉฐ warmup_steps = 4000 ๋™์•ˆ ์„ ํ˜•์ ์œผ๋กœ ์ฆ๊ฐ€ํ•œ ๋‹ค์Œ ์Šคํ… ์ˆ˜์˜ ์—ญ์ œ๊ณฑ๊ทผ์— ๋น„๋ก€ํ•˜์—ฌ ๊ฐ์†Œ์‹œ์ผฐ๋‹ค.
  • ์ •๊ทœํ™”: ์ž”์ฐจ ๋“œ๋กญ์•„์›ƒ (Residual Dropout, Base ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ Pdrop=0.1P_{drop} = 0.1)๊ณผ ๋ผ๋ฒจ ์Šค๋ฌด๋”ฉ (Label Smoothing, ฯ‘ls=0.1\vartheta_{ls} = 0.1)์ด ์‚ฌ์šฉ๋˜์—ˆ๋‹ค.

๊ฒฐ๊ณผ

Transformer ๋ชจ๋ธ์€ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ๋ฐ ์˜์–ด ๊ตฌ์„ฑ ๊ตฌ๋ฌธ ๋ถ„์„ task์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ์ž…์ฆํ–ˆ๋‹ค.

  1. ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ๊ฒฐ๊ณผ
๋ชจ๋ธEN-DE BLEUEN-FR BLEUํ›ˆ๋ จ ๋น„์šฉ (FLOPs)๋น„๊ณ 
Transformer (Base model)27.338.12.2โ‹…10182.2 \cdot 10^{18} (EN-DE)์ด์ „์˜ ๊ฒฝ์Ÿ ๋ชจ๋ธ๋“ค๋ณด๋‹ค ์ ์€ ํ›ˆ๋ จ ๋น„์šฉ์œผ๋กœ ๊ธฐ์กด ๋ชจ๋ธ๋“ค์„ ๋Šฅ๊ฐ€ํ•จ
Transformer (Big model)28.441.82.3โ‹…10192.3 \cdot 10^{19} (EN-DE)๋‘ task ๋ชจ๋‘์—์„œ ์ƒˆ๋กœ์šด ์ตœ์ฒจ๋‹จ (state-of-the-art) ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑ
  • WMT 2014 ์˜์–ด-๋…์ผ์–ด:
    • Big Transformer ๋ชจ๋ธ์€ 28.4 BLEU๋ฅผ ๋‹ฌ์„ฑํ•˜์—ฌ ์ด์ „์˜ ์ตœ์ƒ ์„ฑ๋Šฅ ๋ชจ๋ธ (์•™์ƒ๋ธ” ํฌํ•จ)๋ณด๋‹ค 2.0 BLEU ์ด์ƒ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ ์ƒˆ๋กœ์šด ์ตœ์ฒจ๋‹จ BLEU ์ ์ˆ˜๋ฅผ ํ™•๋ฆฝํ–ˆ๋‹ค.
    • Base ๋ชจ๋ธ๋„ ๊ฒฝ์Ÿ ๋ชจ๋ธ๋“ค์˜ ํ›ˆ๋ จ ๋น„์šฉ์˜ ์ผ๋ถ€๋งŒ์œผ๋กœ๋„ ์ด์ „์— ๋ฐœํ‘œ๋œ ๋ชจ๋“  ๋ชจ๋ธ๊ณผ ์•™์ƒ๋ธ”์„ ๋Šฅ๊ฐ€ํ–ˆ๋‹ค.
  • WMT 2014 ์˜์–ด-ํ”„๋ž‘์Šค์–ด:
    • Big Transformer ๋ชจ๋ธ์€ 41.8 BLEU ์ ์ˆ˜๋ฅผ ๋‹ฌ์„ฑํ•˜์—ฌ ์ƒˆ๋กœ์šด ๋‹จ์ผ ๋ชจ๋ธ ์ตœ์ฒจ๋‹จ ๊ธฐ๋ก์„ ์„ธ์› ๋‹ค. ์ด๋Š” ๊ธฐ์กด ๋ฌธํ—Œ์—์„œ ๋ณด๊ณ ๋œ ์ตœ๊ณ ์˜ ๋ชจ๋ธ๋“ค์˜ ํ›ˆ๋ จ ๋น„์šฉ์˜ ์ž‘์€ ์ผ๋ถ€๋งŒ์œผ๋กœ ๋‹ฌ์„ฑ๋˜์—ˆ๋‹ค.
  1. ๋ชจ๋ธ ๊ตฌ์„ฑ ์š”์†Œ ์ค‘์š”๋„ ํ‰๊ฐ€ (Model Variations)
    ๊ฐœ๋ฐœ ๋ฐ์ดํ„ฐ์…‹ (newstest2013)์—์„œ Transformer ๊ตฌ์กฐ์˜ ๋‹ค์–‘ํ•œ ๊ตฌ์„ฑ ์š”์†Œ์˜ ์ค‘์š”๋„๋ฅผ ํ‰๊ฐ€ํ–ˆ๋‹ค.
  • Multi-head attention: ๋‹จ์ผ head attention์€ ์ตœ์  ์„ค์ •๋ณด๋‹ค 0.9 BLEU๋งŒํผ ์„ฑ๋Šฅ์ด ๋‚ฎ์•˜์œผ๋ฉฐ ํ—ค๋“œ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์•„๋„ ํ’ˆ์งˆ์ด ์ €ํ•˜๋˜์—ˆ๋‹ค.
  • ํ‚ค ํฌ๊ธฐ (dkd_{k}): Attention ํ‚ค ํฌ๊ธฐ๋ฅผ ์ค„์ด๋ฉด ๋ชจ๋ธ ํ’ˆ์งˆ์ด ์ €ํ•˜๋˜์—ˆ๋‹ค.
  • ๊ทœ๋ชจ: ์˜ˆ์ƒ๋Œ€๋กœ ๋ชจ๋ธ ํฌ๊ธฐ๊ฐ€ ํด์ˆ˜๋ก ์„ฑ๋Šฅ์ด ์ข‹์•˜๊ณ  ๋“œ๋กญ์•„์›ƒ์€ ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•˜๋Š”๋ฐ ๋งค์šฐ ๋„์›€์ด ๋˜์—ˆ๋‹ค.
  • Positional encoding: ์‚ฌ์ธ ํ•จ์ˆ˜ ๊ธฐ๋ฐ˜์˜ positional encoding์„ ํ•™์Šต๋œ positional embedding์œผ๋กœ ๋Œ€์ฒดํ–ˆ์„ ๋•Œ ๊ฑฐ์˜ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค. ์ •ํ˜„ํŒŒ (sinusoidal) ๋ฒ„์ „์„ ์„ ํƒํ•œ ์ด์œ ๋Š” ๋ชจ๋ธ์ด ํ›ˆ๋ จ ์‹œ ์ ‘ํ•œ ๊ฒƒ๋ณด๋‹ค ๋” ๊ธด ์‹œํ€€์Šค ๊ธธ์ด๋กœ extrapolateํ•  ์ˆ˜ ์žˆ๋„๋ก ํ—ˆ์šฉํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.
  1. ์˜์–ด ๊ตฌ์„ฑ ๊ตฌ๋ฌธ ๋ถ„์„ (English Constituency Parsing)
    Transformer๋Š” ์˜์–ด ๊ตฌ์„ฑ ๊ตฌ๋ฌธ ๋ถ„์„์—๋„ ์„ฑ๊ณต์ ์œผ๋กœ ์ ์šฉ๋˜์–ด ๋‹ค๋ฅธ ์ž‘์—…์—๋„ ์ž˜ ์ผ๋ฐ˜ํ™”๋จ์„ ๋ณด์˜€๋‹ค.
  • WSJ (Wall Street Journal) ํ›ˆ๋ จ ์…‹ (์•ฝ 4๋งŒ ๋ฌธ์žฅ)๋งŒ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ 91.3 F1์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค.
  • ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ (์•ฝ 1,700๋งŒ ๋ฌธ์žฅ)๋ฅผ ์‚ฌ์šฉํ•œ ์ค€์ง€๋„ ํ•™์Šต ์„ค์ •์—์„œ๋Š” 92.7 F1์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ์ด๋Š” Recurrent Neural Network Grammar๋ฅผ ์ œ์™ธํ•˜๊ณ  ์ด์ „์— ๋ณด๊ณ ๋œ ๋ชจ๋“  ๋ชจ๋ธ๋ณด๋‹ค ๋” ๋‚˜์€ ๊ฒฐ๊ณผ ์ด๋‹ค.
  • RNN sequence-to-sequence ๋ชจ๋ธ๊ณผ ๋‹ฌ๋ฆฌ Transformer๋Š” WSJ ํ›ˆ๋ จ ์…‹๋งŒ์œผ๋กœ ํ›ˆ๋ จํ–ˆ์„ ๋•Œ๋„ Berkeley-Parser๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค.

๊ฒฐ๋ก 

์ด ๋…ผ๋ฌธ์—์„œ ์ œ์‹œ๋œ Transformer๋Š” ์žฌ๊ท€ ๊ณ„์ธต (recurrent layers)์„ multi-headed self-attention์œผ๋กœ ์™„์ „ํžˆ ๋Œ€์ฒดํ•œ ์ตœ์ดˆ์˜ ์‹œํ€€์Šค ๋ณ€ํ™˜ ๋ชจ๋ธ์ด๋‹ค.

ํ•ต์‹ฌ ๊ฒฐ๋ก :

  • ์„ฑ๋Šฅ ์šฐ์œ„: Transformer๋Š” WMT 2014 ์˜์–ด-๋…์ผ์–ด ๋ฐ ์˜์–ด-ํ”„๋ž‘์Šค์–ด ๋ฒˆ์—ญ task ๋ชจ๋‘์—์„œ ์ƒˆ๋กœ์šด ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ํŠนํžˆ ์˜์–ด-๋…์ผ์–ด task์—์„œ ์ตœ๊ณ ์˜ ๋ชจ๋ธ์€ ์ด์ „์— ๋ณด๊ณ ๋œ ๋ชจ๋“  ์•™์ƒ๋ธ” ๋ชจ๋ธ๊นŒ์ง€ ๋Šฅ๊ฐ€ํ–ˆ๋‹ค.
  • ํšจ์œจ์„ฑ: Transformer๋Š” ์žฌ๊ท€์  ๋˜๋Š” ํ•ฉ์„ฑ๊ณฑ ๊ณ„์ธต ๊ธฐ๋ฐ˜ ์•„ํ‚คํ…์ฒ˜๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฅด๊ฒŒ ํ›ˆ๋ จ๋  ์ˆ˜ ์žˆ๋‹ค.
  • ๋ณ‘๋ ฌํ™”: ์ˆœํ™˜ (recurrence) ๋ฐ ํ•ฉ์„ฑ๊ณฑ (convolution)์„ ์™„์ „ํžˆ ์ œ๊ฑฐํ•˜๊ณ  ์˜ค์ง attention mechanism์— ์˜์กดํ•จ์œผ๋กœ์จ Transformer๋Š” ํ›จ์”ฌ ๋” ๋งŽ์€ ๋ณ‘๋ ฌํ™”๋ฅผ ํ—ˆ์šฉํ•œ๋‹ค.

๋งˆ๋ฌด๋ฆฌ

Transformer๋Š” RNN์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ณ  ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ์™€ ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ ํ•™์Šต ๋ชจ๋‘๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ ํ˜์‹ ์ ์ธ ๊ตฌ์กฐ์˜€๋‹ค. ์ดํ›„ ๋“ฑ์žฅํ•œ GPT, BERT, T5, Gemini ๋“ฑ ํ˜„๋Œ€ ์ƒ์„ฑํ˜• AI ๋ชจ๋ธ์€ ๋ชจ๋‘ ์ด ๋…ผ๋ฌธ์˜ ์•„์ด๋””์–ด์—์„œ ์ถœ๋ฐœํ–ˆ๋‹ค. ๋…ผ๋ฌธ์˜ ์ œ๋ชฉ์ฒ˜๋Ÿผ ์ด ๊ตฌ์กฐ์˜ ๋ณธ์งˆ์€ โ€œAttention is All You Needโ€์˜€๋‹ค.
namdarine์€ ์ด์ฒ˜๋Ÿผ ์ค‘์š”ํ•œ ๊ธฐ์ˆ ์„ ๋ณด๋‹ค ์‰ฝ๊ฒŒ ์ดํ•ดํ•˜๊ณ  ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ฝ˜ํ…์ธ ๋ฅผ ๊ณ„์†ํ•ด์„œ ๋งŒ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค.


๐Ÿ“Œ namdarineโ€™s AI Review๋Š” ๋ˆ„๊ตฌ๋‚˜ AI์˜ ํ•ต์‹ฌ ๊ธฐ์ˆ ์„ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ๋…ผ๋ฌธ, ์•Œ๊ณ ๋ฆฌ์ฆ˜, ๊ตฌ์กฐ๋ฅผ ์‰ฝ๊ฒŒ ํ’€์–ด์ฃผ๋Š” ์‹œ๋ฆฌ์ฆˆ์ž…๋‹ˆ๋‹ค.

Letโ€™s build it like itโ€™s already happened.
โ†’ ๋‹ค์Œ ๋ฆฌ๋ทฐ์—์„œ ๋งŒ๋‚˜์š”!