[NLP] Attention is all you need

โ€ขPaperReview

2022๋…„ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ์Šคํ„ฐ๋””, ์ง‘ํ˜„์ „ 3๊ธฐ์—์„œ ์ง„ํ–‰ํ•˜์˜€๋˜ 'Attention is all you need' ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์žฌ๊ตฌ์„ฑํ•œ ํฌ์ŠคํŒ…์ž„์„ ๋ฏธ๋ฆฌ ์•Œ๋ ค๋“œ๋ฆฝ๋‹ˆ๋‹ค.
์•„๋ž˜ ๋งํฌ์—์„œ ์˜์ƒ์œผ๋กœ๋„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋งํฌ : Attention is all you nedd ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ๋ฐœํ‘œ ์˜์ƒ

์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” NLP, ์ฆ‰ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์˜ ์—ฐ๊ตฌ์— ์žˆ์–ด ๊ฐ€์žฅ ๊ธฐ๋ณธ์ด ๋˜๋Š”, ๊ทผ๋ณธ์ด ๋˜๋Š” ๋…ผ๋ฌธ์ธ 'Attention is all you need'๋ฅผ ์†Œ๊ฐœํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค.


Background

๋ณธ ๋…ผ๋ฌธ์— ๋Œ€ํ•œ ์ž์„ธํ•œ ์„ค๋ช…์— ์•ž์„œ ๋จผ์ € ๋ฐฐ๊ฒฝ์ง€์‹์„ ์‚ดํŽด๋ณด๊ณ  ๋„˜์–ด๊ฐ€๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Transformer์˜ ๊ธฐ์—ฌ

Transformer๋Š” 2017๋…„ Google์ด ์ œ์•ˆํ•œ seq2seq ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ Sequence Transduction,๋ณ€ํ™˜ ๋ชจ๋ธ์€ Encoder(์ธ์ฝ”๋”)์™€ Decoder(๋””์ฝ”๋”)๋ฅผ ํฌํ•จํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ Recurrent(์ˆœํ™˜ ์‹ ๊ฒฝ๋ง)๊ณผ Convolution Layer๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. Transformer๋Š” Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ํ™œ์šฉ์„ ํ†ตํ•ด ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”๋ฅผ ์—ฐ๊ฒฐํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ƒˆ๋กœ์šด ๊ตฌ์กฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ•˜๋Š” Transformer๋Š” Machine Translation(๊ธฐ๊ณ„๋ฒˆ์—ญ)์—์„œ ๋งค์šฐ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ํ•™์Šต ์‹œ ์šฐ์ˆ˜ํ•œ ๋ณ‘๋ ฌํ™”(Parallelizable)์— ์šฐ์ˆ˜ํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํ›จ์”ฌ ๋” ์ ์€ ์‹œ๊ฐ„์„ ์†Œ์š”ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ Constituency Parsing(๊ตฌ๋ฌธ ๋ถ„์„)๋ถ„์•ผ์—์„œ๋„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, Generalization(์ผ๋ฐ˜ํ™”) ๋˜ํ•œ ์ž˜ ๋œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

BERT & GPT

Transformer๋Š” ์ตœ๊ทผ ํ•ซํ•œ BERT์™€ GPT์˜ ๊ตฌ์กฐ์—๋„ ํฐ ๊ธฐ์—ฌ๋ฅผ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์•„๋ž˜ fig 1์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋“ฏ์ด, transformer๋Š” ํฌ๊ฒŒ ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

์ด์™€ ๊ด€๋ จํ•œ ์ž์„ธํ•œ ๊ตฌ์กฐ ์„ค๋ช…์€ ๋ฐ‘์—์„œ ๋‹ค์‹œ ๋‹ค๋ฃจ๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Untitled

Bert๋Š” transformer์—์„œ ๋””์ฝ”๋”๋ฅผ ์ œ์™ธํ•˜๊ณ  ์ธ์ฝ”๋”๋ฅผ, GPT๋Š” ๋ฐ˜๋Œ€๋กœ ์ธ์ฝ”๋”๋ฅผ ์ œ์™ธํ•˜๊ณ  ๋””์ฝ”๋”๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ „์ž๋Š” ๋ฌธ์ž์˜ ์˜๋ฏธ๋ฅผ ์ถ”์ถœํ•˜๋Š”๋ฐ, ํ›„์ž๋Š” ๋ฌธ์žฅ ์ƒ์„ฑ์— ๊ฐ•์ ์„ ๋‘๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๋ณธ๋ก ์œผ๋กœ ๋Œ์•„์™€ ์ •๋ฆฌํ•˜์ž๋ฉด, ๊ฒฐ๊ตญ ๋‘ ๋ชจ๋ธ์ด ํƒ„์ƒํ•  ์ˆ˜ ์žˆ์—ˆ๋˜ ํฐ ๊ธฐ์—ฌ๋ฅผ ํ•œ ๊ฒƒ์€ transformer๋ผ๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๋‘ ๋ชจ๋ธ์— ๋Œ€ํ•œ ๋‚ด์šฉ์€ ์ถ”ํ›„ ๋‹ค๋ฃฐ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ์—์„œ ์ด์•ผ๊ธฐํ•˜๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Untitled 1

Seq2seq: RNN๊ณผ seq2seq ๊ตฌ์กฐ์™€ ๋ฌธ์ œ์ 

seq2seq์˜ ๊ตฌ์กฐ๋ฅผ ์‚ดํŽด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Untitled 2

RNN์˜ ๋ชจ๋ธ ์ค‘ ํ•˜๋‚˜์ธ seq2seq๋Š” ์ˆœ์ฐจ์ ์œผ๋กœ ์—ฐ์‚ฐ์„ ์ง„ํ–‰ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ context vector๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ณผ์ •์—์„œ ์ •๋ณด๋ฅผ ์••์ถ•ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๊ทธ๋กœ ์ธํ•ด ๋ณ‘๋ ฌํ™”๊ฐ€ ๋ถˆ๊ฐ€๋Šฅ ํ•˜๋ฉฐ, ์—ฐ์‚ฐ ์†๋„๊ฐ€ ์ €ํ•˜๋˜๋Š” ๋ฌธ์ œ์ ์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. Long-term dependency problem, ์ฆ‰ ์ •๋ณด๊ฐ€ ์ค„์–ด๋“ฆ์— ๋”ฐ๋ผ ์ œ๋Œ€๋กœ๋œ ์˜ˆ์ธก์„ ํ•  ์ˆ˜ ์—†๊ฒŒ ๋˜๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ์ •๋ณด ์••์ถ•์œผ๋กœ ์ธํ•ด ์†์‹ค์ด ๋ฐœ์ƒํ•˜๊ฒŒ ๋œ๋‹ค๋Š” ๋ฌธ์ œ์  ๋˜ํ•œ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

Untitled 3

Seq2seq with Attention model

์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ์†Œ์Šค ๋ฌธ์žฅ์˜ ๋ชจ๋“  ๋ ˆ์ด์–ด(๊ฐ ํ† ํฐ์ด ์—ฐ๊ฒฐ๋œ)์˜ ์ถœ๋ ฅ ์ „๋ถ€๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๋Š”, seq2seq์— attention์ด ๊ฒฐํ•ฉ๋œ ๋ชจ๋ธ์ด ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค. Attention์ด๋ž€, ๊ฒฐ๋ก ์ ์œผ๋กœ ๋‹จ์˜์˜ ์ „์ฒด์ ์ธ ์ •๋ณด๋ฅผ ์ €์žฅํ•˜๋Š” ๊ฒƒ์ด๋ผ๊ณ  ์ดํ•ดํ•˜์‹œ๋ฉด ๋  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Untitled 4

๊ตฌ์กฐ๋ฅผ ๊ทธ๋Œ€๋กœ ๋‘๊ณ  weight sum vector(h1+h2โ€ฆh_1 + h_2 \dots)๋ฅผ ๋””์ฝ”๋”์˜ RNN์…€๊ณผ FC์…€์˜ input์œผ๋กœ ๋„ฃ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๊ตฌํ•ด์ง„ ํ™•๋ฅ ๊ฐ’, attention weight๋ฅผ ์ด์šฉํ•ด ๊ฐ ์ถ”๋ ฅ์ด ์–ด๋–ค ์ •๋ณด๋ฅผ ๋งŒํžˆ ์ฐธ๊ณ ํ•ด์“ด์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Transformer

์ดํ›„ ๋“ฑ์žฅํ•œ Transformer๋Š” CNN, RNN์„ ์ „ํ˜€ ํ•„์š”๋กœ ํ•˜์ง€ ์•Š๊ณ  attention๋งŒ์„ ์ด์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด์˜ ๊ฒฝ์šฐ RNN์ฒ˜๋Ÿผ ๋ฌธ์žฅ ์•ˆ์˜ ๊ฐ ๋‹จ์–ด ์ˆœ์„œ ์ •๋ณด๋ฅผ ์ฃผ๊ธฐ ์–ด๋ ค์›Œ ์ง€๋Š”๋ฐ, ์ด๋ฅผ Positional Encoding์„ ์ด์šฉํ•˜์—ฌ ์ˆœ์„œ์ •๋ณด๋ฅผ ์ฃผ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”๋กœ ๊ตฌ์„ฑ๋˜๋Š” ๊ฒƒ์€ ๋™์ผํ•˜๋‚˜ attention๊ณผ์ •์„ ์—ฌ๋Ÿฌ ๋ ˆ์ด์–ด์—์„œ ๋ฐ˜๋ณต, ์ฆ‰ ์ธ์ฝ”๋”๊ฐ€ N๊ฐœ ์ค‘์ฒฉ๋˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

RNN, LSTM์€ ์ž…๋ ฅ ๋‹จ์–ด ๊ฐฏ์ˆ˜๋งŒํผ ์ธ์ฝ”๋” ๋ ˆ์ด์–ด๋ฅผ ๊ฑฐ์ณ hidden state๋ฅผ ๋งŒ๋“ค์ง€๋งŒ, transformer๋Š” ๋‹จ์–ด๊ฐ€ ํ•˜๋‚˜๋กœ ์—ฐ๊ฒฐ๋˜์–ด ๋ณ‘๋ ฌ์ ์œผ๋กœ ํ•œ๋ฒˆ์˜ ์ธ์ฝ”๋”๋ฅผ ๊ฑฐ์ณ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์ถœ๋ ฅ๊ฐ’์„ ์ƒ์„ฑํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ณ„์‚ฐ ๋ณต์žก๋„๋ฅผ ์ค„์ด๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Untitled 5

Model Architecture

๊ทธ๋Ÿผ ์ด์ œ transformer ๋ชจ๋ธ์˜ ๊ตฌ์กฐ์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์ด์•ผ๊ธฐํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Inputs

Outline

Transformer๋Š” ์•ž์„œ ์ด์•ผ๊ธฐํ–ˆ๋“ฏ์ด ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋จผ์ € ์ธ์ฝ”๋”์˜ input embedding์„ ์‚ดํŽด๋ณด๋ฉด fig 8๊ณผ ๊ฐ™์€ ๊ณผ์ •์œผ๋กœ ์ง„ํ–‰์ด ๋ฉ๋‹ˆ๋‹ค. input ๋ฐ์ดํ„ฐ๋Š” ์ˆ˜๋ฐฑ๋งŒ๊ฐœ์˜ ๋ฌธ์žฅ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค. ์ด ๋ฌธ์žฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ปดํ“จํ„ฐ๊ฐ€ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ฌธ์ž์ธ ๋‹จ์–ด๋ฅผ ์ˆซ์ž๋กœ ๋ณ€๊ฒฝํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. ์ž„๋ฒ ๋”ฉ์„ ํ†ตํ•ด ์šฐ๋ฆฌ๋Š” ๋ฌธ์ž์ธ ๋‹จ์–ด๋“ค์„ ๊ฐ๊ฐ ์ž˜ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋Š” ์ˆซ์ž๋กœ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ˆซ์ž๋กœ ๋ณ€ํ™˜๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์— ์ฃผ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Untitled 6

Byte Pair Encoding(BPE)

Transformer ๋ชจ๋ธ์€ ์ž์—ฐ์–ด ๋ฌธ์žฅ์„ ๋ถ„์ ˆํ•œ ํ† ํฐ ์‹œํ€€์Šค๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ฌธ์žฅ์— ํ† ํฐํ™”๋ฅผ ์ˆ˜ํ–‰ํ•ด์ฃผ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํ† ํฐํ™” ๋ฐฉ๋ฒ•์€ ๋‹จ์–ด ๋‹จ์œ„, ๋ฌธ์ž ๋‹จ์œ„, ์„œ๋ธŒ ๋‹จ์œ„ ๋“ฑ ํฌ๊ฒŒ 3๊ฐ€์ง€๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. BPE๋Š” 1994๋…„ ์ œ์•ˆ๋œ ๋ฐ์ดํ„ฐ ์••์ถ• ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ฉฐ, ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์˜ ์„œ๋ธŒ์›Œ๋“œ ๋ถ„๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์‘์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, ๊ธฐ์กด์— ์žˆ๋˜ ๋‹จ์–ด๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ๊ฒƒ์œผ๋กœ,๊ธ€์ž ๋‹จ์œ„์—์„œ ์ ์ฐจ์ ์œผ๋กœ ๋‹จ์–ด ์ง‘ํ•ฉ์„ ๋งŒ๋“ค์–ด ๋‚ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ ‘๊ทผํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด OOV(Out of Vocabulary)๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Positional Encoding

๋‹ค์Œ์œผ๋กœ positional encoding์ž…๋‹ˆ๋‹ค. fig 9์—์„œ ๋ณด์ด๋Š” ๋ฐ”์™€ ๊ฐ™์ด ์ค‘๊ฐ„์— ์‚ฝ์ž…๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. Positional encoding์€ ์ฃผ๊ธฐ ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฐ ๋‹จ์–ด์˜ ์ƒ๋Œ€์ ์ธ ์œ„์น˜ ์ •๋ณด๋ฅผ ์ž…๋ ฅํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์•ž์—์„œ ์ด์•ผ๊ธฐํ•˜์˜€๋“ฏ์ด, Transformer๋Š” RNN์˜ ๋‹จ์ ์„ ํ•ด๊ฒฐํ•˜๊ณ ์ž ์ œ์‹œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ positional encoding ๊ณ„์ธต์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ์จ ๋‹จ์–ด๋“ค์ด ์ˆœ์ฐจ์ ์œผ๋กœ ๋“ค์–ด์˜ค์ง€ ์•Š๊ณ  ๋ญ‰ํƒœ๊ธฐ๋กœ ๋“ค์–ด์™€๋„ ๋‹จ์–ด๋“ค์˜ ์ˆœ์„œ๋ฅผ ์ดํ•ดํ•˜๋ฉด์„œ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์‹œ ๋งํ•ด, ๋‹จ์–ด ๋ฐ์ดํ„ฐ๋“ค์˜ ์ƒ๋Œ€์ ์ธ ์œ„์น˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ณตํ•˜๋ฏ€๋กœ์จ ๋ณ‘๋ ฌ ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•ด์ง€๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค.

Untitled 7

fig 10์„ ํ†ตํ•ด positional encoding์˜ ์‹์„ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” sinusoidal version์„ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ, ๊ทธ ์ด์œ ๋Š” ๊ฐ ํฌ์ง€์…˜์˜ ์ƒ๋Œ€์ ์ธ ์ •๋ณด๋ฅผ ๋‚˜ํƒ€๋‚ด์•ผํ•˜๋ฉฐ, ์„ ํ˜•๋ณ€ํ™˜ ํ˜•ํƒœ๋กœ ๋‚˜์™€ ํ•™์Šต์ด ํŽธ๋ฆฌํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋ผ ์ดํ•ดํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Untitled 8

Multi-head Attention

outline

๋…ผ๋ฌธ์—์„œ๋Š” self-attention(scaled dot-product attention) layer๋ฅผ ๋‹ค์ค‘์œผ๋กœ ๊ตฌํ˜„ํ•œ multi-head attention์„ ์ œ์‹œํ•˜์˜€์Šต๋‹ˆ๋‹ค. Scaled dot-product attention์˜ ๊ตฌ์กฐ๋Š” fig 12์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Untitled 9

Dot-product Attention & Scaling

Scaled-dot์„ ์ˆ˜์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด fig 14์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์ด๋•Œ, ๋…ผ๋ฌธ์—์„  attention์„ concat์ด ์•„๋‹Œ dot์œผ๋กœ ๊ตฌํ˜„ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ทธ ์ด์œ ๋ฅผ ์‚ดํŽด๋ณด์ž๋ฉด, dim์˜ ์ฆ๊ฐ€๊ฐ€ ์—†์Œ์— ๋”ฐ๋ผ space๊ฐ€ efficientํ•˜๋ฉฐ, matrix multiplication๋งŒ์œผ๋กœ ๊ตฌํ˜„์ด ๊ฐ€๋Šฅํ•˜์—ฌ ๋น ๋ฅด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

Untitled 10

๊ทธ๋ ‡๋‹ค๋ฉด ๊ธฐ์กด์˜ dot-product attention์—์„œ scaling์„ ์ง„ํ–‰ํ•œ ์ด์œ ๋Š” ๋ฌด์—‡์ผ๊นŒ์š”?

๊ทธ ์ด์œ ๋Š” fig 15์—์„œ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, QK์˜ ๋‚ด์  ๊ฐ’์ด ๋งค์šฐ ์ปค์งˆ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์™œ ์ปค์ง€๋Š”๊ฐ€์— ์˜๋ฌธ์„ ํ’ˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ์— ๋Œ€ํ•ด ๋ณด์ถฉ ์„ค๋ช…์„ ํ•˜์ž๋ฉด, Q,KQ, K๊ฐ€ ๊ฐ๊ฐ gaussian(๊ฐ€์šฐ์‹œ์•ˆ) ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •ํ–ˆ์„ ๋•Œ ๋ถ„์‚ฐ์ด dkd_k๊ฐ€ ๋จ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Untitled 11

QK์˜ ๋‚ด์  ๊ฐ’์ด ๋งค์šฐ ์ปค์ง€๋ฉด softmax์˜ scale variantํ•œ ํŠน์„ฑ์„ ๋งŒ๋‚˜ gradient vanishing์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ dim=4dim = 4์ธ ๊ฒฝ์šฐ, softmax์˜ jacobian(์ž์ฝ”๋น„์•ˆ)์€ fig 16๊ณผ ๊ฐ™์€๋ฐ, ์ด ๊ฒฝ์šฐ scale์ด ํฌ๋ฉด S=(1,0,0,0)S = (1, 0, 0, 0)๊ณผ ๊ฐ™์€ ํ˜•ํƒœ๊ฐ€ ๋˜์–ด gradient vanishing์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

Untitled 12

Scaled Dot-product Attention

๊ทธ๋ ‡๋‹ค๋ฉด ๋…ผ๋ฌธ์—์„œ ์ด์•ผ๊ธฐํ•˜๋Š” Q, K, V๊ฐ€ ์˜๋ฏธํ•˜๋Š” ๊ฒƒ์€ ๋ฌด์—‡์ผ๊นŒ์š”?

  • Q = Query vector : ์šฐ๋ฆฌ๊ฐ€ ์ฐพ๊ณ  ์žˆ๋Š” ๋ฒกํ„ฐ, ์˜ํ–ฅ์„ ๋ฐ›๋Š” ๋ฒกํ„ฐ
  • K = Key vector : ์–ด๋–ค ์ข…๋ฅ˜์˜ ์ •๋ณด๊ฐ€ ์žˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ, ์˜ํ–ฅ์„ ์ฃผ๋Š” ๋ฒกํ„ฐ
  • V = Value vector : ์ฃผ๋Š” ์˜ํ–ฅ์˜ ๊ฐ€์ค‘์น˜ ๋ฒกํ„ฐ

Fig 17์— ๋‚˜์™€์žˆ๋“ฏ์ด scaled dot-product attention์˜ ๊ณผ์ •์„ ์ •๋ฆฌํ•ด๋ณด๋ฉด, ๋จผ์ € ์ž…๋ ฅ์„ Q,K,VQ, K, V๋กœ ์ฒ˜๋ฆฌํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ดํ›„ matrix๋กœ ์—ฌ๋Ÿฌ ์ž…๋ ฅ์„ ์ฒ˜๋ฆฌํ•˜๊ณ , QQ์™€ KK ์‚ฌ์ด์˜ ๋‚ด์ ์„ ํ†ตํ•ด Q,KQ, K ์‚ฌ์ด ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ดํ›„ fig 18๊ณผ ๊ฐ™์ด softmax ํ•จ์ˆ˜์— ๋Œ€์ž…ํ•˜์—ฌ ์ตœ์ข… attention ๊ฐ’ matrix๋ฅผ ์–ป๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๋‹ค์‹œ ๋งํ•ด softmax(QK)V=Attentionย valueย matrixsoftmax(QK)V = \text{Attention value matrix}์™€ ๊ฐ™์ด ์ •๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒ ์Šต๋‹ˆ๋‹ค.

Untitled 13

Untitled 14

Process

์œ„์—์„œ ์„ค๋ช…ํ•˜์˜€๋“ฏ์ด Q,K,VQ, K, V๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ concat(=concatenate)๋ฅผ ์ง„ํ–‰ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. concatenate์€ ์‚ฌ์Šฌ๋กœ ์ž‡๋‹ค๋ผ๋Š” ์˜๋ฏธ๋กœ attentionํ•œ ๊ฐ’๋“ค์„ ๋ง ๊ทธ๋Œ€๋กœ ์ด์–ด์ค๋‹ˆ๋‹ค.

Untitled 15
Multi-head attention์œผ๋กœ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์ด์ ์„ ์‚ดํŽด๋ณด๋ฉด, ์ž…๋ ฅ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ถ€๋ถ„์„ ์ฐธ์กฐํ•จ์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•œ ํ‘œํ˜„์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์ด๋ฅผ ํ†ตํ•ด ensemble ํšจ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ํฐ ์ด์ ์„ ๊ฐ–๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์•™์ƒ๋ธ” ํšจ๊ณผ๋ž€, ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋งŒ์„ ํ•™์Šต์‹œ์ผœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผœ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋ฌธ์ œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” , ์—ฌ๋Ÿฌ ์ธก๋ฉด์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”๋ผ๋ณด๋Š” ํšจ๊ณผ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

Untitled 16

3-Type Attention

transformer์—๋Š” encoder self-attention, masked decoder self-attention, encoder-decoder attention์˜ ์ด 3๊ฐ€์ง€ Attention layer๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. Encoder self-attention์€ ๊ฐ ๋‹จ์–ด์˜ ์ˆœ์—ด ์ „๋ถ€๋ฅผ ์ฐธ๊ณ ํ•˜๋ฉฐ, masked decoder self-attention์€ ์น˜ํŒ…์„ ๋ฐฉ์ง€ํ•˜์ง€ ์œ„ํ•ด ์•ž์ชฝ ๋‹จ์–ด๋“ค๋งŒ์„ ์ฐธ๊ณ ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. Encoder-decoder attention์—์„œ query๋Š” ๋””์ฝ”๋”, key์™€ value๋Š” ์ธ์ฝ”๋”์— ์žˆ์œผ๋ฉฐ positional encoding์—์„œ ์ฃผ๊ธฐํ•จ์ˆ˜๋ฅผ ํ™œ์šฉํ•ด ๊ฐ ๋‹จ์–ด์˜ ์ƒ๋Œ€์ ์ธ ์œ„์น˜ ์ •๋ณด๋ฅผ ์ž…๋ ฅํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Untitled 17

๊ฐ ์–ดํ…์…˜์˜ ํƒ€์ž…์˜ Q,K,VQ, K, V๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œ์‹œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์ธ์ฝ”๋”์˜ self-attention
    • Query=Key=ValueQuery = Key = Value
  • ๋””์ฝ”๋”์˜ masked self-attention
    • Query=Key=ValueQuery = Key = Value
  • ๋””์ฝ”๋”์˜ encoder-decoder attention
    • Query: ๋””์ฝ”๋” ๋ฒกํ„ฐ / Key = Value: ์ธ์ฝ”๋” ๋ฒกํ„ฐ

์ธ์ฝ”๋”์˜ self-attention๊ณผ ๋””์ฝ”๋”์˜ masked self-attention์—์„œ๋Š” QQ์™€ K,VK, V๊ฐ€ ๋ชจ๋‘ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋””์ฝ”๋”์˜ encoder-decoder attention์€ QQ๋กœ decoder vector๋ฅผ, K&VK\&V๋กœ encoder vector๋ฅผ ๊ฐ–์Šต๋‹ˆ๋‹ค.

Untitled 18

๋˜ํ•œ multi-head attention์—์„œ decoder๋Š” auto-regressiveํ•œ ๋ชจ๋ธ์ด๊ธฐ ๋•Œ๋ฌธ์— masking์„ ํ•„์š”๋กœ ํ•ฉ๋‹ˆ๋‹ค. Masking์„ ํ†ตํ•ด Auto-regressiveํ•œ ์„ฑ์งˆ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Fig 24์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด y1y_1์€ x1x_1๋งŒ ์ฐธ๊ณ ํ•˜๋ฉฐ, y2y_2๋Š” x1,x2x_1, x_2๋ฅผ, ์ฆ‰ ์ดํ›„ ๋‹จ๊ณ„์˜ ๊ฐ’์€ ์ฐธ๊ณ ํ•˜์ง€ ๋ชปํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Untitled 19

Untitled 20

์ด๋Ÿฌํ•œ masked Decoder self-attention์— ๋Œ€ํ•ด ์กฐ๊ธˆ๋” ์‚ดํŽด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

decoder ๋‚ด๋ถ€์˜ attention

์•ž์„œ ์„ค๋ช…๋“œ๋ฆฐ๋ฐ”์™€ ๊ฐ™์ด Transformer์˜ ๋””์ฝ”๋”์—์„œ๋Š” ์ธ์ฝ”๋”์™€ ๋‹ค๋ฅด๊ฒŒ Masked Multi-head attention์ด ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. ์ธ์ฝ”๋”๋Š” input์— ์žˆ๋Š” ๋‚ด์šฉ์„ ์ดํ•ดํ•˜๋Š” task๋ผ๋ฉด, ๋””์ฝ”๋”๋Š” input์— ์žˆ๋Š” ๋‚ด์šฉ์„ ๊ธฐ๋ฐ˜์œผ๋กœ output ๋‚ด์šฉ์„ ์˜ˆ์ธกํ•˜๋Š” task์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— masking์„ ํ†ตํ•ด output์˜ ๋‚ด์šฉ์„ ๋ฏธ๋ฆฌ ์ปท๋‹ํ•˜์ง€ ๋ชปํ•˜๋„๋ก ๋ง‰๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Multi-head attention๊ณผ๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ ๋™์ผํ•˜์ง€๋งŒ, self attention ๊ณ„์‚ฐ ์ˆ˜ํ–‰์‹œ ํ˜„์žฌ ์‹œ์ ๋ณด๋‹ค ์•ž์— ์œ„์น˜ํ•œ ์‹œํ€€์Šค๋“ค๋งŒ์„ ์ด์šฉํ•ด self attention์„ ์ˆ˜ํ–‰ํ•˜๊ณ , ๋’ค์— ์œ„์น˜ํ•œ ์‹œํ€€์Šค๋Š” ์ฐธ์กฐํ•˜์ง€ ์•Š๋‹ค๋Š” ์ฐจ์ด์ ์„ ๊ฐ–๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด seq2seq์™€ ๊ฐ™์€ RNN(์ˆœํ™˜ ์‹ ๊ฒฝ๋ง) ๋ชจ๋ธ์€ ์‹œํ€€์Šค๊ฐ€ ์ˆœ์ฐจ์ ์œผ๋กœ ์ž…๋ ฅ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์•ž์ชฝ๋ถ€ํ„ฐ ์ˆœ์ฐจ์ ์œผ๋กœ ์—…๋ฐ์ดํŠธ ๋˜์–ด์˜จ hidden state๋ฅผ ๋‹ค์Œ ์‹œํ€€์Šค ์˜ˆ์ธก์„ ์œ„ํ•ด ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ transformer ๋ชจ๋ธ์€ ์ž…๋ ฅ ์‹œํ€€์Šค๊ฐ€ ํ•œ๋ฒˆ์— input์œผ๋กœ ๋“ค์–ด๊ฐ€๊ธฐ ๋•Œ๋ฌธ์—, ํ˜„์žฌ ์‹œ์ ๋ณด๋‹ค ๋’ค์— ์˜ค๋Š” ์‹œํ€€์Šค์˜ ์ •๋ณด๋งˆ์ €๋„ ์•Œ ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด masking์„ ํ•œ ๋’ค self attention์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Untitled 21

์ง€๊ธˆ๋ถ€ํ„ฐ ๊ทธ ๊ณผ์ •์„ ์กฐ๊ธˆ ๋” ์‚ดํŽด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Untitled 22

fig 26์˜ a์™€ ๊ฐ™์ด โ€˜I am a boyโ€™๋ผ๋Š” ๋ฒกํ„ฐ๋ฅผ ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. b๋Š” ์ด ๋ฒกํ„ฐ์˜ score๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ํฐ ์ƒ‰ ๋ถ€๋ถ„, ์ฆ‰ ํ˜„์žฌ ์‹œ์ ๋ณด๋‹ค ๋’ค์— ์˜ค๋Š” ๊ฐ’์— ๋งˆ์Šคํฌ๋ฅผ ์”Œ์›Œ์ค๋‹ˆ๋‹ค. ์ด๋ฅผ ํ‘œ์‹œํ•œ ๊ฒƒ์ด c์ž…๋‹ˆ๋‹ค. masking์„ ์ˆ˜ํ•™์ ์œผ๋กœ ๊ตฌํ˜„ํ•  ๋•Œ์—๋Š” ํฌ์ง€์…˜์— ํ•ด๋‹นํ•˜๋Š” score ๊ฐ’์„ -inf(๋งˆ์ด๋„ˆ์Šค ๋ฌดํ•œ๋Œ€) ๊ฐ’์œผ๋กœ ํ‘œ๊ธฐํ•จ์œผ๋กœ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์Šคํฌ ์ฒ˜๋ฆฌ๋ฅผ ๋จผ์ € ํ•œ ํ›„ softmax๋ฅผ ์ทจํ•˜๋ฉด, d์™€ ๊ฐ™์€ Masked score vector๊ฐ€ ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค.

Untitled 23

์ด๋ ‡๊ฒŒ maksing๋œ ๊ฐ’์€ ์ตœ์ข…์ ์œผ๋กœ self attention์„ ๊ฑฐ์ณ masked multi head attention์„ ์™„์„ฑํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๋””์ฝ”๋”์—์„œ ์™„์„ฑ๋œ ๊ฐ’์€ ์•ž์„œ ์ธ์ฝ”๋”์—์„œ ์™„์„ฑ๋œ ๊ฐ’๊ณผ ๋ณ‘ํ•ฉ์„ ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. ์ธ์ฝ”๋”์—์„œ ๋””์ฝ”๋”๋กœ ๊ฐ’์ด ์ „๋‹ฌ๋˜๋Š” ๊ณผ์ •์—์„œ ๋˜ํ•œ ์–ดํ…์…˜ ๊ธฐ๋ฒ•์ด ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

Layer normalization

๋‹ค์Œ์œผ๋กœ Layer normalization์ž…๋‹ˆ๋‹ค. layer normalization์„ ์‚ฌ์šฉํ•œ ์ด์œ ๋ฅผ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•ด Batch normalization๊ณผ ๋น„๊ตํ•˜์—ฌ ํ•จ๊ป˜ ์‚ดํŽด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Untitled 24

Batch normalization๊ณผ Layer normalization์— ๋Œ€ํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Batch Normalization : Estimate the normalization statistics from the summed inputs to the neurons over a mini-batch of training case.

Layer Normalization : Estimate the normalization statistics from the summed inputs to the neurons within a hidden layer.

์ฆ‰, Batch Normalization์€ ๋ฏธ๋‹ˆ ๋ฐฐ์น˜ ๋‹จ์œ„๋กœ ์ •๊ทœํ™”๋ฅผ ํ•˜๋Š” ๋ฐ˜๋ฉด, Layer Normalization์€ hidden layer์˜ input์„ ๊ธฐ์ค€์œผ๋กœ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ณ„์‚ฐํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ด์ฒ˜๋Ÿผ ์ •๊ทœํ™”์˜ ๋‹จ์œ„๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— Batch Normalization์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‹จ์ ์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

  1. mini-batch ํฌ๊ธฐ์— ์˜์กด์ ์ด๋‹ค.
  2. Recurrent ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์— ์ ์šฉ์ด ์–ด๋ ต๋‹ค

๋ฐ˜๋ฉด์— Layer Normalization์˜ ๊ฒฝ์šฐ ์ด์™€ ๋Œ€๋น„๋˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์žฅ์ ๋“ค์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

  1. ๋ฐ์ดํ„ฐ๋งˆ๋‹ค ๊ฐ๊ฐ ๋‹ค๋ฅธ normalization term(ฮผ,ฯƒ\mu, \sigma)๋ฅผ ๊ฐ–๋Š”๋‹ค
  2. mini-batch ํฌ๊ธฐ์— ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๋Š”๋‹ค. (์ฆ‰, size=1size = 1 ์ด์–ด๋„ ์ž‘๋™ํ•œ๋‹ค.)
  3. ์„œ๋กœ ๋‹ค๋ฅธ ๊ธธ์ด๋ฅผ ๊ฐ–๋Š” sequence๊ฐ€ batch ๋‹จ์œ„์˜ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด์˜ค๋Š” ๊ฒฝ์šฐ์—๋„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. (1๋ฒˆ ํŠน์ง• ๋•Œ๋ฌธ)

๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” layer normalization์„ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

FFN

FFN์˜ ์—ญํ• ์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด๊ธฐ ์ „์— Residual connection์„ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ๋ฅผ ๊ฐ„๋‹จํžˆ ์ด์•ผ๊ธฐํ•˜๊ณ  ๋„˜์–ด๊ฐ€๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Transformer์˜ ๊ฒฝ์šฐ, ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ๊ณ  ์ธต์ด ๊นŠ์–ด ์ผ๋ฐ˜ํ™”๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ฐจ์›์ด ๊ฐ™์€ ์„œ๋ธŒ์ธต์˜ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์„ ๋”ํ•ด ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์—ฌ ๋ชจ๋ธ์˜ ํ•™์Šต์„ ๋•๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Untitled 25

FFN์€ Fully-connected feed forward network๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. FFN์˜ ์ˆ˜์‹๊ณผ ๊ทธ ์ˆ˜์‹์„ ๋„์‹ํ™”ํ•˜๋ฉด fig 30๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. multi-head attention์˜ ๊ฒฝ์šฐ ์„ ํ˜• ๋ณ€ํ™˜๋งŒ ๋“ค์–ด๊ฐ€์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ™œ์„ฑํ™” ํ•จ์ˆ˜(activation function)๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ  ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ์ด์ „๊ณผ ์ดํ›„์— fully-connected layer๋ฅผ ์‚ฝ์ž…ํ•จ์œผ๋กœ์จ ๋น„์„ ํ˜•์„ฑ์„ ์ถ”๊ฐ€ํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.

Untitled 26


Training

Data & Batching

Traininig์— ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์™€ ๋ฐฐ์น˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • Standard WMT 2014 English-German dataset
    • 4.5 million sentence pairs
    • encoded using BPE, 37,000 tokens
  • Larger WMT 2014 English-French dataset
    • 36 million sentences
    • split tokens into 32,000 word-piece vocabulary
  • Batched together by approximate sequence length
    • Each training batch contained a set of sentence pairs
    • containing approximately 25,000 source tokens and 25,000 target tokens

Optimizer & Scheduler

Training์—๋Š” Adam optimizer๋ฅผ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ fig 31์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด ์‹œ์ ์— ๋”ฐ๋ผ learning rate๋ฅผ ๋‹ค๋ฅด๊ฒŒ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Untitled 27

warmup-step์ธ ์ฒซ๋ฒˆ์งธ training step์—์„œ๋Š” lr์„ ์„ ํ˜•์ ์œผ๋กœ ์ฆ๊ฐ€์‹œ์ผฐ์œผ๋ฉฐ, ๊ทธ ํ›„๋กœ๋Š” step number์˜ inverse square root์— ๋น„๋ก€ํ•˜๊ฒŒ ๊ฐ์†Œ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

Regulation

๋˜ํ•œ Residual dropout๊ณผ Label smoothing ๊ธฐ๋ฒ•์„ regulation์œผ๋กœ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋ผ๋ฒจ ์Šค๋ฌด๋”ฉ์ด๋ž€ ๋ผ๋ฒจ์— ๋…ธ์ด์ฆˆ๋ฅผ ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ ์˜๋„์ ์œผ๋กœ hard target์„ soft target์œผ๋กœ ๋ฐ”๊พธ๋Š” ๊ธฐ๋ฒ•์„ ๋งํ•ฉ๋‹ˆ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‘ ๊ธฐ๋ฒ•์„ ํ†ตํ•ด accuracy์™€ BLEU score๋ฅผ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค๊ณ  ์–ธ๊ธ‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Untitled 28


Experiments

BLEU & PPL

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” BLEU์™€ PPL์„ ์ด์šฉํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. BLEU๋Š” ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ๊ฒฐ๊ณผ์™€ ์‚ฌ๋žŒ์ด ์ง์ ‘ ๋ฒˆ์—ญํ•œ ๊ฒฐ๊ณผ๊ฐ€ ์–ผ๋งˆ๋‚˜ ์œ ์‚ฌํ•œ์ง€ ๋น„๊ตํ•˜์—ฌ ๋ฒˆ์—ญ์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ธก์ • ๊ธฐ์ค€์€ n-gram์„ ๊ธฐ๋ฐ˜ํ•ฉ๋‹ˆ๋‹ค. (n-gram์˜ ์ •์˜๋Š” ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ์ผ๋ถ€ ๋‹จ์–ด ๋ช‡ ๊ฐœ๋ฅผ ๋ณด๋Š”๋ฐ, ์ด๋•Œ ๋ช‡ ๊ฐœ๊ฐ€ ๊ณง n-gram์˜ n์ž…๋‹ˆ๋‹ค.)

n-gram์— ๋น„ํ•ด ์ข€ ๋” ์ƒ์†Œํ•œ PPL์— ๋Œ€ํ•ด ์ด์•ผ๊ธฐํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ๋‘ ๊ฐœ์˜ ๋ชจ๋ธ A, B๊ฐ€ ์žˆ์„ ๋•Œ ์ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์€ ์–ด๋–ป๊ฒŒ ํ™•์ธํ•  ์ˆ˜ ์žˆ์„์ง€ ์ƒ๊ฐํ•ด๋ด…์‹œ๋‹ค. ๋‘ ๊ฐœ์˜ ๋ชจ๋ธ์„ ์˜คํƒ€ ๊ต์ •, ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ๋“ฑ์˜ ํ‰๊ฐ€์— ํˆฌ์ž…ํ•ด ๋ณผ ์ˆ˜ ์žˆ๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ํ‰๊ฐ€๋ณด๋‹ค๋Š” ์กฐ๊ธˆ ๋ถ€์ •ํ™•ํ•  ์ˆ˜๋Š” ์žˆ์–ด๋„ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ๋น ๋ฅด๊ฒŒ ์‹์œผ๋กœ ๊ณ„์‚ฐ๋˜๋Š” ๋” ๊ฐ„๋‹จํ•œ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ๋ฐ”๋กœ ๋ชจ๋ธ ๋‚ด์—์„œ ์ž์‹ ์˜ ์„ฑ๋Šฅ์„ ์ˆ˜์น˜ํ™”ํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋†“๋Š” perplexity, PPL์ž…๋‹ˆ๋‹ค. PPL์ด โ€˜๋‚ฎ์„ ์ˆ˜๋กโ€™ ์–ธ์–ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ์ข‹๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. fig 33์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋Š” ์ˆ˜์‹์€ ๊ฝค๋‚˜ ์–ด๋ ต๊ฒŒ ๋А๊ปด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์˜ˆ๋ฅผ ํ†ตํ•ด ๊ฐ„๋‹จํžˆ ์„ค๋ช…ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Untitled 29

PPL์ด 10์ด ๋‚˜์™”๋‹ค๊ณ  ๊ฐ€์ •ํ•ด๋ด…์‹œ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ํ•ด๋‹น ์–ธ์–ด ๋ชจ๋ธ์€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋“  ์‹œ์ (time step)๋งˆ๋‹ค ํ‰๊ท  10๊ฐœ์˜ ๋‹จ์–ด, ์ฆ‰ ์„ ํƒ์ง€๋ฅผ ๊ฐ€์ง€๊ณ  ์–ด๋–ค ๊ฒƒ์ด ์ •๋‹ต์ธ์ง€ ๊ณ ๋ฏผํ•˜๊ณ  ์žˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, ๋ณด๋‹ค ์ ์€ ์„ ํƒ์ง€๋ฅผ ๊ฐ€์ง€๊ณ  ๊ณ ๋ฏผํ•˜๋Š” ๊ฒƒ์ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ๋†’๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

๋‹จ, PPL์€ ์ด์ฒ˜๋Ÿผ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์— ์˜์กดํ•˜๋ฏ€๋กœ ๋‘ ๊ฐœ ์ด์ƒ์˜ ์–ธ์–ด ๋ชจ๋ธ์„ ๋น„๊ตํ•  ๋•Œ๋Š” ์ •๋Ÿ‰์ ์œผ๋กœ ์–‘์ด ๋งŽ๊ณ , ๋˜ํ•œ ๋„๋ฉ”์ธ์— ์•Œ๋งž๋Š” ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Translation

fig 34๋ฅผ ๋ณด๋ฉด 2๊ฐ€์ง€ transformer ๋ชจ๋ธ์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

Untitled 30

  • Transformer (base model) : ์ฒดํฌํฌ์ธํŠธ 5๊ฐœ์— ํ‰๊ท ์„ ๋‚ธ ๋‹จ์ผ ๋ชจ๋ธ
  • Transformer (big) : ์ฒดํฌํฌ์ธํŠธ 20๊ฐœ์˜ ํ‰๊ท ์„ ๋ƒˆ๊ณ  beam size๊ฐ€ 4์ด๊ณ  length penalty๊ฐ€ 0.6์ธ beam search๋„ ์‚ฌ์šฉ

big model์˜ ๊ฒฝ์šฐ, BLEU ์Šค์ฝ”์–ด์—์„œ EN-DE(์˜์–ด-๋…์ผ์–ด), EN-FR(์˜์–ด-ํ”„๋ž‘์Šค์–ด) ๋ฒˆ์—ญ task์—์„œ ๊ฐ€์žฅ ์ข‹์€ ์ ์ˆ˜์ธ SOTA๋ฅผ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ training cost์˜ ๊ฒฝ์šฐ, base model์ด ๊ธฐํƒ€ ๋‹ค๋ฅธ ๋ชจ๋ธ(ConcS2S)๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ cost๊ฐ€ ๋‚ฎ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. big model๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ cost๊ฐ€ ๋†’๋‹ค๊ณ ๋Š” ๋ณผ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

Parsing

parsing์˜ ๊ฒฝ์šฐ์—์„œ๋„ fig 35์™€ ๊ฐ™์ด 2๊ฐ€์ง€๋กœ ๋‚˜๋ˆ„์–ด ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • Transformer (WSJ only) : WSJ dataset(40K)๋งŒ ํ™œ์šฉ
  • Transformer (semi-supervised) : Berkly Parser Corpora(17M)์„ ํ•™์Šต

Untitled 31

Transformer (WSJ only)๋Š” ๋ฒˆ์—ญ์ „์šฉ ๋ชจ๋ธ์ด์—ˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  parser ์ „๋ฌธ์œผ๋กœ ๋งŒ๋“ค์—ˆ๋˜ RNN Grammer(Dyer et al.(2016))์„ ์ œ์™ธํ•˜๊ณ  ์ข‹์€ ํผํฌ๋จผ์Šค๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ Transformer(semi-supervised)์˜ ๊ฒฝ์šฐ ๋‹ค๋ฅธ task์™€ ๋น„๊ตํ•ด์„œ ํ•ด๋‹น task๊ฐ€ ํ›จ์”ฌ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Œ์„ ์ œ์‹œํ•˜์˜€์Šต๋‹ˆ๋‹ค.

ํ•ด๋‹น ์‹คํ—˜์„ ํ†ตํ•ด ํŠธ๋žœ์Šคํฌ๋จธ๊ฐ€ ๋‹ค๋ฅธ task์—์„œ๋„ ์œ ์šฉํ•จ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.


์š”์•ฝํ•˜์ž๋ฉด

Transformer๋Š” ๊ธฐ์กด์˜ RNN, seq2seq ๋ชจ๋ธ์˜ ํ•œ๊ณ„์˜€๋˜ ๋ณ‘๋ ฌํ™” ๋ถˆ๊ฐ€๋Šฅ๊ณผ ์—ฐ์‚ฐ ์†๋„ ์ €ํ•˜์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” attention ๊ธฐ๋ฒ•์„ ๋„์ž…ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋ณ‘๋ ฌ์  ์—ฐ์‚ฐ์„ ํ†ตํ•ด ๊ณ„์‚ฐ ๋ณต์žก๋„๋ฅผ ์ค„์˜€์Šต๋‹ˆ๋‹ค. ๊ตฌ์กฐ์—์„œ ์ฃผ์˜๊นŠ๊ฒŒ ๋ณผ ์ ์€ ์—ญ์‹œ๋‚˜ attention์˜ ์ ์šฉ๊ณผ self-attention, masked self attention์ผ ๊ฒƒ ์ž…๋‹ˆ๋‹ค.

transformer๋ฅผ ๊ธฐ์ ์œผ๋กœ NLP๋Š” ํฐ ๋ฐœ์ „์„ ์ด๋ฃจ์—ˆ์Šต๋‹ˆ๋‹ค. BERT์™€ GPT ๋˜ํ•œ transformer์˜ ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋” ๋งŒ์„ ์ด์šฉํ•ด์„œ ๋งŒ๋“ค์—ˆ์œผ๋ฉฐ, ์ตœ๊ทผ ๋ถ์ด ์ผ๊ณ  ์žˆ๋Š” ChatGPT ๋˜ํ•œ GPT๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋‹ค๋ณด๋‹ˆ transformer์—์„œ ํŒŒ์ƒ๋˜์—ˆ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๊ฒ ์Šต๋‹ˆ๋‹ค.



PS. ์ถ”๊ฐ€ ๋ฌธ์˜์‚ฌํ•ญ ๋ฐ ์งˆ๋ฌธ์€ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฅผ ํ†ตํ•ด ์ €๋„ ๋” ์„ฑ์žฅํ•  ์ˆ˜ ์žˆ์„ํ…Œ๋‹ˆ๊นŒ์š”. ๊ธด ๊ธ€ ์ฝ์–ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

Share