[NLP] ๋ฌธ์ฅ ์ ๋ถ๋ฅ ๋ชจ๋ธ ํ์ตํ๊ธฐ
์์ฐ์ด์ฒ๋ฆฌ์ ์์ ๋ฅผ ํ์ตํ์ฌ ๋ณด์.
๋ค์์ ์ด์ ๊ธ์์ ์ค๋ช
ํ์๋ ๋ฌธ์ฅ ์ ๋ถ๋ฅ ๋ชจ๋ธ์ ๊ตฌํํ ๊ฒ์ด๋ค.
๋ณธ ํ์ผ์ ์ด๊ธฐ์ฐฝ๋์ 'Do it! ์์ฐ์ด ์ฒ๋ฆฌ'์ ๊ธฐ์ดํ์ฌ ์์ฑ๋์๋ค. :)
๋ฌธ์ฅ ์ ๋ถ๋ฅ ๋ชจ๋ธ ํ์ตํ๊ธฐ
์ ์ ์ ๊ฐ์ค์ ๊ฒ์ฆํ๋ ์์ฐ์ด ์ถ๋ก ๋ชจ๋ธ ๋ง๋ค๊ธฐ
1. ๊ฐ์ข ์ค์ ํ๊ธฐ
TPU ๊ด๋ จ ํจํค์ง ์ค์น
์ฝ๋ฉ ๋ ธํธ๋ถ ์ด๊ธฐํ ๊ณผ์ ์์ ํ๋์จ์ด ๊ฐ์๊ธฐ๋ก TPU๋ฅผ ์ ํํ๋ค๋ฉด ๋ค์ ์ฝ๋๋ฅผ ์คํํ๊ณ , GPU๋ฅผ ์ ํํ๋ค๋ฉด ์คํํ์ง ์๋๋ค.
code 3-0
!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
์์กด์ฑ ํจํค์ง ์ค์น
code 3-1์ ์คํํด TPU ์ด์ธ์ ์์กด์ฑ ์๋ ํจํค์ง๋ฅผ ์ค์นํ๋ค.
code 3-1
!pip install ratsnlp
โถCode output
Requirement already satisfied: ratsnlp in /usr/local/lib/python3.7/dist-packages (1.0.1)
Requirement already satisfied: pytorch-lightning==1.3.4 in /usr/local/lib/python3.7/dist-packages (from ratsnlp) (1.3.4)
Requirement already satisfied: torch>=1.9.0 in /usr/local/lib/python3.7/dist-packages (from ratsnlp) (1.10.0+cu111)
Requirement already satisfied: Korpora>=0.2.0 in /usr/local/lib/python3.7/dist-packages (from ratsnlp) (0.2.0)
Requirement already satisfied: flask>=1.1.4 in /usr/local/lib/python3.7/dist-packages (from ratsnlp) (1.1.4)
Requirement already satisfied: flask-cors>=3.0.10 in /usr/local/lib/python3.7/dist-packages (from ratsnlp) (3.0.10)
Requirement already satisfied: transformers==4.10.0 in /usr/local/lib/python3.7/dist-packages (from ratsnlp) (4.10.0)
Requirement already satisfied: flask-ngrok>=0.0.25 in /usr/local/lib/python3.7/dist-packages (from ratsnlp) (0.0.25)
Requirement already satisfied: numpy>=1.17.2 in /usr/local/lib/python3.7/dist-packages (from pytorch-lightning==1.3.4->ratsnlp) (1.21.5)
Requirement already satisfied: fsspec[http]>=2021.4.0 in /usr/local/lib/python3.7/dist-packages (from pytorch-lightning==1.3.4->ratsnlp) (2022.2.0)
Requirement already satisfied: tqdm>=4.41.0 in /usr/local/lib/python3.7/dist-packages (from pytorch-lightning==1.3.4->ratsnlp) (4.62.3)
Requirement already satisfied: pyDeprecate==0.3.0 in /usr/local/lib/python3.7/dist-packages (from pytorch-lightning==1.3.4->ratsnlp) (0.3.0)
Requirement already satisfied: torchmetrics>=0.2.0 in /usr/local/lib/python3.7/dist-packages (from pytorch-lightning==1.3.4->ratsnlp) (0.7.2)
Requirement already satisfied: future>=0.17.1 in /usr/local/lib/python3.7/dist-packages (from pytorch-lightning==1.3.4->ratsnlp) (0.18.2)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from pytorch-lightning==1.3.4->ratsnlp) (21.3)
Requirement already satisfied: tensorboard!=2.5.0,>=2.2.0 in /usr/local/lib/python3.7/dist-packages (from pytorch-lightning==1.3.4->ratsnlp) (2.8.0)
Requirement already satisfied: PyYAML<=5.4.1,>=5.1 in /usr/local/lib/python3.7/dist-packages (from pytorch-lightning==1.3.4->ratsnlp) (5.4.1)
Requirement already satisfied: tokenizers<0.11,>=0.10.1 in /usr/local/lib/python3.7/dist-packages (from transformers==4.10.0->ratsnlp) (0.10.3)
Requirement already satisfied: huggingface-hub>=0.0.12 in /usr/local/lib/python3.7/dist-packages (from transformers==4.10.0->ratsnlp) (0.4.0)
Requirement already satisfied: sacremoses in /usr/local/lib/python3.7/dist-packages (from transformers==4.10.0->ratsnlp) (0.0.47)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from transformers==4.10.0->ratsnlp) (4.11.1)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers==4.10.0->ratsnlp) (2019.12.20)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers==4.10.0->ratsnlp) (3.6.0)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers==4.10.0->ratsnlp) (2.23.0)
Requirement already satisfied: click<8.0,>=5.1 in /usr/local/lib/python3.7/dist-packages (from flask>=1.1.4->ratsnlp) (7.1.2)
Requirement already satisfied: itsdangerous<2.0,>=0.24 in /usr/local/lib/python3.7/dist-packages (from flask>=1.1.4->ratsnlp) (1.1.0)
Requirement already satisfied: Werkzeug<2.0,>=0.15 in /usr/local/lib/python3.7/dist-packages (from flask>=1.1.4->ratsnlp) (1.0.1)
Requirement already satisfied: Jinja2<3.0,>=2.10.1 in /usr/local/lib/python3.7/dist-packages (from flask>=1.1.4->ratsnlp) (2.11.3)
Requirement already satisfied: Six in /usr/local/lib/python3.7/dist-packages (from flask-cors>=3.0.10->ratsnlp) (1.15.0)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.7/dist-packages (from fsspec[http]>=2021.4.0->pytorch-lightning==1.3.4->ratsnlp) (3.8.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.7/dist-packages (from huggingface-hub>=0.0.12->transformers==4.10.0->ratsnlp) (3.10.0.2)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from Jinja2<3.0,>=2.10.1->flask>=1.1.4->ratsnlp) (2.0.1)
Requirement already satisfied: xlrd>=1.2.0 in /usr/local/lib/python3.7/dist-packages (from Korpora>=0.2.0->ratsnlp) (2.0.1)
Requirement already satisfied: dataclasses>=0.6 in /usr/local/lib/python3.7/dist-packages (from Korpora>=0.2.0->ratsnlp) (0.6)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->pytorch-lightning==1.3.4->ratsnlp) (3.0.7)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers==4.10.0->ratsnlp) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers==4.10.0->ratsnlp) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers==4.10.0->ratsnlp) (2021.10.8)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers==4.10.0->ratsnlp) (3.0.4)
Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.7/dist-packages (from tensorboard!=2.5.0,>=2.2.0->pytorch-lightning==1.3.4->ratsnlp) (1.35.0)
Requirement already satisfied: protobuf>=3.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard!=2.5.0,>=2.2.0->pytorch-lightning==1.3.4->ratsnlp) (3.17.3)
Requirement already satisfied: setuptools>=41.0.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard!=2.5.0,>=2.2.0->pytorch-lightning==1.3.4->ratsnlp) (57.4.0)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.7/dist-packages (from tensorboard!=2.5.0,>=2.2.0->pytorch-lightning==1.3.4->ratsnlp) (0.4.6)
Requirement already satisfied: absl-py>=0.4 in /usr/local/lib/python3.7/dist-packages (from tensorboard!=2.5.0,>=2.2.0->pytorch-lightning==1.3.4->ratsnlp) (1.0.0)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard!=2.5.0,>=2.2.0->pytorch-lightning==1.3.4->ratsnlp) (1.8.1)
Requirement already satisfied: wheel>=0.26 in /usr/local/lib/python3.7/dist-packages (from tensorboard!=2.5.0,>=2.2.0->pytorch-lightning==1.3.4->ratsnlp) (0.37.1)
Requirement already satisfied: grpcio>=1.24.3 in /usr/local/lib/python3.7/dist-packages (from tensorboard!=2.5.0,>=2.2.0->pytorch-lightning==1.3.4->ratsnlp) (1.44.0)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.7/dist-packages (from tensorboard!=2.5.0,>=2.2.0->pytorch-lightning==1.3.4->ratsnlp) (3.3.6)
Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard!=2.5.0,>=2.2.0->pytorch-lightning==1.3.4->ratsnlp) (0.6.1)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard!=2.5.0,>=2.2.0->pytorch-lightning==1.3.4->ratsnlp) (4.2.4)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard!=2.5.0,>=2.2.0->pytorch-lightning==1.3.4->ratsnlp) (0.2.8)
Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard!=2.5.0,>=2.2.0->pytorch-lightning==1.3.4->ratsnlp) (4.8)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.7/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard!=2.5.0,>=2.2.0->pytorch-lightning==1.3.4->ratsnlp) (1.3.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->transformers==4.10.0->ratsnlp) (3.7.0)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.7/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard!=2.5.0,>=2.2.0->pytorch-lightning==1.3.4->ratsnlp) (0.4.8)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard!=2.5.0,>=2.2.0->pytorch-lightning==1.3.4->ratsnlp) (3.2.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.7/dist-packages (from aiohttp->fsspec[http]>=2021.4.0->pytorch-lightning==1.3.4->ratsnlp) (6.0.2)
Requirement already satisfied: asynctest==0.13.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->fsspec[http]>=2021.4.0->pytorch-lightning==1.3.4->ratsnlp) (0.13.0)
Requirement already satisfied: charset-normalizer<3.0,>=2.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->fsspec[http]>=2021.4.0->pytorch-lightning==1.3.4->ratsnlp) (2.0.12)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.7/dist-packages (from aiohttp->fsspec[http]>=2021.4.0->pytorch-lightning==1.3.4->ratsnlp) (1.2.0)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->fsspec[http]>=2021.4.0->pytorch-lightning==1.3.4->ratsnlp) (21.4.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.7/dist-packages (from aiohttp->fsspec[http]>=2021.4.0->pytorch-lightning==1.3.4->ratsnlp) (1.3.0)
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->fsspec[http]>=2021.4.0->pytorch-lightning==1.3.4->ratsnlp) (1.7.2)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.7/dist-packages (from aiohttp->fsspec[http]>=2021.4.0->pytorch-lightning==1.3.4->ratsnlp) (4.0.2)
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers==4.10.0->ratsnlp) (1.1.0)
๊ตฌ๊ธ ๋๋ผ์ด๋ธ์ ์ฐ๊ฒฐ
์ฝ๋ฉ ๋ ธํธ๋ถ์ ์ผ์ ์๊ฐ ์ฌ์ฉํ์ง ์์ผ๋ฉด ๋น์๊น์ง์ ๋ชจ๋ ๊ฒฐ๊ณผ๋ฌผ์ด ๋ ์๊ฐ ์ ์๋ค. ๋ชจ๋ธ ์ฒดํฌํฌ์ธํธ ๋ฑ์ ์ ์ฅํด ์ฃผ๊ธฐ ์ํด ์์ ์ ๊ตฌ๊ธ ๋๋ผ์ด๋ธ๋ฅผ ์ฝ๋ฉ ๋ ธํธ๋ถ๊ณผ ์ฐ๊ฒฐํ๋ค.
code 3-2
from google.colab import drive
drive.mount('/gdrive', force_remount=True)
โถCode output
Mounted at /gdrive
๋ชจ๋ธ ํ๊ฒฝ ์ค์
kcbert-base๋ชจ๋ธ์ ์ธ๊ณต์ง๋ฅ ๊ธฐ์ ์ ์คํ ์ด์ง๊ฐ ๊ณต๊ฐํ KLUE-NLI๋ฐ์ดํฐ* ๋ก ํ์ธํ๋
*klue-benchmark.com/tasks/68/data/description
code 3-3
import torch
from ratsnlp.nlpbook.classification import ClassificationTrainArguments
args = ClassificationTrainArguments(
pretrained_model_name="beomi/kcbert-base",
downstream_task_name="pair-classification",
downstream_corpus_name="klue-nli",
downstream_model_dir="/gdrive/My Drive/nlpbook/checkpoint-paircls",
batch_size=32 if torch.cuda.is_available() else 4,
learning_rate=5e-5,
max_seq_length=64,
epochs=5,
tpu_cores=0 if torch.cuda.is_available() else 8,
seed=7,
)
๋๋ค ์๋ ๊ณ ์
๋๋ค ์๋๋ฅผ ์ค์
code 3-4๋ args์ ์ง์ ๋ ์๋๋ก ๊ณ ์ ํ๋ ์ญํ ์ ํ๋ค.
code 3-4
from ratsnlp import nlpbook
nlpbook.set_seed(args)
โถCode output
set seed: 7
๋ก๊ฑฐ ์ค์
๊ฐ์ข ๋ก๊ทธ๋ฅผ ์ถ๋ ฅํ๋ ๋ก๊ฑฐ๋ฅผ ์ค์
code 3-5
nlpbook.set_logger(args)
โถCode output
INFO:ratsnlp:Training/evaluation parameters ClassificationTrainArguments(pretrained_model_name='beomi/kcbert-base', downstream_task_name='pair-classification', downstream_corpus_name='klue-nli', downstream_corpus_root_dir='/content/Korpora', downstream_model_dir='/gdrive/My Drive/nlpbook/checkpoint-paircls', max_seq_length=64, save_top_k=1, monitor='min val_loss', seed=7, overwrite_cache=False, force_download=False, test_mode=False, learning_rate=5e-05, epochs=5, batch_size=32, cpu_workers=2, fp16=False, tpu_cores=0)
2. ๋ง๋ญ์น ๋ด๋ ค๋ฐ๊ธฐ
๋ง๋ญ์น ๋ด๋ ค๋ฐ๊ธฐ
KLUE-NLI ๋ฐ์ดํฐ๋ฅผ ๋ด๋ ค๋ฐ๋๋ค. corpus_name์ ํด๋นํ๋ ๋ง๋ญ์น(klue_nli)๋ฅผ downstream_corpus_root_dir์๋(/root/Korpora)์ ์ ์ฅํด๋๋ค.
code 3-6
nlpbook.download_downstream_dataset(args)
โถCode output
Downloading: 100%|โโโโโโโโโโ| 12.3M/12.3M [00:00<00:00, 42.3MB/s]
Downloading: 100%|โโโโโโโโโโ| 1.47M/1.47M [00:00<00:00, 35.6MB/s]
3. ํ ํฌ๋์ด์ ์ค๋นํ๊ธฐ
ํ ํฌ๋์ด์ ์ค๋น
code 3-7์ ์คํํด pretrained_model_name์ ํด๋นํ๋ ๋ชจ๋ธ(kcbert-base)์ด ์ฌ์ฉํ๋ ํ ํฌ๋์ด์ ๋ฅผ ์ ์ธํ๋ค.
code 3-7
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained(
args.pretrained_model_name,
do_lower_case=False,
)
โถCode output
Downloading: 0%| | 0.00/250k [00:00<?, ?B/s]
Downloading: 0%| | 0.00/49.0 [00:00<?, ?B/s]
Downloading: 0%| | 0.00/619 [00:00<?, ?B/s]
4. ๋ฐ์ดํฐ ์ ์ฒ๋ฆฌํ๊ธฐ
ํ์ต ๋ฐ์ดํฐ์ ๊ตฌ์ถ
code 3-8์ ์ํํ๋ฉด ํ์ต ๋ฐ์ดํฐ์
์ ๋ง๋ค ์ ์๋ค. KlueNLICorpus ํด๋์ค๋ JSON ํ์ผ ํ์์ KLUE-NLI ๋ฐ์ดํฐ๋ฅผ ๋ฌธ์ฅ(์ ์ + ๊ฐ์ค)๊ณผ ๋ ์ด๋ธ(์ฐธ, ๊ฑฐ์ง, ์ค๋ฆฝ)๋ก ์ฝ์ด๋ค์ธ๋ค. KlueNLICorpus๋ ClassificationDataset์ด ์๊ตฌํ๋ฉด ์ด ๋ฌธ์ฅ๊ณผ ๋ ์ด๋ธ์ ClassificationDataset์ ์ ๊ณตํ๋ค.
code 3-8
from ratsnlp.nlpbook.paircls import KlueNLICorpus
from ratsnlp.nlpbook.classification import ClassificationDataset
corpus = KlueNLICorpus()
train_dataset = ClassificationDataset(
args=args,
corpus=corpus,
tokenizer=tokenizer,
mode="train",
)
โถCode output
INFO:ratsnlp:Creating features from dataset file at /content/Korpora/klue-nli
INFO:ratsnlp:loading train data... LOOKING AT /content/Korpora/klue-nli/klue_nli_train.json
INFO:ratsnlp:tokenize sentences, it could take a lot of time...
INFO:ratsnlp:tokenize sentences [took 15.747 s]
INFO:ratsnlp:*** Example ***
INFO:ratsnlp:sentence A, B: 100๋ถ๊ฐ ์๊ป ๊ทธ๋๋ ์๋๋ถ๋์ 2์ ์ค๋ค + 100๋ถ๊ฐ ์ค๋ค.
INFO:ratsnlp:tokens: [CLS] 100 ##๋ถ๊ฐ ์ ##๊ป ๊ทธ๋๋ ์ ##๋ ##๋ถ ##๋์ 2 ##์ ##์ค๋ค [SEP] 100 ##๋ถ๊ฐ ์ค ##๋ค . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
INFO:ratsnlp:label: contradiction
INFO:ratsnlp:features: ClassificationFeatures(input_ids=[2, 8327, 15760, 2483, 4260, 8446, 1895, 5623, 5969, 10319, 21, 4213, 10172, 3, 8327, 15760, 2491, 4020, 17, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], attention_mask=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], token_type_ids=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], label=1)
INFO:ratsnlp:*** Example ***
INFO:ratsnlp:sentence A, B: 100๋ถ๊ฐ ์๊ป ๊ทธ๋๋ ์๋๋ถ๋์ 2์ ์ค๋ค + ์๋๋ถ์ด ์ ๋ง ๋ฉ์์๋ค.
INFO:ratsnlp:tokens: [CLS] 100 ##๋ถ๊ฐ ์ ##๊ป ๊ทธ๋๋ ์ ##๋ ##๋ถ ##๋์ 2 ##์ ##์ค๋ค [SEP] ์ ##๋ ##๋ถ ##์ด ์ ๋ง ๋ฉ ##์ ##์๋ค . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
INFO:ratsnlp:label: neutral
INFO:ratsnlp:features: ClassificationFeatures(input_ids=[2, 8327, 15760, 2483, 4260, 8446, 1895, 5623, 5969, 10319, 21, 4213, 10172, 3, 1895, 5623, 5969, 4017, 8050, 1348, 4188, 8217, 17, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], attention_mask=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], token_type_ids=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], label=2)
INFO:ratsnlp:*** Example ***
INFO:ratsnlp:sentence A, B: 100๋ถ๊ฐ ์๊ป ๊ทธ๋๋ ์๋๋ถ๋์ 2์ ์ค๋ค + 100๋ถ๊ฐ ์๋๊ฒ ๋ ๋์์ ๊ฒ ๊ฐ๋ค.
INFO:ratsnlp:tokens: [CLS] 100 ##๋ถ๊ฐ ์ ##๊ป ๊ทธ๋๋ ์ ##๋ ##๋ถ ##๋์ 2 ##์ ##์ค๋ค [SEP] 100 ##๋ถ๊ฐ ์๋ ##๊ฒ ๋ ๋ ##์์ ๊ฒ ๊ฐ๋ค . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
INFO:ratsnlp:label: neutral
INFO:ratsnlp:features: ClassificationFeatures(input_ids=[2, 8327, 15760, 2483, 4260, 8446, 1895, 5623, 5969, 10319, 21, 4213, 10172, 3, 8327, 15760, 15095, 4199, 832, 587, 25331, 258, 8604, 17, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], attention_mask=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], token_type_ids=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], label=2)
INFO:ratsnlp:*** Example ***
INFO:ratsnlp:sentence A, B: 101๋น๋ฉ ๊ทผ์ฒ์ ๋๋ฆ ์ฆ๊ธธ๊ฑฐ๋ฆฌ๊ฐ ๋ง์ต๋๋ค. + 101๋น๋ฉ ๊ทผ์ฒ์์ ์ฆ๊ธธ๊ฑฐ๋ฆฌ ์ฐพ๊ธฐ๋ ์ด๋ ต์ต๋๋ค.
INFO:ratsnlp:tokens: [CLS] 10 ##1 ##๋น ##๋ฉ ๊ทผ์ฒ์ ๋๋ฆ ์ฆ ##๊ธธ ##๊ฑฐ๋ฆฌ๊ฐ ๋ง์ต๋๋ค . [SEP] 10 ##1 ##๋น ##๋ฉ ๊ทผ์ฒ์ ##์ ์ฆ ##๊ธธ ##๊ฑฐ๋ฆฌ ์ฐพ ##๊ธฐ๋ ์ด๋ ต ##์ต๋๋ค . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
INFO:ratsnlp:label: contradiction
INFO:ratsnlp:features: ClassificationFeatures(input_ids=[2, 8240, 4068, 4647, 4389, 29671, 13715, 2676, 4583, 14516, 14617, 17, 3, 8240, 4068, 4647, 4389, 29671, 4072, 2676, 4583, 8181, 2851, 8189, 9775, 8046, 17, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], attention_mask=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], token_type_ids=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], label=1)
INFO:ratsnlp:*** Example ***
INFO:ratsnlp:sentence A, B: 101๋น๋ฉ ๊ทผ์ฒ์ ๋๋ฆ ์ฆ๊ธธ๊ฑฐ๋ฆฌ๊ฐ ๋ง์ต๋๋ค. + 101๋น๋ฉ ์ฃผ๋ณ์ ์ ์์ด๋ค์ด ์ฆ๊ธธ๊ฑฐ๋ฆฌ๊ฐ ๋ง์ต๋๋ค.
INFO:ratsnlp:tokens: [CLS] 10 ##1 ##๋น ##๋ฉ ๊ทผ์ฒ์ ๋๋ฆ ์ฆ ##๊ธธ ##๊ฑฐ๋ฆฌ๊ฐ ๋ง์ต๋๋ค . [SEP] 10 ##1 ##๋น ##๋ฉ ์ฃผ๋ณ์ ์ ์์ด๋ค์ด ์ฆ ##๊ธธ ##๊ฑฐ๋ฆฌ๊ฐ ๋ง์ต๋๋ค . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
INFO:ratsnlp:label: neutral
INFO:ratsnlp:features: ClassificationFeatures(input_ids=[2, 8240, 4068, 4647, 4389, 29671, 13715, 2676, 4583, 14516, 14617, 17, 3, 8240, 4068, 4647, 4389, 12298, 22790, 2676, 4583, 14516, 14617, 17, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], attention_mask=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], token_type_ids=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], label=2)
INFO:ratsnlp:Saving features into cached file, it could take a lot of time...
INFO:ratsnlp:Saving features into cached file /content/Korpora/klue-nli/cached_train_BertTokenizer_64_klue-nli_pair-classification [took 1.934 s]
ClassificationDataset ํด๋์ค๊ฐ ํ๋ ์ญํ
์ด ํด๋์ค๋ KlueNLICorpus์ code 3-7์์ ์ ์ธํด ๋ ํ ํฌ๋์ด์ ๋ฅผ ํ๊ณ ์๋ค.
ClassificationDataset์ ์ ๊ณต๋ฐ์ ๋ฌธ์ฅ๊ณผ ๋ ์ด๋ธ ๊ฐ๊ฐ์ tokenizer๋ฅผ ํ์ฉํด ๋ชจ๋ธ์ด ํ์ตํ ์ ์๋ ํํ(ClassificationFeature)๋ก ๊ฐ๊ณตํ๋ค.
๋ค์ ๋งํด, ์ ์ ์ ๊ฐ์ค 2๊ฐ ๋ฌธ์ฅ์ ๊ฐ๊ฐ ํ ํฐํํ๊ณ ์ด๋ฅผ ์ธ๋ฑ์ค๋ก ๋ณํํ๋ ํํธ, ๋ ์ด๋ธ ์ญ์ ์ ์๋ก ๋ฐ๊ฟ์ฃผ๋ ์ญํ ์ ํ๋ค.
(entailment: 0, contradiction: 1, neutral: 2)
KlueNLICorpus์ classificationDataset์ ์ญํ ๊ณผ ์์ธํ ๊ตฌํ ๋ด์ฉ์ ์๋์ ๋งํฌ๋ฅผ ์ฐธ๊ณ ํ์!
(ํ์ฌ๋ ๊ต์ฌ๋งํฌ๋ฅผ ์ฌ๋ ค๋์ง๋ง, ์ถํ ๋ณธ์ธ์ ๊นํ๋ธ์ ๊ตฌํ ์์ )
- ratsgo.github.io/nlpbook/docs/pair_cls/detail
ํ์ต ๋ฐ์ดํฐ ๋ก๋ ๊ตฌ์ถ
code 3-9๋ฅผ ์คํํ๋ฉด ํ์ตํ ๋ ์ฐ์ด๋ ๋ฐ์ดํฐ ๋ก๋๋ฅผ ๋ง๋ค ์ ์๋ค. ํ์ต์ฉ ๋ฐ์ดํฐ ๋ก๋๋ ClassificationDataset ํด๋์ค๊ฐ ๋ค๊ณ ์๋ ์ ์ฒด ์ธ์คํด์ค ๊ฐ์ด๋ฐ ๋ฐฐํฌ ํฌ๊ธฐ(code 3-3 ์์ ์ ์ํ args์ batch_size)๋งํผ์ ์ธ์คํด์ค๋ค์ ๋น๋ณต์(replacement=False)๋๋ค ์ถ์ถ(RandomSampler)ํ ๋ค ์ด๋ฅผ ๋ฐฐ์น ํํ๋ก ๊ฐ๊ณต(nlpbook.data_collator)ํด ๋ชจ๋ธ์ ๊ณต๊ธํ๋ ์ญํ ์ ์ํํ๋ค.
code 3-9
from torch.utils.data import DataLoader, RandomSampler
train_dataloader = DataLoader(
train_dataset,
batch_size=args.batch_size,
sampler=RandomSampler(train_dataset, replacement=False),
collate_fn=nlpbook.data_collator,
drop_last=False,
num_workers=args.cpu_workers,
)
ํ๊ฐ์ฉ ๋ฐ์ดํฐ ๋ก๋ ๊ตฌ์ถ
code 3-10์ ์คํํ๋ฉด ํ๊ฐ์ฉ ๋ฐ์ดํฐ ๋ก๋๋ฅผ ๊ตฌ์ถํ ์ ์๋ค. ํ๊ฐ์ฉ ๋ฐ์ดํฐ ๋ก๋๋ ๋ฐฐ์น ํฌ๊ธฐ(code 3-3์์ ์ ์ํ args์ batch_size)๋งํผ์ ์ธ์คํด์ค๋ฅผ ์์๋๋ก ์ถ์ถ(Sequential Sampler)ํ ํ ์ด๋ฅผ ๋ฐฐ์น ํํ๋ก ๊ฐ๊ณต(nlpbook.data_collator)ํด ๋ชจ๋ธ์ ๊ณต๊ธํ๋ค.
code 3-10
from torch.utils.data import SequentialSampler
val_dataset = ClassificationDataset(
args=args,
corpus=corpus,
tokenizer=tokenizer,
mode="test",
)
val_dataloader = DataLoader(
val_dataset,
batch_size=args.batch_size,
sampler=SequentialSampler(val_dataset),
collate_fn=nlpbook.data_collator,
drop_last=False,
num_workers=args.cpu_workers,
)
โถCode output
INFO:ratsnlp:Loading features from cached file /content/Korpora/klue-nli/cached_test_BertTokenizer_64_klue-nli_pair-classification [took 0.116 s]
5. ๋ชจ๋ธ ๋ถ๋ฌ์ค๊ธฐ
๋ชจ๋ธ ์ด๊ธฐํ
code 3-11์ ์ํํด ๋ชจ๋ธ์ ์ด๊ธฐํ ํ๋ค. ํ๋ฆฌํธ๋ ์ธ์ ๋ง์น BERT๋ก kcbert-base๋ฅผ ์ฌ์ฉํ๋ค. code 3-3์์ pretrained_model_name์ beomi/kcber-base๋ก ์ง์ ํ๊ธฐ ๋๋ฌธ์ด๋ค. ๋ฌผ๋ก ํ๊น
ํ์ด์ค ๋ชจ๋ธ ํ๋ธ์ ๋ฑ๋ก๋ ๋ชจ๋ธ์ด๋ผ๋ฉด ๋ค๋ฅธ ๋ชจ๋ธ ์ญ์ ์ฌ์ฉํ ์ ์๋ค.
BertForSequenceClassification์ ํ๋ฆฌํธ๋ ์ธ์ ๋ง์น BERT๋ชจ๋ธ ์์ ๋ฌธ์ ๋ถ๋ฅ์ฉ ํ์คํฌ ๋ชจ๋์ ๋ง๋ถ์ธ ํํ์ ๋ชจ๋ธ ํด๋์ค์ด๋ค. ์ด ํด๋์ค๋ ๋ฌธ์ ๋ถ๋ฅ ๋ชจ๋ธ์์ ์ฌ์ฉํ ๊ฒ๊ณผ ๋์ผํ๋ค.
code 3-11
from transformers import BertConfig, BertForSequenceClassification
pretrained_model_config = BertConfig.from_pretrained(
args.pretrained_model_name,
num_labels=corpus.num_labels,
)
model = BertForSequenceClassification.from_pretrained(
args.pretrained_model_name,
config=pretrained_model_config,
)
โถCode output
Downloading: 0%| | 0.00/438M [00:00<?, ?B/s]
Some weights of the model checkpoint at beomi/kcbert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at beomi/kcbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
6. ๋ชจ๋ธ ํ์ต์ํค๊ธฐ
code 3-12๋ฅผ ์คํํ๋ฉด ๋ฌธ์ฅ ์ ๋ถ๋ฅ์ฉ ํ์คํฌ๋ฅผ ์ ์ํ ์ ์๋ค. ๋ชจ๋ธ์ code 3-11์์ ์ค๋นํ ๋ชจ๋ธ ํด๋์ค๋ฅผ ClassificationTask์ ํฌํจํ๋ค. ClassificationTask ํด๋์ค์๋ ์ตํฐ๋ง์ด์ , ๋ฌ๋ ๋ ์ดํธ ์ค์ผ์ค๋ฌ๊ฐ ์ ์ ๋ ์๋๋ฐ, ์ตํฐ๋ง์ด์ ๋ก๋ ์๋ด(Adam), ๋ฌ๋ ๋ ์ดํธ ์ค์ผ์ค๋ฌ๋ก๋ ExponentialLR์ ์ฌ์ฉํ๋ค.
ํ์คํฌ ์ ์
code 3-12
from ratsnlp.nlpbook.classification import ClassificationTask
task = ClassificationTask(model, args)
ํธ๋ ์ด๋ ์ ์
code 3-13์ ์คํํ๋ฉด ํธ๋ ์ด๋๋ฅผ ์ ์ํ ์ ์๋ค. ์ด ํธ๋ ์ด๋๋ ํ์ดํ ์น ๋ผ์ดํธ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ์ ๋์์ ๋ฐ์ GPU/TPU ์ค์ , ๋ก๊ทธ ๋ฐ ์ฒดํฌํฌ์ธํธ ๋ฑ ๊ท์ฐฎ์ ์ค์ ๋ค์ ์์์ ํด์ค๋ค.
code 3-13
trainer = nlpbook.get_trainer(args)
โถCode output
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
ํ์ต ๊ฐ์
code 3-14์ ๊ฐ์ด ํธ๋ ์ด๋์ fit()ํจ์๋ฅผ ํธ์ถํ๋ฉด ํ์ต์ ์์ํ๋ค.
code 3-14
trainer.fit(
task,
train_dataloader=train_dataloader,
val_dataloaders=val_dataloader,
)
โถCode output
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
--------------------------------------------------------
0 | model | BertForSequenceClassification | 108 M
--------------------------------------------------------
108 M Trainable params
0 Non-trainable params
108 M Total params
435.683 Total estimated model params size (MB)
Training: 0it [00:00, ?it/s]
Validating: 0it [00:00, ?it/s]
Validating: 0it [00:00, ?it/s]
Validating: 0it [00:00, ?it/s]
Validating: 0it [00:00, ?it/s]
Validating: 0it [00:00, ?it/s]
