Transformers documentation

๐Ÿค— Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ํ† ํฌ๋‚˜์ด์ € ์‚ฌ์šฉํ•˜๊ธฐ

You are viewing v4.52.3 version. A newer version v4.53.3 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

๐Ÿค— Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ํ† ํฌ๋‚˜์ด์ € ์‚ฌ์šฉํ•˜๊ธฐ

PreTrainedTokenizerFast๋Š” ๐Ÿค— Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์— ๊ธฐ๋ฐ˜ํ•ฉ๋‹ˆ๋‹ค. ๐Ÿค— Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ํ† ํฌ๋‚˜์ด์ €๋Š” ๐Ÿค— Transformers๋กœ ๋งค์šฐ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ตฌ์ฒด์ ์ธ ๋‚ด์šฉ์— ๋“ค์–ด๊ฐ€๊ธฐ ์ „์—, ๋ช‡ ์ค„์˜ ์ฝ”๋“œ๋กœ ๋”๋ฏธ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

>>> from tokenizers import Tokenizer
>>> from tokenizers.models import BPE
>>> from tokenizers.trainers import BpeTrainer
>>> from tokenizers.pre_tokenizers import Whitespace

>>> tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
>>> trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

>>> tokenizer.pre_tokenizer = Whitespace()
>>> files = [...]
>>> tokenizer.train(files, trainer)

์šฐ๋ฆฌ๊ฐ€ ์ •์˜ํ•œ ํŒŒ์ผ์„ ํ†ตํ•ด ์ด์ œ ํ•™์Šต๋œ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๊ฐ–๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋Ÿฐํƒ€์ž„์—์„œ ๊ณ„์† ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ JSON ํŒŒ์ผ๋กœ ์ €์žฅํ•˜์—ฌ ๋‚˜์ค‘์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ† ํฌ๋‚˜์ด์ € ๊ฐ์ฒด๋กœ๋ถ€ํ„ฐ ์ง์ ‘ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

๐Ÿค— Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ์ด ํ† ํฌ๋‚˜์ด์ € ๊ฐ์ฒด๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. PreTrainedTokenizerFast ํด๋ž˜์Šค๋Š” ์ธ์Šคํ„ด์Šคํ™”๋œ ํ† ํฌ๋‚˜์ด์ € ๊ฐ์ฒด๋ฅผ ์ธ์ˆ˜๋กœ ๋ฐ›์•„ ์‰ฝ๊ฒŒ ์ธ์Šคํ„ด์Šคํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

>>> from transformers import PreTrainedTokenizerFast

>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

์ด์ œ fast_tokenizer ๊ฐ์ฒด๋Š” ๐Ÿค— Transformers ํ† ํฌ๋‚˜์ด์ €์—์„œ ๊ณต์œ ํ•˜๋Š” ๋ชจ๋“  ๋ฉ”์†Œ๋“œ์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ์ž์„ธํ•œ ๋‚ด์šฉ์€ ํ† ํฌ๋‚˜์ด์ € ํŽ˜์ด์ง€๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

JSON ํŒŒ์ผ์—์„œ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

JSON ํŒŒ์ผ์—์„œ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ์œ„ํ•ด, ๋จผ์ € ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ €์žฅํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

>>> tokenizer.save("tokenizer.json")

JSON ํŒŒ์ผ์„ ์ €์žฅํ•œ ๊ฒฝ๋กœ๋Š” tokenizer_file ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ PreTrainedTokenizerFast ์ดˆ๊ธฐํ™” ๋ฉ”์†Œ๋“œ์— ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

>>> from transformers import PreTrainedTokenizerFast

>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

์ด์ œ fast_tokenizer ๊ฐ์ฒด๋Š” ๐Ÿค— Transformers ํ† ํฌ๋‚˜์ด์ €์—์„œ ๊ณต์œ ํ•˜๋Š” ๋ชจ๋“  ๋ฉ”์†Œ๋“œ์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ์ž์„ธํ•œ ๋‚ด์šฉ์€ ํ† ํฌ๋‚˜์ด์ € ํŽ˜์ด์ง€๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

< > Update on GitHub

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy