0% found this document useful (0 votes)
10 views6 pages

The Significance of LLM Tokenization

The document discusses the significance of tokenization in Large Language Models (LLMs), highlighting problems caused by improper tokenization that can lead to misinterpretation of words. It suggests that using special separator characters between alphabets can help mitigate these issues, allowing LLMs to process each alphabet as an individual token. This approach improves performance in tasks that require alphabet-level tokenization, such as word games.

Uploaded by

adhvaith7381
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views6 pages

The Significance of LLM Tokenization

The document discusses the significance of tokenization in Large Language Models (LLMs), highlighting problems caused by improper tokenization that can lead to misinterpretation of words. It suggests that using special separator characters between alphabets can help mitigate these issues, allowing LLMs to process each alphabet as an individual token. This approach improves performance in tasks that require alphabet-level tokenization, such as word games.

Uploaded by

adhvaith7381
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

vaniyellamraju@gmail.

com
The Significance of
LLM Tokenization
6RW4PK7EJL

Proprietary
This file iscontent.
meant © Great Learning. All
for personal useRights
by Reserved. Unauthorized use or distribution
vaniyellamraju@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda

In t h is s e s s io n , w e w ill d is c u s s :

● P ro b le m s c a u s e d b y To k e n iza t io n in LLMs
● Mit ig a t in g To k e n iza t io n P ro b le m s
vaniyellamraju@gmail.com
6RW4PK7EJL

Proprietary
This file iscontent.
meant © Great Learning. All
for personal useRights
by Reserved. Unauthorized use or distribution
vaniyellamraju@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
The Significance of Tokenization

Input 1 Output 1
Tokenization Agnostic Poor Performance
As k t h e LLM t o p e rfo rm w o rd Th e LLM w ill lik e ly fa il a t
g a m e s o n a fu ll w o rd . p e rfo rm in g t h e t a s k o n t h e w o rd .

Ca n yo u re ve rs e t h e le t t e rs o f Su re , t h e re ve rs e d le t t e rs o f t h e
t h e w o rd “P e rfo rm a n c e ”? w o rd “P e rfo rm a n c e ” a re
“e m e rfn s re P ”.
vaniyellamraju@gmail.com
6RW4PK7EJL
LLM

Input 2 Output 2
Tokenization Cognizant Better Performance
As k t h e LLM t o d o w o rd g a m e s b y Th e LLM w ill n o w b e m o re lik e ly t o
s e p a ra t in g t h e le t t e rs o f t h e w o rd . c o m p le t e t h e w o rd t a s k
s u c c e s s fu lly
Ca n yo u re ve rs e t h e le t t e rs o f Th e re ve rs e d le t t e rs a re E-C-N-A-
t h e w o rd “P -E-R-F-O -R-M-A-N-C- M-R-O -F-R-E-P .
E”? This file is meant for personal use by vaniyellamraju@gmail.com only.
Sharing orcontent.
Proprietary publishing
© Greatthe contents
Learning. in part
All Rights or full
Reserved. is liable for
Unauthorized use legal action.
or distribution
The Significance of Tokenization (Cont.)

Input 1
vaniyellamraju@gmail.com
6RW4PK7EJL

Proprietary
This file iscontent.
meant © Great Learning. All
for personal useRights
by Reserved. Unauthorized use or distribution
vaniyellamraju@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
The Significance of Tokenization (Cont.)

Input 2
vaniyellamraju@gmail.com
6RW4PK7EJL

Proprietary
This file iscontent.
meant © Great Learning. All
for personal useRights
by Reserved. Unauthorized use or distribution
vaniyellamraju@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Summary

He re ’s a b rie f re c a p :

● To k e n iza t io n c a n c a u s e La rg e La n g u a g e Mo d e ls (LLMs ) t o
m is in t e rp re t w o rd s c o m p a re d t o h u m a n re vie w e rs , e s p e c ia lly in
t a s k s lik e W o rd Ga m e s t h a t re q u ire a lp h a b e t -le ve l t o k e n iza t io n .
vaniyellamraju@gmail.com
6RW4PK7EJL

● Th e c o rre c t w a y t o re s o lve t h e s e is s u e s is b y a d d in g s p e c ia l
s e p a ra t o r c h a ra c t e rs b e t w e e n a lp h a b e t s , e n s u rin g t h e LLM
p ro c e s s e s e a c h a lp h a b e t a s a n in d ivid u a l t o k e n .

Proprietary
This file iscontent.
meant © Great Learning. All
for personal useRights
by Reserved. Unauthorized use or distribution
vaniyellamraju@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy