Understanding Tokenization in GPT
Language models like GPT do not process text directly; instead, they divide the text into smaller units before calculations.
This process is known as Tokenization
.
In this lesson, we will explore what tokenization is and how tokens are used in GPT.
What is Tokenization?
A Token
is a small unit that a sentence is divided into, such as words, punctuation, numbers, etc.
When AI receives a prompt like "The cat climbed up the tree.", it splits the sentence into tokens.
The / cat / climbed / up / the / tree / .
In English tokenization, words are primarily split based on spaces
or punctuation
(symbols used in a sentence like periods).
For instance, the sentence "The quick brown fox jumps over the lazy dog." is tokenized as follows:
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
A single word can also be divided into multiple tokens based on prefixes, patterns, and suffixes.
For example, the word "unconscious" can be divided into sub-components like un
(a prefix indicating negation), consc
(a common pattern in English words), and ious
(a common suffix in English words), resulting in three tokens.
The method of handling tokens varies based on the AI model and the type of characters being processed. ChatGPT generally allocates 1 token per 1-4 alphabets, while tokenizing based on morphological units in other languages.
Note: Most text-generating AIs, like ChatGPT, charge costs based on the number of input and output tokens. Therefore, it's important to reduce unnecessary tokens.
AI models understand the relationships between these tokens statistically and generate new text based on the input prompt.
In the next lesson, we will look into Hallucination
, one of the critical issues in generative AI.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.