Understanding Tokenization in GPT

Language models like GPT do not process text directly; instead, they divide the text into smaller units before calculations.

This process is known as Tokenization.

In this lesson, we will explore what tokenization is and how tokens are used in GPT.

What is Tokenization?

A token is a small unit of text, such as a word, punctuation mark, or number, that a sentence is broken into.

When AI receives a prompt like "The cat climbed up the tree.", it splits the sentence into tokens.

Tokenized Sentence Example

The / cat / climbed / up / the / tree / .

In English, tokenization mainly splits words based on spaces and punctuation (like periods and commas).

For instance, the sentence "The quick brown fox jumps over the lazy dog." is tokenized as follows:

English Tokenization Example

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']

A single word can also be divided into multiple tokens based on prefixes, patterns, and suffixes.

For example, the word "unconscious" can be divided into sub-components like un (a prefix indicating negation), consc (a common pattern in English words), and ious (a common suffix in English words), resulting in three tokens.

Token handling methods vary depending on the AI model and the language or character set it’s processing. ChatGPT usually assigns 1 token for every 1–4 characters in English, while other languages may use morphological tokenization.

Note: Most text-generating AIs, like ChatGPT, charge costs based on the number of input and output tokens. Therefore, it's important to reduce unnecessary tokens.

AI models learn statistical relationships between tokens and generate new text based on the prompt.

In the next lesson, we will look into Hallucination, one of the critical issues in generative AI.

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.

What is Tokenization?​

Want to learn more?

What is Tokenization?