Skip to main content
Crowdfunding
Python + AI for Geeks
Practice

Categorical Data Encoding

AI and machine learning models can only understand numbers.

However, much of the data we work with is text-based.

This kind of data, which can be grouped into certain categories without numerical meaning, is called categorical data.

Example of Categorical Data
| ID  | Color | Region | Occupation |
|-----|-------|--------|------------|
| 1 | Red | New York | Student |
| 2 | Blue | Chicago | Employee |
| 3 | Green | Los Angeles | Student |
| 4 | Yellow| New York | Doctor |

In the data above, color, region, and occupation are categorical data.

These cannot be directly calculated as numbers, and comparing their magnitude or order is not meaningful.

Categorical data can be divided into two main types.


Nominal Data

This is categorical data without any order. Colors (red, blue, green), regions (New York, Chicago, Los Angeles) are examples of nominal data.


Ordinal Data

This is categorical data with an order. Education level (elementary, middle, high school), customer satisfaction (low, medium, high) are examples of ordinal data.

Categorical data needs to be converted into numerical form for machine learning, a process known as encoding.


What is Data Encoding?

Categorical data must be transformed into numbers so that machine learning models can comprehend it. This transformation process is known as data encoding.

For example, let's convert the color data above into numbers.

Color Data Encoding
| ID  | Color  | Color (Encoded) |
|-----|--------|----------------|
| 1 | Red | 0 |
| 2 | Blue | 1 |
| 3 | Green | 2 |
| 4 | Yellow | 3 |

With this conversion, the model can process the color data as numbers.

There are methods like Label Encoding and One-Hot Encoding for this transformation.

We will discuss each method in more detail in the following lessons.

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.