Categorical Data Encoding
AI and machine learning models can only understand numbers.
However, much of the data we work with is text-based.
This kind of data, which can be grouped into certain categories without numerical meaning, is called categorical data
.
| ID | Color | Region | Occupation |
|-----|-------|--------|------------|
| 1 | Red | New York | Student |
| 2 | Blue | Chicago | Employee |
| 3 | Green | Los Angeles | Student |
| 4 | Yellow| New York | Doctor |
In the data above, color, region, and occupation are categorical data.
These cannot be directly calculated as numbers, and comparing their magnitude or order is not meaningful.
Categorical data can be divided into two main types.
Nominal Data
This is categorical data without any order. Colors (red, blue, green), regions (New York, Chicago, Los Angeles) are examples of nominal data.
Ordinal Data
This is categorical data with an order. Education level (elementary, middle, high school), customer satisfaction (low, medium, high) are examples of ordinal data.
Categorical data needs to be converted into numerical form for machine learning, a process known as encoding
.
What is Data Encoding?
Categorical data must be transformed into numbers so that machine learning models can comprehend it. This transformation process is known as data encoding.
For example, let's convert the color data above into numbers.
| ID | Color | Color (Encoded) |
|-----|--------|----------------|
| 1 | Red | 0 |
| 2 | Blue | 1 |
| 3 | Green | 2 |
| 4 | Yellow | 3 |
With this conversion, the model can process the color data as numbers.
There are methods like Label Encoding
and One-Hot Encoding
for this transformation.
We will discuss each method in more detail in the following lessons.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.