Preprocessing: Preparing Data for Consumption
Data Preprocessing
refers to the process of cleaning and transforming data before it's analyzed or used to train AI models.
In simple terms, it's about making raw data, which might be messy or incomplete, clean and consistent.
Why is Preprocessing Necessary?
Data can have the following issues:
-
Missing Values: When parts of the data are absent
-
Duplicate Values: When the same data is included multiple times
-
Inconsistent Data: When data formats are not uniform
JSONL Data Preprocessing Example
Here's how you can handle missing values, ensure consistency, and remove duplicates in a JSONL dataset.
Original JSONL Data
{"name": "John Doe", "age": "30", "city": "New York"}
{"name": "Jane Smith", "age": 40, "city": "Los Angeles"}
{"name": "Jim Brown", "city": "Chicago"}
{"name": "John Doe", "age": "thirty", "city": "New York"}
⬇
JSONL Data with Missing Values Handled
{"name": "John Doe", "age": "30", "city": "New York"}
{"name": "Jane Smith", "age": 40, "city": "Los Angeles"}
{"name": "Jim Brown", "age": 0, "city": "Chicago"} // Replace missing age with 0
{"name": "John Doe", "age": "thirty", "city": "New York"}
⬇
JSONL Data with Consistent Formatting
{"name": "John Doe", "age": 30, "city": "New York"}
{"name": "Jane Smith", "age": 40, "city": "Los Angeles"}
{"name": "Jim Brown", "age": 0, "city": "Chicago"}
{"name": "John Doe", "age": 30, "city": "New York"} // Convert 'thirty' to the number 30
⬇
JSONL Data with Duplicates Removed
{"name": "John Doe", "age": 30, "city": "New York"}
{"name": "Jane Smith", "age": 40, "city": "Los Angeles"}
{"name": "Jim Brown", "age": 0, "city": "Chicago"}
// Removed duplicate "John Doe", "30", "New York"
When creating additional training datasets for fine-tuning, it is crucial to pre-process the data meticulously.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.