Preprocessing: Preparing Data for Consumption

Data Preprocessing refers to the process of cleaning and transforming data before it's analyzed or used to train AI models.

In simple terms, it's about making raw data, which might be messy or incomplete, clean and consistent.

Why is Preprocessing Necessary?

Data can have the following issues:

Missing Values: When parts of the data are absent
Duplicate Values: When the same data is included multiple times
Inconsistent Data: When data formats are not uniform

JSONL Data Preprocessing Example

Here's how you can handle missing values, ensure consistency, and remove duplicates in a JSONL dataset.

Original JSONL Data
{"name": "John Doe", "age": "30", "city": "New York"}
{"name": "Jane Smith", "age": 40, "city": "Los Angeles"}
{"name": "Jim Brown", "city": "Chicago"}
{"name": "John Doe", "age": "thirty", "city": "New York"}

⬇

JSONL Data with Missing Values Handled
{"name": "John Doe", "age": "30", "city": "New York"}
{"name": "Jane Smith", "age": 40, "city": "Los Angeles"}
{"name": "Jim Brown", "age": 0, "city": "Chicago"}  // Replace missing age with 0
{"name": "John Doe", "age": "thirty", "city": "New York"}

⬇

JSONL Data with Consistent Formatting
{"name": "John Doe", "age": 30, "city": "New York"}
{"name": "Jane Smith", "age": 40, "city": "Los Angeles"}
{"name": "Jim Brown", "age": 0, "city": "Chicago"}
{"name": "John Doe", "age": 30, "city": "New York"}  // Convert 'thirty' to the number 30

⬇

JSONL Data with Duplicates Removed
{"name": "John Doe", "age": 30, "city": "New York"}
{"name": "Jane Smith", "age": 40, "city": "Los Angeles"}
{"name": "Jim Brown", "age": 0, "city": "Chicago"}
// Removed duplicate "John Doe", "30", "New York"

When creating additional training datasets for fine-tuning, it is crucial to pre-process the data meticulously.

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.

Why is Preprocessing Necessary?​

JSONL Data Preprocessing Example​

Want to learn more?

Why is Preprocessing Necessary?

JSONL Data Preprocessing Example