How to Split a Dataset

To evaluate if an AI model functions well for its intended purpose, the dataset should be divided into training set, validation set, and test set.

In this lesson, we'll explore how to split datasets effectively for optimal learning.

How Should We Split?

There are various ways to split a dataset, but it's generally divided using the following ratios:

Training Set: About 60-80% of the entire dataset
Validation Set: About 10-20% of the entire dataset
Test Set: About 10-20% of the entire dataset

For instance, you might use 70% of the entire dataset as the training set, 15% as the validation set, and the remaining 15% as the test set.

Example of Dataset Splitting: Student Exam Scores

If you have exam score data for 100 students, you can split the data as follows:

Training Set (70%):

Use the exam scores of 70 students to train the AI model.
The AI model learns patterns in the scores through this data.

Validation Set (15%):

During AI model training, use the exam scores of 15 students to evaluate the model's performance.
Ensure the model isn't overly specialized on the training data (Overfitting) and that it learns enough from the training data (Underfitting).

Test Set (15%):

Use the remaining exam scores of 15 students to test the model's final performance.
Assess how well the model predicts new data using this set.

Methods of Dataset Splitting

Random Sampling

Select data randomly to split into training, validation, and test sets.

Advantages: It's simple to implement, and all data has an equal chance of being selected.
Disadvantages: There's a risk that the data's important characteristics might not be evenly distributed across the sets. For example, there may be too many or too few instances of a specific category.

Stratified Sampling

Select data by considering key characteristics of the dataset to ensure each stratum (or group) is proportionally represented. For instance, in a disease prediction model, you can maintain the male-to-female ratio to ensure that the dataset's relevant characteristics are reflected uniformly across all sets.

Advantages: Allows you to maintain the important characteristics of the dataset while splitting.
Disadvantages: It can be complex to implement and requires a precise understanding of the dataset's characteristics.

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.

How Should We Split?​

Example of Dataset Splitting: Student Exam Scores​

Methods of Dataset Splitting​

Random Sampling​

Stratified Sampling​