How to Split a Dataset
To evaluate if an AI model functions well for its intended purpose, the dataset should be divided into training set
, validation set
, and test set
.
In this lesson, we'll explore how to split datasets effectively for optimal learning.
How Should We Split?
There are various ways to split a dataset, but it's generally divided using the following ratios:
-
Training Set: About 60-80% of the entire dataset
-
Validation Set: About 10-20% of the entire dataset
-
Test Set: About 10-20% of the entire dataset
For instance, you might use 70% of the entire dataset as the training set, 15% as the validation set, and the remaining 15% as the test set.
Example of Dataset Splitting: Student Exam Scores
If you have exam score data for 100 students, you can split the data as follows:
Training Set (70%):
- Use the exam scores of 70 students to train the AI model.
- The AI model learns patterns in the scores through this data.
Validation Set (15%):
- During AI model training, use the exam scores of 15 students to evaluate the model's performance.
- Ensure the model isn't overly specialized on the training data (Overfitting) and that it learns enough from the training data (Underfitting).
Test Set (15%):
- Use the remaining exam scores of 15 students to test the model's final performance.
- Assess how well the model predicts new data using this set.
Methods of Dataset Splitting
Random Sampling
Select data randomly to split into training, validation, and test sets.
-
Advantages: It's simple to implement, and all data has an equal chance of being selected.
-
Disadvantages: There's a risk that the data's important characteristics might not be evenly distributed across the sets. For example, there may be too many or too few instances of a specific category.
Stratified Sampling
Select data by considering key characteristics of the dataset to ensure each stratum (or group) is proportionally represented. For instance, in a disease prediction model, you can maintain the male-to-female ratio to ensure that the dataset's relevant characteristics are reflected uniformly across all sets.
-
Advantages: Allows you to maintain the important characteristics of the dataset while splitting.
-
Disadvantages: It can be complex to implement and requires a precise understanding of the dataset's characteristics.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.