Test Dataset for Final Performance Evaluation
In this lesson, we'll explore the test dataset
used for evaluating the final performance of a machine learning model.
A test dataset is used to assess whether the model can make accurate predictions on new, unseen data after the training is completed.
Once the model has been trained and tuned using the training and validation datasets, the test dataset is utilized to confirm the model's generalization
performance.
Generalization
refers to a model's ability to perform well on new data, not just the data it was trained on.
Differences Between Validation and Test Datasets
Both validation and test datasets are used to evaluate a model's performance, but their purposes and timing of use differ as follows:
-
Validation Dataset: Used to tune the model's performance and select the best model.
-
Test Dataset: Used to evaluate the model's final performance and confirm its effectiveness in real-world scenarios.
The Role of the Test Dataset
The test dataset is utilized only once during the final evaluation stage to verify that the model trained and validated is performing well even in real-world scenarios.
For instance, in an AI model that classifies dogs and cats, the test dataset may be structured as follows:
-
Input Data: Completely new images of dogs and cats not used for training or validation
-
Label (Ground Truth): Information indicating whether each image is of a dog or a cat
After training and validating the model, the test dataset is used to evaluate whether the model can accurately classify dogs and cats in a real-world setting.
If the performance on the test dataset is low, additional adjustments, such as improving the quality of the training data or modifying the model's structure, may be necessary.
Conditions for a Good Test Dataset
For a test dataset to be effectively constructed, the following considerations should be made:
1. Completely Independent from Training and Validation Data
The test data should be new data that the model has never seen before.
If the test data overlaps with the training data, there is a high chance that the model's true performance will be overestimated.
2. Sufficient Volume of Data
A proper size for the test dataset is typically about 10-15% of the total dataset.
If it's too small, it becomes difficult to accurately assess the model's generalization performance, while too large a portion may leave insufficient training data.
3. Reflective of Real-World Scenarios
The test dataset should resemble the data that will be input in the actual service environment.
For instance, when developing a dog and cat classification model, it would be advantageous to include not just regular photos but also blurred images, those taken in poor lighting conditions, and partially obscured pictures.
After evaluating the final accuracy of the model using the test dataset, repeat the training process as needed to enhance performance.
In the next lesson, we will go through a simple quiz to review what we have learned so far.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.