Skip to main content
Practice

Model Selection and Cross-Validation


Choosing the right machine learning model is one of the most important steps in any ML project.
Even if two models perform well, their generalization to unseen data can be very different.


Why Model Selection Matters

  • Avoid Overfitting – Some models perform exceptionally well on training data but fail on new data.
  • Balance Accuracy and Complexity – A simpler model might generalize better than a complex one.
  • Optimize Resources – The best-performing model might also be the most computationally efficient.

What is Cross-Validation?

Cross-validation is a technique to evaluate model performance by splitting the dataset into multiple subsets (folds) and training/testing across different combinations.

For example, in k-fold cross-validation:

  1. The data is divided into k folds.
  2. For each fold:
    • Train the model on k-1 folds.
    • Test it on the remaining fold.
  3. Average the results to get a more reliable performance estimate.

Common Cross-Validation Types

  • K-Fold Cross-Validation – Most common, splits into k equal folds.
  • Stratified K-Fold – Maintains class proportions in each fold (important for classification).
  • Leave-One-Out (LOO) – Each observation is tested individually.
  • ShuffleSplit – Random splits with replacement.

Example: Comparing Models with Cross-Validation

Cross-Validation Example
import piplite
await piplite.install('scikit-learn')

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Define models
log_reg = LogisticRegression(max_iter=200)
knn = KNeighborsClassifier(n_neighbors=5)

# Cross-validation
log_scores = cross_val_score(log_reg, X, y, cv=5)
knn_scores = cross_val_score(knn, X, y, cv=5)

print(f"Logistic Regression mean score: {log_scores.mean():.3f}")
print(f"KNN mean score: {knn_scores.mean():.3f}")

This example uses 5-fold cross-validation to compare two models and select the one with the highest average accuracy.


Key Takeaways

  • Model selection ensures the chosen model is the best fit for both accuracy and efficiency.
  • Cross-validation gives a more robust estimate of real-world performance.
  • Always use the same cross-validation strategy when comparing models to ensure fairness.

What’s Next?

In the next lesson, we’ll wrap up the chapter with the Final Quiz – Machine Learning Essentials to review everything you’ve learned.

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.