Skip to main content
Practice

Splitting Data: Train vs Test


In machine learning, we split datasets into training and testing sets to evaluate how well a model generalizes to unseen data.

  • Training set – Used by the model to learn patterns.
  • Testing set – Used to check performance on data the model has never seen before.

If we don’t separate them, the model might overfit — memorizing data instead of learning general rules.


Using train_test_split in Scikit-learn

train_test_split() randomly divides data into training and test sets.

Basic Train-Test Split
# Install scikit-learn in Jupyter Lite
import piplite
await piplite.install('scikit-learn')

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train (80%) and test (20%)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

print("Train size:", X_train.shape)
print("Test size:", X_test.shape)

Controlling Randomness

The random_state parameter ensures reproducibility — without it, every run may split differently.

Fixed Random State
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=123
)

print("Train size:", X_train.shape)
print("Test size:", X_test.shape)

Stratified Splits

For classification tasks, use stratify=y to maintain class proportions.

Stratified Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, stratify=y, random_state=42
)

# Check distribution
import numpy as np
unique_train, counts_train = np.unique(y_train, return_counts=True)
unique_test, counts_test = np.unique(y_test, return_counts=True)

print("Train distribution:", dict(zip(unique_train, counts_train)))
print("Test distribution:", dict(zip(unique_test, counts_test)))

Key Takeaways

  • Always split data before training to avoid overfitting.
  • train_test_split() is the most common and flexible approach.
  • Use stratify=y for classification tasks to preserve label proportions.
  • Fix random_state for reproducibility.

What’s Next?

In the next lesson, we’ll explore the ML Workflow and Model Lifecycle.

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.