Skip to main content
Practice

Dataset Structure: Features and Labels


In machine learning, a dataset is typically organized into:

  • Features (X) – The input variables used by the model to make predictions.
    Examples: age, height, number of purchases.
  • Labels (y) – The target variable the model is trying to predict.
    Examples: whether an email is spam, the price of a house.

A model learns the relationship between features and labels in supervised learning.


Loading a Dataset in Scikit-learn

Scikit-learn provides built-in datasets for practice. One of the most famous is the Iris dataset.

Loading the Iris Dataset
from sklearn.datasets import load_iris

iris = load_iris()

# Features (X) - shape: (samples, features)
X = iris.data
print("Feature shape:", X.shape)
print("First row of features:", X[0])

# Labels (y) - shape: (samples,)
y = iris.target
print("Label shape:", y.shape)
print("First label:", y[0])

Inspecting Feature and Label Names

Feature and Label Names
print("Feature names:", iris.feature_names)
print("Target names:", iris.target_names)

Why This Matters

  • Features are the information your model uses to make predictions.
  • Labels define the correct answers during training.
  • Organizing data correctly into X and y is essential for Scikit-learn functions like train_test_split() and .fit().

Key Takeaways

  • X → input features, 2D array shape (n_samples, n_features).
  • y → target labels, 1D array shape (n_samples,).
  • Proper separation of features and labels is the first step in preparing data for training.

What’s Next?

In the next lesson, we’ll learn how to split data into training and testing sets to evaluate model performance.

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.