Data Preprocessing and Model Evaluation Using Scikit-Learn

Preprocessing refers to transforming data to make it suitable for use in a machine learning model.

Before training a machine learning model, it's necessary to prepare and preprocess the training data.

Scikit-Learn provides a variety of functionalities for data preprocessing and offers metrics for model evaluation.

Data Preprocessing

The process of data preprocessing includes the following steps.

Handling Missing Values

Missing values indicate that there are empty entries in the dataset.

When a dataset contains missing values, they can be filled with the mean or median.

Handling Missing Values
from sklearn.impute import SimpleImputer
import numpy as np

data = np.array([[1, 2, np.nan], [4, np.nan, 6]])
imputer = SimpleImputer(strategy="mean")
filled_data = imputer.fit_transform(data)

Feature Scaling

Feature scaling is the task of making the range of all features identical.

If the magnitude of feature values varies significantly, it can degrade model performance, so normalization is used to make feature ranges consistent.

Feature Scaling
from sklearn.preprocessing import StandardScaler

X = np.array([[1, 100], [2, 200], [3, 300]])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(X_scaled)

In the code above, StandardScaler transforms features to have a mean of 0 and a standard deviation of 1.

Thus, X_scaled is converted to values with a mean of 0 and a standard deviation of 1.

Model Evaluation

There are various metrics available to evaluate model performance.

Classification Model Evaluation

Classification models categorize data into multiple classes.

Accuracy is calculated to determine how correctly the model made predictions.

Accuracy Evaluation
from sklearn.metrics import accuracy_score

y_true = [0, 1, 1, 0]
y_pred = [0, 1, 0, 0]

accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy is the ratio of correctly predicted values to actual values, and higher accuracy indicates better model performance.

Regression Model Evaluation

Regression analysis aims to statistically predict relationships between data variables.

A regression model calculates the difference between predicted values and actual values from regression analysis.

Regression Model Evaluation
from sklearn.metrics import mean_squared_error

y_true = [3.0, -0.5, 2.0, 7.0]
y_pred = [2.5, 0.0, 2.0, 8.0]

mse = mean_squared_error(y_true, y_pred)
print(f"MSE: {mse:.2f}")

The code above calculates the Mean Squared Error (MSE) to measure the difference between predictions and actual values.

Mean Squared Error (MSE) calculates the average of the squared differences between predicted and actual values—a lower MSE indicates better performance.

References

Scikit-Learn Official Documentation

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.

Data Preprocessing​

Handling Missing Values​

Feature Scaling​

Model Evaluation​

Classification Model Evaluation​

Regression Model Evaluation​

References​