Calculate Summary Statistics with Pandas
How can we calculate the mean
, standard deviation
, etc., of large datasets all at once?
Defining and calculating functions for each item individually can be a very cumbersome task.
However, using the describe()
method of DataFrames allows you to calculate summary statistics at once, including the number of entries, mean, standard deviation, minimum, and maximum values.
import pandas as pd
data_frame = pd.DataFrame({
'Item': ['Apple', 'Banana', 'Strawberry', 'Grapes'],
'Sales': [1000, 2000, 1500, 3000]
})
# Calculate summary statistics
summary_stats = data_frame.describe()
print(summary_stats)
The code data_frame.describe()
returns a DataFrame with summary statistics (mean, standard deviation, minimum, maximum, etc.) of the DataFrame.
Sales
count 4.000000
mean 1875.000000
std 866.025404
min 1000.000000
25% 1375.000000
50% 1750.000000
75% 2250.000000
max 3000.000000
The meanings of each term are as follows:
-
count
: Number of entries -
mean
: Mean value -
std
: Standard deviation -
min
: Minimum value -
25%
,50%
,75%
: Percentiles -
max
: Maximum value
Handling Missing Values
Missing values
in a dataset refer to instances where data is absent.
Pandas provides various methods to handle missing values.
import pandas as pd
data_frame = pd.DataFrame({
'Item': ['Apple', 'Banana', 'Strawberry', None],
'Sales': [1000, 2000, 1500, None]
})
# Check for missing values
missing_values = data_frame.isnull()
# Replace missing values with 0
data_frame_filled = data_frame.fillna(0)
print(data_frame_filled)
Item Sales
0 Apple 1000.0
1 Banana 2000.0
2 Strawberry 1500.0
3 0 0.0
Code Explanation
-
data_frame.isnull()
returns a DataFrame indicating the positions of missing values with True. -
data_frame.fillna(0)
returns a DataFrame where missing values are replaced with 0. -
Instead of
data_frame.fillna(0)
, you can usedata_frame.dropna()
to remove rows containing missing values.
Want to learn more?
Join CodeFriends Plus membership or enroll in a course to start your journey.