Skip to main content
Practice

Calculate Summary Statistics with Pandas

How can we calculate the mean, standard deviation, etc., of large datasets all at once?

Defining and calculating functions for each item individually can be a very cumbersome task.

However, using the describe() method of DataFrames allows you to calculate summary statistics at once, including the number of entries, mean, standard deviation, minimum, and maximum values.

Calculate Summary Statistics
import pandas as pd

data_frame = pd.DataFrame({
'Item': ['Apple', 'Banana', 'Strawberry', 'Grapes'],
'Sales': [1000, 2000, 1500, 3000]
})

# Calculate summary statistics
summary_stats = data_frame.describe()
print(summary_stats)

The code data_frame.describe() returns a DataFrame with summary statistics (mean, standard deviation, minimum, maximum, etc.) of the DataFrame.

describe() Method Output
            Sales
count 4.000000
mean 1875.000000
std 866.025404
min 1000.000000
25% 1375.000000
50% 1750.000000
75% 2250.000000
max 3000.000000

The meanings of each term are as follows:

  • count: Number of entries

  • mean: Mean value

  • std: Standard deviation

  • min: Minimum value

  • 25%, 50%, 75%: Percentiles

  • max: Maximum value


Handling Missing Values

Missing values in a dataset refer to instances where data is absent.

Pandas provides various methods to handle missing values.

Handling Missing Values Example
import pandas as pd

data_frame = pd.DataFrame({
'Item': ['Apple', 'Banana', 'Strawberry', None],
'Sales': [1000, 2000, 1500, None]
})

# Check for missing values
missing_values = data_frame.isnull()

# Replace missing values with 0
data_frame_filled = data_frame.fillna(0)

print(data_frame_filled)
Missing Values Replacement Result
         Item   Sales
0 Apple 1000.0
1 Banana 2000.0
2 Strawberry 1500.0
3 0 0.0

Code Explanation

  • data_frame.isnull() returns a DataFrame indicating the positions of missing values with True.

  • data_frame.fillna(0) returns a DataFrame where missing values are replaced with 0.

  • Instead of data_frame.fillna(0), you can use data_frame.dropna() to remove rows containing missing values.

Want to learn more?

Join CodeFriends Plus membership or enroll in a course to start your journey.