Handling Missing and Duplicate Data

Real-world datasets are rarely perfect. You’ll often encounter missing values or duplicate rows that can skew your analysis. Pandas provides powerful tools to identify and handle these issues efficiently.

Dealing with Missing Data

Missing values are usually represented as NaN (Not a Number) in Pandas. You can:

Detect missing values using .isnull() or .notnull()
Drop missing data with .dropna()
Fill missing data using .fillna() (e.g., fill with a default value or forward-fill based on previous values)

Handling missing values is essential before performing computations like mean, sum, or correlation — otherwise, results may be distorted.

Handling Duplicate Entries

Duplicate rows can occur due to data entry errors or when merging datasets.

Use .duplicated() to flag duplicates
Use .drop_duplicates() to remove them

Always check if duplicates make sense in the context of your data — not all repetition is bad!

Summary Table

Task	Method	Description
Detect missing	`df.isnull()`	Shows True for missing values
Drop missing rows	`df.dropna()`	Removes rows with any NaN
Fill missing values	`df.fillna(value)`	Replaces NaN with the specified value
Detect duplicates	`df.duplicated()`	Returns a Boolean Series
Drop duplicates	`df.drop_duplicates()`	Removes duplicate rows

What’s Next?

Let’s apply these techniques in a Jupyter notebook and practice cleaning some messy data!

Dealing with Missing Data​

Handling Duplicate Entries​

Summary Table​

What’s Next?​

Dealing with Missing Data

Handling Duplicate Entries

Summary Table

What’s Next?