Skip to main content
Practice

Handling Missing and Duplicate Data

Real-world datasets are rarely perfect. You’ll often encounter missing values or duplicate rows that can skew your analysis. Pandas provides powerful tools to identify and handle these issues efficiently.


Dealing with Missing Data

Missing values are usually represented as NaN (Not a Number) in Pandas. You can:

  • Detect missing values using .isnull() or .notnull()
  • Drop missing data with .dropna()
  • Fill missing data using .fillna() (e.g., fill with a default value or forward-fill based on previous values)

Handling missing values is essential before performing computations like mean, sum, or correlation — otherwise, results may be distorted.


Handling Duplicate Entries

Duplicate rows can occur due to data entry errors or when merging datasets.

  • Use .duplicated() to flag duplicates
  • Use .drop_duplicates() to remove them

Always check if duplicates make sense in the context of your data — not all repetition is bad!


Summary Table

TaskMethodDescription
Detect missingdf.isnull()Shows True for missing values
Drop missing rowsdf.dropna()Removes rows with any NaN
Fill missing valuesdf.fillna(value)Replaces NaN with the specified value
Detect duplicatesdf.duplicated()Returns a Boolean Series
Drop duplicatesdf.drop_duplicates()Removes duplicate rows

What’s Next?

Let’s apply these techniques in a Jupyter notebook and practice cleaning some messy data!