Handling Missing and Duplicate Data
Real-world datasets are rarely perfect. You’ll often encounter missing values or duplicate rows that can skew your analysis. Pandas provides powerful tools to identify and handle these issues efficiently.
Dealing with Missing Data
Missing values are usually represented as NaN
(Not a Number) in Pandas. You can:
- Detect missing values using
.isnull()
or.notnull()
- Drop missing data with
.dropna()
- Fill missing data using
.fillna()
(e.g., fill with a default value or forward-fill based on previous values)
Handling missing values is essential before performing computations like mean, sum, or correlation — otherwise, results may be distorted.
Handling Duplicate Entries
Duplicate rows can occur due to data entry errors or when merging datasets.
- Use
.duplicated()
to flag duplicates - Use
.drop_duplicates()
to remove them
Always check if duplicates make sense in the context of your data — not all repetition is bad!
Summary Table
Task | Method | Description |
---|---|---|
Detect missing | df.isnull() | Shows True for missing values |
Drop missing rows | df.dropna() | Removes rows with any NaN |
Fill missing values | df.fillna(value) | Replaces NaN with the specified value |
Detect duplicates | df.duplicated() | Returns a Boolean Series |
Drop duplicates | df.drop_duplicates() | Removes duplicate rows |
What’s Next?
Let’s apply these techniques in a Jupyter notebook and practice cleaning some messy data!