🟪 1-Minute Summary
Duplicates are rows that appear multiple times in a dataset. Types: (1) Exact duplicates (all columns identical), (2) Partial duplicates (key columns identical). Detection: df.duplicated(). Treatment depends on context: remove if errors, keep if legitimate (e.g., multiple purchases by same customer). Always investigate before blindly dropping.
🟦 Core Notes (Must-Know)
Types of Duplicates
[Content to be filled in]
Why Duplicates Occur
[Content to be filled in]
Detection Methods
[Content to be filled in]
Treatment Strategies
[Content to be filled in]
When NOT to Remove Duplicates
[Content to be filled in]
🟨 Interview Triggers (What Interviewers Actually Test)
Common Interview Questions
-
“How do you identify duplicates in a dataset?”
- [Answer: df.duplicated(), check subset of key columns]
-
“When would duplicates be legitimate?”
- [Answer: Time series, transaction data, multi-part records]
-
“What if you have partial duplicates (same ID, different values)?”
- [Answer framework to be filled in]
🟥 Common Mistakes (Traps to Avoid)
Mistake 1: Removing duplicates without investigation
[Content to be filled in]
Mistake 2: Only checking exact duplicates
[Content to be filled in]
Mistake 3: Not preserving any duplicate record
[Content to be filled in - use keep=‘first’ or keep=‘last’]
🟩 Mini Example (Quick Application)
Scenario
[Customer transaction duplicate example]
Solution
import pandas as pd
# Load data
df = pd.read_csv('data.csv')
# Detect exact duplicates
print(f"Exact duplicates: {df.duplicated().sum()}")
# Detect partial duplicates (based on key columns)
print(f"Duplicate IDs: {df.duplicated(subset=['customer_id']).sum()}")
# View duplicates
duplicates = df[df.duplicated(keep=False)]
print(duplicates)
# Remove exact duplicates
df_clean = df.drop_duplicates()
# Remove based on subset, keep first occurrence
df_clean = df.drop_duplicates(subset=['customer_id'], keep='first')
🔗 Related Topics
Navigation: