🟪 1-Minute Summary

Duplicates are rows that appear multiple times in a dataset. Types: (1) Exact duplicates (all columns identical), (2) Partial duplicates (key columns identical). Detection: df.duplicated(). Treatment depends on context: remove if errors, keep if legitimate (e.g., multiple purchases by same customer). Always investigate before blindly dropping.


🟦 Core Notes (Must-Know)

Types of Duplicates

[Content to be filled in]

Why Duplicates Occur

[Content to be filled in]

Detection Methods

[Content to be filled in]

Treatment Strategies

[Content to be filled in]

When NOT to Remove Duplicates

[Content to be filled in]


🟨 Interview Triggers (What Interviewers Actually Test)

Common Interview Questions

  1. “How do you identify duplicates in a dataset?”

    • [Answer: df.duplicated(), check subset of key columns]
  2. “When would duplicates be legitimate?”

    • [Answer: Time series, transaction data, multi-part records]
  3. “What if you have partial duplicates (same ID, different values)?”

    • [Answer framework to be filled in]

🟥 Common Mistakes (Traps to Avoid)

Mistake 1: Removing duplicates without investigation

[Content to be filled in]

Mistake 2: Only checking exact duplicates

[Content to be filled in]

Mistake 3: Not preserving any duplicate record

[Content to be filled in - use keep=‘first’ or keep=‘last’]


🟩 Mini Example (Quick Application)

Scenario

[Customer transaction duplicate example]

Solution

import pandas as pd

# Load data
df = pd.read_csv('data.csv')

# Detect exact duplicates
print(f"Exact duplicates: {df.duplicated().sum()}")

# Detect partial duplicates (based on key columns)
print(f"Duplicate IDs: {df.duplicated(subset=['customer_id']).sum()}")

# View duplicates
duplicates = df[df.duplicated(keep=False)]
print(duplicates)

# Remove exact duplicates
df_clean = df.drop_duplicates()

# Remove based on subset, keep first occurrence
df_clean = df.drop_duplicates(subset=['customer_id'], keep='first')


Navigation: