🟪 1-Minute Summary
Missing values are inevitable in real datasets. Treatment options: (1) Drop (if < 5% missing or MCAR), (2) Impute (mean/median/mode for numerical, mode for categorical, or advanced methods like KNN/MICE), (3) Create missing indicator (if missingness is informative). Choice depends on missingness mechanism (MCAR, MAR, MNAR) and percentage missing.
🟦 Core Notes (Must-Know)
Types of Missingness
[Content to be filled in]
- MCAR (Missing Completely At Random)
- MAR (Missing At Random)
- MNAR (Missing Not At Random)
Detection Strategies
[Content to be filled in]
Treatment Options
Option 1: Drop
[Content to be filled in]
Option 2: Simple Imputation
[Content to be filled in]
Option 3: Advanced Imputation
[Content to be filled in]
Option 4: Missing Indicator Feature
[Content to be filled in]
Decision Framework
[Content to be filled in]
🟨 Interview Triggers (What Interviewers Actually Test)
Common Interview Questions
-
“30% of values are missing in a key feature. What do you do?”
- [Answer framework: Check if missingness is informative first]
-
“When would you drop rows vs impute?”
- [Answer: Drop if MCAR and < 5%, otherwise impute]
-
“What’s wrong with always using mean imputation?”
- [Answer: Reduces variance, doesn’t work for MNAR, ignores relationships]
🟥 Common Mistakes (Traps to Avoid)
Mistake 1: Always using mean/median imputation
[Content to be filled in]
Mistake 2: Dropping rows before checking patterns
[Content to be filled in]
Mistake 3: Imputing before train-test split
[Content to be filled in - causes data leakage]
🟩 Mini Example (Quick Application)
Scenario
[Missing value treatment example]
Solution
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
# Detection
print(df.isnull().sum())
print(df.isnull().sum() / len(df) * 100) # Percentage
# Visualization
import missingno as msno
msno.matrix(df)
# Treatment examples to be filled in...
🔗 Related Topics
Navigation: