🟪 1-Minute Summary

Missing values are inevitable in real datasets. Treatment options: (1) Drop (if < 5% missing or MCAR), (2) Impute (mean/median/mode for numerical, mode for categorical, or advanced methods like KNN/MICE), (3) Create missing indicator (if missingness is informative). Choice depends on missingness mechanism (MCAR, MAR, MNAR) and percentage missing.


🟦 Core Notes (Must-Know)

Types of Missingness

[Content to be filled in]

  • MCAR (Missing Completely At Random)
  • MAR (Missing At Random)
  • MNAR (Missing Not At Random)

Detection Strategies

[Content to be filled in]

Treatment Options

Option 1: Drop

[Content to be filled in]

Option 2: Simple Imputation

[Content to be filled in]

Option 3: Advanced Imputation

[Content to be filled in]

Option 4: Missing Indicator Feature

[Content to be filled in]

Decision Framework

[Content to be filled in]


🟨 Interview Triggers (What Interviewers Actually Test)

Common Interview Questions

  1. “30% of values are missing in a key feature. What do you do?”

    • [Answer framework: Check if missingness is informative first]
  2. “When would you drop rows vs impute?”

    • [Answer: Drop if MCAR and < 5%, otherwise impute]
  3. “What’s wrong with always using mean imputation?”

    • [Answer: Reduces variance, doesn’t work for MNAR, ignores relationships]

🟥 Common Mistakes (Traps to Avoid)

Mistake 1: Always using mean/median imputation

[Content to be filled in]

Mistake 2: Dropping rows before checking patterns

[Content to be filled in]

Mistake 3: Imputing before train-test split

[Content to be filled in - causes data leakage]


🟩 Mini Example (Quick Application)

Scenario

[Missing value treatment example]

Solution

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer

# Detection
print(df.isnull().sum())
print(df.isnull().sum() / len(df) * 100)  # Percentage

# Visualization
import missingno as msno
msno.matrix(df)

# Treatment examples to be filled in...


Navigation: