🟪 1-Minute Summary

Outliers are data points significantly different from others. Detection methods: (1) IQR method (Q3 + 1.5*IQR), (2) Z-score (>3 or <-3), (3) Visual inspection (box plots, scatter plots). Treatment: (1) Remove if errors, (2) Cap/floor (winsorization), (3) Transform (log), (4) Keep if legitimate. NEVER blindly remove - investigate first!


🟦 Core Notes (Must-Know)

What are Outliers?

[Content to be filled in]

Types of Outliers

[Content to be filled in]

  • Univariate outliers
  • Multivariate outliers

Detection Methods

IQR Method

[Content to be filled in]

Z-Score Method

[Content to be filled in]

Visual Methods

[Content to be filled in]

Treatment Strategies

Option 1: Remove

[Content to be filled in]

Option 2: Cap/Floor (Winsorization)

[Content to be filled in]

Option 3: Transform

[Content to be filled in]

Option 4: Keep

[Content to be filled in]


🟨 Interview Triggers (What Interviewers Actually Test)

Common Interview Questions

  1. “How do you detect outliers?”

    • [Answer: IQR method, Z-score, visual inspection]
  2. “When would you keep outliers instead of removing them?”

    • [Answer: Fraud detection, rare events, legitimate extreme values]
  3. “What’s the difference between outliers and anomalies?”

    • [Answer framework to be filled in]

🟥 Common Mistakes (Traps to Avoid)

Mistake 1: Removing outliers without understanding why they exist

[Content to be filled in]

Mistake 2: Using same threshold for all features

[Content to be filled in]

Mistake 3: Treating outliers before understanding the business context

[Content to be filled in]


🟩 Mini Example (Quick Application)

Scenario

[Salary outliers in employee dataset]

Solution

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# IQR Method
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers_iqr = df[(df['salary'] < lower_bound) | (df['salary'] > upper_bound)]
print(f"Outliers (IQR): {len(outliers_iqr)}")

# Z-Score Method
from scipy import stats
z_scores = np.abs(stats.zscore(df['salary']))
outliers_z = df[z_scores > 3]
print(f"Outliers (Z-score): {len(outliers_z)}")

# Visualization
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
sns.boxplot(x=df['salary'])
plt.title('Box Plot')
plt.subplot(1, 2, 2)
sns.histplot(df['salary'])
plt.title('Histogram')
plt.show()

# Treatment: Capping
df['salary_capped'] = df['salary'].clip(lower=lower_bound, upper=upper_bound)


Navigation: