🟪 1-Minute Summary
Outliers are data points significantly different from others. Detection methods: (1) IQR method (Q3 + 1.5*IQR), (2) Z-score (>3 or <-3), (3) Visual inspection (box plots, scatter plots). Treatment: (1) Remove if errors, (2) Cap/floor (winsorization), (3) Transform (log), (4) Keep if legitimate. NEVER blindly remove - investigate first!
🟦 Core Notes (Must-Know)
What are Outliers?
[Content to be filled in]
Types of Outliers
[Content to be filled in]
- Univariate outliers
- Multivariate outliers
Detection Methods
IQR Method
[Content to be filled in]
Z-Score Method
[Content to be filled in]
Visual Methods
[Content to be filled in]
Treatment Strategies
Option 1: Remove
[Content to be filled in]
Option 2: Cap/Floor (Winsorization)
[Content to be filled in]
Option 3: Transform
[Content to be filled in]
Option 4: Keep
[Content to be filled in]
🟨 Interview Triggers (What Interviewers Actually Test)
Common Interview Questions
-
“How do you detect outliers?”
- [Answer: IQR method, Z-score, visual inspection]
-
“When would you keep outliers instead of removing them?”
- [Answer: Fraud detection, rare events, legitimate extreme values]
-
“What’s the difference between outliers and anomalies?”
- [Answer framework to be filled in]
🟥 Common Mistakes (Traps to Avoid)
Mistake 1: Removing outliers without understanding why they exist
[Content to be filled in]
Mistake 2: Using same threshold for all features
[Content to be filled in]
Mistake 3: Treating outliers before understanding the business context
[Content to be filled in]
🟩 Mini Example (Quick Application)
Scenario
[Salary outliers in employee dataset]
Solution
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# IQR Method
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_iqr = df[(df['salary'] < lower_bound) | (df['salary'] > upper_bound)]
print(f"Outliers (IQR): {len(outliers_iqr)}")
# Z-Score Method
from scipy import stats
z_scores = np.abs(stats.zscore(df['salary']))
outliers_z = df[z_scores > 3]
print(f"Outliers (Z-score): {len(outliers_z)}")
# Visualization
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
sns.boxplot(x=df['salary'])
plt.title('Box Plot')
plt.subplot(1, 2, 2)
sns.histplot(df['salary'])
plt.title('Histogram')
plt.show()
# Treatment: Capping
df['salary_capped'] = df['salary'].clip(lower=lower_bound, upper=upper_bound)
🔗 Related Topics
Navigation: