🟪 1-Minute Summary
Imbalanced data occurs when classes have very different frequencies (e.g., 95% no-fraud, 5% fraud). Problems: model biased toward majority class, accuracy misleading. Solutions: (1) Resampling (over/undersample), (2) Different metrics (precision/recall/F1, not accuracy), (3) Class weights, (4) Anomaly detection. Choose based on data size and importance of minority class.
🟦 Core Notes (Must-Know)
What is Imbalanced Data?
[Content to be filled in]
Why It’s a Problem
[Content to be filled in]
Detection
[Content to be filled in]
Solutions
[Content to be filled in]
- Resampling (SMOTE, undersampling)
- Class weights
- Different algorithms (tree-based handle well)
- Anomaly detection
- Different metrics
- Ensemble methods
🟨 Interview Triggers (What Interviewers Actually Test)
Common Interview Questions
-
“How do you handle imbalanced data?”
- [Answer: List approaches - resampling, class weights, metrics]
-
“Why is accuracy bad for imbalanced data?”
- [Answer: Dominated by majority class]
-
“What metrics would you use instead?”
- [Answer: Precision, Recall, F1, ROC-AUC]
🟥 Common Mistakes (Traps to Avoid)
Mistake 1: Using accuracy
[Content to be filled in]
Mistake 2: Resampling before train-test split
[Content to be filled in - data leakage]
🟩 Mini Example (Quick Application)
Scenario
[Fraud detection with 2% fraud rate]
Solution
from imblearn.over_sampling import SMOTE
from sklearn.utils.class_weight import compute_class_weight
# Example to be filled in
🔗 Related Topics
Navigation: