Data Science Interview Cheat Sheets

Quick reference guides organized by topic. These are meant for last-minute review before interviews.


πŸ“Š Statistics & Probability

Cheat Sheet: Descriptive Statistics

Metric Formula Use Case
Mean Ξ£x / n Central tendency, continuous data
Median Middle value Skewed distributions, outliers present
Mode Most frequent Categorical data
Variance Ξ£(x - ΞΌ)Β² / n Data spread
Std Dev √Variance Same units as data

Cheat Sheet: Probability Distributions

Distribution Type Parameters Use Case
Normal Continuous ΞΌ, Οƒ Natural phenomena, errors
Binomial Discrete n, p Success/failure trials
Poisson Discrete Ξ» Rare events over time
Uniform Continuous a, b Equal probability

πŸ”¬ Hypothesis Testing Decision Tree

Start: Do you have a question about relationships?
  β”‚
  β”œβ”€ YES β†’ What type of data?
  β”‚   β”‚
  β”‚   β”œβ”€ Categorical vs Categorical β†’ Chi-Square Test
  β”‚   β”‚
  β”‚   β”œβ”€ Numerical vs Categorical (2 groups) β†’ t-test
  β”‚   β”‚   β”œβ”€ Known population Οƒ? β†’ Z-test
  β”‚   β”‚   └─ Unknown Οƒ, small sample β†’ t-test
  β”‚   β”‚
  β”‚   β”œβ”€ Numerical vs Categorical (3+ groups) β†’ ANOVA
  β”‚   β”‚
  β”‚   └─ Numerical vs Numerical β†’ Correlation / Regression
  β”‚
  └─ NO β†’ EDA / Descriptive Statistics

Cheat Sheet: Hypothesis Tests Comparison

Test Data Types Null Hypothesis When to Use
Chi-Square Cat vs Cat No association Independence, goodness-of-fit
t-test Num vs Cat (2 groups) Means are equal Compare 2 group means
Z-test Num vs Cat (2 groups) Means are equal Large sample, known Οƒ
ANOVA Num vs Cat (3+ groups) All means equal Compare 3+ group means
F-test Num vs Num Variances equal Compare variances

πŸ€– Machine Learning Algorithms

Cheat Sheet: Supervised Learning Algorithm Selection

Algorithm Problem Type Pros Cons When to Use
Linear Regression Regression Fast, interpretable Assumes linearity Linear relationships
Logistic Regression Classification Interpretable, probabilities Linear boundary Binary/multi-class, need probabilities
Decision Tree Both Non-linear, interpretable Overfits easily Complex patterns, explainability needed
Random Forest Both Reduces overfitting, robust Slow, black box High accuracy, less interpretable OK
KNN Both Simple, no training Slow prediction, sensitive to scale Small datasets, simple patterns

Cheat Sheet: Clustering Algorithms

Algorithm Type Pros Cons When to Use
K-Means Partitioning Fast, scalable Need to set K, spherical clusters Large datasets, known # clusters
Hierarchical Agglomerative/Divisive No need to set K, dendrogram Slow, memory intensive Small datasets, explore # clusters

πŸ“ˆ Model Evaluation Metrics

Cheat Sheet: Regression Metrics

Metric Formula Range Interpretation When to Use
RMSE √(Σ(y - ŷ)² / n) [0, ∞] Same units as target Penalize large errors
MAE Σ|y - ŷ| / n [0, ∞] Same units as target Treat all errors equally
MAPE (100/n) * Σ|y - ŷ|/|y| [0, ∞]% Percentage error Relative error important
R² 1 - (SS_res / SS_tot) (-∞, 1] Variance explained Model comparison

Cheat Sheet: Classification Metrics

Metric Formula Range When to Use
Accuracy (TP + TN) / Total [0, 1] Balanced classes
Precision TP / (TP + FP) [0, 1] Minimize false alarms
Recall TP / (TP + FN) [0, 1] Find all positives (e.g., disease detection)
F1-Score 2 * (Prec * Rec) / (Prec + Rec) [0, 1] Balance precision & recall
AUC-ROC Area under ROC curve [0, 1] Overall classifier performance

Confusion Matrix Quick Reference

                Predicted
              Pos     Neg
Actual  Pos   TP      FN
        Neg   FP      TN
  • Precision = “Of all predicted positives, how many were correct?”
  • Recall = “Of all actual positives, how many did we find?”

🎯 Overfitting vs Underfitting

Aspect Underfitting Good Fit Overfitting
Training Error High Low Very Low
Validation Error High Low High
Model Complexity Too simple Just right Too complex
What’s happening Not learning patterns Learning generalizable patterns Memorizing noise
Fix More features, complex model βœ“ Good to go Regularization, more data, simpler model

πŸ”§ Regularization

Technique Type Formula Effect When to Use
Ridge (L2) Linear + λΣβ² Shrinks coefficients Multicollinearity, keep all features
Lasso (L1) Linear + λΣ|β| Sets some β to 0 Feature selection needed

🎲 Ensemble Methods

Method Type How it Works Best For
Random Forest Bagging Average of many trees Reduce variance, high accuracy
AdaBoost Boosting Sequential, focus on errors Weak learners, binary classification
Gradient Boosting Boosting Sequential, fit residuals High accuracy, regression/classification
XGBoost Boosting Optimized gradient boosting Competition winning, production systems

⚑ Quick Interview Formulas

Must-Know Formulas

# Standard Error
SE = Οƒ / √n

# Z-Score
z = (x - ΞΌ) / Οƒ

# Confidence Interval
CI = xΜ„ Β± (z * SE)

# RΒ² (coefficient of determination)
RΒ² = 1 - (SS_residual / SS_total)

# Bias-Variance Tradeoff
Total Error = BiasΒ² + Variance + Irreducible Error

πŸ—ΊοΈ Navigation


Pro Tip: Print these cheat sheets and review them the night before your interview!