Data Science Interview Cheat Sheets

Quick reference guides organized by topic. These are meant for last-minute review before interviews.

📊 Statistics & Probability

Cheat Sheet: Descriptive Statistics

Metric	Formula	Use Case
Mean	`Σx / n`	Central tendency, continuous data
Median	Middle value	Skewed distributions, outliers present
Mode	Most frequent	Categorical data
Variance	`Σ(x - μ)² / n`	Data spread
Std Dev	`√Variance`	Same units as data

Cheat Sheet: Probability Distributions

Distribution	Type	Parameters	Use Case
Normal	Continuous	μ, σ	Natural phenomena, errors
Binomial	Discrete	n, p	Success/failure trials
Poisson	Discrete	λ	Rare events over time
Uniform	Continuous	a, b	Equal probability

🔬 Hypothesis Testing Decision Tree

Start: Do you have a question about relationships?
  │
  ├─ YES → What type of data?
  │   │
  │   ├─ Categorical vs Categorical → Chi-Square Test
  │   │
  │   ├─ Numerical vs Categorical (2 groups) → t-test
  │   │   ├─ Known population σ? → Z-test
  │   │   └─ Unknown σ, small sample → t-test
  │   │
  │   ├─ Numerical vs Categorical (3+ groups) → ANOVA
  │   │
  │   └─ Numerical vs Numerical → Correlation / Regression
  │
  └─ NO → EDA / Descriptive Statistics

Cheat Sheet: Hypothesis Tests Comparison

Test	Data Types	Null Hypothesis	When to Use
Chi-Square	Cat vs Cat	No association	Independence, goodness-of-fit
t-test	Num vs Cat (2 groups)	Means are equal	Compare 2 group means
Z-test	Num vs Cat (2 groups)	Means are equal	Large sample, known σ
ANOVA	Num vs Cat (3+ groups)	All means equal	Compare 3+ group means
F-test	Num vs Num	Variances equal	Compare variances

🤖 Machine Learning Algorithms

Cheat Sheet: Supervised Learning Algorithm Selection

Algorithm	Problem Type	Pros	Cons	When to Use
Linear Regression	Regression	Fast, interpretable	Assumes linearity	Linear relationships
Logistic Regression	Classification	Interpretable, probabilities	Linear boundary	Binary/multi-class, need probabilities
Decision Tree	Both	Non-linear, interpretable	Overfits easily	Complex patterns, explainability needed
Random Forest	Both	Reduces overfitting, robust	Slow, black box	High accuracy, less interpretable OK
KNN	Both	Simple, no training	Slow prediction, sensitive to scale	Small datasets, simple patterns

Cheat Sheet: Clustering Algorithms

Algorithm	Type	Pros	Cons	When to Use
K-Means	Partitioning	Fast, scalable	Need to set K, spherical clusters	Large datasets, known # clusters
Hierarchical	Agglomerative/Divisive	No need to set K, dendrogram	Slow, memory intensive	Small datasets, explore # clusters

📈 Model Evaluation Metrics

Cheat Sheet: Regression Metrics

Metric	Formula	Range	Interpretation	When to Use
RMSE	`√(Σ(y - ŷ)² / n)`	[0, ∞]	Same units as target	Penalize large errors
MAE	`Σ\|y - ŷ\| / n`	[0, ∞]	Same units as target	Treat all errors equally
MAPE	`(100/n) * Σ\|y - ŷ\|/\|y\|`	[0, ∞]%	Percentage error	Relative error important
R²	`1 - (SS_res / SS_tot)`	(-∞, 1]	Variance explained	Model comparison

Cheat Sheet: Classification Metrics

Metric	Formula	Range	When to Use
Accuracy	`(TP + TN) / Total`	[0, 1]	Balanced classes
Precision	`TP / (TP + FP)`	[0, 1]	Minimize false alarms
Recall	`TP / (TP + FN)`	[0, 1]	Find all positives (e.g., disease detection)
F1-Score	`2 * (Prec * Rec) / (Prec + Rec)`	[0, 1]	Balance precision & recall
AUC-ROC	Area under ROC curve	[0, 1]	Overall classifier performance

Confusion Matrix Quick Reference

                Predicted
              Pos     Neg
Actual  Pos   TP      FN
        Neg   FP      TN

Precision = “Of all predicted positives, how many were correct?”
Recall = “Of all actual positives, how many did we find?”

🎯 Overfitting vs Underfitting

Aspect	Underfitting	Good Fit	Overfitting
Training Error	High	Low	Very Low
Validation Error	High	Low	High
Model Complexity	Too simple	Just right	Too complex
What’s happening	Not learning patterns	Learning generalizable patterns	Memorizing noise
Fix	More features, complex model	✓ Good to go	Regularization, more data, simpler model

🔧 Regularization

Technique	Type	Formula	Effect	When to Use
Ridge (L2)	Linear	`+ λΣβ²`	Shrinks coefficients	Multicollinearity, keep all features
Lasso (L1)	Linear	`+ λΣ\|β\|`	Sets some β to 0	Feature selection needed

🎲 Ensemble Methods

Method	Type	How it Works	Best For
Random Forest	Bagging	Average of many trees	Reduce variance, high accuracy
AdaBoost	Boosting	Sequential, focus on errors	Weak learners, binary classification
Gradient Boosting	Boosting	Sequential, fit residuals	High accuracy, regression/classification
XGBoost	Boosting	Optimized gradient boosting	Competition winning, production systems

⚡ Quick Interview Formulas

Must-Know Formulas

# Standard Error
SE = σ / √n

# Z-Score
z = (x - μ) / σ

# Confidence Interval
CI = x̄ ± (z * SE)

# R² (coefficient of determination)
R² = 1 - (SS_residual / SS_total)

# Bias-Variance Tradeoff
Total Error = Bias² + Variance + Irreducible Error

Pro Tip: Print these cheat sheets and review them the night before your interview!

Data Science Interview Cheat Sheets Index

Arun Murali

Data Science Interview Cheat Sheets

📊 Statistics & Probability

Cheat Sheet: Descriptive Statistics

Cheat Sheet: Probability Distributions

🔬 Hypothesis Testing Decision Tree

Cheat Sheet: Hypothesis Tests Comparison

🤖 Machine Learning Algorithms

Cheat Sheet: Supervised Learning Algorithm Selection

Cheat Sheet: Clustering Algorithms

📈 Model Evaluation Metrics

Cheat Sheet: Regression Metrics

Cheat Sheet: Classification Metrics

Confusion Matrix Quick Reference

🎯 Overfitting vs Underfitting

🔧 Regularization

🎲 Ensemble Methods

⚡ Quick Interview Formulas

Must-Know Formulas

🗺️ Navigation