Machine Learning — Interview Prep
Targeted preparation for ML Engineer, Data Scientist, Applied Scientist, and Research Scientist roles. Covers classical ML, deep learning, evaluation, and production ML systems.
Roles covered: ML Engineer · Data Scientist · Applied Scientist · MLOps Engineer · ML Platform Engineer
1. Classical ML Algorithms
Decision Trees & Ensembles
Q: How does a decision tree split? What makes a good split?
At each node, the tree tries every feature and every threshold, picking the split that maximizes information gain (or minimizes Gini impurity).
Information Gain = H(parent) − weighted average H(children), where H = entropy = -Σ p log₂ p
Gini Impurity = 1 − Σ pᵢ² (default in sklearn)
A split is good when child nodes are more pure (lower impurity) than the parent. Stops when: max depth reached, min samples threshold, or no improvement.
Q: Why does Random Forest reduce variance while Gradient Boosting reduces bias? When do you prefer each?
Random Forest: Bagging (bootstrap samples) + feature subsampling. Each tree is decorrelated from others. Averaging N decorrelated trees reduces variance without increasing bias. Parallelizable; robust to noisy features.
Gradient Boosting (XGBoost, LightGBM): Sequential ensemble — each tree fits the residuals of previous trees. Reduces bias (high-bias weak learners become strong together). More powerful but prone to overfitting; need careful tuning of n_estimators, learning_rate, max_depth.
Use Random Forest when: Quick baseline, noisy data, don’t have time to tune, interpretability via feature importance.
Use XGBoost/LightGBM when: Tabular data, maximizing performance, enough time to tune, low to moderate noise.
Q: Explain XGBoost’s key innovations over vanilla gradient boosting.
- Regularization: L1 (α) and L2 (λ) regularization on leaf weights — reduces overfitting that vanilla GBDT lacks
- Approximate split finding: Quantile sketch for finding splits on large datasets (faster than exact greedy)
- Column subsampling: Like Random Forest — randomly select features per tree (reduces variance)
- Sparsity awareness: Handles missing values by learning a default direction for missing data at each split
- Cache-aware access: Tree learning algorithm optimized for CPU cache misses
- Parallel tree construction: Build each tree in parallel (node-level parallelism)
Q: What is the difference between bagging and boosting?
| Aspect | Bagging (Random Forest) | Boosting (XGBoost) |
|---|---|---|
| Training | Parallel (independent trees) | Sequential (each tree corrects previous) |
| Data sampling | With replacement (bootstrap) | Weighted by residuals |
| Goal | Reduce variance | Reduce bias |
| Risk | Underfitting | Overfitting |
| Tuning | Less sensitive | More sensitive |
Support Vector Machines
Q: Explain SVMs intuitively. When do kernels help?
SVM finds the hyperplane with maximum margin between classes. The “support vectors” are the training points closest to the boundary — only these determine the decision boundary.
Kernel trick: Instead of mapping to a higher-dimensional feature space explicitly (expensive), kernels compute dot products in that space cheaply:
- Linear: No transformation (fast, works for linearly separable data)
- RBF (Gaussian): Infinite-dimensional space — can model any decision boundary (hyperparameter: γ controls smoothness)
- Polynomial: Degree-d polynomial features implicitly
Use kernels when: data is not linearly separable in original space, but you believe a transformation would separate it.
k-Nearest Neighbors
Q: What are the computational challenges of kNN and how do you address them?
kNN at inference: compute distance to every training point → O(Nd) per query. Prohibitive for N=1M+.
Solutions:
- KD-trees: Exact search in O(d log N) for low-d (d ≤ ~20)
- Ball trees: Better for moderate d
- Approximate nearest neighbor (ANN): FAISS (Facebook AI), HNSW, ScaNN — sacrifice exact correctness for speed. 99%+ recall at 100× speedup. Used in vector databases (Pinecone, Weaviate) for RAG.
2. Model Evaluation
Q: Explain the bias-variance trade-off. How do you diagnose which problem you have?
Total error = Bias² + Variance + Irreducible Noise
| Problem | Symptoms | Fix |
|---|---|---|
| High bias (underfitting) | High train error, high test error, similar errors | More complex model, more features, fewer regularization |
| High variance (overfitting) | Low train error, high test error, large gap | More data, more regularization, simpler model, dropout |
| High noise | Both errors high, similar → irreducible | Better data quality, feature engineering |
Learning curves: plot train/val error vs training set size. High bias: both errors plateau high. High variance: large gap between train and val error.
Q: When is AUC-ROC misleading? What do you use instead?
AUC-ROC can be misleadingly high when the positive class is rare. Example: 1% fraud rate. A model that marks 50% of transactions as fraud has high TPR but terrible FPR — yet ROC-AUC can still look reasonable because the denominator (N negatives) is large.
AUC-PR (precision-recall): More informative for imbalanced problems. Random classifier has AUC-PR ≈ prevalence rate (1%); a good model significantly above that.
Log loss / Brier score: For calibration — does P(fraud)=0.8 actually mean 80% of such cases are fraud? Critical for decision-making systems.
Q: What is k-fold cross-validation? When would you use stratified k-fold?
k-fold CV: split data into k folds; train on k-1, validate on the remaining fold; rotate k times; average performance.
- Standard k-fold: Randomized split. Default for regression.
- Stratified k-fold: Each fold preserves the class distribution. Always use for classification (especially imbalanced).
- Group k-fold: Ensure all samples from the same group (e.g., same customer) are in the same fold — prevents data leakage when a customer has multiple rows.
- Time series split: No shuffling; train on past, validate on future. Prevents temporal leakage.
Q: How do you calibrate a model’s probability outputs?
A well-calibrated model means: when it predicts P(y=1) = 0.7, 70% of those samples actually are positive.
Check calibration: reliability diagram (calibration curve) — plot mean predicted probability vs fraction of positives in each bin.
Calibration methods:
- Platt scaling: Fit a logistic regression on held-out validation predictions → calibrated probability
- Isotonic regression: Non-parametric; more flexible than Platt but requires more data
- Temperature scaling: For neural networks — scale logits by a temperature T (single parameter)
Why it matters: credit risk scores, medical diagnosis, fraud thresholds — any decision that depends on the probability magnitude (not just ranking) requires calibration.
3. Feature Engineering & Selection
Q: How do you handle missing values? Walk through the decision process.
-
Understand why the data is missing:
- MCAR (Missing Completely At Random): safe to drop or impute
- MAR (Missing At Random): conditional on observed data — impute carefully
- MNAR (Missing Not At Random): the fact that it’s missing is informative (e.g., income not reported for high earners) — add a missing indicator feature
-
Strategies:
Method When Risk Drop rows MCAR, small fraction Reduces sample size Mean/median imputation MAR, numerical Distorts distribution, hides uncertainty Mode imputation Categorical May introduce bias Model-based (MICE) MAR, complex patterns Expensive but principled Add missing indicator MNAR Additional feature Forward fill Time series Only when temporal order justifies it -
For gradient boosting (XGBoost): Built-in handling — can learn optimal direction for missing values.
Q: How do you handle categorical features with high cardinality (e.g., merchant_id with 50K unique values)?
| Method | When | Notes |
|---|---|---|
| Target encoding | High cardinality + strong signal | Risk of leakage; use cross-val folds |
| Frequency encoding | High cardinality + frequency matters | Replace category with count in training data |
| Embedding | Neural network, many categories | Learn dense representation end-to-end |
| Hashing trick | Very high cardinality, memory constraints | Hash to fixed-size vector; collisions are rare issue |
| Grouping rare categories | Long tail | Group all categories with < N occurrences into “Other” |
Q: What is feature leakage? Give three examples.
Leakage: the model learns from features that contain information about the target that wouldn’t be available at prediction time.
- Temporal leakage: Using current month’s transaction count to predict whether a loan from 6 months ago will default — the count wasn’t known at origination.
- Label-derived features: Including “outcome_category” (e.g., “charged off”) as a feature when the target is default — trivially predicts itself.
- Group leakage: Customer has 10 rows (monthly snapshots). Training on 8 rows and validating on 2 from the same customer — the model learns customer-level patterns that don’t generalize.
Q: How do you select important features? What are the pitfalls of each approach?
| Method | Pros | Pitfalls |
|---|---|---|
| Gini/permutation importance (tree-based) | Fast, model-native | Gini biased toward high-cardinality features |
| SHAP values | Game-theoretically consistent, shows direction | Slow for large ensembles; correlated features split importance |
| L1 (LASSO) | Sparse, interpretable | Arbitrary selection among correlated features |
| Mutual information | Non-parametric, catches non-linear dependencies | Univariate (misses interactions) |
| Recursive Feature Elimination | Principled, model-validated | Computationally expensive |
4. Regularization & Optimization
Q: Explain L1 vs L2 regularization. When would you use each?
Both penalize large weights, preventing overfitting:
- L2 (Ridge): Adds λΣwᵢ² to loss. Penalizes large weights smoothly. Solution: all weights shrink toward zero, none exactly zero. Good for: multicollinearity (distributes weight across correlated features).
- L1 (LASSO): Adds λΣ|wᵢ| to loss. Produces sparse solutions (some weights exactly zero). Good for: feature selection when you believe few features are truly predictive.
- Elastic Net: L1 + L2 combined. Best of both — sparse solutions while handling correlated features better than pure L1.
Q: What is the difference between SGD, Adam, and AdamW? When would each be preferred?
| Optimizer | Update Rule | Best For |
|---|---|---|
| SGD | w ← w - η∇L | LLMs (with momentum + LR schedule); most robust with proper tuning |
| Adam | Adaptive per-parameter LR using 1st/2nd moment estimates | Fast convergence; good default for deep learning |
| AdamW | Adam + decoupled weight decay | LLM fine-tuning (weight decay is correctly decoupled from adaptive LR) |
| RMSprop | Adaptive LR (1 moment) | RNNs |
AdamW is the default for fine-tuning LLMs. Pure Adam conflates L2 regularization with weight decay — AdamW corrects this for proper regularization.
Q: What is batch normalization and what problem does it solve?
Problem: As the network trains, the distribution of activations at each layer shifts with parameter updates — forces each layer to constantly adapt to the changing input distribution (“internal covariate shift”).
BatchNorm: normalize activations within each mini-batch (zero mean, unit variance), then apply learnable γ, β to rescale. Computed per-channel over the batch.
Benefits: enables higher learning rates, acts as mild regularization, reduces sensitivity to initialization.
LayerNorm vs BatchNorm: LLMs use LayerNorm (normalize over the feature dimension within a single sample — batch-independent). BatchNorm is standard for CNNs.
5. Deep Learning Fundamentals
Q: Explain backpropagation without using the word “gradient.”
Backprop is the chain rule applied to compute how much each parameter contributed to the final loss.
Forward pass: input flows through the network layer by layer → prediction → loss computed.
Backward pass: starting from the loss, work backwards. At each operation, compute: “how much did this input/parameter affect the output?” using the derivative of that operation. Multiply these effects together along every path back to each parameter (chain rule).
The result: for each weight, a value indicating whether increasing that weight increases or decreases the loss. Update weights in the direction that decreases loss.
Q: What is the vanishing gradient problem and how do modern architectures address it?
In deep networks, gradients are multiplied through many layers. If each layer multiplies gradients by < 1 (e.g., sigmoid saturates at ~0 gradient in the tails), gradients shrink exponentially → early layers learn very slowly.
Solutions:
- Residual connections (ResNets): x → x + f(x). The gradient of x flows directly, bypassing f(x) — gradient highway to early layers.
- ReLU activation: Gradient is exactly 1 for positive inputs (doesn’t saturate). Replaced sigmoid/tanh in deep nets.
- Layer normalization: Normalizes activations, preventing exploding/vanishing values.
- Careful initialization (He, Xavier): Scale initial weights to maintain variance through layers.
Q: When would you use a CNN vs RNN vs Transformer for sequence data?
| Architecture | Inductive Bias | Best For | Weakness |
|---|---|---|---|
| CNN | Local patterns, translation invariance | Short patterns (text classification), image features | Long-range dependencies |
| RNN/LSTM | Sequential order, recurrent state | Time series with clear temporal structure | Slow training, vanishing gradients at long range |
| Transformer | Global attention (all pairs) | Long-range dependencies, parallel training | O(N²) attention, no built-in position sense |
For most NLP today: Transformer wins. For short time series with clear temporal patterns: RNN still competitive. For image patches → now also Transformer (ViT).
Q: What is transfer learning and when does fine-tuning outperform training from scratch?
Transfer learning: use a model pre-trained on a large dataset as initialization for a downstream task.
Fine-tuning outperforms scratch when:
- Data is limited: Pre-trained weights provide a strong prior (Imagenet features generalize to medical images)
- Task is similar: Text classification benefits from BERT’s language understanding
- Compute is limited: Fine-tuning < 5% of pre-training cost
Training from scratch when:
- Domain is radically different: Molecular biology, specialized sensor data
- Data is abundant: If you have 100B tokens, pre-training a custom model may outperform general pre-trained models
- Architecture needs differ: Custom input/output structure
6. MLOps & Production ML
Q: What is the ML lifecycle? What breaks in production that tests don’t catch?
Lifecycle: Data collection → Feature engineering → Training → Evaluation → Deployment → Monitoring → Retraining
What breaks in production:
- Data drift: Input distribution changes (seasonality, macro events, user behavior changes)
- Concept drift: P(Y|X) changes — same features, different true labels (e.g., fraud patterns evolve)
- Pipeline failures: Upstream feature computation fails silently → model receives null/stale features
- Training-serving skew: Feature computed differently in training vs serving code
- Model staleness: Model trained 6 months ago on old patterns
Q: How do you detect and handle model drift in production?
Detection:
| Signal | How | Threshold |
|---|---|---|
| PSI (feature drift) | Compare feature distributions train vs current | PSI > 0.2 = significant |
| Score drift | Model’s output distribution shifted | Mean score ± 2σ from baseline |
| Label drift | Actual outcomes drift from predictions | Monitor calibration weekly |
| Covariate shift | Train classifier to distinguish train vs current data | AUC > 0.6 = detectable shift |
Response: Minor drift → recalibrate. Moderate drift → retrain on recent data. Severe drift → rebuild model + features.
Q: What is training-serving skew and how do you prevent it?
Training-serving skew: a feature computed one way during training and differently in production (e.g., age_in_days computed as current_date - birth_date, but in production current_date is UTC while in training it was local time).
Prevention:
- Feature store: Single source of truth for feature computation — both training and serving read from the same feature definitions
- Shared feature computation library: Import the same Python module in training pipeline and serving code
- Shadow mode testing: Before deploying, run new model in parallel with old model; compare feature distributions
- Integration tests: Test feature pipeline with known inputs → known outputs (not just unit tests)
Q: What is a feature store? When do you need one?
A feature store is a centralized repository that stores, computes, and serves ML features consistently across training and inference.
Components:
- Offline store: Historical features for training (Spark/SQL on data warehouse)
- Online store: Low-latency feature lookup at inference time (Redis, DynamoDB)
- Feature registry: Metadata, lineage, versioning
Need one when:
- Multiple teams are computing the same features differently
- Features need to be served at < 50ms latency in production
- Training-serving skew is causing model degradation
- Feature reuse across multiple models would save significant engineering time
Early-stage (single model, single team): a simple Pandas/SQL pipeline is fine.
7. ML System Design
Q: Design a recommendation system for a streaming platform.
Requirements: 50M users, 1M content items, latency < 100ms, fresh recommendations daily.
Two-stage architecture:
Stage 1: Candidate Generation (fast, coarse) ├── Collaborative filtering (user-item matrix factorization) ├── Content-based (item embedding similarity) └── → Top-500 candidates per user (precomputed offline, stored in cache)
Stage 2: Ranking (slow, precise) ├── Features: user context, time of day, device, watch history recency ├── Model: gradient boosted tree or neural ranker └── → Top-20 recommendations (computed at request time)Why two stages?
- 1M items × 50M users × real-time ranking = computationally infeasible
- Candidate generation reduces to 500 items; ranker can afford deep features on 500
Key features for ranking:
- User watch history (last 10 items’ genres, duration completed)
- Item freshness (hours since release)
- Context (time of day, weekend, device)
- Collaborative signals (similar users’ recent watches)
Evaluation: Offline: NDCG, hit rate. Online: click-through rate, watch time, completion rate.
Q: How do you design an offline evaluation pipeline for a fraud model before deployment?
Historical transactions (12 months) ↓Walk-forward splits (monthly): Train: months 1–6 → Validate: month 7 Train: months 1–7 → Validate: month 8 ... Train: months 1–11 → Validate: month 12 ↓Per-fold metrics: • AUC-PR (primary) and AUC-ROC • Precision@K (top K flagged transactions) • Gini coefficient • Calibration curve ↓Stability check: • Feature PSI across folds (features shouldn't drift drastically month-over-month) • Score distribution stability ↓Fairness check: • Approval/denial rates by demographic group (disparate impact test) ↓Champion vs challenger: • Statistical significance test for AUC improvement • Business impact estimate: (TP × avg fraud amount) - (FP × customer service cost)8. Common Interview Questions & Answers
Q: How do you approach a new ML problem from scratch?
- Understand the business problem: What decision will this model make? What’s the cost of a false positive vs false negative?
- Define the label: What exactly are we predicting? Ensure labels are clean and representative.
- Exploratory data analysis: Class distribution, missing values, feature correlations, temporal patterns.
- Baseline: Simplest model (logistic regression, majority-class predictor). Never skip this.
- Feature engineering: Domain knowledge + statistical analysis.
- Model selection: Start simple, escalate complexity as needed.
- Evaluation: Correct metrics for the problem (not just accuracy).
- Error analysis: Examine failure cases to find systematic patterns.
- Production plan: Monitoring, retraining cadence, fallback.
Q: Explain the difference between precision and recall to a non-technical person. Which is more important for spam filtering?
Precision: “Of all emails I flagged as spam, how many were actually spam?” (Avoid false alarms — blocking a real email is bad)
Recall: “Of all actual spam emails, how many did I catch?” (Avoid missing spam — letting spam through is annoying)
For spam filtering: precision matters more. Missing a spam email is annoying; blocking an important email (missed meeting invite, job offer) damages trust more. High-precision, lower-recall spam filter is preferred.
Q: Why would a model perform well on a test set but fail in production?
Five main causes:
- Distribution shift: Test set drawn from same distribution as train (historical data), but production data is different (new market segment, post-COVID behavior)
- Data leakage in evaluation: Test set contaminated with train-time information
- Incorrect evaluation metric: Test metric doesn’t reflect production cost function
- Non-representative test set: Test set too small, wrong time period, or curated (cherry-picked)
- Training-serving skew: Feature computation differs between offline and online environments
Quick-Reference: Algorithm Cheatsheet
| Algorithm | Type | Pros | Cons | Hyperparams |
|---|---|---|---|---|
| Logistic Regression | Classification | Interpretable, fast, calibrated | Linear boundary only | C (regularization) |
| Random Forest | Ensemble | Robust, handles nonlinearity, OOB estimate | Slow inference for large N | n_estimators, max_depth, max_features |
| XGBoost | Ensemble | Best tabular performance | Many hyperparams to tune | n_estimators, learning_rate, max_depth, subsample |
| LightGBM | Ensemble | Faster than XGBoost, good for large data | Less interpretable | Similar to XGBoost |
| SVM (RBF) | Kernel method | Non-linear boundary, effective high-d | Slow for large N, hard to tune | C, γ |
| k-NN | Instance-based | Simple, no training | Slow inference, sensitive to scale | k, distance metric |
| Neural Network | Deep learning | Universal approximator | Data hungry, black box | Architecture, LR, batch size, regularization |