Quantitative Methods
The Core Idea
Linear regression models the relationship between a response variable and one or more predictors as a linear function, then estimates that function from data by minimising prediction error. It is the workhorse of quantitative analytics — directly useful for modelling returns, risk factors, and economic relationships, and the conceptual foundation for nearly every more complex method.
Part 1: Ordinary Least Squares (OLS)
The Model
The population regression model assumes:
In matrix form with observations and predictors:
Where:
- — response vector
- — design matrix (first column is a vector of ones for the intercept)
- — coefficient vector (what we estimate)
- — error vector (unobservable)
OLS Derivation
OLS minimises the residual sum of squares:
Taking the derivative with respect to and setting it to zero:
This gives the normal equations:
Solving (when is invertible):
This is the OLS estimator. It has a closed-form solution — no iteration required.
Intuition: OLS projects orthogonally onto the column space of . The fitted values are the point in that column space closest to under the Euclidean norm.
The Gauss-Markov Theorem
Under five assumptions, OLS is the Best Linear Unbiased Estimator (BLUE) — it has the smallest variance among all linear unbiased estimators.
| Assumption | Statement | What breaks it |
|---|---|---|
| L Linearity | is correctly specified | Omitted variables, wrong functional form |
| I Independence | Observations are independent | Time series autocorrelation, clustered data |
| H Homoscedasticity | (constant) | Volatility clustering in financial returns |
| N Normality | Fat tails, outliers (needed for exact inference, not BLUE) | |
| E Exogeneity | $\mathbb{E}[\varepsilon | X] = 0$ |
When these hold, is unbiased () and efficient.
Part 2: Inference and Model Evaluation
Coefficient Standard Errors
The variance-covariance matrix of under homoscedasticity:
Since is unknown, replace it with the unbiased estimator:
The standard error of is
Hypothesis Testing
t-test for individual coefficients:
F-test for joint significance:
Where is the number of restrictions. Tests whether a group of coefficients are jointly zero.
Goodness of Fit
R² (coefficient of determination):
R² measures the fraction of total variance in explained by the model. It never decreases when you add predictors — regardless of whether they’re useful.
Adjusted R² penalises for adding irrelevant predictors:
AIC and BIC — information criteria for model selection:
Lower is better. BIC penalises complexity more heavily than AIC and tends to select simpler models.
Confidence Intervals
A 95% confidence interval for :
Interpretation: If you repeated the experiment many times and computed this interval each time, 95% of intervals would contain the true . It is NOT “95% probability that the true parameter is in this interval” (that’s the Bayesian credible interval).
Part 3: Regression Diagnostics
Diagnostics test whether the Gauss-Markov assumptions hold. Violating them doesn’t always invalidate the regression — but it changes what you can conclude.
Residual Analysis
Always start with residual plots:
- Residuals vs. fitted values — should be random scatter. Patterns indicate heteroscedasticity or non-linearity.
- Q-Q plot — residuals vs. theoretical normal quantiles. Deviations at tails indicate non-normality (common in financial data).
- Scale-location plot — vs. fitted values. Increasing spread = heteroscedasticity.
- Residuals vs. time — for time-ordered data. Patterns indicate autocorrelation.
Each pattern tells you something different and has a different fix:
Heteroscedasticity
When varies across observations, OLS is still unbiased but no longer efficient. Standard errors are wrong — t-tests and confidence intervals are invalid.
Detection:
- Breusch-Pagan test — regress squared residuals on the predictors. Significant F-stat = heteroscedasticity.
- White test — more general, includes squared terms and cross-products.
Fixes:
- Heteroscedasticity-consistent (HC) standard errors (White standard errors) — correct the standard errors without changing .
- Weighted Least Squares (WLS) — weight observations by the inverse of their error variance when the variance structure is known.
- Generalised Least Squares (GLS) — the general fix when the error covariance structure is known:
Autocorrelation
When errors are correlated across time (), OLS standard errors are too small — you over-reject the null.
Detection:
- Durbin-Watson statistic — tests for first-order autocorrelation. DW ≈ 2 means no autocorrelation; DW < 2 means positive autocorrelation (very common in financial time series).
- Ljung-Box Q-test — tests for autocorrelation at multiple lags simultaneously.
- ACF/PACF plots of residuals — visual inspection of autocorrelation structure.
Fixes:
- Newey-West standard errors (HAC — heteroscedasticity and autocorrelation consistent)
- Explicitly model the autocorrelation structure (ARIMA residuals)
- Include lagged dependent variable as a predictor (Cochrane-Orcutt)
Multicollinearity
When predictors are highly correlated, becomes unstable. Coefficients have large standard errors and wrong signs — individual coefficients can’t be trusted even when the overall fit is good.
Detection:
- Variance Inflation Factor (VIF): where is the R² from regressing on all other predictors. VIF > 10 (or > 5 conservatively) indicates a problem.
- Condition number of — above 30 indicates moderate, above 100 indicates severe multicollinearity.
- Correlation matrix — pairwise correlations above 0.8 are a warning sign.
Fixes: Ridge regression (shrinks coefficients), PCA regression (transforms to orthogonal predictors), removing one of the collinear variables.
Influential Observations
Leverage measures how far an observation’s -values are from the mean. High-leverage points have outsized influence on the regression line regardless of their -value.
Cook’s Distance combines leverage and residual size into a single influence measure:
is a common threshold for “influential.” Examine these observations: data errors, legitimate outliers, or regime changes.
Diagnostic Decision Tree
flowchart TD Fit[Fit OLS] --> RvF[Plot residuals vs fitted] RvF --> Q1{Fan / funnel shape?} Q1 -->|Yes| A1[Heteroscedasticity\nFix: HC errors or WLS] Q1 -->|No| Q2{Curved or U-shape?} Q2 -->|Yes| A2[Non-linearity\nFix: polynomial or log transform] Q2 -->|No| Q3{Wave or drift over time?} Q3 -->|Yes| A3[Autocorrelation\nFix: HAC errors or ARIMA residuals] Q3 -->|No| VIF[Check VIF for each predictor] VIF --> Q4{Any VIF above 10?} Q4 -->|Yes| A4[Multicollinearity\nFix: Ridge regression or drop variable] Q4 -->|No| Cook[Compute Cook's D] Cook --> Q5{Any D above 1?} Q5 -->|Yes| A5[Influential observation\nFix: investigate or robust regression] Q5 -->|No| Done[OLS assumptions satisfied]Part 4: Variable Transformations
Transformations serve two purposes: fixing violated OLS assumptions (non-normality, heteroscedasticity, non-linearity) and changing how coefficients are interpreted. Applying the wrong transformation — or not applying one when needed — is a common source of misleading models.
Dependent Variable Transformations
Log transformation:
Use when is strictly positive, right-skewed, or when variance grows with the mean (common in income, prices, exposure).
Interpretation: A one-unit increase in multiplies by , or equivalently, changes by approximately (exact for small ).
When to use: residuals fan outward (heteroscedasticity), is a monetary amount or count that can’t be negative.
Watch out: is undefined — you need . A common fix is or for small .
Retransforming to the original scale: is biased downward. The smearing estimator (Duan, 1983) corrects this:
Square-root transformation:
Gentler than log — useful for count data or moderate right skew. Variance-stabilising for Poisson-distributed outcomes (where ).
Box-Cox transformation
A family of power transformations parameterised by :
Special cases: is no transformation, is log, is square root, is reciprocal.
Estimate by maximum likelihood — the optimal maximises the log-likelihood of the transformed residuals being normal. In Python: scipy.stats.boxcox(y).
Limitation: Requires . Yeo-Johnson extends this to allow .
Logit transformation:
For bounded outcomes such as rates or proportions. Expands the bounded range to , making OLS applicable. The fractional logit model is an alternative that avoids retransformation.
Independent Variable Transformations
Log transformation:
Use when the relationship between and is concave (diminishing returns) or when spans several orders of magnitude (income, asset size, market cap).
Interpretation depends on the model form:
| Model | Equation | Interpretation of |
|---|---|---|
| Linear-linear | per unit increase in | |
| Log-linear | change in per unit increase in | |
| Linear-log | per 1% increase in | |
| Log-log | is the elasticity: 1% increase in → change in |
The log-log form is widely used in economics and finance because elasticities are unit-free and directly comparable across variables.
Polynomial features
Include , to capture non-linear relationships while staying within the OLS framework:
The marginal effect is — it varies with . The turning point is at .
Caution: High-degree polynomials overfit at the extremes. Splines or piecewise linear functions are more robust.
Standardisation vs normalisation
Standardisation (z-score): . After standardising, is interpreted as the effect of a one-standard-deviation increase in . Makes coefficients directly comparable across predictors with different units. Required before applying Ridge or Lasso (otherwise the penalty treats variables unequally).
Min-max normalisation: . Scales to . Sensitive to outliers. Rarely used for regression; more common in ML preprocessing.
Standardisation does not change , -statistics, or -values — only the scale of .
Categorical Variables: Dummy Encoding
A categorical variable with levels is encoded as binary (0/1) dummy variables. The omitted level is the reference category — all coefficients are interpreted relative to it.
Example: Credit rating with levels AAA, AA, A, BBB → create dummies for AA, A, BBB; AAA is the reference.
is the average difference in between AA and AAA, holding constant.
Dummy variable trap: Including all dummies creates perfect multicollinearity with the intercept (they sum to 1). Always use dummies. Software handles this automatically, but be aware if constructing features manually.
Ordered categories: For ordinal variables (e.g., credit grades with a natural ranking), a single integer encoding can be appropriate if the spacing is roughly equal. Dummy encoding is safer when spacing is unequal.
Interaction Terms
An interaction term allows the effect of to depend on the level of :
The marginal effect of is — it varies with .
When to include interactions:
- Theory suggests the effect of one variable depends on another (e.g., income × age in credit scoring)
- Residual plots show systematic patterns that disappear after adding the interaction
- You want to test whether a relationship differs across groups (equivalent to separate slopes)
Hierarchy principle: If you include an interaction , include the main effects and too, even if their main-effect coefficients are insignificant. Omitting them changes the interpretation of the interaction.
Choosing the Right Transformation
| Symptom | Likely fix |
|---|---|
| Right-skewed , variance grows with mean | |
| is a count (Poisson) | or Poisson regression |
| is a proportion in | Logit or fractional logit |
| is continuous, optimal unknown | Box-Cox |
| Residuals show a curve (concave/convex) | or add |
| spans orders of magnitude | |
| Predictors on different scales (for regularisation) | Standardise all |
| Non-linear group differences | Interaction terms |
Part 5: Regularised Regression
When predictors are numerous or collinear, OLS over-fits. Regularisation adds a penalty term to the loss function, shrinking coefficients toward zero.
Ridge Regression (L2)
Closed-form solution:
Adding makes the matrix invertible even under perfect multicollinearity. Ridge shrinks all coefficients toward zero but never exactly to zero — it does not perform variable selection. Choose via cross-validation.
Lasso Regression (L1)
The L1 penalty produces sparse solutions — it drives some coefficients exactly to zero. Lasso does automatic variable selection. No closed-form solution (solved with coordinate descent or LARS algorithm).
Elastic Net
Combines L1 and L2:
Useful when predictors number in the thousands (genomics, factor zoo in finance) — Lasso tends to select only one variable from a correlated group; Elastic Net can include all of them with reduced coefficients.
| Method | Penalty | Selects variables? | Handles multicollinearity? |
|---|---|---|---|
| OLS | None | No | No |
| Ridge | No | Yes | |
| Lasso | Yes | Partially | |
| Elastic Net | Both | Yes | Yes |
As increases from 0, Ridge shrinks all coefficients smoothly toward (but never to) zero. Lasso drives coefficients to exactly zero at different thresholds — automatic variable selection:
Quantile Regression
Standard OLS estimates the conditional mean of given . Quantile regression estimates any conditional quantile — the median, the 5th percentile, the 95th percentile.
Minimises the asymmetric loss function (pinball loss):
Where and is the quantile.
Why it matters in finance: Asset returns have fat tails. OLS ignores tail behaviour. Quantile regression at directly models Value-at-Risk; at models upside potential. No normality assumption required.
Part 6: Panel Data and Fixed Effects
Panel data has both a cross-sectional dimension (, e.g., stocks) and a time dimension (). Standard OLS ignores the panel structure.
The panel model:
Where is an individual fixed effect — a time-invariant, unit-specific unobservable (e.g., a company’s management quality).
Fixed Effects (Within) Estimator
Demean each variable within its unit:
Then regress on . This eliminates entirely — fixed effects are controlled for regardless of whether they’re correlated with (no endogeneity from time-invariant confounders).
Random Effects
Assumes and . More efficient than fixed effects when the assumption holds, but biased when it doesn’t.
Hausman test — tests whether random effects is consistent (i.e., whether is uncorrelated with ). Significant → use fixed effects. Not significant → random effects is valid and more efficient.
Part 7: Time Series
OLS assumes independent observations. Financial time series violates this — returns and prices are autocorrelated. Time series methods model the temporal dependence explicitly.
Stationarity
A time series is weakly stationary if:
- (constant mean)
- (constant variance)
- depends only on , not on
Non-stationary series (trending prices, unit root processes) produce spurious regressions — high R² and significant t-stats between unrelated variables.
Testing for stationarity:
- ADF (Augmented Dickey-Fuller) test — : unit root (non-stationary). Reject = stationary.
- KPSS test — : stationary. Reject = non-stationary.
- Run both: if ADF rejects and KPSS doesn’t reject, strong evidence of stationarity.
Transformations to achieve stationarity:
- First-difference: (removes trend)
- Log transformation: stabilises variance
- Log-difference: — log returns in finance, typically stationary
ACF and PACF
Autocorrelation function (ACF): — correlation between the series and its -period lag. Decays slowly for AR processes, cuts off sharply for MA processes.
Partial autocorrelation function (PACF): correlation between and after removing the effects of . Cuts off sharply for AR processes, decays slowly for MA.
Use ACF/PACF plots to identify model order before fitting ARIMA.
The chart below shows an AR(1) process with . ACF decays geometrically (never cuts off); PACF has a single spike at lag 1 then drops to zero — the diagnostic signature of a pure AR(1):
| Pattern | ACF | PACF | Model |
|---|---|---|---|
| Geometric decay | Geometric decay | Cuts off at lag p | AR(p) |
| Cuts off at lag q | Geometric decay | Geometric decay | MA(q) |
| Both decay slowly | Both decay slowly | — | ARMA(p,q) |
ARIMA
AR(p) — Autoregressive:
Current value is a linear function of past values. ACF decays geometrically; PACF cuts off at lag .
MA(q) — Moving Average:
Current value is a linear combination of past shocks. ACF cuts off at lag ; PACF decays geometrically.
ARIMA(p, d, q): Apply AR() and MA() to the -times differenced series. handles a linear trend; handles a quadratic trend.
Model selection:
flowchart TD Raw[Raw time series] --> ADF{ADF + KPSS test\nStationary?} ADF -->|No| Diff[First-difference the series\nRepeat until stationary] Diff --> ADF ADF -->|Yes| Plots[Plot ACF and PACF\nof stationary series] Plots --> AR{PACF cuts off\nat lag p?} Plots --> MA{ACF cuts off\nat lag q?} AR -->|Yes| ARm[Include AR terms] MA -->|Yes| MAm[Include MA terms] ARm --> Fit[Fit ARIMA candidates] MAm --> Fit Fit --> IC[Compare AIC and BIC] IC --> Diag[Ljung-Box Q-test\non residuals] Diag -->|Autocorrelation remains| Fit Diag -->|White noise| Done[Final model selected]GARCH (Volatility Modelling)
Financial return series exhibit volatility clustering — large moves follow large moves. ARIMA models the conditional mean; GARCH models the conditional variance.
GARCH(1,1):
Where is today’s variance, is yesterday’s squared shock (ARCH term), and is yesterday’s variance (GARCH term). Stationarity requires .
GARCH is the standard model for VaR, option pricing (implied vol dynamics), and risk management.
VAR (Vector Autoregression)
Extends AR to multiple time series, each equation regressing on lags of all variables:
Useful for modelling interdependencies between variables (e.g., macro factors). Key tools:
- Granger causality — does help predict beyond ‘s own history?
- Impulse response functions (IRF) — trace the effect of a shock in one variable through the system over time
- Forecast error variance decomposition (FEVD) — what fraction of variable ‘s forecast error variance is attributable to shocks from variable ?
Cointegration
Two non-stationary series and are cointegrated if there exists a linear combination that is stationary — they share a common stochastic trend and move together in the long run.
Engle-Granger test: Regress on ; test residuals for stationarity. If stationary, the series are cointegrated with cointegrating vector .
Error Correction Model (ECM): When series are cointegrated, model short-run dynamics and long-run equilibrium together:
The term is the error correction term — it measures how far the system deviated from long-run equilibrium last period, and determines the speed of mean reversion back.
Applications: pairs trading (equity or fixed income), purchasing power parity, yield curve dynamics.
Part 8: Principal Component Analysis (PCA)
PCA finds directions of maximum variance in high-dimensional data. It’s used for dimensionality reduction, factor construction, and dealing with multicollinearity.
The Math
Given a centred data matrix (zero mean columns), compute the sample covariance matrix:
Decompose via eigendecomposition:
Where contains eigenvectors (principal components) and contains eigenvalues in decreasing order. Equivalently, via SVD of : .
Project onto the first components:
The fraction of variance explained by the first components is .
Interpretation in Finance
The first principal component of a set of stock returns often approximates the market factor. The second and third components often capture sector or style effects. PCA on the yield curve typically extracts:
- PC1 — level (parallel shift, ~90% of variance)
- PC2 — slope (short vs. long rates)
- PC3 — curvature (butterfly)
PCA Regression
When predictors are collinear, regress on the first principal components instead of the original variables. Eliminates multicollinearity by construction (PCs are orthogonal). Trade-off: PCs may lack intuitive interpretation.
Part 9: Factor Models
Factor models decompose returns into systematic and idiosyncratic components:
Where are common factors, are factor loadings, and is idiosyncratic risk.
CAPM (Single Factor)
measures systematic (market) risk. is Jensen’s alpha — excess return above what CAPM predicts. Estimated by OLS regression of excess returns on excess market returns.
Fama-French Three-Factor Model
Where SMB (Small Minus Big) captures the size premium and HML (High Minus Low) captures the value premium. The Carhart four-factor model adds MOM (momentum). Fama-French five-factor adds RMW (profitability) and CMA (investment).
Barra-Style Risk Models
Multi-factor risk models used by risk management:
- Style factors: value, momentum, quality, size, low volatility
- Industry factors: GICS sector exposures
- Country/currency factors: for global portfolios
The factor return covariance matrix decomposes portfolio risk:
Where is the factor exposure matrix, is the factor covariance matrix, and is the diagonal idiosyncratic variance matrix.
Part 10: Statistical Testing Framework
Hypothesis Testing
- State (null) and (alternative)
- Choose a test statistic and its null distribution
- Compute the p-value: probability of observing a test statistic at least as extreme as the one computed, given is true
- Compare p-value to significance level (typically 0.05)
Type I error (false positive): Rejecting when it is true. Probability = . Type II error (false negative): Failing to reject when it is false. Probability = . Power = : probability of correctly rejecting a false null.
The p-value is not the probability that is true. It is the probability of the data (or more extreme) given .
Multiple Testing
When testing hypotheses simultaneously, the probability of at least one false positive explodes:
For tests at : chance of a false positive.
Bonferroni correction — divide by : test each hypothesis at . Conservative (controls family-wise error rate).
Benjamini-Hochberg (FDR) — controls the false discovery rate (expected proportion of false positives among rejections). Less conservative than Bonferroni; preferred when testing many hypotheses:
- Order p-values:
- Find the largest such that
- Reject all hypotheses with
In finance this matters enormously — Harvey, Liu & Zhu (2016) showed most published factor discoveries fail to survive multiple testing corrections.
Key Tests Reference
| Test | Null hypothesis | Use when |
|---|---|---|
| t-test (one sample) | Testing if mean return differs from zero | |
| t-test (two sample) | Comparing means of two groups | |
| F-test | Joint significance of predictors | |
| Jarque-Bera | Normality (, ) | Testing normality of returns |
| Breusch-Pagan | Homoscedasticity | Testing for heteroscedasticity |
| Durbin-Watson | No first-order autocorrelation | Time series residual checking |
| Ljung-Box | No autocorrelation up to lag | Residual diagnostics |
| ADF | Unit root (non-stationary) | Pre-testing time series |
| KPSS | Stationarity | Pre-testing time series |
| Hausman | RE consistent () | Fixed vs. random effects choice |
| Granger causality | does not Granger-cause | VAR causal inference |
| Chow test | No structural break | Testing regime changes |
Part 11: Distribution Statistics
Before running regressions or tests, understanding the shape of your data’s distribution matters — especially in finance where returns are decidedly non-normal.
Moments
The first four moments of a distribution describe its shape completely:
| Moment | Formula | What it measures |
|---|---|---|
| Mean | Central tendency | |
| Variance | Spread | |
| Skewness | Asymmetry | |
| Kurtosis | Tail heaviness |
Skewness: Zero = symmetric. Positive = right tail (large positive outliers). Negative = left tail (crash risk in equity returns — large negative outliers dominate).
Kurtosis: Normal distribution has kurtosis = 3. Excess kurtosis = kurtosis − 3. Positive excess kurtosis means fat tails — extreme events are far more common than a normal distribution predicts.
Financial returns typically show: negative skew + excess kurtosis > 0. This is why the Gaussian assumption in Black-Scholes systematically underprices out-of-the-money options.
Fat Tails vs Normal
The Jarque-Bera test tests jointly for zero skewness and zero excess kurtosis:
Part 12: Inequality and Concentration Measures
Gini Coefficient
The Gini coefficient measures inequality in a distribution — how concentrated values are among a subset of the population. It ranges from 0 (perfect equality) to 1 (perfect inequality).
Construction via the Lorenz Curve:
The Lorenz curve plots the cumulative share of total income (or wealth) held by the bottom of the population:
Where is the quantile function (inverse CDF). Perfect equality means — the bottom 50% holds 50% of income.
The Gini coefficient is twice the area between the perfect equality line and the Lorenz curve:
Gini in Model Validation (Credit Scoring)
The Gini coefficient has a second life in quantitative model evaluation — particularly in credit risk. A credit model ranks borrowers by predicted default probability; the Gini measures how well it separates defaulters from non-defaulters.
The relationship to the AUC (Area Under the ROC Curve):
A random model has AUC = 0.5, Gini = 0. A perfect model has AUC = 1, Gini = 1. In practice, credit scorecards with Gini > 0.4 are considered good; > 0.6 is excellent.
Herfindahl-Hirschman Index (HHI)
HHI measures market concentration — how dominant the largest players are:
Where is firm ‘s market share (as a fraction). Ranges from (perfectly equal shares) to 1 (monopoly).
- HHI < 0.15: unconcentrated market
- 0.15–0.25: moderate concentration
- HHI > 0.25: highly concentrated (US DOJ merger review threshold)
Used in: antitrust analysis, portfolio concentration risk, factor concentration in quant portfolios.
Part 13: Non-parametric Methods
Non-parametric methods make no assumptions about the underlying distribution. Essential when data is ordinal, heavily skewed, or has fat tails.
Rank Correlations
Pearson correlation measures linear dependence between two variables. It can be misleading when relationships are monotonic but non-linear, or when outliers distort the picture.
Spearman’s replaces values with their ranks, then computes Pearson correlation on the ranks:
Where . Captures any monotonic relationship, not just linear.
Kendall’s counts concordant vs discordant pairs:
Where = concordant pairs (both and rank the same way) and = discordant pairs. More robust than Spearman to small samples and tied values.
| Method | Measures | Sensitive to outliers? | Use when |
|---|---|---|---|
| Pearson | Linear dependence | Yes | Normal data, linear relationship |
| Spearman | Monotonic dependence | No | Ordinal data, non-linear monotone |
| Kendall | Ordinal association | No | Small samples, many ties |
Bootstrap
The bootstrap estimates the sampling distribution of any statistic by resampling with replacement from the observed data. No distributional assumptions required.
Algorithm:
- Draw bootstrap samples of size from the data (with replacement)
- Compute the statistic on each sample
- The distribution of approximates the sampling distribution of
Bootstrap confidence interval (percentile method):
The bootstrap is invaluable when:
- The statistic has no closed-form sampling distribution (e.g., Sharpe ratio, Gini)
- The data is clearly non-normal
- You want robust standard errors for complex estimators
In finance: Bootstrap is used to test whether a backtest’s Sharpe ratio is statistically significant, controlling for look-ahead bias and non-normality.
Kernel Density Estimation (KDE)
KDE estimates the probability density function of a dataset without assuming a parametric form:
Where is a kernel function (usually Gaussian) and is the bandwidth — the smoothing parameter.
- Small : wiggly, overfits to noise
- Large : over-smoothed, loses shape detail
- Optimal (Silverman’s rule of thumb):
KDE is used to visualise return distributions, compare empirical vs theoretical densities, and detect multimodality (e.g., bimodal return distributions suggesting regime changes).
Part 14: Logistic Regression and Classification
Linear regression predicts a continuous outcome. When the outcome is binary — default or no-default, fraud or not, churn or not — logistic regression is the standard tool.
The Model
Instead of modelling directly, logistic regression models the log-odds of the event:
Solving for the probability:
The sigmoid function maps any real number to , giving a valid probability.
Estimation: Maximum Likelihood
Logistic regression has no closed-form solution. Parameters are found by maximising the log-likelihood:
Solved iteratively with Newton-Raphson or gradient descent.
Interpreting Coefficients
A one-unit increase in multiplies the odds by :
- → higher increases default probability
- → higher decreases default probability
- → effect size (on the log-odds scale)
Unlike OLS, marginal effects on the probability scale depend on the values of all other variables — they are not constant.
Model Performance Metrics
| Metric | Formula | Interpretation |
|---|---|---|
| AUC | Area under ROC curve | 0.5 = random; 1.0 = perfect |
| Gini | 0 = random; 1 = perfect | |
| KS Statistic | Max separation between default/non-default CDFs | |
| Log-loss | Lower is better; measures calibration | |
| Brier Score | Mean squared error of probability forecasts |
In credit risk, Gini > 0.4 is typically the minimum acceptable threshold; Gini > 0.6 is strong.
Part 15: Scorecard Development — WoE and Information Value
Credit scorecards translate continuous and categorical predictors into integer points. The standard preprocessing pipeline uses Weight of Evidence (WoE) encoding.
Weight of Evidence (WoE)
For a predictor binned into groups, WoE for bin is:
- Positive WoE → bin has a higher proportion of defaults than the overall population (risky)
- Negative WoE → bin has lower proportion of defaults (safe)
- WoE = 0 → bin default rate equals the population average
WoE transforms all variables to a common, interpretable scale and handles non-linearity and missing values naturally.
Information Value (IV)
IV summarises a variable’s predictive power across all its bins:
| IV | Predictive Power |
|---|---|
| < 0.02 | Useless |
| 0.02–0.1 | Weak |
| 0.1–0.3 | Medium |
| 0.3–0.5 | Strong |
| > 0.5 | Suspicious (check for data leakage) |
IV is the primary variable selection criterion in scorecard development. Variables with IV < 0.02 are typically dropped; IV > 0.5 triggers a data quality review.
From WoE to Scorecard Points
Once logistic regression is fit on WoE-transformed variables, scorecard points are assigned by scaling coefficients to an integer range (e.g., 300–850 for consumer credit):
Where Factor and Offset are chosen to anchor the score to a target odds at a target score (e.g., odds of 50:1 at score 600).
The final score is additive across characteristics — easy to explain to regulators and customers.
Part 16: Survival Analysis
Survival analysis models the time until an event occurs — time to default, time to prepayment, time to customer churn. Unlike logistic regression (which asks “will it happen?”), survival analysis asks “when will it happen?”
Core Functions
Survival function — probability the event has not occurred by time :
Hazard function — instantaneous rate of the event at time , given survival to :
Cumulative hazard
Kaplan-Meier Estimator
The non-parametric estimate of from censored data:
Where is the number of events and is the number at risk at time . Censored observations (e.g., loans that were paid off before defaulting) are handled naturally — they contribute to the risk set up to their exit time, then drop out.
Cox Proportional Hazards Model
The Cox model is the standard regression approach for survival data. It relates covariates to the hazard without specifying the baseline hazard shape (semi-parametric):
Where is an unspecified baseline hazard. The proportional hazards assumption: the hazard ratio between two individuals with different covariates is constant over time.
Hazard ratio (HR): — a one-unit increase in multiplies the hazard by . HR > 1 means higher risk; HR < 1 means lower risk.
Estimated with partial likelihood — the baseline hazard cancels out, making estimation tractable without specifying it.
Applications in credit:
- Probability of default over a 12-month horizon (IFRS 9 Stage migration)
- Lifetime probability of default (IFRS 9 ECL)
- Time to repayment / prepayment modelling
Part 17: Model Monitoring and Validation
Models degrade over time as the population they’re applied to drifts away from the development sample. Model monitoring is a regulatory requirement (SR 11-7, PRA SS1/23) and a risk management necessity.
Population Stability Index (PSI)
PSI measures how much a variable’s distribution has shifted between the development (reference) period and a monitoring period:
Where = actual proportion in bin (monitoring), = expected proportion in bin (development).
| PSI | Interpretation |
|---|---|
| < 0.10 | No significant shift — model still valid |
| 0.10–0.25 | Moderate shift — investigate |
| > 0.25 | Major shift — model may need redevelopment |
PSI is computed on the score distribution (overall stability) and on each input characteristic (Characteristic Stability Index, CSI). A high PSI on one characteristic identifies which variable is driving the drift.
Characteristic Stability Index (CSI)
CSI applies the same formula as PSI but to individual input variables. Workflow:
flowchart LR Score[Score monitoring window] --> PSI{PSI above 0.10?} PSI -->|No| OK[Model stable - continue] PSI -->|Yes| CSI[Compute CSI for each variable] CSI --> Driver[Identify driver variable] Driver --> Root[Root cause: data issue or population shift] Root --> Fix[Recalibrate or redevelop model]Performance Monitoring
Track discrimination and calibration separately — a model can remain discriminatory (Gini stable) while becoming poorly calibrated (predicted rates diverge from actuals):
| Metric | Monitors | Alert threshold |
|---|---|---|
| Gini / AUC | Discrimination (rank ordering) | Drop > 5 pp from development Gini |
| KS Statistic | Separation between default/non-default | Drop > 5 pp |
| Predicted vs Actual Default Rate | Calibration | Predicted/Actual ratio outside 0.8–1.2 |
| Hosmer-Lemeshow test | Calibration (formal) | p-value < 0.05 across score bands |
| PSI | Population drift | > 0.25 on score or key characteristic |
Backtesting
For through-the-cycle models (PD, LGD), backtesting compares predicted values against realised outcomes:
Binomial test for PD: Under that predicted PD is correct, the number of defaults in a cohort follows a Binomial distribution. Test whether actual defaults are consistent with predicted.
Traffic light framework (Basel):
- Green zone: actual defaults within expected range
- Amber zone: borderline — increase monitoring
- Red zone: model materially over/underpredicts — regulatory notification required
Part 18: Risk Metrics — VaR and Expected Shortfall
Value at Risk (VaR)
VaR is the loss not exceeded with probability over a given horizon:
Equivalently, VaR is the -quantile of the loss distribution (e.g., 99th percentile for 1% VaR).
Three estimation approaches:
| Method | How | Assumptions |
|---|---|---|
| Historical simulation | Sort past P&L; read off percentile | Distribution-free; captures fat tails and correlations |
| Parametric (variance-covariance) | Assume normal returns; | Fast; underestimates tail risk for non-normal returns |
| Monte Carlo | Simulate thousands of scenarios from a model | Flexible; computationally expensive |
Limitations of VaR:
- Not subadditive — a portfolio of two positions can have higher VaR than the sum of their individual VaRs (violates diversification intuition)
- Tells you nothing about the magnitude of losses beyond the threshold
Expected Shortfall (CVaR / ES)
Expected Shortfall is the expected loss conditional on exceeding VaR:
ES is the average of all losses in the tail beyond VaR. It is:
- Subadditive — always rewards diversification
- More sensitive to tail shape — captures the severity, not just the threshold
- The regulatory standard under Basel IV (FRTB) — replaced VaR at the 97.5th percentile
Duration and DV01 (Fixed Income Risk)
For fixed income portfolios, interest rate sensitivity is measured by:
Modified Duration:
A bond with modified duration of 5 loses approximately 5% in value for a 1% (100bp) rise in yield.
DV01 (Dollar Value of a Basis Point):
DV01 is the P&L change for a 1 basis point (0.01%) move in yield. The standard unit for expressing interest rate risk on a trading desk.
Convexity measures the curvature of the price-yield relationship (duration is the first-order approximation; convexity is the second-order correction):
Positive convexity (standard bonds) means the bond gains more when yields fall than it loses when yields rise by the same amount.
Expected Credit Loss (ECL — IFRS 9)
Under IFRS 9, banks must recognise lifetime expected credit losses on all financial instruments:
Where:
- PD — Probability of Default (from logistic/survival model)
- LGD — Loss Given Default (fraction of exposure lost; modelled via beta regression or OLS on logit-transformed LGD)
- EAD — Exposure at Default (outstanding balance at time of default)
- DF — Discount factor (to present value)
Staging under IFRS 9:
- Stage 1 — 12-month ECL (no significant credit deterioration since origination)
- Stage 2 — Lifetime ECL (significant increase in credit risk)
- Stage 3 — Lifetime ECL, credit-impaired
The transition between stages is the critical modelling decision — typically driven by PD relative to origination PD, delinquency triggers, or watchlist flags.
Common Pitfalls
| Pitfall | What happens | Fix |
|---|---|---|
| Omitted variable bias | is biased and inconsistent | Add the variable; use IV or FE |
| Spurious regression | Fake significance between unrelated non-stationary series | Test stationarity; difference or use ECM |
| Look-ahead bias | Future data leaks into predictors | Align data carefully; use lagged values |
| P-hacking | Testing many models, reporting the best | Pre-register hypothesis; correct for multiple testing |
| Overfitting | Model fits in-sample noise | Cross-validate; use regularisation; hold-out test set |
| Ignoring autocorrelation | Standard errors too small; over-rejection | Use HAC standard errors or model residuals |
| Reverse causality | Causal direction is ambiguous | Instrumental variables; Granger causality |