Skip to content

Quantitative Methods

📖 48 min read deep-divestatisticsquantitativeregressiontime-seriesfinancecredit-risk
A comprehensive reference for quantitative analytics — regression, credit risk models, scorecard development, survival analysis, model monitoring, time series, factor models, and risk metrics.
Key Takeaways
  • OLS finds the coefficient vector that minimises the sum of squared residuals; the closed-form solution is β = (XᵀX)⁻¹Xᵀy.
  • Logistic regression models binary outcomes (default/no-default); coefficients are log-odds ratios and the output is a probability.
  • Weight of Evidence (WoE) and Information Value (IV) are the standard feature engineering and selection tools for credit scorecards.
  • Survival analysis models time-to-event (time to default); the Cox proportional hazards model estimates relative default risk.
  • PSI detects population drift post-deployment — it is the first check in any model monitoring framework.
  • VaR and Expected Shortfall quantify market and credit risk; ES is now the regulatory standard under Basel IV.

The Core Idea

Linear regression models the relationship between a response variable yy and one or more predictors x\mathbf{x} as a linear function, then estimates that function from data by minimising prediction error. It is the workhorse of quantitative analytics — directly useful for modelling returns, risk factors, and economic relationships, and the conceptual foundation for nearly every more complex method.


Part 1: Ordinary Least Squares (OLS)

The Model

The population regression model assumes:

yi=β0+β1xi1+β2xi2++βkxik+εiy_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik} + \varepsilon_i

In matrix form with nn observations and kk predictors:

y=Xβ+ε\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}

Where:

  • yRn\mathbf{y} \in \mathbb{R}^n — response vector
  • XRn×(k+1)X \in \mathbb{R}^{n \times (k+1)} — design matrix (first column is a vector of ones for the intercept)
  • βRk+1\boldsymbol{\beta} \in \mathbb{R}^{k+1} — coefficient vector (what we estimate)
  • εRn\boldsymbol{\varepsilon} \in \mathbb{R}^n — error vector (unobservable)

OLS Derivation

OLS minimises the residual sum of squares:

RSS(β)=i=1n(yixiβ)2=(yXβ)(yXβ)\text{RSS}(\boldsymbol{\beta}) = \sum_{i=1}^n (y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 = (\mathbf{y} - X\boldsymbol{\beta})^\top(\mathbf{y} - X\boldsymbol{\beta})

Taking the derivative with respect to β\boldsymbol{\beta} and setting it to zero:

RSSβ=2X(yXβ)=0\frac{\partial \text{RSS}}{\partial \boldsymbol{\beta}} = -2X^\top(\mathbf{y} - X\boldsymbol{\beta}) = 0

This gives the normal equations: XXβ=XyX^\top X \boldsymbol{\beta} = X^\top \mathbf{y}

Solving (when XXX^\top X is invertible):

β^=(XX)1Xy\boxed{\hat{\boldsymbol{\beta}} = (X^\top X)^{-1} X^\top \mathbf{y}}

This is the OLS estimator. It has a closed-form solution — no iteration required.

Intuition: OLS projects y\mathbf{y} orthogonally onto the column space of XX. The fitted values y^=Xβ^\hat{\mathbf{y}} = X\hat{\boldsymbol{\beta}} are the point in that column space closest to y\mathbf{y} under the Euclidean norm.

ε ε ε 024 6810 024 6810 Predictor x Response y observed ŷ = β₀+β₁x

The Gauss-Markov Theorem

Under five assumptions, OLS is the Best Linear Unbiased Estimator (BLUE) — it has the smallest variance among all linear unbiased estimators.

AssumptionStatementWhat breaks it
L Linearityy=Xβ+εy = X\beta + \varepsilon is correctly specifiedOmitted variables, wrong functional form
I IndependenceObservations are independentTime series autocorrelation, clustered data
H HomoscedasticityVar(εi)=σ2\text{Var}(\varepsilon_i) = \sigma^2 (constant)Volatility clustering in financial returns
N NormalityεN(0,σ2I)\varepsilon \sim \mathcal{N}(0, \sigma^2 I)Fat tails, outliers (needed for exact inference, not BLUE)
E Exogeneity$\mathbb{E}[\varepsilonX] = 0$

When these hold, β^\hat{\boldsymbol{\beta}} is unbiased (E[β^]=β\mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}) and efficient.


Part 2: Inference and Model Evaluation

Coefficient Standard Errors

The variance-covariance matrix of β^\hat{\boldsymbol{\beta}} under homoscedasticity:

Var(β^)=σ2(XX)1\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (X^\top X)^{-1}

Since σ2\sigma^2 is unknown, replace it with the unbiased estimator:

σ^2=RSSnk1=iε^i2nk1\hat{\sigma}^2 = \frac{\text{RSS}}{n - k - 1} = \frac{\sum_i \hat{\varepsilon}_i^2}{n-k-1}

The standard error of β^j\hat{\beta}_j is SE(β^j)=σ^[(XX)1]jj\text{SE}(\hat{\beta}_j) = \hat{\sigma} \sqrt{[(X^\top X)^{-1}]_{jj}}

Hypothesis Testing

t-test for individual coefficients:

tj=β^jSE(β^j)tnk1under H0:βj=0t_j = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)} \sim t_{n-k-1} \quad \text{under } H_0: \beta_j = 0

F-test for joint significance:

F=(RSSrestrictedRSSunrestricted)/qRSSunrestricted/(nk1)Fq,nk1F = \frac{(\text{RSS}_\text{restricted} - \text{RSS}_\text{unrestricted})/q}{\text{RSS}_\text{unrestricted}/(n-k-1)} \sim F_{q,\, n-k-1}

Where qq is the number of restrictions. Tests whether a group of coefficients are jointly zero.

Goodness of Fit

R² (coefficient of determination):

R2=1RSSTSS=1iε^i2i(yiyˉ)2R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum_i \hat{\varepsilon}_i^2}{\sum_i (y_i - \bar{y})^2}

R² measures the fraction of total variance in yy explained by the model. It never decreases when you add predictors — regardless of whether they’re useful.

Adjusted R² penalises for adding irrelevant predictors:

Rˉ2=1(1R2)(n1)nk1\bar{R}^2 = 1 - \frac{(1-R^2)(n-1)}{n-k-1}

AIC and BIC — information criteria for model selection:

AIC=2k2ln(L^),BIC=kln(n)2ln(L^)\text{AIC} = 2k - 2\ln(\hat{L}), \quad \text{BIC} = k\ln(n) - 2\ln(\hat{L})

Lower is better. BIC penalises complexity more heavily than AIC and tends to select simpler models.

Confidence Intervals

A 95% confidence interval for βj\beta_j:

β^j±tnk1,0.025SE(β^j)\hat{\beta}_j \pm t_{n-k-1,\, 0.025} \cdot \text{SE}(\hat{\beta}_j)

Interpretation: If you repeated the experiment many times and computed this interval each time, 95% of intervals would contain the true βj\beta_j. It is NOT “95% probability that the true parameter is in this interval” (that’s the Bayesian credible interval).


Part 3: Regression Diagnostics

Diagnostics test whether the Gauss-Markov assumptions hold. Violating them doesn’t always invalidate the regression — but it changes what you can conclude.

Residual Analysis

Always start with residual plots:

  • Residuals vs. fitted values — should be random scatter. Patterns indicate heteroscedasticity or non-linearity.
  • Q-Q plot — residuals vs. theoretical normal quantiles. Deviations at tails indicate non-normality (common in financial data).
  • Scale-location plotε^i\sqrt{|\hat{\varepsilon}_i|} vs. fitted values. Increasing spread = heteroscedasticity.
  • Residuals vs. time — for time-ordered data. Patterns indicate autocorrelation.

Each pattern tells you something different and has a different fix:

✓ Good — random scatter ✗ Heteroscedastic — fan shape ✗ Non-linear — curved pattern ✗ Autocorrelated — wave pattern Fitted values → Fitted values → Fitted values → Fitted values → 00 00

Heteroscedasticity

When Var(εi)=σi2\text{Var}(\varepsilon_i) = \sigma_i^2 varies across observations, OLS is still unbiased but no longer efficient. Standard errors are wrong — t-tests and confidence intervals are invalid.

Detection:

  • Breusch-Pagan test — regress squared residuals on the predictors. Significant F-stat = heteroscedasticity.
  • White test — more general, includes squared terms and cross-products.

Fixes:

  • Heteroscedasticity-consistent (HC) standard errors (White standard errors) — correct the standard errors without changing β^\hat{\boldsymbol{\beta}}.
  • Weighted Least Squares (WLS) — weight observations by the inverse of their error variance when the variance structure is known.
  • Generalised Least Squares (GLS) — the general fix when the error covariance structure Σ\Sigma is known: β^GLS=(XΣ1X)1XΣ1y\hat{\boldsymbol{\beta}}_\text{GLS} = (X^\top \Sigma^{-1} X)^{-1} X^\top \Sigma^{-1} \mathbf{y}

Autocorrelation

When errors are correlated across time (Cov(εt,εts)0\text{Cov}(\varepsilon_t, \varepsilon_{t-s}) \neq 0), OLS standard errors are too small — you over-reject the null.

Detection:

  • Durbin-Watson statistic — tests for first-order autocorrelation. DW ≈ 2 means no autocorrelation; DW < 2 means positive autocorrelation (very common in financial time series).
  • Ljung-Box Q-test — tests for autocorrelation at multiple lags simultaneously.
  • ACF/PACF plots of residuals — visual inspection of autocorrelation structure.

Fixes:

  • Newey-West standard errors (HAC — heteroscedasticity and autocorrelation consistent)
  • Explicitly model the autocorrelation structure (ARIMA residuals)
  • Include lagged dependent variable as a predictor (Cochrane-Orcutt)

Multicollinearity

When predictors are highly correlated, (XX)1(X^\top X)^{-1} becomes unstable. Coefficients have large standard errors and wrong signs — individual coefficients can’t be trusted even when the overall fit is good.

Detection:

  • Variance Inflation Factor (VIF): VIFj=11Rj2\text{VIF}_j = \frac{1}{1 - R_j^2} where Rj2R_j^2 is the R² from regressing xjx_j on all other predictors. VIF > 10 (or > 5 conservatively) indicates a problem.
  • Condition number of XXX^\top X — above 30 indicates moderate, above 100 indicates severe multicollinearity.
  • Correlation matrix — pairwise correlations above 0.8 are a warning sign.

Fixes: Ridge regression (shrinks coefficients), PCA regression (transforms to orthogonal predictors), removing one of the collinear variables.

Influential Observations

Leverage measures how far an observation’s xx-values are from the mean. High-leverage points have outsized influence on the regression line regardless of their yy-value.

hii=[X(XX)1X]iih_{ii} = [X(X^\top X)^{-1}X^\top]_{ii}

Cook’s Distance combines leverage and residual size into a single influence measure:

Di=ε^i2(k+1)σ^2hii(1hii)2D_i = \frac{\hat{\varepsilon}_i^2}{(k+1)\hat{\sigma}^2} \cdot \frac{h_{ii}}{(1-h_{ii})^2}

Di>1D_i > 1 is a common threshold for “influential.” Examine these observations: data errors, legitimate outliers, or regime changes.

Diagnostic Decision Tree

flowchart TD
Fit[Fit OLS] --> RvF[Plot residuals vs fitted]
RvF --> Q1{Fan / funnel shape?}
Q1 -->|Yes| A1[Heteroscedasticity\nFix: HC errors or WLS]
Q1 -->|No| Q2{Curved or U-shape?}
Q2 -->|Yes| A2[Non-linearity\nFix: polynomial or log transform]
Q2 -->|No| Q3{Wave or drift over time?}
Q3 -->|Yes| A3[Autocorrelation\nFix: HAC errors or ARIMA residuals]
Q3 -->|No| VIF[Check VIF for each predictor]
VIF --> Q4{Any VIF above 10?}
Q4 -->|Yes| A4[Multicollinearity\nFix: Ridge regression or drop variable]
Q4 -->|No| Cook[Compute Cook's D]
Cook --> Q5{Any D above 1?}
Q5 -->|Yes| A5[Influential observation\nFix: investigate or robust regression]
Q5 -->|No| Done[OLS assumptions satisfied]

Part 4: Variable Transformations

Transformations serve two purposes: fixing violated OLS assumptions (non-normality, heteroscedasticity, non-linearity) and changing how coefficients are interpreted. Applying the wrong transformation — or not applying one when needed — is a common source of misleading models.


Dependent Variable Transformations

Log transformation: ln(y)\ln(y)

Use when yy is strictly positive, right-skewed, or when variance grows with the mean (common in income, prices, exposure).

ln(yi)=β0+β1xi+εi\ln(y_i) = \beta_0 + \beta_1 x_i + \varepsilon_i

Interpretation: A one-unit increase in xx multiplies yy by eβ1e^{\beta_1}, or equivalently, changes yy by approximately 100β1%100 \cdot \beta_1\% (exact for small β1\beta_1).

When to use: residuals fan outward (heteroscedasticity), yy is a monetary amount or count that can’t be negative.

Watch out: ln(0)\ln(0) is undefined — you need y>0y > 0. A common fix is ln(y+1)\ln(y + 1) or ln(y+c)\ln(y + c) for small cc.

Retransforming to the original scale: y^=elny^\hat{y} = e^{\hat{\ln y}} is biased downward. The smearing estimator (Duan, 1983) corrects this:

y^=eμ^1ni=1neε^i\hat{y} = e^{\hat{\mu}} \cdot \frac{1}{n} \sum_{i=1}^n e^{\hat{\varepsilon}_i}

Square-root transformation: y\sqrt{y}

Gentler than log — useful for count data or moderate right skew. Variance-stabilising for Poisson-distributed outcomes (where Var(y)μ\text{Var}(y) \propto \mu).

Box-Cox transformation

A family of power transformations parameterised by λ\lambda:

y(λ)={yλ1λλ0ln(y)λ=0y^{(\lambda)} = \begin{cases} \dfrac{y^\lambda - 1}{\lambda} & \lambda \neq 0 \\ \ln(y) & \lambda = 0 \end{cases}

Special cases: λ=1\lambda = 1 is no transformation, λ=0\lambda = 0 is log, λ=0.5\lambda = 0.5 is square root, λ=1\lambda = -1 is reciprocal.

Estimate λ\lambda by maximum likelihood — the optimal λ\lambda maximises the log-likelihood of the transformed residuals being normal. In Python: scipy.stats.boxcox(y).

Limitation: Requires y>0y > 0. Yeo-Johnson extends this to allow y0y \leq 0.

Logit transformation: ln(y1y)\ln\left(\frac{y}{1-y}\right)

For bounded outcomes y(0,1)y \in (0, 1) such as rates or proportions. Expands the bounded range to (,+)(-\infty, +\infty), making OLS applicable. The fractional logit model is an alternative that avoids retransformation.


Independent Variable Transformations

Log transformation: ln(x)\ln(x)

Use when the relationship between xx and yy is concave (diminishing returns) or when xx spans several orders of magnitude (income, asset size, market cap).

Interpretation depends on the model form:

ModelEquationInterpretation of β1\beta_1
Linear-lineary=β0+β1xy = \beta_0 + \beta_1 xΔy=β1\Delta y = \beta_1 per unit increase in xx
Log-linearlny=β0+β1x\ln y = \beta_0 + \beta_1 x100β1%\approx 100\beta_1\% change in yy per unit increase in xx
Linear-logy=β0+β1lnxy = \beta_0 + \beta_1 \ln xΔy=β1/100\Delta y = \beta_1 / 100 per 1% increase in xx
Log-loglny=β0+β1lnx\ln y = \beta_0 + \beta_1 \ln xβ1\beta_1 is the elasticity: 1% increase in xxβ1%\beta_1\% change in yy

The log-log form is widely used in economics and finance because elasticities are unit-free and directly comparable across variables.

Polynomial features

Include x2x^2, x3x^3 to capture non-linear relationships while staying within the OLS framework:

y=β0+β1x+β2x2+εy = \beta_0 + \beta_1 x + \beta_2 x^2 + \varepsilon

The marginal effect is y/x=β1+2β2x\partial y / \partial x = \beta_1 + 2\beta_2 x — it varies with xx. The turning point is at x=β1/(2β2)x^* = -\beta_1 / (2\beta_2).

Caution: High-degree polynomials overfit at the extremes. Splines or piecewise linear functions are more robust.

Standardisation vs normalisation

Standardisation (z-score): x=(xμ)/σx' = (x - \mu) / \sigma. After standardising, βj\beta_j is interpreted as the effect of a one-standard-deviation increase in xjx_j. Makes coefficients directly comparable across predictors with different units. Required before applying Ridge or Lasso (otherwise the penalty treats variables unequally).

Min-max normalisation: x=(xxmin)/(xmaxxmin)x' = (x - x_{\min}) / (x_{\max} - x_{\min}). Scales to [0,1][0, 1]. Sensitive to outliers. Rarely used for regression; more common in ML preprocessing.

Standardisation does not change R2R^2, tt-statistics, or pp-values — only the scale of β^\hat{\beta}.


Categorical Variables: Dummy Encoding

A categorical variable with kk levels is encoded as k1k - 1 binary (0/1) dummy variables. The omitted level is the reference category — all coefficients are interpreted relative to it.

Example: Credit rating with levels AAA, AA, A, BBB → create dummies for AA, A, BBB; AAA is the reference.

y=β0+β1DAA+β2DA+β3DBBB+γx+εy = \beta_0 + \beta_1 D_{\text{AA}} + \beta_2 D_{\text{A}} + \beta_3 D_{\text{BBB}} + \gamma x + \varepsilon

β1\beta_1 is the average difference in yy between AA and AAA, holding xx constant.

Dummy variable trap: Including all kk dummies creates perfect multicollinearity with the intercept (they sum to 1). Always use k1k - 1 dummies. Software handles this automatically, but be aware if constructing features manually.

Ordered categories: For ordinal variables (e.g., credit grades with a natural ranking), a single integer encoding can be appropriate if the spacing is roughly equal. Dummy encoding is safer when spacing is unequal.


Interaction Terms

An interaction term x1x2x_1 \cdot x_2 allows the effect of x1x_1 to depend on the level of x2x_2:

y=β0+β1x1+β2x2+β3(x1x2)+εy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 \cdot x_2) + \varepsilon

The marginal effect of x1x_1 is y/x1=β1+β3x2\partial y / \partial x_1 = \beta_1 + \beta_3 x_2 — it varies with x2x_2.

When to include interactions:

  • Theory suggests the effect of one variable depends on another (e.g., income × age in credit scoring)
  • Residual plots show systematic patterns that disappear after adding the interaction
  • You want to test whether a relationship differs across groups (equivalent to separate slopes)

Hierarchy principle: If you include an interaction x1x2x_1 x_2, include the main effects x1x_1 and x2x_2 too, even if their main-effect coefficients are insignificant. Omitting them changes the interpretation of the interaction.


Choosing the Right Transformation

SymptomLikely fix
Right-skewed yy, variance grows with meanln(y)\ln(y)
yy is a count (Poisson)y\sqrt{y} or Poisson regression
yy is a proportion in (0,1)(0,1)Logit(y)(y) or fractional logit
yy is continuous, optimal λ\lambda unknownBox-Cox
Residuals show a curve (concave/convex)ln(x)\ln(x) or add x2x^2
xx spans orders of magnitudeln(x)\ln(x)
Predictors on different scales (for regularisation)Standardise all xx
Non-linear group differencesInteraction terms

Part 5: Regularised Regression

When predictors are numerous or collinear, OLS over-fits. Regularisation adds a penalty term to the loss function, shrinking coefficients toward zero.

Ridge Regression (L2)

β^ridge=argminβ{i(yixiβ)2+λjβj2}\hat{\boldsymbol{\beta}}_\text{ridge} = \arg\min_{\boldsymbol{\beta}} \left\{ \sum_i (y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 + \lambda \sum_j \beta_j^2 \right\}

Closed-form solution:

β^ridge=(XX+λI)1Xy\hat{\boldsymbol{\beta}}_\text{ridge} = (X^\top X + \lambda I)^{-1} X^\top \mathbf{y}

Adding λI\lambda I makes the matrix invertible even under perfect multicollinearity. Ridge shrinks all coefficients toward zero but never exactly to zero — it does not perform variable selection. Choose λ\lambda via cross-validation.

Lasso Regression (L1)

β^lasso=argminβ{i(yixiβ)2+λjβj}\hat{\boldsymbol{\beta}}_\text{lasso} = \arg\min_{\boldsymbol{\beta}} \left\{ \sum_i (y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 + \lambda \sum_j |\beta_j| \right\}

The L1 penalty produces sparse solutions — it drives some coefficients exactly to zero. Lasso does automatic variable selection. No closed-form solution (solved with coordinate descent or LARS algorithm).

Elastic Net

Combines L1 and L2:

β^enet=argminβ{RSS+λ1jβj+λ2jβj2}\hat{\boldsymbol{\beta}}_\text{enet} = \arg\min_{\boldsymbol{\beta}} \left\{ \text{RSS} + \lambda_1 \sum_j |\beta_j| + \lambda_2 \sum_j \beta_j^2 \right\}

Useful when predictors number in the thousands (genomics, factor zoo in finance) — Lasso tends to select only one variable from a correlated group; Elastic Net can include all of them with reduced coefficients.

MethodPenaltySelects variables?Handles multicollinearity?
OLSNoneNoNo
Ridgeλβ22\lambda \|\boldsymbol{\beta}\|_2^2NoYes
Lassoλβ1\lambda \|\boldsymbol{\beta}\|_1YesPartially
Elastic NetBothYesYes

As λ\lambda increases from 0, Ridge shrinks all coefficients smoothly toward (but never to) zero. Lasso drives coefficients to exactly zero at different thresholds — automatic variable selection:

Ridge (L2) Lasso (L1) Regularisation strength (λ) → Regularisation strength (λ) → Coefficient β̂ 31.5 0−1 00.51 00.51 shrinks but never zero exact zeros → β₁ β₂ β₃ β₄

Quantile Regression

Standard OLS estimates the conditional mean of yy given XX. Quantile regression estimates any conditional quantile — the median, the 5th percentile, the 95th percentile.

Minimises the asymmetric loss function (pinball loss):

β^τ=argminβiρτ(yixiβ)\hat{\boldsymbol{\beta}}_\tau = \arg\min_{\boldsymbol{\beta}} \sum_i \rho_\tau(y_i - \mathbf{x}_i^\top \boldsymbol{\beta})

Where ρτ(u)=u(τ1[u<0])\rho_\tau(u) = u(\tau - \mathbf{1}[u < 0]) and τ(0,1)\tau \in (0,1) is the quantile.

Why it matters in finance: Asset returns have fat tails. OLS ignores tail behaviour. Quantile regression at τ=0.05\tau = 0.05 directly models Value-at-Risk; at τ=0.95\tau = 0.95 models upside potential. No normality assumption required.


Part 6: Panel Data and Fixed Effects

Panel data has both a cross-sectional dimension (ii, e.g., stocks) and a time dimension (tt). Standard OLS ignores the panel structure.

The panel model:

yit=xitβ+αi+εity_{it} = \mathbf{x}_{it}^\top \boldsymbol{\beta} + \alpha_i + \varepsilon_{it}

Where αi\alpha_i is an individual fixed effect — a time-invariant, unit-specific unobservable (e.g., a company’s management quality).

Fixed Effects (Within) Estimator

Demean each variable within its unit:

y~it=yityˉi,x~it=xitxˉi\tilde{y}_{it} = y_{it} - \bar{y}_i, \quad \tilde{\mathbf{x}}_{it} = \mathbf{x}_{it} - \bar{\mathbf{x}}_i

Then regress y~\tilde{y} on x~\tilde{\mathbf{x}}. This eliminates αi\alpha_i entirely — fixed effects are controlled for regardless of whether they’re correlated with xit\mathbf{x}_{it} (no endogeneity from time-invariant confounders).

Random Effects

Assumes αiN(0,σα2)\alpha_i \sim \mathcal{N}(0, \sigma_\alpha^2) and Cov(αi,xit)=0\text{Cov}(\alpha_i, \mathbf{x}_{it}) = 0. More efficient than fixed effects when the assumption holds, but biased when it doesn’t.

Hausman test — tests whether random effects is consistent (i.e., whether αi\alpha_i is uncorrelated with xit\mathbf{x}_{it}). Significant → use fixed effects. Not significant → random effects is valid and more efficient.


Part 7: Time Series

OLS assumes independent observations. Financial time series violates this — returns and prices are autocorrelated. Time series methods model the temporal dependence explicitly.

Stationarity

A time series {yt}\{y_t\} is weakly stationary if:

  • E[yt]=μ\mathbb{E}[y_t] = \mu (constant mean)
  • Var(yt)=σ2\text{Var}(y_t) = \sigma^2 (constant variance)
  • Cov(yt,yts)\text{Cov}(y_t, y_{t-s}) depends only on ss, not on tt

Non-stationary series (trending prices, unit root processes) produce spurious regressions — high R² and significant t-stats between unrelated variables.

Testing for stationarity:

  • ADF (Augmented Dickey-Fuller) testH0H_0: unit root (non-stationary). Reject = stationary.
  • KPSS testH0H_0: stationary. Reject = non-stationary.
  • Run both: if ADF rejects and KPSS doesn’t reject, strong evidence of stationarity.

Transformations to achieve stationarity:

  • First-difference: Δyt=ytyt1\Delta y_t = y_t - y_{t-1} (removes trend)
  • Log transformation: stabilises variance
  • Log-difference: ln(yt/yt1)\ln(y_t/y_{t-1}) — log returns in finance, typically stationary

ACF and PACF

Autocorrelation function (ACF): ρ(s)=Corr(yt,yts)\rho(s) = \text{Corr}(y_t, y_{t-s}) — correlation between the series and its ss-period lag. Decays slowly for AR processes, cuts off sharply for MA processes.

Partial autocorrelation function (PACF): correlation between yty_t and ytsy_{t-s} after removing the effects of yt1,,yts+1y_{t-1}, \ldots, y_{t-s+1}. Cuts off sharply for AR processes, decays slowly for MA.

Use ACF/PACF plots to identify model order before fitting ARIMA.

The chart below shows an AR(1) process with ϕ=0.7\phi = 0.7. ACF decays geometrically (never cuts off); PACF has a single spike at lag 1 then drops to zero — the diagnostic signature of a pure AR(1):

ACF — geometric decay PACF — cuts off at lag 1 012 345 67 012 345 67 Lag Lag 1.00−0.3 ±1.96/√n cuts off after lag 1 →
PatternACFPACFModel
Geometric decayGeometric decayCuts off at lag pAR(p)
Cuts off at lag qGeometric decayGeometric decayMA(q)
Both decay slowlyBoth decay slowlyARMA(p,q)

ARIMA

AR(p) — Autoregressive:

yt=c+ϕ1yt1+ϕ2yt2++ϕpytp+εty_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \cdots + \phi_p y_{t-p} + \varepsilon_t

Current value is a linear function of pp past values. ACF decays geometrically; PACF cuts off at lag pp.

MA(q) — Moving Average:

yt=μ+εt+θ1εt1++θqεtqy_t = \mu + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \cdots + \theta_q \varepsilon_{t-q}

Current value is a linear combination of qq past shocks. ACF cuts off at lag qq; PACF decays geometrically.

ARIMA(p, d, q): Apply AR(pp) and MA(qq) to the dd-times differenced series. d=1d=1 handles a linear trend; d=2d=2 handles a quadratic trend.

Model selection:

flowchart TD
Raw[Raw time series] --> ADF{ADF + KPSS test\nStationary?}
ADF -->|No| Diff[First-difference the series\nRepeat until stationary]
Diff --> ADF
ADF -->|Yes| Plots[Plot ACF and PACF\nof stationary series]
Plots --> AR{PACF cuts off\nat lag p?}
Plots --> MA{ACF cuts off\nat lag q?}
AR -->|Yes| ARm[Include AR terms]
MA -->|Yes| MAm[Include MA terms]
ARm --> Fit[Fit ARIMA candidates]
MAm --> Fit
Fit --> IC[Compare AIC and BIC]
IC --> Diag[Ljung-Box Q-test\non residuals]
Diag -->|Autocorrelation remains| Fit
Diag -->|White noise| Done[Final model selected]

GARCH (Volatility Modelling)

Financial return series exhibit volatility clustering — large moves follow large moves. ARIMA models the conditional mean; GARCH models the conditional variance.

GARCH(1,1):

σt2=ω+αεt12+βσt12\sigma_t^2 = \omega + \alpha \varepsilon_{t-1}^2 + \beta \sigma_{t-1}^2

Where σt2\sigma_t^2 is today’s variance, εt12\varepsilon_{t-1}^2 is yesterday’s squared shock (ARCH term), and σt12\sigma_{t-1}^2 is yesterday’s variance (GARCH term). Stationarity requires α+β<1\alpha + \beta < 1.

GARCH is the standard model for VaR, option pricing (implied vol dynamics), and risk management.

VAR (Vector Autoregression)

Extends AR to multiple time series, each equation regressing on lags of all variables:

yt=c+A1yt1+A2yt2++Apytp+εt\mathbf{y}_t = \mathbf{c} + A_1 \mathbf{y}_{t-1} + A_2 \mathbf{y}_{t-2} + \cdots + A_p \mathbf{y}_{t-p} + \boldsymbol{\varepsilon}_t

Useful for modelling interdependencies between variables (e.g., macro factors). Key tools:

  • Granger causality — does xx help predict yy beyond yy‘s own history?
  • Impulse response functions (IRF) — trace the effect of a shock in one variable through the system over time
  • Forecast error variance decomposition (FEVD) — what fraction of variable ii‘s forecast error variance is attributable to shocks from variable jj?

Cointegration

Two non-stationary series yty_t and xtx_t are cointegrated if there exists a linear combination ytγxty_t - \gamma x_t that is stationary — they share a common stochastic trend and move together in the long run.

Engle-Granger test: Regress yty_t on xtx_t; test residuals for stationarity. If stationary, the series are cointegrated with cointegrating vector (1,γ^)(1, -\hat{\gamma}).

Error Correction Model (ECM): When series are cointegrated, model short-run dynamics and long-run equilibrium together:

Δyt=α0+α1(yt1γxt1)+βΔxt1+εt\Delta y_t = \alpha_0 + \alpha_1 (y_{t-1} - \gamma x_{t-1}) + \beta \Delta x_{t-1} + \varepsilon_t

The term (yt1γxt1)(y_{t-1} - \gamma x_{t-1}) is the error correction term — it measures how far the system deviated from long-run equilibrium last period, and α1\alpha_1 determines the speed of mean reversion back.

Applications: pairs trading (equity or fixed income), purchasing power parity, yield curve dynamics.


Part 8: Principal Component Analysis (PCA)

PCA finds directions of maximum variance in high-dimensional data. It’s used for dimensionality reduction, factor construction, and dealing with multicollinearity.

The Math

Given a centred data matrix XRn×pX \in \mathbb{R}^{n \times p} (zero mean columns), compute the sample covariance matrix:

S=1n1XXS = \frac{1}{n-1} X^\top X

Decompose via eigendecomposition:

S=VΛVS = V \Lambda V^\top

Where VV contains eigenvectors (principal components) and Λ=diag(λ1,,λp)\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_p) contains eigenvalues in decreasing order. Equivalently, via SVD of XX: X=UDVX = U D V^\top.

Project onto the first kk components:

Z=XVkRn×kZ = X V_k \in \mathbb{R}^{n \times k}

The fraction of variance explained by the first kk components is i=1kλi/i=1pλi\sum_{i=1}^k \lambda_i / \sum_{i=1}^p \lambda_i.

Interpretation in Finance

The first principal component of a set of stock returns often approximates the market factor. The second and third components often capture sector or style effects. PCA on the yield curve typically extracts:

  • PC1 — level (parallel shift, ~90% of variance)
  • PC2 — slope (short vs. long rates)
  • PC3 — curvature (butterfly)

PCA Regression

When predictors are collinear, regress on the first kk principal components instead of the original variables. Eliminates multicollinearity by construction (PCs are orthogonal). Trade-off: PCs may lack intuitive interpretation.


Part 9: Factor Models

Factor models decompose returns into systematic and idiosyncratic components:

ri=αi+βi1F1+βi2F2++βikFk+εir_i = \alpha_i + \beta_{i1} F_1 + \beta_{i2} F_2 + \cdots + \beta_{ik} F_k + \varepsilon_i

Where FjF_j are common factors, βij\beta_{ij} are factor loadings, and εi\varepsilon_i is idiosyncratic risk.

CAPM (Single Factor)

rirf=αi+βi(rmrf)+εir_i - r_f = \alpha_i + \beta_i (r_m - r_f) + \varepsilon_i

βi\beta_i measures systematic (market) risk. αi\alpha_i is Jensen’s alpha — excess return above what CAPM predicts. Estimated by OLS regression of excess returns on excess market returns.

Fama-French Three-Factor Model

rirf=αi+β1MKT+β2SMB+β3HML+εir_i - r_f = \alpha_i + \beta_1 \text{MKT} + \beta_2 \text{SMB} + \beta_3 \text{HML} + \varepsilon_i

Where SMB (Small Minus Big) captures the size premium and HML (High Minus Low) captures the value premium. The Carhart four-factor model adds MOM (momentum). Fama-French five-factor adds RMW (profitability) and CMA (investment).

Barra-Style Risk Models

Multi-factor risk models used by risk management:

  • Style factors: value, momentum, quality, size, low volatility
  • Industry factors: GICS sector exposures
  • Country/currency factors: for global portfolios

The factor return covariance matrix Σ\Sigma decomposes portfolio risk:

Var(rp)=wΣw=w(BΣFB+D)w\text{Var}(\mathbf{r}_p) = \mathbf{w}^\top \Sigma \mathbf{w} = \mathbf{w}^\top (B \Sigma_F B^\top + D) \mathbf{w}

Where BB is the factor exposure matrix, ΣF\Sigma_F is the factor covariance matrix, and DD is the diagonal idiosyncratic variance matrix.


Part 10: Statistical Testing Framework

Hypothesis Testing

  1. State H0H_0 (null) and H1H_1 (alternative)
  2. Choose a test statistic and its null distribution
  3. Compute the p-value: probability of observing a test statistic at least as extreme as the one computed, given H0H_0 is true
  4. Compare p-value to significance level α\alpha (typically 0.05)

Type I error (false positive): Rejecting H0H_0 when it is true. Probability = α\alpha. Type II error (false negative): Failing to reject H0H_0 when it is false. Probability = β\beta. Power = 1β1 - \beta: probability of correctly rejecting a false null.

The p-value is not the probability that H0H_0 is true. It is the probability of the data (or more extreme) given H0H_0.

Multiple Testing

When testing mm hypotheses simultaneously, the probability of at least one false positive explodes:

P(at least one false positive)=1(1α)mP(\text{at least one false positive}) = 1 - (1-\alpha)^m

For m=20m=20 tests at α=0.05\alpha=0.05: 10.952064%1 - 0.95^{20} \approx 64\% chance of a false positive.

Bonferroni correction — divide α\alpha by mm: test each hypothesis at α/m\alpha/m. Conservative (controls family-wise error rate).

Benjamini-Hochberg (FDR) — controls the false discovery rate (expected proportion of false positives among rejections). Less conservative than Bonferroni; preferred when testing many hypotheses:

  1. Order p-values: p(1)p(2)p(m)p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}
  2. Find the largest kk such that p(k)kmαp_{(k)} \leq \frac{k}{m} \alpha
  3. Reject all hypotheses with pp(k)p \leq p_{(k)}

In finance this matters enormously — Harvey, Liu & Zhu (2016) showed most published factor discoveries fail to survive multiple testing corrections.

Key Tests Reference

TestNull hypothesisUse when
t-test (one sample)μ=μ0\mu = \mu_0Testing if mean return differs from zero
t-test (two sample)μ1=μ2\mu_1 = \mu_2Comparing means of two groups
F-testβj==βk=0\beta_{j} = \cdots = \beta_k = 0Joint significance of predictors
Jarque-BeraNormality (skew=0\text{skew}=0, kurt=3\text{kurt}=3)Testing normality of returns
Breusch-PaganHomoscedasticityTesting for heteroscedasticity
Durbin-WatsonNo first-order autocorrelationTime series residual checking
Ljung-BoxNo autocorrelation up to lag hhResidual diagnostics
ADFUnit root (non-stationary)Pre-testing time series
KPSSStationarityPre-testing time series
HausmanRE consistent (αiX\alpha_i \perp X)Fixed vs. random effects choice
Granger causalityxx does not Granger-cause yyVAR causal inference
Chow testNo structural breakTesting regime changes

Part 11: Distribution Statistics

Before running regressions or tests, understanding the shape of your data’s distribution matters — especially in finance where returns are decidedly non-normal.

Moments

The first four moments of a distribution describe its shape completely:

MomentFormulaWhat it measures
Meanμ=E[X]\mu = \mathbb{E}[X]Central tendency
Varianceσ2=E[(Xμ)2]\sigma^2 = \mathbb{E}[(X-\mu)^2]Spread
Skewnessγ1=E ⁣[(Xμσ)3]\gamma_1 = \mathbb{E}\!\left[\left(\frac{X-\mu}{\sigma}\right)^3\right]Asymmetry
Kurtosisγ2=E ⁣[(Xμσ)4]\gamma_2 = \mathbb{E}\!\left[\left(\frac{X-\mu}{\sigma}\right)^4\right]Tail heaviness

Skewness: Zero = symmetric. Positive = right tail (large positive outliers). Negative = left tail (crash risk in equity returns — large negative outliers dominate).

Kurtosis: Normal distribution has kurtosis = 3. Excess kurtosis = kurtosis − 3. Positive excess kurtosis means fat tails — extreme events are far more common than a normal distribution predicts.

Financial returns typically show: negative skew + excess kurtosis > 0. This is why the Gaussian assumption in Black-Scholes systematically underprices out-of-the-money options.

Fat Tails vs Normal

Tail risk Tail risk −3σ−2σ−1σ 0 +1σ+2σ+3σ Normal (excess kurtosis = 0) Fat-tailed (excess kurtosis > 0)

The Jarque-Bera test tests jointly for zero skewness and zero excess kurtosis:

JB=n6(γ12+γ224)χ22under normalityJB = \frac{n}{6}\left(\gamma_1^2 + \frac{\gamma_2^2}{4}\right) \sim \chi^2_2 \quad \text{under normality}

Part 12: Inequality and Concentration Measures

Gini Coefficient

The Gini coefficient measures inequality in a distribution — how concentrated values are among a subset of the population. It ranges from 0 (perfect equality) to 1 (perfect inequality).

Construction via the Lorenz Curve:

The Lorenz curve plots the cumulative share of total income (or wealth) held by the bottom x%x\% of the population:

L(F)=0FQ(p)dp01Q(p)dpL(F) = \frac{\int_0^F Q(p)\, dp}{\int_0^1 Q(p)\, dp}

Where Q(p)Q(p) is the quantile function (inverse CDF). Perfect equality means L(F)=FL(F) = F — the bottom 50% holds 50% of income.

The Gini coefficient is twice the area between the perfect equality line and the Lorenz curve:

G=1201L(F)dF=Area between diagonal and Lorenz curveTotal area under diagonalG = 1 - 2\int_0^1 L(F)\, dF = \frac{\text{Area between diagonal and Lorenz curve}}{\text{Total area under diagonal}} Gini = 2 × shaded area Cumulative % of population 0%50%100% Cumulative % of income 0%50%100% Perfect equality (G=0) Moderate inequality (G≈0.35) High inequality (G≈0.60)

Gini in Model Validation (Credit Scoring)

The Gini coefficient has a second life in quantitative model evaluation — particularly in credit risk. A credit model ranks borrowers by predicted default probability; the Gini measures how well it separates defaulters from non-defaulters.

The relationship to the AUC (Area Under the ROC Curve):

Gini=2×AUC1\text{Gini} = 2 \times \text{AUC} - 1

A random model has AUC = 0.5, Gini = 0. A perfect model has AUC = 1, Gini = 1. In practice, credit scorecards with Gini > 0.4 are considered good; > 0.6 is excellent.

Herfindahl-Hirschman Index (HHI)

HHI measures market concentration — how dominant the largest players are:

HHI=i=1nsi2\text{HHI} = \sum_{i=1}^n s_i^2

Where sis_i is firm ii‘s market share (as a fraction). Ranges from 1/n1/n (perfectly equal shares) to 1 (monopoly).

  • HHI < 0.15: unconcentrated market
  • 0.15–0.25: moderate concentration
  • HHI > 0.25: highly concentrated (US DOJ merger review threshold)

Used in: antitrust analysis, portfolio concentration risk, factor concentration in quant portfolios.


Part 13: Non-parametric Methods

Non-parametric methods make no assumptions about the underlying distribution. Essential when data is ordinal, heavily skewed, or has fat tails.

Rank Correlations

Pearson correlation measures linear dependence between two variables. It can be misleading when relationships are monotonic but non-linear, or when outliers distort the picture.

Spearman’s ρ\rho replaces values with their ranks, then computes Pearson correlation on the ranks:

ρs=16idi2n(n21)\rho_s = 1 - \frac{6 \sum_i d_i^2}{n(n^2-1)}

Where di=rank(xi)rank(yi)d_i = \text{rank}(x_i) - \text{rank}(y_i). Captures any monotonic relationship, not just linear.

Kendall’s τ\tau counts concordant vs discordant pairs:

τ=CD(n2)\tau = \frac{C - D}{\binom{n}{2}}

Where CC = concordant pairs (both xx and yy rank the same way) and DD = discordant pairs. More robust than Spearman to small samples and tied values.

MethodMeasuresSensitive to outliers?Use when
PearsonLinear dependenceYesNormal data, linear relationship
SpearmanMonotonic dependenceNoOrdinal data, non-linear monotone
KendallOrdinal associationNoSmall samples, many ties

Bootstrap

The bootstrap estimates the sampling distribution of any statistic by resampling with replacement from the observed data. No distributional assumptions required.

Algorithm:

  1. Draw BB bootstrap samples of size nn from the data (with replacement)
  2. Compute the statistic θ^b\hat{\theta}^*_b on each sample
  3. The distribution of {θ^1,,θ^B}\{\hat{\theta}^*_1, \ldots, \hat{\theta}^*_B\} approximates the sampling distribution of θ^\hat{\theta}

Bootstrap confidence interval (percentile method):

CI95%=[θ^(0.025),  θ^(0.975)]\text{CI}_{95\%} = [\hat{\theta}^*_{(0.025)},\; \hat{\theta}^*_{(0.975)}]

The bootstrap is invaluable when:

  • The statistic has no closed-form sampling distribution (e.g., Sharpe ratio, Gini)
  • The data is clearly non-normal
  • You want robust standard errors for complex estimators

In finance: Bootstrap is used to test whether a backtest’s Sharpe ratio is statistically significant, controlling for look-ahead bias and non-normality.

Kernel Density Estimation (KDE)

KDE estimates the probability density function of a dataset without assuming a parametric form:

f^(x)=1nhi=1nK ⁣(xxih)\hat{f}(x) = \frac{1}{nh} \sum_{i=1}^n K\!\left(\frac{x - x_i}{h}\right)

Where KK is a kernel function (usually Gaussian) and hh is the bandwidth — the smoothing parameter.

  • Small hh: wiggly, overfits to noise
  • Large hh: over-smoothed, loses shape detail
  • Optimal hh (Silverman’s rule of thumb): h=1.06σ^n1/5h = 1.06 \hat{\sigma} n^{-1/5}

KDE is used to visualise return distributions, compare empirical vs theoretical densities, and detect multimodality (e.g., bimodal return distributions suggesting regime changes).


Part 14: Logistic Regression and Classification

Linear regression predicts a continuous outcome. When the outcome is binary — default or no-default, fraud or not, churn or not — logistic regression is the standard tool.

The Model

Instead of modelling yy directly, logistic regression models the log-odds of the event:

logP(y=1x)1P(y=1x)=βx\log\frac{P(y=1|\mathbf{x})}{1 - P(y=1|\mathbf{x})} = \boldsymbol{\beta}^\top \mathbf{x}

Solving for the probability:

P(y=1x)=11+eβx=σ(βx)P(y=1|\mathbf{x}) = \frac{1}{1+e^{-\boldsymbol{\beta}^\top \mathbf{x}}} = \sigma(\boldsymbol{\beta}^\top \mathbf{x})

The sigmoid function σ\sigma maps any real number to (0,1)(0,1), giving a valid probability.

Linear predictor β₀ + β₁x₁ + … P(default) 1.00.50.0 threshold = 0.5 non-default (y=0) default (y=1)

Estimation: Maximum Likelihood

Logistic regression has no closed-form solution. Parameters are found by maximising the log-likelihood:

(β)=i=1n[yilogp^i+(1yi)log(1p^i)]\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[ y_i \log \hat{p}_i + (1-y_i) \log(1-\hat{p}_i) \right]

Solved iteratively with Newton-Raphson or gradient descent.

Interpreting Coefficients

A one-unit increase in xjx_j multiplies the odds by eβje^{\beta_j}:

ORj=eβj\text{OR}_j = e^{\beta_j}
  • βj>0\beta_j > 0 → higher xjx_j increases default probability
  • βj<0\beta_j < 0 → higher xjx_j decreases default probability
  • βj|\beta_j| → effect size (on the log-odds scale)

Unlike OLS, marginal effects on the probability scale depend on the values of all other variables — they are not constant.

Model Performance Metrics

MetricFormulaInterpretation
AUCArea under ROC curve0.5 = random; 1.0 = perfect
Gini2×AUC12 \times \text{AUC} - 10 = random; 1 = perfect
KS StatisticmaxF1(x)F0(x)\max\|F_1(x) - F_0(x)\|Max separation between default/non-default CDFs
Log-loss(β^)/n-\ell(\hat{\boldsymbol{\beta}})/nLower is better; measures calibration
Brier Score1n(yip^i)2\frac{1}{n}\sum(y_i - \hat{p}_i)^2Mean squared error of probability forecasts

In credit risk, Gini > 0.4 is typically the minimum acceptable threshold; Gini > 0.6 is strong.


Part 15: Scorecard Development — WoE and Information Value

Credit scorecards translate continuous and categorical predictors into integer points. The standard preprocessing pipeline uses Weight of Evidence (WoE) encoding.

Weight of Evidence (WoE)

For a predictor binned into groups, WoE for bin ii is:

WoEi=ln(Distribution of EventsiDistribution of Non-Eventsi)=ln(P(event in bin i)P(non-event in bin i))\text{WoE}_i = \ln\left(\frac{\text{Distribution of Events}_i}{\text{Distribution of Non-Events}_i}\right) = \ln\left(\frac{P(\text{event in bin } i)}{P(\text{non-event in bin } i)}\right)
  • Positive WoE → bin has a higher proportion of defaults than the overall population (risky)
  • Negative WoE → bin has lower proportion of defaults (safe)
  • WoE = 0 → bin default rate equals the population average

WoE transforms all variables to a common, interpretable scale and handles non-linearity and missing values naturally.

Information Value (IV)

IV summarises a variable’s predictive power across all its bins:

IV=i(Eventsi%Non-Eventsi%)×WoEi\text{IV} = \sum_i (\text{Events}_i\% - \text{Non-Events}_i\%) \times \text{WoE}_i
IVPredictive Power
< 0.02Useless
0.02–0.1Weak
0.1–0.3Medium
0.3–0.5Strong
> 0.5Suspicious (check for data leakage)

IV is the primary variable selection criterion in scorecard development. Variables with IV < 0.02 are typically dropped; IV > 0.5 triggers a data quality review.

From WoE to Scorecard Points

Once logistic regression is fit on WoE-transformed variables, scorecard points are assigned by scaling coefficients to an integer range (e.g., 300–850 for consumer credit):

Pointsj=(βj×WoEij+β0k)×Factor+Offset\text{Points}_j = -\left(\beta_j \times \text{WoE}_{ij} + \frac{\beta_0}{k}\right) \times \text{Factor} + \text{Offset}

Where Factor and Offset are chosen to anchor the score to a target odds at a target score (e.g., odds of 50:1 at score 600).

The final score is additive across characteristics — easy to explain to regulators and customers.


Part 16: Survival Analysis

Survival analysis models the time until an event occurs — time to default, time to prepayment, time to customer churn. Unlike logistic regression (which asks “will it happen?”), survival analysis asks “when will it happen?”

Core Functions

Survival function S(t)S(t) — probability the event has not occurred by time tt:

S(t)=P(T>t),S(0)=1,S()=0S(t) = P(T > t), \quad S(0) = 1, \quad S(\infty) = 0

Hazard function h(t)h(t) — instantaneous rate of the event at time tt, given survival to tt:

h(t)=limΔt0P(tT<t+ΔtTt)Δt=S(t)S(t)h(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t+\Delta t \mid T \geq t)}{\Delta t} = -\frac{S'(t)}{S(t)}

Cumulative hazard H(t)=0th(s)ds=lnS(t)H(t) = \int_0^t h(s)\, ds = -\ln S(t)

Kaplan-Meier Estimator

The non-parametric estimate of S(t)S(t) from censored data:

S^(t)=tit(1dini)\hat{S}(t) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)

Where did_i is the number of events and nin_i is the number at risk at time tit_i. Censored observations (e.g., loans that were paid off before defaulting) are handled naturally — they contribute to the risk set up to their exit time, then drop out.

Cox Proportional Hazards Model

The Cox model is the standard regression approach for survival data. It relates covariates to the hazard without specifying the baseline hazard shape (semi-parametric):

h(tx)=h0(t)exp(βx)h(t|\mathbf{x}) = h_0(t) \cdot \exp(\boldsymbol{\beta}^\top \mathbf{x})

Where h0(t)h_0(t) is an unspecified baseline hazard. The proportional hazards assumption: the hazard ratio between two individuals with different covariates is constant over time.

Hazard ratio (HR): HRj=eβj\text{HR}_j = e^{\beta_j} — a one-unit increase in xjx_j multiplies the hazard by eβje^{\beta_j}. HR > 1 means higher risk; HR < 1 means lower risk.

Estimated with partial likelihood — the baseline hazard cancels out, making estimation tractable without specifying it.

Applications in credit:

  • Probability of default over a 12-month horizon (IFRS 9 Stage migration)
  • Lifetime probability of default (IFRS 9 ECL)
  • Time to repayment / prepayment modelling

Part 17: Model Monitoring and Validation

Models degrade over time as the population they’re applied to drifts away from the development sample. Model monitoring is a regulatory requirement (SR 11-7, PRA SS1/23) and a risk management necessity.

Population Stability Index (PSI)

PSI measures how much a variable’s distribution has shifted between the development (reference) period and a monitoring period:

PSI=i(AiEi)×ln(AiEi)\text{PSI} = \sum_i \left(A_i - E_i\right) \times \ln\left(\frac{A_i}{E_i}\right)

Where AiA_i = actual proportion in bin ii (monitoring), EiE_i = expected proportion in bin ii (development).

PSIInterpretation
< 0.10No significant shift — model still valid
0.10–0.25Moderate shift — investigate
> 0.25Major shift — model may need redevelopment

PSI is computed on the score distribution (overall stability) and on each input characteristic (Characteristic Stability Index, CSI). A high PSI on one characteristic identifies which variable is driving the drift.

Characteristic Stability Index (CSI)

CSI applies the same formula as PSI but to individual input variables. Workflow:

flowchart LR
Score[Score monitoring window] --> PSI{PSI above 0.10?}
PSI -->|No| OK[Model stable - continue]
PSI -->|Yes| CSI[Compute CSI for each variable]
CSI --> Driver[Identify driver variable]
Driver --> Root[Root cause: data issue or population shift]
Root --> Fix[Recalibrate or redevelop model]

Performance Monitoring

Track discrimination and calibration separately — a model can remain discriminatory (Gini stable) while becoming poorly calibrated (predicted rates diverge from actuals):

MetricMonitorsAlert threshold
Gini / AUCDiscrimination (rank ordering)Drop > 5 pp from development Gini
KS StatisticSeparation between default/non-defaultDrop > 5 pp
Predicted vs Actual Default RateCalibrationPredicted/Actual ratio outside 0.8–1.2
Hosmer-Lemeshow testCalibration (formal)p-value < 0.05 across score bands
PSIPopulation drift> 0.25 on score or key characteristic

Backtesting

For through-the-cycle models (PD, LGD), backtesting compares predicted values against realised outcomes:

Binomial test for PD: Under H0H_0 that predicted PD is correct, the number of defaults in a cohort follows a Binomial distribution. Test whether actual defaults are consistent with predicted.

Traffic light framework (Basel):

  • Green zone: actual defaults within expected range
  • Amber zone: borderline — increase monitoring
  • Red zone: model materially over/underpredicts — regulatory notification required

Part 18: Risk Metrics — VaR and Expected Shortfall

Value at Risk (VaR)

VaR is the loss not exceeded with probability 1α1-\alpha over a given horizon:

P(L>VaRα)=αP(L > \text{VaR}_\alpha) = \alpha

Equivalently, VaRα_\alpha is the α\alpha-quantile of the loss distribution (e.g., 99th percentile for 1% VaR).

Three estimation approaches:

MethodHowAssumptions
Historical simulationSort past P&L; read off percentileDistribution-free; captures fat tails and correlations
Parametric (variance-covariance)Assume normal returns; VaR=μ+zασ\text{VaR} = \mu + z_\alpha \sigmaFast; underestimates tail risk for non-normal returns
Monte CarloSimulate thousands of scenarios from a modelFlexible; computationally expensive

Limitations of VaR:

  • Not subadditive — a portfolio of two positions can have higher VaR than the sum of their individual VaRs (violates diversification intuition)
  • Tells you nothing about the magnitude of losses beyond the threshold

Expected Shortfall (CVaR / ES)

Expected Shortfall is the expected loss conditional on exceeding VaR:

ESα=E[LL>VaRα]=1α1α1VaRudu\text{ES}_\alpha = \mathbb{E}[L \mid L > \text{VaR}_\alpha] = \frac{1}{\alpha} \int_{1-\alpha}^1 \text{VaR}_u\, du

ES is the average of all losses in the tail beyond VaR. It is:

  • Subadditive — always rewards diversification
  • More sensitive to tail shape — captures the severity, not just the threshold
  • The regulatory standard under Basel IV (FRTB) — replaced VaR at the 97.5th percentile

Duration and DV01 (Fixed Income Risk)

For fixed income portfolios, interest rate sensitivity is measured by:

Modified Duration:

Dmod=1PdPdyΔP/PΔyD_\text{mod} = -\frac{1}{P}\frac{dP}{dy} \approx \frac{\Delta P / P}{\Delta y}

A bond with modified duration of 5 loses approximately 5% in value for a 1% (100bp) rise in yield.

DV01 (Dollar Value of a Basis Point):

DV01=dPdy×0.0001Dmod×P×0.0001\text{DV01} = -\frac{dP}{dy} \times 0.0001 \approx D_\text{mod} \times P \times 0.0001

DV01 is the P&L change for a 1 basis point (0.01%) move in yield. The standard unit for expressing interest rate risk on a trading desk.

Convexity measures the curvature of the price-yield relationship (duration is the first-order approximation; convexity is the second-order correction):

ΔPDmodPΔy+12CP(Δy)2\Delta P \approx -D_\text{mod} \cdot P \cdot \Delta y + \frac{1}{2} \cdot C \cdot P \cdot (\Delta y)^2

Positive convexity (standard bonds) means the bond gains more when yields fall than it loses when yields rise by the same amount.

Expected Credit Loss (ECL — IFRS 9)

Under IFRS 9, banks must recognise lifetime expected credit losses on all financial instruments:

ECL=PD×LGD×EAD×DF\text{ECL} = \text{PD} \times \text{LGD} \times \text{EAD} \times \text{DF}

Where:

  • PD — Probability of Default (from logistic/survival model)
  • LGD — Loss Given Default (fraction of exposure lost; modelled via beta regression or OLS on logit-transformed LGD)
  • EAD — Exposure at Default (outstanding balance at time of default)
  • DF — Discount factor (to present value)

Staging under IFRS 9:

  • Stage 1 — 12-month ECL (no significant credit deterioration since origination)
  • Stage 2 — Lifetime ECL (significant increase in credit risk)
  • Stage 3 — Lifetime ECL, credit-impaired

The transition between stages is the critical modelling decision — typically driven by PD relative to origination PD, delinquency triggers, or watchlist flags.


Common Pitfalls

PitfallWhat happensFix
Omitted variable biasβ^\hat{\boldsymbol{\beta}} is biased and inconsistentAdd the variable; use IV or FE
Spurious regressionFake significance between unrelated non-stationary seriesTest stationarity; difference or use ECM
Look-ahead biasFuture data leaks into predictorsAlign data carefully; use lagged values
P-hackingTesting many models, reporting the bestPre-register hypothesis; correct for multiple testing
OverfittingModel fits in-sample noiseCross-validate; use regularisation; hold-out test set
Ignoring autocorrelationStandard errors too small; over-rejectionUse HAC standard errors or model residuals
Reverse causalityCausal direction is ambiguousInstrumental variables; Granger causality