Quantitative Methods

📖 48 min read deep-divestatisticsquantitativeregressiontime-seriesfinancecredit-risk

A comprehensive reference for quantitative analytics — regression, credit risk models, scorecard development, survival analysis, model monitoring, time series, factor models, and risk metrics.

Key Takeaways

OLS finds the coefficient vector that minimises the sum of squared residuals; the closed-form solution is β = (XᵀX)⁻¹Xᵀy.
Logistic regression models binary outcomes (default/no-default); coefficients are log-odds ratios and the output is a probability.
Weight of Evidence (WoE) and Information Value (IV) are the standard feature engineering and selection tools for credit scorecards.
Survival analysis models time-to-event (time to default); the Cox proportional hazards model estimates relative default risk.
PSI detects population drift post-deployment — it is the first check in any model monitoring framework.
VaR and Expected Shortfall quantify market and credit risk; ES is now the regulatory standard under Basel IV.

The Core Idea

Linear regression models the relationship between a response variable $y$ and one or more predictors $\mathbf{x}$ as a linear function, then estimates that function from data by minimising prediction error. It is the workhorse of quantitative analytics — directly useful for modelling returns, risk factors, and economic relationships, and the conceptual foundation for nearly every more complex method.

Part 1: Ordinary Least Squares (OLS)

The Model

The population regression model assumes:

y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik} + \varepsilon_i

In matrix form with $n$ observations and $k$ predictors:

\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}

Where:

$\mathbf{y} \in \mathbb{R}^n$ — response vector
$X \in \mathbb{R}^{n \times (k+1)}$ — design matrix (first column is a vector of ones for the intercept)
$\boldsymbol{\beta} \in \mathbb{R}^{k+1}$ — coefficient vector (what we estimate)
$\boldsymbol{\varepsilon} \in \mathbb{R}^n$ — error vector (unobservable)

OLS Derivation

OLS minimises the residual sum of squares:

\text{RSS}(\boldsymbol{\beta}) = \sum_{i=1}^n (y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 = (\mathbf{y} - X\boldsymbol{\beta})^\top(\mathbf{y} - X\boldsymbol{\beta})

Taking the derivative with respect to $\boldsymbol{\beta}$ and setting it to zero:

\frac{\partial \text{RSS}}{\partial \boldsymbol{\beta}} = -2X^\top(\mathbf{y} - X\boldsymbol{\beta}) = 0

This gives the normal equations: $X^\top X \boldsymbol{\beta} = X^\top \mathbf{y}$

Solving (when $X^\top X$ is invertible):

\boxed{\hat{\boldsymbol{\beta}} = (X^\top X)^{-1} X^\top \mathbf{y}}

This is the OLS estimator. It has a closed-form solution — no iteration required.

Intuition: OLS projects $\mathbf{y}$ orthogonally onto the column space of $X$ . The fitted values $\hat{\mathbf{y}} = X\hat{\boldsymbol{\beta}}$ are the point in that column space closest to $\mathbf{y}$ under the Euclidean norm.

The Gauss-Markov Theorem

Under five assumptions, OLS is the Best Linear Unbiased Estimator (BLUE) — it has the smallest variance among all linear unbiased estimators.

Assumption	Statement	What breaks it
L Linearity	$y = X\beta + \varepsilon$ is correctly specified	Omitted variables, wrong functional form
I Independence	Observations are independent	Time series autocorrelation, clustered data
H Homoscedasticity	$\text{Var}(\varepsilon_i) = \sigma^2$ (constant)	Volatility clustering in financial returns
N Normality	$\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$	Fat tails, outliers (needed for exact inference, not BLUE)
E Exogeneity	$\mathbb{E}[\varepsilon	X] = 0$

When these hold, $\hat{\boldsymbol{\beta}}$ is unbiased ( $\mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}$ ) and efficient.

Part 2: Inference and Model Evaluation

Coefficient Standard Errors

The variance-covariance matrix of $\hat{\boldsymbol{\beta}}$ under homoscedasticity:

\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (X^\top X)^{-1}

Since $\sigma^2$ is unknown, replace it with the unbiased estimator:

\hat{\sigma}^2 = \frac{\text{RSS}}{n - k - 1} = \frac{\sum_i \hat{\varepsilon}_i^2}{n-k-1}

The standard error of $\hat{\beta}_j$ is $\text{SE}(\hat{\beta}_j) = \hat{\sigma} \sqrt{[(X^\top X)^{-1}]_{jj}}$

Hypothesis Testing

t-test for individual coefficients:

t_j = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)} \sim t_{n-k-1} \quad \text{under } H_0: \beta_j = 0

F-test for joint significance:

F = \frac{(\text{RSS}_\text{restricted} - \text{RSS}_\text{unrestricted})/q}{\text{RSS}_\text{unrestricted}/(n-k-1)} \sim F_{q,\, n-k-1}

Where $q$ is the number of restrictions. Tests whether a group of coefficients are jointly zero.

Goodness of Fit

R² (coefficient of determination):

R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum_i \hat{\varepsilon}_i^2}{\sum_i (y_i - \bar{y})^2}

R² measures the fraction of total variance in $y$ explained by the model. It never decreases when you add predictors — regardless of whether they’re useful.

Adjusted R² penalises for adding irrelevant predictors:

\bar{R}^2 = 1 - \frac{(1-R^2)(n-1)}{n-k-1}

AIC and BIC — information criteria for model selection:

\text{AIC} = 2k - 2\ln(\hat{L}), \quad \text{BIC} = k\ln(n) - 2\ln(\hat{L})

Lower is better. BIC penalises complexity more heavily than AIC and tends to select simpler models.

Confidence Intervals

A 95% confidence interval for $\beta_j$ :

\hat{\beta}_j \pm t_{n-k-1,\, 0.025} \cdot \text{SE}(\hat{\beta}_j)

Interpretation: If you repeated the experiment many times and computed this interval each time, 95% of intervals would contain the true $\beta_j$ . It is NOT “95% probability that the true parameter is in this interval” (that’s the Bayesian credible interval).

Part 3: Regression Diagnostics

Diagnostics test whether the Gauss-Markov assumptions hold. Violating them doesn’t always invalidate the regression — but it changes what you can conclude.

Residual Analysis

Always start with residual plots:

Residuals vs. fitted values — should be random scatter. Patterns indicate heteroscedasticity or non-linearity.
Q-Q plot — residuals vs. theoretical normal quantiles. Deviations at tails indicate non-normality (common in financial data).
Scale-location plot — $\sqrt{|\hat{\varepsilon}_i|}$ vs. fitted values. Increasing spread = heteroscedasticity.
Residuals vs. time — for time-ordered data. Patterns indicate autocorrelation.

Each pattern tells you something different and has a different fix:

Heteroscedasticity

When $\text{Var}(\varepsilon_i) = \sigma_i^2$ varies across observations, OLS is still unbiased but no longer efficient. Standard errors are wrong — t-tests and confidence intervals are invalid.

Detection:

Breusch-Pagan test — regress squared residuals on the predictors. Significant F-stat = heteroscedasticity.
White test — more general, includes squared terms and cross-products.

Fixes:

Heteroscedasticity-consistent (HC) standard errors (White standard errors) — correct the standard errors without changing $\hat{\boldsymbol{\beta}}$ .
Weighted Least Squares (WLS) — weight observations by the inverse of their error variance when the variance structure is known.
Generalised Least Squares (GLS) — the general fix when the error covariance structure $\Sigma$ is known: $\hat{\boldsymbol{\beta}}_\text{GLS} = (X^\top \Sigma^{-1} X)^{-1} X^\top \Sigma^{-1} \mathbf{y}$

Autocorrelation

When errors are correlated across time ( $\text{Cov}(\varepsilon_t, \varepsilon_{t-s}) \neq 0$ ), OLS standard errors are too small — you over-reject the null.

Detection:

Durbin-Watson statistic — tests for first-order autocorrelation. DW ≈ 2 means no autocorrelation; DW < 2 means positive autocorrelation (very common in financial time series).
Ljung-Box Q-test — tests for autocorrelation at multiple lags simultaneously.
ACF/PACF plots of residuals — visual inspection of autocorrelation structure.

Fixes:

Newey-West standard errors (HAC — heteroscedasticity and autocorrelation consistent)
Explicitly model the autocorrelation structure (ARIMA residuals)
Include lagged dependent variable as a predictor (Cochrane-Orcutt)

Multicollinearity

When predictors are highly correlated, $(X^\top X)^{-1}$ becomes unstable. Coefficients have large standard errors and wrong signs — individual coefficients can’t be trusted even when the overall fit is good.

Detection:

Variance Inflation Factor (VIF): $\text{VIF}_j = \frac{1}{1 - R_j^2}$ where $R_j^2$ is the R² from regressing $x_j$ on all other predictors. VIF > 10 (or > 5 conservatively) indicates a problem.
Condition number of $X^\top X$ — above 30 indicates moderate, above 100 indicates severe multicollinearity.
Correlation matrix — pairwise correlations above 0.8 are a warning sign.

Fixes: Ridge regression (shrinks coefficients), PCA regression (transforms to orthogonal predictors), removing one of the collinear variables.

Influential Observations

Leverage measures how far an observation’s $x$ -values are from the mean. High-leverage points have outsized influence on the regression line regardless of their $y$ -value.

h_{ii} = [X(X^\top X)^{-1}X^\top]_{ii}

Cook’s Distance combines leverage and residual size into a single influence measure:

D_i = \frac{\hat{\varepsilon}_i^2}{(k+1)\hat{\sigma}^2} \cdot \frac{h_{ii}}{(1-h_{ii})^2}

$D_i > 1$ is a common threshold for “influential.” Examine these observations: data errors, legitimate outliers, or regime changes.

Diagnostic Decision Tree

flowchart TD
    Fit[Fit OLS] --> RvF[Plot residuals vs fitted]
    RvF --> Q1{Fan / funnel shape?}
    Q1 -->|Yes| A1[Heteroscedasticity\nFix: HC errors or WLS]
    Q1 -->|No| Q2{Curved or U-shape?}
    Q2 -->|Yes| A2[Non-linearity\nFix: polynomial or log transform]
    Q2 -->|No| Q3{Wave or drift over time?}
    Q3 -->|Yes| A3[Autocorrelation\nFix: HAC errors or ARIMA residuals]
    Q3 -->|No| VIF[Check VIF for each predictor]
    VIF --> Q4{Any VIF above 10?}
    Q4 -->|Yes| A4[Multicollinearity\nFix: Ridge regression or drop variable]
    Q4 -->|No| Cook[Compute Cook's D]
    Cook --> Q5{Any D above 1?}
    Q5 -->|Yes| A5[Influential observation\nFix: investigate or robust regression]
    Q5 -->|No| Done[OLS assumptions satisfied]

Part 4: Variable Transformations

Transformations serve two purposes: fixing violated OLS assumptions (non-normality, heteroscedasticity, non-linearity) and changing how coefficients are interpreted. Applying the wrong transformation — or not applying one when needed — is a common source of misleading models.

Dependent Variable Transformations

Log transformation: $\ln(y)$

Use when $y$ is strictly positive, right-skewed, or when variance grows with the mean (common in income, prices, exposure).

\ln(y_i) = \beta_0 + \beta_1 x_i + \varepsilon_i

Interpretation: A one-unit increase in $x$ multiplies $y$ by $e^{\beta_1}$ , or equivalently, changes $y$ by approximately $100 \cdot \beta_1\%$ (exact for small $\beta_1$ ).

When to use: residuals fan outward (heteroscedasticity), $y$ is a monetary amount or count that can’t be negative.

Watch out: $\ln(0)$ is undefined — you need $y > 0$ . A common fix is $\ln(y + 1)$ or $\ln(y + c)$ for small $c$ .

Retransforming to the original scale: $\hat{y} = e^{\hat{\ln y}}$ is biased downward. The smearing estimator (Duan, 1983) corrects this:

\hat{y} = e^{\hat{\mu}} \cdot \frac{1}{n} \sum_{i=1}^n e^{\hat{\varepsilon}_i}

Square-root transformation: $\sqrt{y}$

Gentler than log — useful for count data or moderate right skew. Variance-stabilising for Poisson-distributed outcomes (where $\text{Var}(y) \propto \mu$ ).

Box-Cox transformation

A family of power transformations parameterised by $\lambda$ :

y^{(\lambda)} = \begin{cases} \dfrac{y^\lambda - 1}{\lambda} & \lambda \neq 0 \\ \ln(y) & \lambda = 0 \end{cases}

Special cases: $\lambda = 1$ is no transformation, $\lambda = 0$ is log, $\lambda = 0.5$ is square root, $\lambda = -1$ is reciprocal.

Estimate $\lambda$ by maximum likelihood — the optimal $\lambda$ maximises the log-likelihood of the transformed residuals being normal. In Python: scipy.stats.boxcox(y).

Limitation: Requires $y > 0$ . Yeo-Johnson extends this to allow $y \leq 0$ .

Logit transformation: $\ln\left(\frac{y}{1-y}\right)$

For bounded outcomes $y \in (0, 1)$ such as rates or proportions. Expands the bounded range to $(-\infty, +\infty)$ , making OLS applicable. The fractional logit model is an alternative that avoids retransformation.

Independent Variable Transformations

Log transformation: $\ln(x)$

Use when the relationship between $x$ and $y$ is concave (diminishing returns) or when $x$ spans several orders of magnitude (income, asset size, market cap).

Interpretation depends on the model form:

Model	Equation	Interpretation of $\beta_1$
Linear-linear	$y = \beta_0 + \beta_1 x$	$\Delta y = \beta_1$ per unit increase in $x$
Log-linear	$\ln y = \beta_0 + \beta_1 x$	$\approx 100\beta_1\%$ change in $y$ per unit increase in $x$
Linear-log	$y = \beta_0 + \beta_1 \ln x$	$\Delta y = \beta_1 / 100$ per 1% increase in $x$
Log-log	$\ln y = \beta_0 + \beta_1 \ln x$	$\beta_1$ is the elasticity: 1% increase in $x$ → $\beta_1\%$ change in $y$

The log-log form is widely used in economics and finance because elasticities are unit-free and directly comparable across variables.

Polynomial features

Include $x^2$ , $x^3$ to capture non-linear relationships while staying within the OLS framework:

y = \beta_0 + \beta_1 x + \beta_2 x^2 + \varepsilon

The marginal effect is $\partial y / \partial x = \beta_1 + 2\beta_2 x$ — it varies with $x$ . The turning point is at $x^* = -\beta_1 / (2\beta_2)$ .

Caution: High-degree polynomials overfit at the extremes. Splines or piecewise linear functions are more robust.

Standardisation vs normalisation

Standardisation (z-score): $x' = (x - \mu) / \sigma$ . After standardising, $\beta_j$ is interpreted as the effect of a one-standard-deviation increase in $x_j$ . Makes coefficients directly comparable across predictors with different units. Required before applying Ridge or Lasso (otherwise the penalty treats variables unequally).

Min-max normalisation: $x' = (x - x_{\min}) / (x_{\max} - x_{\min})$ . Scales to $[0, 1]$ . Sensitive to outliers. Rarely used for regression; more common in ML preprocessing.

Standardisation does not change $R^2$ , $t$ -statistics, or $p$ -values — only the scale of $\hat{\beta}$ .

Categorical Variables: Dummy Encoding

A categorical variable with $k$ levels is encoded as $k - 1$ binary (0/1) dummy variables. The omitted level is the reference category — all coefficients are interpreted relative to it.

Example: Credit rating with levels AAA, AA, A, BBB → create dummies for AA, A, BBB; AAA is the reference.

y = \beta_0 + \beta_1 D_{\text{AA}} + \beta_2 D_{\text{A}} + \beta_3 D_{\text{BBB}} + \gamma x + \varepsilon

$\beta_1$ is the average difference in $y$ between AA and AAA, holding $x$ constant.

Dummy variable trap: Including all $k$ dummies creates perfect multicollinearity with the intercept (they sum to 1). Always use $k - 1$ dummies. Software handles this automatically, but be aware if constructing features manually.

Ordered categories: For ordinal variables (e.g., credit grades with a natural ranking), a single integer encoding can be appropriate if the spacing is roughly equal. Dummy encoding is safer when spacing is unequal.

Interaction Terms

An interaction term $x_1 \cdot x_2$ allows the effect of $x_1$ to depend on the level of $x_2$ :

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 \cdot x_2) + \varepsilon

The marginal effect of $x_1$ is $\partial y / \partial x_1 = \beta_1 + \beta_3 x_2$ — it varies with $x_2$ .

When to include interactions:

Theory suggests the effect of one variable depends on another (e.g., income × age in credit scoring)
Residual plots show systematic patterns that disappear after adding the interaction
You want to test whether a relationship differs across groups (equivalent to separate slopes)

Hierarchy principle: If you include an interaction $x_1 x_2$ , include the main effects $x_1$ and $x_2$ too, even if their main-effect coefficients are insignificant. Omitting them changes the interpretation of the interaction.

Choosing the Right Transformation

Symptom	Likely fix
Right-skewed $y$ , variance grows with mean	$\ln(y)$
$y$ is a count (Poisson)	$\sqrt{y}$ or Poisson regression
$y$ is a proportion in $(0,1)$	Logit $(y)$ or fractional logit
$y$ is continuous, optimal $\lambda$ unknown	Box-Cox
Residuals show a curve (concave/convex)	$\ln(x)$ or add $x^2$
$x$ spans orders of magnitude	$\ln(x)$
Predictors on different scales (for regularisation)	Standardise all $x$
Non-linear group differences	Interaction terms

Part 5: Regularised Regression

When predictors are numerous or collinear, OLS over-fits. Regularisation adds a penalty term to the loss function, shrinking coefficients toward zero.

Ridge Regression (L2)

\hat{\boldsymbol{\beta}}_\text{ridge} = \arg\min_{\boldsymbol{\beta}} \left\{ \sum_i (y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 + \lambda \sum_j \beta_j^2 \right\}

Closed-form solution:

\hat{\boldsymbol{\beta}}_\text{ridge} = (X^\top X + \lambda I)^{-1} X^\top \mathbf{y}

Adding $\lambda I$ makes the matrix invertible even under perfect multicollinearity. Ridge shrinks all coefficients toward zero but never exactly to zero — it does not perform variable selection. Choose $\lambda$ via cross-validation.

Lasso Regression (L1)

\hat{\boldsymbol{\beta}}_\text{lasso} = \arg\min_{\boldsymbol{\beta}} \left\{ \sum_i (y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 + \lambda \sum_j |\beta_j| \right\}

The L1 penalty produces sparse solutions — it drives some coefficients exactly to zero. Lasso does automatic variable selection. No closed-form solution (solved with coordinate descent or LARS algorithm).

Elastic Net

Combines L1 and L2:

\hat{\boldsymbol{\beta}}_\text{enet} = \arg\min_{\boldsymbol{\beta}} \left\{ \text{RSS} + \lambda_1 \sum_j |\beta_j| + \lambda_2 \sum_j \beta_j^2 \right\}

Useful when predictors number in the thousands (genomics, factor zoo in finance) — Lasso tends to select only one variable from a correlated group; Elastic Net can include all of them with reduced coefficients.

Method	Penalty	Selects variables?	Handles multicollinearity?
OLS	None	No	No
Ridge	$\lambda \\|\boldsymbol{\beta}\\|_2^2$	No	Yes
Lasso	$\lambda \\|\boldsymbol{\beta}\\|_1$	Yes	Partially
Elastic Net	Both	Yes	Yes

As $\lambda$ increases from 0, Ridge shrinks all coefficients smoothly toward (but never to) zero. Lasso drives coefficients to exactly zero at different thresholds — automatic variable selection:

Quantile Regression

Standard OLS estimates the conditional mean of $y$ given $X$ . Quantile regression estimates any conditional quantile — the median, the 5th percentile, the 95th percentile.

Minimises the asymmetric loss function (pinball loss):

\hat{\boldsymbol{\beta}}_\tau = \arg\min_{\boldsymbol{\beta}} \sum_i \rho_\tau(y_i - \mathbf{x}_i^\top \boldsymbol{\beta})

Where $\rho_\tau(u) = u(\tau - \mathbf{1}[u < 0])$ and $\tau \in (0,1)$ is the quantile.

Why it matters in finance: Asset returns have fat tails. OLS ignores tail behaviour. Quantile regression at $\tau = 0.05$ directly models Value-at-Risk; at $\tau = 0.95$ models upside potential. No normality assumption required.

Part 6: Panel Data and Fixed Effects

Panel data has both a cross-sectional dimension ( $i$ , e.g., stocks) and a time dimension ( $t$ ). Standard OLS ignores the panel structure.

The panel model:

y_{it} = \mathbf{x}_{it}^\top \boldsymbol{\beta} + \alpha_i + \varepsilon_{it}

Where $\alpha_i$ is an individual fixed effect — a time-invariant, unit-specific unobservable (e.g., a company’s management quality).

Fixed Effects (Within) Estimator

Demean each variable within its unit:

\tilde{y}_{it} = y_{it} - \bar{y}_i, \quad \tilde{\mathbf{x}}_{it} = \mathbf{x}_{it} - \bar{\mathbf{x}}_i

Then regress $\tilde{y}$ on $\tilde{\mathbf{x}}$ . This eliminates $\alpha_i$ entirely — fixed effects are controlled for regardless of whether they’re correlated with $\mathbf{x}_{it}$ (no endogeneity from time-invariant confounders).

Random Effects

Assumes $\alpha_i \sim \mathcal{N}(0, \sigma_\alpha^2)$ and $\text{Cov}(\alpha_i, \mathbf{x}_{it}) = 0$ . More efficient than fixed effects when the assumption holds, but biased when it doesn’t.

Hausman test — tests whether random effects is consistent (i.e., whether $\alpha_i$ is uncorrelated with $\mathbf{x}_{it}$ ). Significant → use fixed effects. Not significant → random effects is valid and more efficient.

Part 7: Time Series

OLS assumes independent observations. Financial time series violates this — returns and prices are autocorrelated. Time series methods model the temporal dependence explicitly.

Stationarity

A time series $\{y_t\}$ is weakly stationary if:

$\mathbb{E}[y_t] = \mu$ (constant mean)
$\text{Var}(y_t) = \sigma^2$ (constant variance)
$\text{Cov}(y_t, y_{t-s})$ depends only on $s$ , not on $t$

Non-stationary series (trending prices, unit root processes) produce spurious regressions — high R² and significant t-stats between unrelated variables.

Testing for stationarity:

ADF (Augmented Dickey-Fuller) test — $H_0$ : unit root (non-stationary). Reject = stationary.
KPSS test — $H_0$ : stationary. Reject = non-stationary.
Run both: if ADF rejects and KPSS doesn’t reject, strong evidence of stationarity.

Transformations to achieve stationarity:

First-difference: $\Delta y_t = y_t - y_{t-1}$ (removes trend)
Log transformation: stabilises variance
Log-difference: $\ln(y_t/y_{t-1})$ — log returns in finance, typically stationary

ACF and PACF

Autocorrelation function (ACF): $\rho(s) = \text{Corr}(y_t, y_{t-s})$ — correlation between the series and its $s$ -period lag. Decays slowly for AR processes, cuts off sharply for MA processes.

Partial autocorrelation function (PACF): correlation between $y_t$ and $y_{t-s}$ after removing the effects of $y_{t-1}, \ldots, y_{t-s+1}$ . Cuts off sharply for AR processes, decays slowly for MA.

Use ACF/PACF plots to identify model order before fitting ARIMA.

The chart below shows an AR(1) process with $\phi = 0.7$ . ACF decays geometrically (never cuts off); PACF has a single spike at lag 1 then drops to zero — the diagnostic signature of a pure AR(1):

Pattern	ACF	PACF	Model
Geometric decay	Geometric decay	Cuts off at lag p	AR(p)
Cuts off at lag q	Geometric decay	Geometric decay	MA(q)
Both decay slowly	Both decay slowly	—	ARMA(p,q)

ARIMA

AR(p) — Autoregressive:

y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \cdots + \phi_p y_{t-p} + \varepsilon_t

Current value is a linear function of $p$ past values. ACF decays geometrically; PACF cuts off at lag $p$ .

MA(q) — Moving Average:

y_t = \mu + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \cdots + \theta_q \varepsilon_{t-q}

Current value is a linear combination of $q$ past shocks. ACF cuts off at lag $q$ ; PACF decays geometrically.

ARIMA(p, d, q): Apply AR( $p$ ) and MA( $q$ ) to the $d$ -times differenced series. $d=1$ handles a linear trend; $d=2$ handles a quadratic trend.

Model selection:

flowchart TD
    Raw[Raw time series] --> ADF{ADF + KPSS test\nStationary?}
    ADF -->|No| Diff[First-difference the series\nRepeat until stationary]
    Diff --> ADF
    ADF -->|Yes| Plots[Plot ACF and PACF\nof stationary series]
    Plots --> AR{PACF cuts off\nat lag p?}
    Plots --> MA{ACF cuts off\nat lag q?}
    AR -->|Yes| ARm[Include AR terms]
    MA -->|Yes| MAm[Include MA terms]
    ARm --> Fit[Fit ARIMA candidates]
    MAm --> Fit
    Fit --> IC[Compare AIC and BIC]
    IC --> Diag[Ljung-Box Q-test\non residuals]
    Diag -->|Autocorrelation remains| Fit
    Diag -->|White noise| Done[Final model selected]

GARCH (Volatility Modelling)

Financial return series exhibit volatility clustering — large moves follow large moves. ARIMA models the conditional mean; GARCH models the conditional variance.

GARCH(1,1):

\sigma_t^2 = \omega + \alpha \varepsilon_{t-1}^2 + \beta \sigma_{t-1}^2

Where $\sigma_t^2$ is today’s variance, $\varepsilon_{t-1}^2$ is yesterday’s squared shock (ARCH term), and $\sigma_{t-1}^2$ is yesterday’s variance (GARCH term). Stationarity requires $\alpha + \beta < 1$ .

GARCH is the standard model for VaR, option pricing (implied vol dynamics), and risk management.

VAR (Vector Autoregression)

Extends AR to multiple time series, each equation regressing on lags of all variables:

\mathbf{y}_t = \mathbf{c} + A_1 \mathbf{y}_{t-1} + A_2 \mathbf{y}_{t-2} + \cdots + A_p \mathbf{y}_{t-p} + \boldsymbol{\varepsilon}_t

Useful for modelling interdependencies between variables (e.g., macro factors). Key tools:

Granger causality — does $x$ help predict $y$ beyond $y$ ‘s own history?
Impulse response functions (IRF) — trace the effect of a shock in one variable through the system over time
Forecast error variance decomposition (FEVD) — what fraction of variable $i$ ‘s forecast error variance is attributable to shocks from variable $j$ ?

Cointegration

Two non-stationary series $y_t$ and $x_t$ are cointegrated if there exists a linear combination $y_t - \gamma x_t$ that is stationary — they share a common stochastic trend and move together in the long run.

Engle-Granger test: Regress $y_t$ on $x_t$ ; test residuals for stationarity. If stationary, the series are cointegrated with cointegrating vector $(1, -\hat{\gamma})$ .

Error Correction Model (ECM): When series are cointegrated, model short-run dynamics and long-run equilibrium together:

\Delta y_t = \alpha_0 + \alpha_1 (y_{t-1} - \gamma x_{t-1}) + \beta \Delta x_{t-1} + \varepsilon_t

The term $(y_{t-1} - \gamma x_{t-1})$ is the error correction term — it measures how far the system deviated from long-run equilibrium last period, and $\alpha_1$ determines the speed of mean reversion back.

Applications: pairs trading (equity or fixed income), purchasing power parity, yield curve dynamics.

Part 8: Principal Component Analysis (PCA)

PCA finds directions of maximum variance in high-dimensional data. It’s used for dimensionality reduction, factor construction, and dealing with multicollinearity.

The Math

Given a centred data matrix $X \in \mathbb{R}^{n \times p}$ (zero mean columns), compute the sample covariance matrix:

S = \frac{1}{n-1} X^\top X

Decompose via eigendecomposition:

S = V \Lambda V^\top

Where $V$ contains eigenvectors (principal components) and $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_p)$ contains eigenvalues in decreasing order. Equivalently, via SVD of $X$ : $X = U D V^\top$ .

Project onto the first $k$ components:

Z = X V_k \in \mathbb{R}^{n \times k}

The fraction of variance explained by the first $k$ components is $\sum_{i=1}^k \lambda_i / \sum_{i=1}^p \lambda_i$ .

Interpretation in Finance

The first principal component of a set of stock returns often approximates the market factor. The second and third components often capture sector or style effects. PCA on the yield curve typically extracts:

PC1 — level (parallel shift, ~90% of variance)
PC2 — slope (short vs. long rates)
PC3 — curvature (butterfly)

PCA Regression

When predictors are collinear, regress on the first $k$ principal components instead of the original variables. Eliminates multicollinearity by construction (PCs are orthogonal). Trade-off: PCs may lack intuitive interpretation.

Part 9: Factor Models

Factor models decompose returns into systematic and idiosyncratic components:

r_i = \alpha_i + \beta_{i1} F_1 + \beta_{i2} F_2 + \cdots + \beta_{ik} F_k + \varepsilon_i

Where $F_j$ are common factors, $\beta_{ij}$ are factor loadings, and $\varepsilon_i$ is idiosyncratic risk.

CAPM (Single Factor)

r_i - r_f = \alpha_i + \beta_i (r_m - r_f) + \varepsilon_i

$\beta_i$ measures systematic (market) risk. $\alpha_i$ is Jensen’s alpha — excess return above what CAPM predicts. Estimated by OLS regression of excess returns on excess market returns.

Fama-French Three-Factor Model

r_i - r_f = \alpha_i + \beta_1 \text{MKT} + \beta_2 \text{SMB} + \beta_3 \text{HML} + \varepsilon_i

Where SMB (Small Minus Big) captures the size premium and HML (High Minus Low) captures the value premium. The Carhart four-factor model adds MOM (momentum). Fama-French five-factor adds RMW (profitability) and CMA (investment).

Barra-Style Risk Models

Multi-factor risk models used by risk management:

Style factors: value, momentum, quality, size, low volatility
Industry factors: GICS sector exposures
Country/currency factors: for global portfolios

The factor return covariance matrix $\Sigma$ decomposes portfolio risk:

\text{Var}(\mathbf{r}_p) = \mathbf{w}^\top \Sigma \mathbf{w} = \mathbf{w}^\top (B \Sigma_F B^\top + D) \mathbf{w}

Where $B$ is the factor exposure matrix, $\Sigma_F$ is the factor covariance matrix, and $D$ is the diagonal idiosyncratic variance matrix.

Part 10: Statistical Testing Framework

Hypothesis Testing

State $H_0$ (null) and $H_1$ (alternative)
Choose a test statistic and its null distribution
Compute the p-value: probability of observing a test statistic at least as extreme as the one computed, given $H_0$ is true
Compare p-value to significance level $\alpha$ (typically 0.05)

Type I error (false positive): Rejecting $H_0$ when it is true. Probability = $\alpha$ . Type II error (false negative): Failing to reject $H_0$ when it is false. Probability = $\beta$ . Power = $1 - \beta$ : probability of correctly rejecting a false null.

The p-value is not the probability that $H_0$ is true. It is the probability of the data (or more extreme) given $H_0$ .

Multiple Testing

When testing $m$ hypotheses simultaneously, the probability of at least one false positive explodes:

P(\text{at least one false positive}) = 1 - (1-\alpha)^m

For $m=20$ tests at $\alpha=0.05$ : $1 - 0.95^{20} \approx 64\%$ chance of a false positive.

Bonferroni correction — divide $\alpha$ by $m$ : test each hypothesis at $\alpha/m$ . Conservative (controls family-wise error rate).

Benjamini-Hochberg (FDR) — controls the false discovery rate (expected proportion of false positives among rejections). Less conservative than Bonferroni; preferred when testing many hypotheses:

Order p-values: $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}$
Find the largest $k$ such that $p_{(k)} \leq \frac{k}{m} \alpha$
Reject all hypotheses with $p \leq p_{(k)}$

In finance this matters enormously — Harvey, Liu & Zhu (2016) showed most published factor discoveries fail to survive multiple testing corrections.

Key Tests Reference

Test	Null hypothesis	Use when
t-test (one sample)	$\mu = \mu_0$	Testing if mean return differs from zero
t-test (two sample)	$\mu_1 = \mu_2$	Comparing means of two groups
F-test	$\beta_{j} = \cdots = \beta_k = 0$	Joint significance of predictors
Jarque-Bera	Normality ( $\text{skew}=0$ , $\text{kurt}=3$ )	Testing normality of returns
Breusch-Pagan	Homoscedasticity	Testing for heteroscedasticity
Durbin-Watson	No first-order autocorrelation	Time series residual checking
Ljung-Box	No autocorrelation up to lag $h$	Residual diagnostics
ADF	Unit root (non-stationary)	Pre-testing time series
KPSS	Stationarity	Pre-testing time series
Hausman	RE consistent ( $\alpha_i \perp X$ )	Fixed vs. random effects choice
Granger causality	$x$ does not Granger-cause $y$	VAR causal inference
Chow test	No structural break	Testing regime changes

Part 11: Distribution Statistics

Before running regressions or tests, understanding the shape of your data’s distribution matters — especially in finance where returns are decidedly non-normal.

Moments

The first four moments of a distribution describe its shape completely:

Moment	Formula	What it measures
Mean	$\mu = \mathbb{E}[X]$	Central tendency
Variance	$\sigma^2 = \mathbb{E}[(X-\mu)^2]$	Spread
Skewness	$\gamma_1 = \mathbb{E}\!\left[\left(\frac{X-\mu}{\sigma}\right)^3\right]$	Asymmetry
Kurtosis	$\gamma_2 = \mathbb{E}\!\left[\left(\frac{X-\mu}{\sigma}\right)^4\right]$	Tail heaviness

Skewness: Zero = symmetric. Positive = right tail (large positive outliers). Negative = left tail (crash risk in equity returns — large negative outliers dominate).

Kurtosis: Normal distribution has kurtosis = 3. Excess kurtosis = kurtosis − 3. Positive excess kurtosis means fat tails — extreme events are far more common than a normal distribution predicts.

Financial returns typically show: negative skew + excess kurtosis > 0. This is why the Gaussian assumption in Black-Scholes systematically underprices out-of-the-money options.

Fat Tails vs Normal

The Jarque-Bera test tests jointly for zero skewness and zero excess kurtosis:

JB = \frac{n}{6}\left(\gamma_1^2 + \frac{\gamma_2^2}{4}\right) \sim \chi^2_2 \quad \text{under normality}

Part 12: Inequality and Concentration Measures

Gini Coefficient

The Gini coefficient measures inequality in a distribution — how concentrated values are among a subset of the population. It ranges from 0 (perfect equality) to 1 (perfect inequality).

Construction via the Lorenz Curve:

The Lorenz curve plots the cumulative share of total income (or wealth) held by the bottom $x\%$ of the population:

L(F) = \frac{\int_0^F Q(p)\, dp}{\int_0^1 Q(p)\, dp}

Where $Q(p)$ is the quantile function (inverse CDF). Perfect equality means $L(F) = F$ — the bottom 50% holds 50% of income.

The Gini coefficient is twice the area between the perfect equality line and the Lorenz curve:

G = 1 - 2\int_0^1 L(F)\, dF = \frac{\text{Area between diagonal and Lorenz curve}}{\text{Total area under diagonal}}

Gini in Model Validation (Credit Scoring)

The Gini coefficient has a second life in quantitative model evaluation — particularly in credit risk. A credit model ranks borrowers by predicted default probability; the Gini measures how well it separates defaulters from non-defaulters.

The relationship to the AUC (Area Under the ROC Curve):

\text{Gini} = 2 \times \text{AUC} - 1

A random model has AUC = 0.5, Gini = 0. A perfect model has AUC = 1, Gini = 1. In practice, credit scorecards with Gini > 0.4 are considered good; > 0.6 is excellent.

Herfindahl-Hirschman Index (HHI)

HHI measures market concentration — how dominant the largest players are:

\text{HHI} = \sum_{i=1}^n s_i^2

Where $s_i$ is firm $i$ ‘s market share (as a fraction). Ranges from $1/n$ (perfectly equal shares) to 1 (monopoly).

HHI < 0.15: unconcentrated market
0.15–0.25: moderate concentration
HHI > 0.25: highly concentrated (US DOJ merger review threshold)

Used in: antitrust analysis, portfolio concentration risk, factor concentration in quant portfolios.

Part 13: Non-parametric Methods

Non-parametric methods make no assumptions about the underlying distribution. Essential when data is ordinal, heavily skewed, or has fat tails.

Rank Correlations

Pearson correlation measures linear dependence between two variables. It can be misleading when relationships are monotonic but non-linear, or when outliers distort the picture.

Spearman’s $\rho$ replaces values with their ranks, then computes Pearson correlation on the ranks:

\rho_s = 1 - \frac{6 \sum_i d_i^2}{n(n^2-1)}

Where $d_i = \text{rank}(x_i) - \text{rank}(y_i)$ . Captures any monotonic relationship, not just linear.

Kendall’s $\tau$ counts concordant vs discordant pairs:

\tau = \frac{C - D}{\binom{n}{2}}

Where $C$ = concordant pairs (both $x$ and $y$ rank the same way) and $D$ = discordant pairs. More robust than Spearman to small samples and tied values.

Method	Measures	Sensitive to outliers?	Use when
Pearson	Linear dependence	Yes	Normal data, linear relationship
Spearman	Monotonic dependence	No	Ordinal data, non-linear monotone
Kendall	Ordinal association	No	Small samples, many ties

Bootstrap

The bootstrap estimates the sampling distribution of any statistic by resampling with replacement from the observed data. No distributional assumptions required.

Algorithm:

Draw $B$ bootstrap samples of size $n$ from the data (with replacement)
Compute the statistic $\hat{\theta}^*_b$ on each sample
The distribution of $\{\hat{\theta}^*_1, \ldots, \hat{\theta}^*_B\}$ approximates the sampling distribution of $\hat{\theta}$

Bootstrap confidence interval (percentile method):

\text{CI}_{95\%} = [\hat{\theta}^*_{(0.025)},\; \hat{\theta}^*_{(0.975)}]

The bootstrap is invaluable when:

The statistic has no closed-form sampling distribution (e.g., Sharpe ratio, Gini)
The data is clearly non-normal
You want robust standard errors for complex estimators

In finance: Bootstrap is used to test whether a backtest’s Sharpe ratio is statistically significant, controlling for look-ahead bias and non-normality.

Kernel Density Estimation (KDE)

KDE estimates the probability density function of a dataset without assuming a parametric form:

\hat{f}(x) = \frac{1}{nh} \sum_{i=1}^n K\!\left(\frac{x - x_i}{h}\right)

Where $K$ is a kernel function (usually Gaussian) and $h$ is the bandwidth — the smoothing parameter.

Small $h$ : wiggly, overfits to noise
Large $h$ : over-smoothed, loses shape detail
Optimal $h$ (Silverman’s rule of thumb): $h = 1.06 \hat{\sigma} n^{-1/5}$

KDE is used to visualise return distributions, compare empirical vs theoretical densities, and detect multimodality (e.g., bimodal return distributions suggesting regime changes).

Part 14: Logistic Regression and Classification

Linear regression predicts a continuous outcome. When the outcome is binary — default or no-default, fraud or not, churn or not — logistic regression is the standard tool.

The Model

Instead of modelling $y$ directly, logistic regression models the log-odds of the event:

\log\frac{P(y=1|\mathbf{x})}{1 - P(y=1|\mathbf{x})} = \boldsymbol{\beta}^\top \mathbf{x}

Solving for the probability:

P(y=1|\mathbf{x}) = \frac{1}{1+e^{-\boldsymbol{\beta}^\top \mathbf{x}}} = \sigma(\boldsymbol{\beta}^\top \mathbf{x})

The sigmoid function $\sigma$ maps any real number to $(0,1)$ , giving a valid probability.

Estimation: Maximum Likelihood

Logistic regression has no closed-form solution. Parameters are found by maximising the log-likelihood:

\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[ y_i \log \hat{p}_i + (1-y_i) \log(1-\hat{p}_i) \right]

Solved iteratively with Newton-Raphson or gradient descent.

Interpreting Coefficients

A one-unit increase in $x_j$ multiplies the odds by $e^{\beta_j}$ :

\text{OR}_j = e^{\beta_j}

$\beta_j > 0$ → higher $x_j$ increases default probability
$\beta_j < 0$ → higher $x_j$ decreases default probability
$|\beta_j|$ → effect size (on the log-odds scale)

Unlike OLS, marginal effects on the probability scale depend on the values of all other variables — they are not constant.

Model Performance Metrics

Metric	Formula	Interpretation
AUC	Area under ROC curve	0.5 = random; 1.0 = perfect
Gini	$2 \times \text{AUC} - 1$	0 = random; 1 = perfect
KS Statistic	$\max\\|F_1(x) - F_0(x)\\|$	Max separation between default/non-default CDFs
Log-loss	$-\ell(\hat{\boldsymbol{\beta}})/n$	Lower is better; measures calibration
Brier Score	$\frac{1}{n}\sum(y_i - \hat{p}_i)^2$	Mean squared error of probability forecasts

In credit risk, Gini > 0.4 is typically the minimum acceptable threshold; Gini > 0.6 is strong.

Part 15: Scorecard Development — WoE and Information Value

Credit scorecards translate continuous and categorical predictors into integer points. The standard preprocessing pipeline uses Weight of Evidence (WoE) encoding.

Weight of Evidence (WoE)

For a predictor binned into groups, WoE for bin $i$ is:

\text{WoE}_i = \ln\left(\frac{\text{Distribution of Events}_i}{\text{Distribution of Non-Events}_i}\right) = \ln\left(\frac{P(\text{event in bin } i)}{P(\text{non-event in bin } i)}\right)

Positive WoE → bin has a higher proportion of defaults than the overall population (risky)
Negative WoE → bin has lower proportion of defaults (safe)
WoE = 0 → bin default rate equals the population average

WoE transforms all variables to a common, interpretable scale and handles non-linearity and missing values naturally.

Information Value (IV)

IV summarises a variable’s predictive power across all its bins:

\text{IV} = \sum_i (\text{Events}_i\% - \text{Non-Events}_i\%) \times \text{WoE}_i

IV	Predictive Power
< 0.02	Useless
0.02–0.1	Weak
0.1–0.3	Medium
0.3–0.5	Strong
> 0.5	Suspicious (check for data leakage)

IV is the primary variable selection criterion in scorecard development. Variables with IV < 0.02 are typically dropped; IV > 0.5 triggers a data quality review.

From WoE to Scorecard Points

Once logistic regression is fit on WoE-transformed variables, scorecard points are assigned by scaling coefficients to an integer range (e.g., 300–850 for consumer credit):

\text{Points}_j = -\left(\beta_j \times \text{WoE}_{ij} + \frac{\beta_0}{k}\right) \times \text{Factor} + \text{Offset}

Where Factor and Offset are chosen to anchor the score to a target odds at a target score (e.g., odds of 50:1 at score 600).

The final score is additive across characteristics — easy to explain to regulators and customers.

Part 16: Survival Analysis

Survival analysis models the time until an event occurs — time to default, time to prepayment, time to customer churn. Unlike logistic regression (which asks “will it happen?”), survival analysis asks “when will it happen?”

Core Functions

Survival function $S(t)$ — probability the event has not occurred by time $t$ :

S(t) = P(T > t), \quad S(0) = 1, \quad S(\infty) = 0

Hazard function $h(t)$ — instantaneous rate of the event at time $t$ , given survival to $t$ :

h(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t+\Delta t \mid T \geq t)}{\Delta t} = -\frac{S'(t)}{S(t)}

Cumulative hazard $H(t) = \int_0^t h(s)\, ds = -\ln S(t)$

Kaplan-Meier Estimator

The non-parametric estimate of $S(t)$ from censored data:

\hat{S}(t) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)

Where $d_i$ is the number of events and $n_i$ is the number at risk at time $t_i$ . Censored observations (e.g., loans that were paid off before defaulting) are handled naturally — they contribute to the risk set up to their exit time, then drop out.

Cox Proportional Hazards Model

The Cox model is the standard regression approach for survival data. It relates covariates to the hazard without specifying the baseline hazard shape (semi-parametric):

h(t|\mathbf{x}) = h_0(t) \cdot \exp(\boldsymbol{\beta}^\top \mathbf{x})

Where $h_0(t)$ is an unspecified baseline hazard. The proportional hazards assumption: the hazard ratio between two individuals with different covariates is constant over time.

Hazard ratio (HR): $\text{HR}_j = e^{\beta_j}$ — a one-unit increase in $x_j$ multiplies the hazard by $e^{\beta_j}$ . HR > 1 means higher risk; HR < 1 means lower risk.

Estimated with partial likelihood — the baseline hazard cancels out, making estimation tractable without specifying it.

Applications in credit:

Probability of default over a 12-month horizon (IFRS 9 Stage migration)
Lifetime probability of default (IFRS 9 ECL)
Time to repayment / prepayment modelling

Part 17: Model Monitoring and Validation

Models degrade over time as the population they’re applied to drifts away from the development sample. Model monitoring is a regulatory requirement (SR 11-7, PRA SS1/23) and a risk management necessity.

Population Stability Index (PSI)

PSI measures how much a variable’s distribution has shifted between the development (reference) period and a monitoring period:

\text{PSI} = \sum_i \left(A_i - E_i\right) \times \ln\left(\frac{A_i}{E_i}\right)

Where $A_i$ = actual proportion in bin $i$ (monitoring), $E_i$ = expected proportion in bin $i$ (development).

PSI	Interpretation
< 0.10	No significant shift — model still valid
0.10–0.25	Moderate shift — investigate
> 0.25	Major shift — model may need redevelopment

PSI is computed on the score distribution (overall stability) and on each input characteristic (Characteristic Stability Index, CSI). A high PSI on one characteristic identifies which variable is driving the drift.

Characteristic Stability Index (CSI)

CSI applies the same formula as PSI but to individual input variables. Workflow:

flowchart LR
    Score[Score monitoring window] --> PSI{PSI above 0.10?}
    PSI -->|No| OK[Model stable - continue]
    PSI -->|Yes| CSI[Compute CSI for each variable]
    CSI --> Driver[Identify driver variable]
    Driver --> Root[Root cause: data issue or population shift]
    Root --> Fix[Recalibrate or redevelop model]

Performance Monitoring

Track discrimination and calibration separately — a model can remain discriminatory (Gini stable) while becoming poorly calibrated (predicted rates diverge from actuals):

Metric	Monitors	Alert threshold
Gini / AUC	Discrimination (rank ordering)	Drop > 5 pp from development Gini
KS Statistic	Separation between default/non-default	Drop > 5 pp
Predicted vs Actual Default Rate	Calibration	Predicted/Actual ratio outside 0.8–1.2
Hosmer-Lemeshow test	Calibration (formal)	p-value < 0.05 across score bands
PSI	Population drift	> 0.25 on score or key characteristic

Backtesting

For through-the-cycle models (PD, LGD), backtesting compares predicted values against realised outcomes:

Binomial test for PD: Under $H_0$ that predicted PD is correct, the number of defaults in a cohort follows a Binomial distribution. Test whether actual defaults are consistent with predicted.

Traffic light framework (Basel):

Green zone: actual defaults within expected range
Amber zone: borderline — increase monitoring
Red zone: model materially over/underpredicts — regulatory notification required

Part 18: Risk Metrics — VaR and Expected Shortfall

Value at Risk (VaR)

VaR is the loss not exceeded with probability $1-\alpha$ over a given horizon:

P(L > \text{VaR}_\alpha) = \alpha

Equivalently, VaR $_\alpha$ is the $\alpha$ -quantile of the loss distribution (e.g., 99th percentile for 1% VaR).

Three estimation approaches:

Method	How	Assumptions
Historical simulation	Sort past P&L; read off percentile	Distribution-free; captures fat tails and correlations
Parametric (variance-covariance)	Assume normal returns; $\text{VaR} = \mu + z_\alpha \sigma$	Fast; underestimates tail risk for non-normal returns
Monte Carlo	Simulate thousands of scenarios from a model	Flexible; computationally expensive

Limitations of VaR:

Not subadditive — a portfolio of two positions can have higher VaR than the sum of their individual VaRs (violates diversification intuition)
Tells you nothing about the magnitude of losses beyond the threshold

Expected Shortfall (CVaR / ES)

Expected Shortfall is the expected loss conditional on exceeding VaR:

\text{ES}_\alpha = \mathbb{E}[L \mid L > \text{VaR}_\alpha] = \frac{1}{\alpha} \int_{1-\alpha}^1 \text{VaR}_u\, du

ES is the average of all losses in the tail beyond VaR. It is:

Subadditive — always rewards diversification
More sensitive to tail shape — captures the severity, not just the threshold
The regulatory standard under Basel IV (FRTB) — replaced VaR at the 97.5th percentile

Duration and DV01 (Fixed Income Risk)

For fixed income portfolios, interest rate sensitivity is measured by:

Modified Duration:

D_\text{mod} = -\frac{1}{P}\frac{dP}{dy} \approx \frac{\Delta P / P}{\Delta y}

A bond with modified duration of 5 loses approximately 5% in value for a 1% (100bp) rise in yield.

DV01 (Dollar Value of a Basis Point):

\text{DV01} = -\frac{dP}{dy} \times 0.0001 \approx D_\text{mod} \times P \times 0.0001

DV01 is the P&L change for a 1 basis point (0.01%) move in yield. The standard unit for expressing interest rate risk on a trading desk.

Convexity measures the curvature of the price-yield relationship (duration is the first-order approximation; convexity is the second-order correction):

\Delta P \approx -D_\text{mod} \cdot P \cdot \Delta y + \frac{1}{2} \cdot C \cdot P \cdot (\Delta y)^2

Positive convexity (standard bonds) means the bond gains more when yields fall than it loses when yields rise by the same amount.

Expected Credit Loss (ECL — IFRS 9)

Under IFRS 9, banks must recognise lifetime expected credit losses on all financial instruments:

\text{ECL} = \text{PD} \times \text{LGD} \times \text{EAD} \times \text{DF}

Where:

PD — Probability of Default (from logistic/survival model)
LGD — Loss Given Default (fraction of exposure lost; modelled via beta regression or OLS on logit-transformed LGD)
EAD — Exposure at Default (outstanding balance at time of default)
DF — Discount factor (to present value)

Staging under IFRS 9:

Stage 1 — 12-month ECL (no significant credit deterioration since origination)
Stage 2 — Lifetime ECL (significant increase in credit risk)
Stage 3 — Lifetime ECL, credit-impaired

The transition between stages is the critical modelling decision — typically driven by PD relative to origination PD, delinquency triggers, or watchlist flags.

Common Pitfalls

Pitfall	What happens	Fix
Omitted variable bias	$\hat{\boldsymbol{\beta}}$ is biased and inconsistent	Add the variable; use IV or FE
Spurious regression	Fake significance between unrelated non-stationary series	Test stationarity; difference or use ECM
Look-ahead bias	Future data leaks into predictors	Align data carefully; use lagged values
P-hacking	Testing many models, reporting the best	Pre-register hypothesis; correct for multiple testing
Overfitting	Model fits in-sample noise	Cross-validate; use regularisation; hold-out test set
Ignoring autocorrelation	Standard errors too small; over-rejection	Use HAC standard errors or model residuals
Reverse causality	Causal direction is ambiguous	Instrumental variables; Granger causality

Quantitative Methods

The Core Idea

Part 1: Ordinary Least Squares (OLS)

The Model

OLS Derivation

The Gauss-Markov Theorem

Part 2: Inference and Model Evaluation

Coefficient Standard Errors

Hypothesis Testing

Goodness of Fit

Confidence Intervals

Part 3: Regression Diagnostics

Residual Analysis

Heteroscedasticity

Autocorrelation

Multicollinearity

Influential Observations

Diagnostic Decision Tree

Part 4: Variable Transformations

Dependent Variable Transformations

Log transformation: ln⁡(y)\ln(y)ln(y)

Square-root transformation: y\sqrt{y}y​

Box-Cox transformation

Logit transformation: ln⁡(y1−y)\ln\left(\frac{y}{1-y}\right)ln(1−yy​)

Independent Variable Transformations

Log transformation: ln⁡(x)\ln(x)ln(x)

Polynomial features

Standardisation vs normalisation

Categorical Variables: Dummy Encoding

Interaction Terms

Choosing the Right Transformation

Part 5: Regularised Regression

Ridge Regression (L2)

Lasso Regression (L1)

Elastic Net

Quantile Regression

Part 6: Panel Data and Fixed Effects

Fixed Effects (Within) Estimator

Random Effects

Part 7: Time Series

Stationarity

ACF and PACF

ARIMA

GARCH (Volatility Modelling)

VAR (Vector Autoregression)

Cointegration

Part 8: Principal Component Analysis (PCA)

The Math

Interpretation in Finance

PCA Regression

Part 9: Factor Models

CAPM (Single Factor)

Fama-French Three-Factor Model

Barra-Style Risk Models

Part 10: Statistical Testing Framework

Hypothesis Testing

Multiple Testing

Key Tests Reference

Part 11: Distribution Statistics

Moments

Fat Tails vs Normal

Part 12: Inequality and Concentration Measures

Gini Coefficient

Gini in Model Validation (Credit Scoring)

Herfindahl-Hirschman Index (HHI)

Part 13: Non-parametric Methods

Rank Correlations

Bootstrap

Kernel Density Estimation (KDE)

Part 14: Logistic Regression and Classification

The Model

Estimation: Maximum Likelihood

Interpreting Coefficients

Model Performance Metrics

Part 15: Scorecard Development — WoE and Information Value

Weight of Evidence (WoE)

Information Value (IV)

From WoE to Scorecard Points

Part 16: Survival Analysis

Core Functions

Log transformation: $\ln(y)$

Square-root transformation: $\sqrt{y}$

Logit transformation: $\ln\left(\frac{y}{1-y}\right)$

Log transformation: $\ln(x)$