1. Linear regression seeks \(\hat\beta\) minimizing \(\|y – X\beta\|^2\), yielding the OLS estimator:
\[\hat\beta = (X^TX)^{-1}X^Ty\]
The feature matrix \(X\) encodes \(n\) sample points in \(\mathbb{R}^p\). The matrix \(X^TX\) captures the geometric spread of these points.
2. PCA as a Description of Sample Spread
Centering \(\tilde{X} = X – \mathbf{1}\bar{x}^T\), the sample covariance \(S = \frac{1}{n}\tilde{X}^T\tilde{X}\) has eigendecomposition \(S = V\Lambda V^T\). The eigenvectors \(v_k\) are the principal directions, ordered by how much the sample points spread in each direction, with variances \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p\).
3. Connecting PCA to the Estimator
To make the connection precise, center the model by separating the intercept, reducing to regression of \(\tilde{y} = y – \bar{y}\) on \(\tilde{X}\). Now \(\tilde{X}^T\tilde{X}\) shares its eigenvectors with \(S\). Using the SVD \(\tilde{X} = U\Sigma V^T\):
\[\hat\beta = V\Sigma^{-1}U^T\tilde{y}.\]
Express \(\hat \beta\) under the orthonormal basis \( \{v_k\}\):
\(\hat \beta = \tilde\beta_1 v_1 + \cdots + \tilde\beta_p v_p = V \ \tilde\beta\).
Collecting terms, we have
\[\tilde\beta =\Sigma^{-1}U^T\tilde{y} \qquad \Longrightarrow \qquad \tilde\beta_k = \frac{u_k^T\tilde{y}}{\sigma_k}, 1\le k \le p.\]
The estimator decomposes along principal directions, with each component determined by two factors:
\(u_k^T \tilde{y}\): how much the response \(y\) correlates with the \(k\)-th principal direction.
\(1/\sigma_k\): inverse spread of the data in that direction.
4. Stability Analysis and Ridge regression
This decomposition makes instability explicit. When \(\sigma_k \approx 0\), the data has little spread in direction \(v_k\) and carries almost no information about \(\beta\) in that direction. Yet OLS amplifies the noisy signal by \(1/\sigma_k \to \infty\), destabilizing \(\hat\beta\).
This is the geometric meaning of multicollinearity: the sample points nearly lie in a subspace, leaving some directions of \(\beta\) unidentifiable.
Ridge regression cures this by shrinking small-\(\sigma_k\) components:
\[\hat\beta_k^{\text{ridge}} = \frac{\sigma_k}{\sigma_k^2 + \lambda} \cdot u_k^T\tilde{y}\]
trading a controlled amount of bias for reduced variance, the bias-variance tradeoff made explicit in the PC basis.
5. Takeaway
The OLS estimator, viewed in the PC basis, is a direction-by-direction regression, weighted by the inverse spread of the data. PCA reveals which directions are well-sampled (\(\sigma_k\), stable estimation) and which are poorly sampled (\(\sigma_k\), unstable estimation). Ridge regression is the natural remedy, regularizing precisely those directions where the data provides little information.
Leave a Reply