1. Linear regression seeks \(\hat\beta\) minimizing \(\|y – X\beta\|^2\), yielding the OLS estimator:

\[\hat\beta = (X^TX)^{-1}X^Ty\]

The feature matrix \(X\) encodes \(n\) sample points in \(\mathbb{R}^p\). The matrix \(X^TX\) captures the geometric spread of these points.

2. PCA as a Description of Sample Spread

Centering \(\tilde{X} = X – \mathbf{1}\bar{x}^T\), the sample covariance \(S = \frac{1}{n}\tilde{X}^T\tilde{X}\) has eigendecomposition \(S = V\Lambda V^T\). The eigenvectors \(v_k\) are the principal directions, ordered by how much the sample points spread in each direction, with variances \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p\).

3. Connecting PCA to the Estimator

To make the connection precise, center the model by separating the intercept, reducing to regression of \(\tilde{y} = y – \bar{y}\) on \(\tilde{X}\). Now \(\tilde{X}^T\tilde{X}\) shares its eigenvectors with \(S\). Using the SVD \(\tilde{X} = U\Sigma V^T\):

\[\hat\beta = V\Sigma^{-1}U^T\tilde{y}.\]

Express \(\hat \beta\) under the orthonormal basis \( \{v_k\}\):
\(\hat \beta = \tilde\beta_1 v_1 + \cdots + \tilde\beta_p v_p = V \ \tilde\beta\).
Collecting terms, we have

\[\tilde\beta =\Sigma^{-1}U^T\tilde{y} \qquad \Longrightarrow \qquad \tilde\beta_k = \frac{u_k^T\tilde{y}}{\sigma_k}, 1\le k \le p.\]

The estimator decomposes along principal directions, with each component determined by two factors:

\(u_k^T \tilde{y}\): how much the response \(y\) correlates with the \(k\)-th principal direction.
\(1/\sigma_k\): inverse spread of the data in that direction.

4. Stability Analysis and Ridge regression

This decomposition makes instability explicit. When \(\sigma_k \approx 0\), the data has little spread in direction \(v_k\) and carries almost no information about \(\beta\) in that direction. Yet OLS amplifies the noisy signal by \(1/\sigma_k \to \infty\), destabilizing \(\hat\beta\).

This is the geometric meaning of multicollinearity: the sample points nearly lie in a subspace, leaving some directions of \(\beta\) unidentifiable.

Ridge regression cures this by shrinking small-\(\sigma_k\) components:

\[\hat\beta_k^{\text{ridge}} = \frac{\sigma_k}{\sigma_k^2 + \lambda} \cdot u_k^T\tilde{y}\]

trading a controlled amount of bias for reduced variance, the bias-variance tradeoff made explicit in the PC basis.

5. Takeaway

The OLS estimator, viewed in the PC basis, is a direction-by-direction regression, weighted by the inverse spread of the data. PCA reveals which directions are well-sampled (\(\sigma_k\), stable estimation) and which are poorly sampled (\(\sigma_k\), unstable estimation). Ridge regression is the natural remedy, regularizing precisely those directions where the data provides little information.


Leave a Reply

Your email address will not be published. Required fields are marked *