Weighted Least Squares Method

To understand the weighted least squares method, let us consider the model in matrix format with the basic assumption:

$$Y= X \beta + \varepsilon$$

where $\varepsilon \sim (0, \sigma^2 I)$, the observations are independent from each other.

The variance function $Var(Y|X)$ is the same for all values of the term $X$. This assumption can be related in a number of ways.

Weighted Least Squares Method

Let the simplest multiple regression case

$$E(Y|X = x_i) = \beta’ x_i$$

Assuming the error term as

$$Var(Y|X = x_i) = Var(e_i) = \frac{\sigma^2}{w_i}$$

where $w_1, w_2, \cdots, w_n$ are known positive numbers.

The variance function is still characterized by only one unknown positive number $\sigma^2$, but variances can be different for each case. This will lead to Weighted Least Squares instead of Ordinary Least Squares.

In a standard OLS model, we assume homoscedasticity: the idea that the “noise” or error term is constant across all observations. In the real world, this is often false.

Example: If you are modeling household spending vs. income, wealthier families tend to have much higher variability in their spending than lower-income families. Weighted least squares allows you to “down-weight” the high-variance observations so they do not disproportionately pull the regression line.

Formally,

$$Y=X\beta + e, \qquad \qquad X: n\times p’ \qquad \quad rank\,\,\, p’$$

$$Var(e) = \sigma^2 \Sigma$$

where $\Sigma$ is known and $\sigma^2>0$ not necessarily known.

$$\Sigma = \begin{bmatrix} \sigma_1^2 & \sigma_{12} & \sigma_{13} & \cdots & \sigma_{1n} \\ \sigma_{21} & \sigma_{2}^2 & \sigma_{23} & \ddots & \vdots \\ \sigma_{31} & \sigma_{32} & \sigma_3^2 & \ddots & \vdots \\ \vdots & \vdots & \vdots & \cdots & \vdots \\ \sigma_{p1} & \sigma_{p2} & \cdots &\cdots & \sigma_n^2\end{bmatrix}$$

Once $\hat{\beta}$ is determined, the residuals $\hat{e}$are given by $$\hat{e} = Y – \hat{Y} = Y – X\hat{\beta}$$

The estimator $\hat{\beta}$ is chosen to minimize the generalized residual sum of squares is

\begin{align*}
RR(\beta) &= (Y-X\beta)’W(Y-X\beta)\\
&= \Sigma w_i (y_i – x’_i\beta)^2\\
RSS &= e’\Sigma^{-1} e
\end{align*}

The generalized least squares estimator is

$$\hat{\beta} = (X^t \Sigma ^{-1} X)^{-1} X^t \Sigma^{-1}Y$$

Now suppose that we could find an $n\times n$ matrix $C$ such that $C$ is symmetric and $C’C = CC^t = \Sigma^{-1}$ (and $C^{-1}C^{-t}=\Sigma$. Such a matrix $C$ will be called the square root of $\Sigma^{-1}$.

\begin{align*}
V(Ce) &= C Var(e)C’ \tag*{as $Var(e) = \sigma^2 \Sigma$}\\
&= \sigma^2 C \Sigma C’\\
& = \sigma^2 [CC^{-1}C^{-t}C’] = \sigma^2I_n
\end{align*}

Multiplying both sides of $Y=X\beta+e$ with $C$

\begin{align*}
CY &= CX\beta + Ce\\
Z &= W\beta + d\\
Z &= CY
\end{align*}

where $Z=CY, W=CX, d=Ce$ with $Var(d) = \sigma^2I_n$

Comparison of OLS vs WLS

The following table distinguishes the difference between OLS and WLS

FeatureOrdinary Least SquaresWeighted Least Squares
Variance AssumptionConstant ($\sigma^2 I$)Not Constant ($\sigma^2 \Sigma$)
EfficiencyBest linear unbiased estimator (BLUE) if homoscedasticMore efficient than OLS when heteroscedasticity is present
WeightsAll observations have equal weight($w_i=1$)Observations weighted by $\frac{1}{\sigma_i^2}$

WLS Practical Implementation Steps

One can find the weights, as they are rarely “known” in practice, by following these steps:

  1. Residual Analysis: Run an OLS regression first and plot residuals
  2. Model the Variance: Regress the absolute residuals (or square residuals) against the predictors to estimate the variance function
  3. Calculate the weights: set $w_i = \frac{1}{\hat{\sigma}_i}$
  4. Re-run regression: Perform WLS using the weights computed in step-3.

Note that WLS is a special case of Generalized Least Squares (GLS). It is a powerful tool because it restores the “Best” in BLUE (Best Linear Unbiased Estimator) when the standard OLS assumptions fail.

Weighted Least Squares Method

Weighted Least Squares FAQs

What is the main difference between OLS and WLS?

The primary difference lies in the assumption of variance. Ordinary Least Squares (OLS) assumes that every observation has the same variance (homoscedasticity). Weighted Least Squares (WLS) is used when observations have different variances (heteroscedasticity). WLS assigns a “weight” to each data point, typically $w_i = \frac{1}{\sigma_i^2}$, giving more influence to more precise observations.

When to use the Weighted Least Squares Method instead of the Ordinary Least Squares Method?

One should use WLS when a residual analysis of your OLS model reveals a non-constant variance pattern (e.g., a “fan” or “funnel” shape in a residual plot). If the variance of your errors increases or decreases with the independent variable, OLS is no longer the most efficient estimator, and WLS should be applied.

How do I determine the weights ($w_i$) for many models?

In theoretical exercises, weights are often provided. In practice, one must estimate the weights. The common methods to compute weights include:

  • Prior Knowledge: Using known measurement errors.
  • Residual Modeling: Regressing the absolute or squared residuals from an initial OLS model against the predictor variables to find a variance function.
  • Subgrouping: If data is grouped, using the inverse of the variance within each group.

Is the Weighted Least Squares Method a type of Generalized Least Squares Method?

Yes. The Weighted Least Squares Method (WLS) is a special case of Generalized Least Squares (GLS). While GLS handles cases where errors are both heteroscedastic and correlated (the $\Sigma$ matrix has non-zero off-diagonal elements), WLS specifically deals with cases where errors are uncorrelated but have unequal variances (the $\Sigma$ matrix is diagonal).

Does the Weighted Least Squares Method change the coefficients compared to the Ordinary Least Squares Method?

Yes, the estimated coefficients $\hat{\beta}$ will likely change. Because WLS prioritizes data points with lower variance, the resulting regression line will “tilt” to better fit the most reliable data points. This typically results in smaller standard errors for your coefficients, making your t-tests and p-values more reliable.

Can the Weighted Least Squares Method Handle Outliers?

While WLS can down-weight observations with high variance, it is not inherently a “robust regression” technique for outliers in the way that M-estimation is. If an outlier has a small variance (and thus a high weight), it can actually pull the WLS line even more aggressively than OLS.

MCQs Advanced Statistics

Weighted Least Squares

Weighted Least Squares is primarily a technique within Regression Analysis, which serves as a major part of Statistics and Econometrics. It is an enhancement of the fundamental Ordinary Least Squares (OLS) method. OLS assumes that all data points are equally reliable, but WLS is designed for situations where this is not true. It is also considered a special case of a more general method called Generalized Least Squares (GLS).

Importance of Weighted Least Squares Techniques

The core assumption of OLS is homoscedasticity, meaning the variance of the errors is constant across all levels (or fixed values) of the independent variables. In reality, data often exhibit heteroscedasticity, where the error variance changes. This is where WLS becomes invaluable.

Its primary importance lies in its ability to handle data of varying quality. Instead of treating a precise, low-variance measurement the same as an imprecise, high-variance one, WLS gives each data point a weight that reflects its reliability. This approach offers several key advantages:

  • Increased Efficiency: By giving proper influence to more precise data, WLS produces parameter estimates with the smallest possible variance, making it the Best Linear Unbiased Estimator (BLUE) under heteroscedasticity.
  • More Accurate Estimates: It prevents less reliable data points from skewing results, leading to a model that better represents the true relationship, especially in critical areas such as low-concentration measurements.
  • Valid Inferences: Correctly modeling the error structure allows for more reliable confidence intervals and hypothesis tests.
Weighted Least Squares Techniques

Real-Life Applications and Examples

WLS is applied across a vast range of fields. Here are some compelling examples from the search results:

1. Analytical Chemistry (HPLC Calibration)
In pharmaceutical analysis, accurately measuring low concentrations of impurities is critical. Data from instruments like HPLC-UV often have heteroscedastic noise, where the variability increases with concentration. A blog post from LCGC International demonstrates that using WLS for the calibration curve of the drug carbamazepine reduced the average back-calculation error from 27.7% (with OLS) to just 4.02% (with WLS). This ensures that drug impurities are quantified accurately, which is vital for patient safety.

2. Economics (Consumer Expenditure Surveys)
When studying household consumption, economists often find that spending patterns are more variable for high-income households than for low-income ones. For instance, a high-income family’s spending might fluctuate wildly, while a low-income family’s spending is more consistent. An OLS regression of consumption on income would be inefficient. WLS can be used to give less weight to the highly variable, high-income data points, resulting in a more stable and accurate model of the overall consumption trend.

3. Engineering (Satellite Positioning)
A cutting-edge application from an arXiv research paper involves positioning user terminals using signals from Low Earth Orbit (LEO) satellites. The quality of the signal from each satellite can vary due to interference or changing geometry. The researchers propose a hybrid system where a Deep Reinforcement Learning (DRL) model learns to assign optimal weights to each satellite’s measurement before feeding them into a Weighted Least Squares estimator. This approach achieves sub-meter accuracy (as low as 0.395m RMSE) while keeping computational demands low for the satellite’s onboard systems.

4. Political Science (Analyzing Policy Outcomes)
Political scientists use WLS when the uncertainty of data varies across observations. A classic example is analyzing the proportion of felons incarcerated across different states. The variance of this proportion might be smaller in states with higher average education levels. WLS can account for this, giving more weight to data from states where the measurement (or the underlying process) is more precise.

The Basic Formula and Logic of Weighted Least Squares Techniques

Let us formalize the idea of “Listening to some data more than others.”

  1. The Weights ($w_i$)
    In Weighted Least Squares, every single observation (data point) is given a weight (say $w_i$).
    Weight is inversely related to reliability
    • If a data point has low variance (it is precise and reliable), we give it a high weight
    • If a data point has high variance (it is noisy and unpredictable), we give it a low weight

      Mathematically, the weight $w_i$ is often calculated as
      $$w_i = \frac{1}{\sigma_i^2}$$
      where $\sigma_i^2$ is the variance of the error for that observation. Low $\sigma^2$ leads to high $w_i$.
  2. The Core Weighted Least Squares Goal
    The standard “Least Squares” method minimizes the sum of the squared errors (the distance between the actual data point and the prediction line).
    OLS minimizes $\Sigma e_i^2$ while WLS minimizes $\Sigma w_i e_i^2$
  3. The Weighted Least Squares multiplies each squared error by its specific weight before adding them up. This means that the model works extra hard to make the errors small for the data point with large weights.

Performing Statistical Models in R Language

Understanding Ridge Regression

Discover the fundamentals of Ridge Regression, a powerful biased regression technique for handling multicollinearity and overfitting. Learn its canonical form, key differences from Lasso Regression (L1 vs L2 regularization), and why it’s essential for robust predictive modeling. Perfect for ML beginners and data scientists!

Introduction

In cases of near multicollinearity, the Ordinary Least Squares (OLS) estimator may perform worse compared to non-linear or biased estimators. For near multicollinearity, the variance of regression coefficients ($\beta$’s, where $\beta=(X’X)^{-1}X’Y$), given by $\sigma^2(X’X)^{-1}$ can be very large. While in terms of the Mean Squared Error (MSE) criterion, a biased estimator with less dispersion may be more efficient.

Ridge Regression, Bias Variance Trade off

Understanding Ridge Regression

Ridge regression (RR) is a popular biased regression technique used to address multicollinearity and overfitting in linear regression models. Unlike ordinary least squares (OLS), RR introduces a regularization term (L2 penalty) to shrink coefficients, improving model stability and generalization.

Addition of the matrix $KI_p$ (where $K$ is a scalar to $X’X$ yields a more stable matrix $(X’X+KI_p)$. The ridge estimator of $\beta$ ($(X’X+KI_p)^{-1}X’Y$) should have a smaller dispersion than the OLS estimator.

Why Use Ridge Regression

OLS regression can produce high variance when predictors are highly correlated (multicollinearity). Ridge regression helps by:

  • Reducing overfitting by penalizing large coefficients
  • Improving model stability in the presence of multicollinearity
  • Providing better predictions when data has many predictors

Canonical Form

Let $P$ denote the orthogonal matrix whose elements are the eigenvectors of $X’X$ and let $\Lambda$ be the (diagonal) matrix containing the eigenvalues. Consider the spectral decomposition;

\begin{align*}
X’X &= P\Lambda P’\\
\alpha = P’\beta\\
X^* &= XP\\
C &= X’^*Y
\end{align*}

The mode $Y=X\beta + \varepsilon$ can be written as

$$Y = X^*\alpha + \varepsilon$$

The OLS estimator of $\alpha$ is

\begin{align*}
\hat{\alpha} &= (X’^*X*)^{-1}X’^* Y\\
&=(P’X’ XP)^{-1}C = \Lambda^{-1}C
\end{align*}

In scalar notation $$\hat{\alpha}_i=\frac{C_i}{\lambda_i},\quad i=1,2,\cdots,P_i\tag{(A)}$$

From $\hat{\beta}_R = (X’X+KI_p)^{-1}X’Y$, it follows that the principle of RR is to add a constant $K$ to the denominator of ($A$), to obtain:

$$\hat{\alpha}_i^R = \frac{C_i}{\lambda_i + K}$$

Grob criticized this approach, that all eigenvalues of $X’X$ are equal, while for the purpose of stabilization, it would be reasonable to add rather large values to small eigenvalues but small values to large eigenvalues. This is the general ridge (GR) estimator. it is

$$\hat{\alpha}_i^R = \frac{C_i}{\lambda_i+K_i}$$

Ridge Regression vs Lasso Regression

Both are regularized regression techniques, but:

FeatureL2L1
ShrinkageShrinks coefficients evenlyCan shrink coefficients to zero
Use CaseMulticollinearity, many predictorsFeature selection, sparse models

Ridge regression is a powerful biased regression method that improves prediction accuracy by adding L2 regularization. It’s especially useful when dealing with multicollinearity and high-dimensional data.

Learn R Programming Language