Implementation of Lasso Regression From Scratch Using Python

A few months ago I had to ship a salary prediction model into a hiring analytics pipeline. The dataset was wide, messy, and full of features that looked useful on paper but behaved like noise in practice. I could have thrown a black‑box model at it, but stakeholders wanted transparency and a short list of factors that truly mattered. That’s where Lasso regression earned its spot. It’s linear, interpretable, and it actively zeroes out weak signals instead of merely shrinking them. When you’re working with lots of correlated or high‑dimensional inputs, that sparsity becomes a real design advantage.

In this guide, I’ll show you how I implement Lasso regression from scratch in Python, not just how to call a library. You’ll see the objective function, the subgradient details for the L1 penalty, and a full training loop with feature scaling, convergence checks, and sanity tests. I’ll also show where people stumble, how I tune the regularization strength, and how I validate correctness against a reference implementation. By the end, you’ll have a runnable Lasso model, the intuition to trust it, and a clear sense of when to use it—and when you shouldn’t.

Why Lasso Is Worth Building Yourself

If you only ever call a library implementation, you might miss what makes Lasso special: the L1 penalty’s ability to drive some coefficients to exactly zero. That is not just “smaller weights.” It is feature selection baked into the math. I’ve seen it reduce a feature set of 300 down to 18 while retaining predictive power, which made downstream monitoring and governance far simpler.

From a systems angle, a sparse model is faster and cheaper to evaluate. In a 2026 production stack, that matters. Model inference often sits in a latency‑sensitive path. Fewer non‑zero weights means fewer multiplications and easier model audits. When I’m shipping something that goes into a decision loop, I would rather explain 12 features clearly than handwave 300.

Lasso is not the right hammer for every nail. If you have extremely non‑linear relationships or a known causal graph, you may be better served with other methods. But if your data is high‑dimensional, you need interpretability, and you’re willing to accept a linear relationship, Lasso is still a top‑tier option.

The Math That Drives Sparsity

Lasso regression starts with the same hypothesis as linear regression:

Prediction: \( \hat{y} = Xw + b \)
Loss (mean squared error): \( \frac{1}{m} \sum{i=1}^{m} (yi – \hat{y}_i)^2 \)

Lasso adds an L1 penalty on the weights:

J(w, b) = \frac{1}{m} \sum{i=1}^{m} (yi – (Xw + b)i)^2 + \lambda \sum{j=1}^{n}

w_j

A few important points I keep in mind when I implement this:

The L1 term is not differentiable at zero. You can still train with subgradients or with a soft‑thresholding step (which I’ll show).
The penalty does not apply to the bias term. This is standard practice and keeps the intercept from being unfairly shrunk.
Feature scaling is essential. Without scaling, the penalty hits large‑scale features too hard and small‑scale features too lightly.

A simple analogy I use with teams: L2 is like pulling all weights toward zero with a rubber band, while L1 is like snapping some cords off entirely. That snapping is what makes the model sparse.

From Objective to Update Rule

I usually start with a gradient descent version because it’s easy to read and reason about. The subgradient of \(

w_j

\) is:

\(+1\) if \(w_j > 0\)
\(-1\) if \(w_j < 0\)
Any value in \([-1, 1]\) if \(w_j = 0\)

In practice, I use \(\text{sign}(w_j)\) and treat 0 as 0, which works fine for SGD‑style methods. For stability and faster convergence, I often use coordinate descent with soft‑thresholding. But for “from scratch” clarity, I’ll show the gradient approach first and then offer the coordinate descent variant.

Here’s the core update for weight \(w_j\):

\frac{\partial J}{\partial wj} = -\frac{2}{m} \sum{i=1}^{m} x{ij}(yi – \hat{y}i) + \lambda \cdot \text{sign}(wj)

Then:

wj \leftarrow wj – \alpha \cdot \frac{\partial J}{\partial w_j}

The bias update ignores the L1 term:

b \leftarrow b – \alpha \cdot \left(-\frac{2}{m} \sum{i=1}^{m} (yi – \hat{y}_i)\right)

That’s all you need for a working implementation.

A Full, Runnable Implementation in Python

Below is a clean, minimal implementation I use for teaching and prototyping. It includes scaling, training, prediction, and a small evaluation routine. I also include comments only where the logic is not obvious, so the code stays readable.

import numpy as np
import pandas as pd
from sklearn.modelselection import traintest_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
class LassoFromScratch:
def init(self, learningrate=0.01, iterations=2000, l1penalty=0.1):
self.learningrate = learningrate
self.iterations = iterations
self.l1penalty = l1penalty
self.w = None
self.b = 0.0
self.cost_history = []
def computecost(self, X, y):
m = X.shape[0]
y_pred = X @ self.w + self.b
mse = (1 / m)  np.sum((y - y_pred) * 2)
l1 = self.l1_penalty * np.sum(np.abs(self.w))
return mse + l1
def fit(self, X, y):
m, n = X.shape
self.w = np.zeros(n)
self.b = 0.0
for _ in range(self.iterations):
y_pred = X @ self.w + self.b
# Gradients for weights and bias
dw = (-2 / m) * (X.T @ (y - y_pred))
db = (-2 / m) * np.sum(y - y_pred)
# Subgradient for L1 penalty
dw += self.l1_penalty * np.sign(self.w)
# Update parameters
self.w -= self.learning_rate * dw
self.b -= self.learning_rate * db
# Track cost occasionally for diagnostics
if _ % 50 == 0:
self.costhistory.append(self.compute_cost(X, y))
return self
def predict(self, X):
return X @ self.w + self.b
Load dataset
Expecting columns: "YearsExperience", "Salary"
data = pd.read_csv("Experience-Salary.csv")
X = data[["YearsExperience"]].values
y = data["Salary"].values
Train-test split
Xtrain, Xtest, ytrain, ytest = traintestsplit(
X, y, testsize=0.2, randomstate=42
)
Feature scaling
scaler = StandardScaler()
Xtrainscaled = scaler.fittransform(Xtrain)
Xtestscaled = scaler.transform(X_test)
Train model
lasso = LassoFromScratch(learningrate=0.05, iterations=2000, l1penalty=0.1)
lasso.fit(Xtrainscaled, y_train)
Predict
ypred = lasso.predict(Xtest_scaled)
Evaluate
mse = np.mean((ytest - ypred)  2)
print(f"Test MSE: {mse:.2f}")
print(f"Weight: {lasso.w}")
print(f"Bias: {lasso.b:.2f}")
Plot predictions
plt.scatter(Xtest, ytest, color="steelblue", label="Actual")
plt.scatter(Xtest, ypred, color="darkorange", label="Predicted")
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.title("Lasso Regression from Scratch")
plt.legend()
plt.show()

This is intentionally readable rather than clever. For large datasets I would vectorize more aggressively and consider coordinate descent. But for a first‑principles implementation, this is a solid baseline.

Coordinate Descent and Soft‑Thresholding (Optional but Powerful)

Gradient descent works, but coordinate descent is often faster and more stable for Lasso because you can update one coefficient at a time with a closed‑form step. The key idea is the soft‑thresholding operator:

S(z, \lambda) = \text{sign}(z) \cdot \max(

– \lambda, 0)

When you isolate a single weight \(w_j\), the update becomes:

wj \leftarrow S\left(\frac{1}{m}\sumi x{ij}(yi – \hat{y}_{-j}) , \frac{\lambda}{2m}\right)

Where \(\hat{y}{-j}\) is the prediction without the contribution of \(wj\). This tends to converge quickly and produces true zeros in finite steps.

Here’s a compact coordinate descent implementation you can swap in. I use it when I care about speed and clean sparsity:

class LassoCoordinateDescent:
def init(self, iterations=200, l1_penalty=0.1):
self.iterations = iterations
self.l1penalty = l1penalty
self.w = None
self.b = 0.0
@staticmethod
def softthreshold(z, gamma):
if z > 0 and gamma < abs(z):
return z - gamma
if z < 0 and gamma < abs(z):
return z + gamma
return 0.0
def fit(self, X, y):
m, n = X.shape
self.w = np.zeros(n)
self.b = np.mean(y)
for _ in range(self.iterations):
# Update bias
y_pred = X @ self.w + self.b
self.b += np.mean(y - y_pred)
# Update each weight
for j in range(n):
y_pred = X @ self.w + self.b
residual = y - (y_pred - X[:, j] * self.w[j])
rho = np.mean(X[:, j] * residual)
self.w[j] = self.softthreshold(rho, self.l1_penalty / 2)
return self
def predict(self, X):
return X @ self.w + self.b

This version is still clear, and in practice it often converges faster than plain gradient descent. The trade‑off is that it’s a bit harder to derive, so I prefer to teach gradient descent first and then introduce this as a performance upgrade.

Tuning λ and Learning Rate in Practice

Choosing \(\lambda\) is the real craft. Too small and you get a model close to linear regression. Too large and everything collapses to zero. In my projects, I do a small grid search over \(\lambda\) values on a log scale, then pick based on validation error and sparsity.

I usually start with a range like:

\(\lambda\) in [0.001, 0.01, 0.1, 1.0]

Then I plot validation MSE versus number of non‑zero weights. The sweet spot is where error stops improving but sparsity increases.

Here’s a short snippet to explore that trade‑off:

lambdas = [0.001, 0.01, 0.1, 1.0]
results = []
for lam in lambdas:
model = LassoFromScratch(learningrate=0.05, iterations=2000, l1penalty=lam)
model.fit(Xtrainscaled, y_train)
yvalpred = model.predict(Xtestscaled)
mse = np.mean((ytest - yval_pred)  2)
nonzero = np.sum(np.abs(model.w) > 1e-6)
results.append((lam, mse, nonzero))
for lam, mse, nonzero in results:
print(f"lambda={lam}, MSE={mse:.2f}, nonzero={nonzero}")

If you scale features, the learning rate can be fairly stable. I often start around 0.01–0.05 and adjust if the cost oscillates. If cost increases, you should lower the learning rate. If it decreases too slowly, raise it slightly.

Common Mistakes I See (and How to Avoid Them)

I’ve reviewed a lot of Lasso implementations over the years. Here are the failures that keep popping up:

Not scaling features. L1 penalty assumes features are comparable. If they’re not, you’ll penalize the wrong ones.
Penalizing the bias term. Don’t do it. It shifts the whole line downward without improving sparsity.
Ignoring convergence diagnostics. If the cost stays flat, your learning rate may be too small or your iterations too few.
Using too large a learning rate. L1 can cause jitter around zero; too large and weights bounce instead of settling.
Confusing Lasso with Ridge. Ridge uses L2 and does not create true zeros.

A quick sanity check I run: set \(\lambda = 0\) and verify the model behaves like plain linear regression. Then set \(\lambda\) very large and verify that weights shrink toward zero. If those tests fail, something is wrong.

When You Should and Shouldn’t Use Lasso

I’ll be direct about this because it saves time.

Use Lasso when:

You need interpretability and feature selection in one model.
You suspect many features are irrelevant or redundant.
You want a linear baseline that is robust to overfitting.

Avoid Lasso when:

You expect strong non‑linear relationships.
All features are known to be important and correlated (Lasso may drop the wrong ones).
You need the highest possible accuracy and can afford more complex models.

If you’re unsure, I recommend training both Lasso and Ridge and comparing. In my experience, Lasso tends to shine when you have too many features and limited training data.

A Quick Comparison: Traditional vs Modern Workflow

I still train Lasso from scratch in code reviews and prototypes, but I also integrate with modern ML tooling. Here’s how I frame it for teams.

Approach

Traditional Workflow

Modern Workflow (2026‑ready) —

—

— Training loop

Hand‑written gradient descent

Auto‑generated with AI helpers, then audited Hyperparameter search

Manual trial and error

Lightweight sweeps with structured tracking Diagnostics

Print statements and plots

Auto‑logged metrics and reproducible runs Deployment

Copy model coefficients by hand

Exported as artifacts with versioned metadata

I still recommend building the core loop by hand at least once. It keeps your mental model sharp, and it makes it much easier to debug odd behavior when you later use automated tools.

Validating Against a Reference Model

Even when I write from scratch, I validate. That keeps me honest and makes debugging faster. Here’s how I compare my model to a trusted library implementation. It’s not about matching exactly—L1 can be tricky—but the direction should align.

from sklearn.linear_model import Lasso
sklasso = Lasso(alpha=0.1, maxiter=5000)
sklasso.fit(Xtrainscaled, ytrain)
print("Scratch weights:", lasso.w)
print("Sklearn weights:", sklasso.coef)
print("Scratch bias:", lasso.b)
print("Sklearn bias:", sklasso.intercept)

If the values are wildly different, I recheck scaling, learning rate, and gradient implementation. I also check whether my loss uses mean squared error and whether I’m scaling the penalty the same way the library does. Different conventions for \(\lambda\) vs \(\alpha\) can lead to confusion, so I always log them clearly.

Performance Considerations You’ll Actually Feel

For a single feature like years of experience, performance is trivial. But Lasso often appears when you have hundreds or thousands of features. Here are the performance patterns I see most often:

Gradient descent typically runs in tens to hundreds of milliseconds for a few thousand samples and a few hundred features.
Coordinate descent can be faster and more stable for sparse solutions, often finishing in tens of milliseconds for mid‑sized data.
Scaling with a standardizer is cheap but should be included in the pipeline so training and inference use the same transform.

If you’re building a production service, I recommend exporting the scaler mean/variance along with weights. I store them in the same artifact so the inference path is consistent.

Edge Cases and Real‑World Scenarios

Here are situations where Lasso behaves differently than people expect:

Highly correlated features: Lasso tends to pick one and drop the rest, even if multiple are useful. That can be fine for interpretability but risky if you want stable feature importance across re‑trains.
Rare binary flags: If a feature is rarely active, its coefficient can become unstable and drop to zero unless it has a strong effect. I sometimes combine rare flags or set a minimum frequency threshold before fitting.
Large differences in scale: If one feature is measured in millions and another in fractions, the L1 penalty will effectively ignore the small one and crush the big one. Standardization avoids this.
Tiny datasets: With small m and large n, Lasso can over‑prune. Use cross‑validation and consider elastic net in these cases.
All features correlated with the target: If every feature carries signal, Lasso might throw away useful information. That’s when I compare it to Ridge or Elastic Net.

I’ll unpack each of those more deeply in the next sections and show how I handle them.

A More Complete “From Scratch” Class With Diagnostics

In real use, I want more than just fit and predict. I want convergence diagnostics, early stopping, and a simple way to inspect sparsity. Here’s a fuller version of the model with a few practical utilities. I still keep it readable, but I add the features I actually use in a notebook or a small project.

import numpy as np
class LassoScratchWithDiagnostics:
def init(self, learningrate=0.01, iterations=3000, l1penalty=0.1,
tol=1e-5, patience=20, verbose=False):
self.learningrate = learningrate
self.iterations = iterations
self.l1penalty = l1penalty
self.tol = tol
self.patience = patience
self.verbose = verbose
self.w = None
self.b = 0.0
self.cost_history = []
self.nonzero_history = []
def computecost(self, X, y):
m = X.shape[0]
y_pred = X @ self.w + self.b
mse = (1 / m)  np.sum((y - y_pred) * 2)
l1 = self.l1_penalty * np.sum(np.abs(self.w))
return mse + l1
def fit(self, X, y):
m, n = X.shape
self.w = np.zeros(n)
self.b = np.mean(y)
best_cost = float("inf")
no_improve = 0
for i in range(self.iterations):
y_pred = X @ self.w + self.b
error = y - y_pred
dw = (-2 / m) * (X.T @ error)
db = (-2 / m) * np.sum(error)
dw += self.l1_penalty * np.sign(self.w)
self.w -= self.learning_rate * dw
self.b -= self.learning_rate * db
if i % 50 == 0:
cost = self.computecost(X, y)
self.cost_history.append(cost)
self.nonzero_history.append(np.sum(np.abs(self.w) > 1e-8))
if self.verbose:
print(f"Iter {i}: cost={cost:.4f}, nonzero={self.nonzero_history[-1]}")
if best_cost - cost > self.tol:
best_cost = cost
no_improve = 0
else:
no_improve += 1
if no_improve >= self.patience:
if self.verbose:
print("Early stopping: no improvement")
break
return self
def predict(self, X):
return X @ self.w + self.b
def sparsity(self, eps=1e-8):
return np.mean(np.abs(self.w) <= eps)

This is the version I hand to junior teammates. It reinforces good habits like early stopping and basic introspection without overwhelming them. The sparsity method is useful when you’re picking \(\lambda\) values and want a quick measure of how aggressively the model is pruning.

A Simple Synthetic Data Test (My Favorite Sanity Check)

I’m a big fan of synthetic tests because they give you known ground truth. I use this one to verify Lasso zeroes out irrelevant features while keeping the true ones.

import numpy as np
np.random.seed(7)
m = 200
n = 20
X = np.random.randn(m, n)
true_w = np.zeros(n)
true_w[:3] = np.array([2.5, -1.7, 3.2])
noise = np.random.randn(m) * 0.5
y = X @ true_w + noise
model = LassoScratchWithDiagnostics(
learning_rate=0.05,
iterations=3000,
l1_penalty=0.1,
verbose=True
)
model.fit(X, y)
print("True nonzero indices:", np.where(true_w != 0)[0])
print("Learned nonzero indices:", np.where(np.abs(model.w) > 0.1)[0])

When this works, the learned indices align with the first three features, and most of the others stay close to zero. It’s the cleanest possible proof that your implementation is behaving like Lasso, not just “a slightly off linear regression.”

Edge Case Deep Dive: Correlated Features

One of the most common surprises with Lasso is what it does to correlated predictors. If two features are basically telling the same story, Lasso often keeps one and drops the other. That’s good for sparsity, but it can be unstable across different splits of the data.

Here’s how I handle this in practice:

I run multiple train/validation splits and track how often each feature is selected.
I group correlated features and consider combining them (e.g., averaging or taking the first principal component).
If stability matters, I compare Lasso to Elastic Net, which combines L1 and L2 and tends to keep correlated groups together.

This is also a good place to educate stakeholders: Lasso is performing feature selection, not truth discovery. If two features are correlated, either could appear “important” depending on the split.

Edge Case Deep Dive: Highly Skewed Features

Another real‑world problem is features with extreme skew, like incomes or transaction volumes. Even after standardization, a heavy‑tailed distribution can cause large residuals and unstable gradient updates.

My approach:

Apply a log transform or a robust scaler before standardization.
Clip extreme outliers when it’s reasonable and documented.
Use a smaller learning rate for stability when outliers remain.

If I’m working in a regulated domain, I make sure any transformation is explicitly documented, because transformations can affect interpretability.

A Practical Feature Scaling Workflow

I can’t emphasize this enough: consistent scaling is the difference between a working model and a misleading one. Here’s the small “pipeline” I use, even for scratch implementations:

Fit a scaler only on training data.
Transform training and test data using that scaler.
Store the scaler mean and variance alongside the model.

For deployment, I serialize the scaler parameters and apply them before computing \(Xw + b\). If you skip this step, your production model will behave differently than your training model—often dramatically.

Interpreting Coefficients Without Lying to Yourself

When Lasso produces sparse coefficients, it’s tempting to treat them as definitive proof of feature importance. That’s not always fair. Here’s how I interpret them responsibly:

Magnitude matters only after scaling. A large coefficient on an unscaled feature is not the same as a large coefficient on a standardized feature.
Zero is meaningful, but small non‑zero values can still be noise. I typically use a threshold like \(1e^{-6}\) to define “effectively zero.”
Interpretability depends on how features were engineered. If the features are highly abstract (like embeddings), sparsity is less meaningful.

When I present results, I often show three things: coefficient magnitude, stability across folds, and the direction (positive/negative). That paints a much more honest picture.

A Deeper Look at Convergence

Lasso’s optimization can be tricky because of the kink at zero. If your learning rate is too high, weights can bounce around zero and never settle. If it’s too low, you’ll spend forever approaching the optimum. Here’s how I monitor convergence:

Cost curve: It should drop quickly early on, then taper.
Weight change: I track the norm of the weight update every few iterations.
Nonzero count: The number of active features should stabilize after a while.

If the cost oscillates, I reduce the learning rate. If it’s flat and high, I increase it slightly or increase iterations. If non‑zero counts keep changing late in training, I consider switching to coordinate descent.

A More Robust Learning Rate Schedule

Sometimes I prefer a schedule instead of a fixed learning rate. It reduces the “jitter near zero” problem. Here’s a simple schedule you can plug in:

lr = self.learning_rate / (1 + 0.01 * i)
self.w -= lr * dw
self.b -= lr * db

This decays the learning rate over time and often makes training smoother. It’s not mandatory, but it’s a nice upgrade if your cost curve is noisy.

Alternative Approaches to Lasso From Scratch

There are a few variations worth knowing, even if you stick to the basic implementation:

1) Proximal Gradient Descent

This combines a standard gradient step for the MSE with a soft‑thresholding step for the L1 penalty. It’s clean and often converges faster than raw subgradients.

Outline:

Take a gradient step on the MSE part.
Apply soft‑thresholding to the weights.

It looks like this conceptually:

w \leftarrow S\left(w – \alpha \nabla_{w} \text{MSE}, \alpha \lambda\right)

If I’m building a production‑grade custom solver, this is usually my go‑to.

2) Elastic Net (Hybrid L1 + L2)

Elastic Net is not Lasso, but it’s the most common alternative when Lasso is too aggressive. It adds a small L2 term so correlated features are less likely to be dropped.

Use it when:

You have many correlated predictors.
You want sparsity but don’t want “winner‑takes‑all.”

3) Orthogonal Matching Pursuit (OMP)

OMP is a greedy selection algorithm that picks one feature at a time. It can be faster for very sparse solutions, but it’s not the same objective as Lasso.

I mention it because teams often confuse “sparse regression” methods. Lasso is convex and has strong theoretical guarantees; OMP is not, but it can be a practical alternative for certain problems.

Practical Scenario: Salary Prediction With Many Features

In the earlier example, I used only years of experience, which is too simple. Real salary datasets include education level, role, location, company size, skills, and often interaction terms. That easily becomes a wide dataset with hundreds of columns once you one‑hot encode everything.

Here’s how I handle it:

I one‑hot encode categorical variables.
I standardize all numeric columns.
I set \(\lambda\) via validation, targeting both MSE and sparsity.
I inspect which groups of features survive.

The result is usually a model that keeps 5–15 strong predictors and trims the long tail of weak indicators. That’s exactly what I want when I need explainability and auditability.

Practical Scenario: Marketing Mix Modeling

Lasso is also useful when you have a lot of channel variables (search, social, email, affiliate, etc.) and you suspect many are redundant or noisy. The goal is often interpretability rather than perfect accuracy.

With Lasso, I can quickly identify which channels are consistently contributing. But I also cross‑validate heavily because marketing data is seasonal and unstable. I prefer to report the features that survive across folds, not just a single split.

Practical Scenario: Sensor or IoT Data

In sensor systems, you might have hundreds or thousands of signals. Some are redundant, some are broken, some are irrelevant. Lasso can be a first‑pass filter that prunes the sensor set before a more complex model is applied.

The workflow I use:

Fit Lasso on a standardized dataset.
Keep the non‑zero features.
Retrain a more flexible model (if needed) on the reduced set.

This saves time and reduces model complexity downstream.

A Note on Bias and Intercept Handling

People often forget that the intercept should not be penalized. It’s easy to accidentally include it if you’re stacking a column of ones into the feature matrix. In my own code, I keep the bias separate and update it without any penalty. That avoids subtle distortions in the predicted baseline.

If you choose to include a column of ones, be explicit about excluding it from the L1 penalty. Otherwise, your model can shift the intercept to compensate and the coefficients become misleading.

Regularization Strength: Practical Heuristics

Beyond grid search, I use a couple of quick heuristics:

If the number of features is much larger than samples, I start with a higher \(\lambda\).
If I care more about interpretability than accuracy, I move \(\lambda\) upward.
If validation error jumps quickly with small increases in \(\lambda\), I back off and consider a smaller learning rate or elastic net.

When I’m under time pressure, I’ll do a simple log‑scale sweep, pick the best two, and refine with a smaller search around those values.

Debugging Checklist I Actually Use

When Lasso behaves oddly, I go through this checklist:

Are features scaled using training data only?
Is the bias excluded from the penalty?
Is the loss function MSE + L1, not MSE + L1 on bias?
Is the learning rate too high (oscillations) or too low (flat cost)?
Do I see expected behavior at \(\lambda=0\) and large \(\lambda\)?

Most errors show up on this list, and it’s faster than guessing.

Production Considerations: From Notebook to Pipeline

In a production pipeline, I treat Lasso like any other model artifact:

Save weights and bias in a structured file (JSON or a simple binary).
Save the scaler mean/variance and apply it in inference.
Version the model and include training metadata: data window, \(\lambda\), learning rate, and validation metrics.
Monitor the distribution of input features over time to catch drift.

Lasso is lightweight, so it’s easy to deploy. The bigger risk is data drift: if feature distributions change, coefficients can become meaningless even if the model still runs.

Monitoring and Model Drift

Because Lasso is sparse, monitoring becomes simpler. I track:

Mean and variance of each non‑zero feature over time.
Prediction error on holdout or delayed feedback.
Stability of coefficients across retraining cycles.

If the set of non‑zero features changes drastically between versions, that’s a red flag. It usually means data shift or instability, and I want to investigate before I trust the new model.

A Small Comparison: Lasso vs Ridge vs Elastic Net

Here’s the short version I give teammates:

Model

Penalty

Feature Selection

Best Use Case

—

Ridge

Many correlated features, want stable coefficients

Lasso

Yes

High dimensional data, need interpretability

Elastic Net

L1 + L2

Sometimes

Correlated features but still want sparsityWhen I’m unsure, I train all three and compare validation error and sparsity. That takes an extra few minutes and often saves a bad decision.

A More Realistic End‑to‑End Example

Here’s a slightly expanded workflow that includes feature engineering, scaling, training, and evaluation. It’s still readable, but it reflects the steps I use in actual projects:

import numpy as np
import pandas as pd
from sklearn.modelselection import traintest_split
from sklearn.preprocessing import StandardScaler
Load and prepare data
data = pd.readcsv("salarydata.csv")
Example preprocessing
categorical_cols = ["Role", "Location", "Education"]
numeric_cols = ["YearsExperience", "CompanySize", "Projects"]
One-hot encode categoricals
Xcat = pd.getdummies(data[categoricalcols], dropfirst=True)
Xnum = data[numericcols]
X = pd.concat([Xnum, Xcat], axis=1).values
y = data["Salary"].values
Split
Xtrain, Xtest, ytrain, ytest = traintestsplit(
X, y, testsize=0.2, randomstate=42
)
Scale numeric + one-hot (safe since one-hot are 0/1)
scaler = StandardScaler()
Xtrainscaled = scaler.fittransform(Xtrain)
Xtestscaled = scaler.transform(X_test)
Train Lasso
model = LassoScratchWithDiagnostics(
learning_rate=0.03,
iterations=3000,
l1_penalty=0.05,
verbose=True
)
model.fit(Xtrainscaled, y_train)
Evaluate
preds = model.predict(Xtestscaled)
mse = np.mean((y_test - preds)  2)
nonzero = np.sum(np.abs(model.w) > 1e-6)
print(f"Test MSE: {mse:.2f}")
print(f"Non-zero coefficients: {nonzero}")

This is closer to what a real production feature set looks like. The Lasso model is still linear, but it gives you a strong baseline and a clear view into which variables matter.

Practical Advice on Choosing Between Gradient and Coordinate Descent

If you’re just learning or teaching, gradient descent is easier to explain. If you’re building something for repeated use, coordinate descent is usually the better choice for Lasso. It converges faster and gives you more stable sparsity.

A rule of thumb I use:

Use gradient descent for clarity and small datasets.
Use coordinate descent for speed, stability, and large feature sets.

If you want the best of both, proximal gradient is a nice compromise.

A Brief Note on Numerical Stability

When \(\lambda\) is large, weights can collapse to zero quickly. If you have very small learning rates or very large features, you can end up with numerical underflow or “stuck” weights. That’s another reason scaling is essential.

I also avoid extremely tiny thresholds for non‑zero checks. I usually use \(1e^{-6}\) or \(1e^{-8}\) depending on the scale of the data. The goal is practical sparsity, not theoretical purity.

A Practical Checklist Before You Ship

Here’s the checklist I keep near my desk when I ship a Lasso model:

Features are standardized using training data only.
Bias is not penalized.
Hyperparameter \(\lambda\) was tuned on validation data.
Sparsity and accuracy trade‑off was reviewed and documented.
Reference model comparison performed to validate correctness.
Model artifact includes scaler stats and training metadata.
Monitoring set up for drift on non‑zero features.

If I can check all of these, I’m comfortable shipping.

Final Thoughts

Building Lasso from scratch is more than a math exercise. It’s a way to internalize why sparsity works, how regularization changes behavior, and what “interpretability” really means in a production context. Once you understand the mechanics, you can use Lasso confidently as a baseline, a feature selector, or a transparent model in its own right.

I keep Lasso in my toolbox because it scales well, it’s easy to explain, and it solves real problems in messy, high‑dimensional data. If you’re working with a wide dataset and you need a model that’s both honest and efficient, Lasso is still one of the smartest starting points.

If you want to go further, try these next steps:

Implement proximal gradient descent and compare convergence.
Add cross‑validation to automate \(\lambda\) selection.
Extend to elastic net and test stability on correlated features.
Package the model with a simple prediction API for deployment.

That’s the full arc: from scratch implementation to a model you can trust in the real world.