Regression Analysis and the Best Fitting Line using Python

In this tutorial, we will implement regression analysis and find the best-fitting line using Python. We'll explore linear regression concepts and demonstrate practical implementation with scikit-learn.

What is Regression Analysis?

Regression analysis is a statistical method for modeling relationships between variables. Linear regression specifically models the relationship between a dependent variable (target) and one or more independent variables using a linear equation.

In machine learning, linear regression is a supervised algorithm that predicts continuous target values like salary, temperature, or stock prices based on input features.

Linear Regression Equation

The linear regression equation follows the form:

Y = c + mx

Where:

  • Y = target variable (dependent)
  • x = independent variable (feature)
  • m = slope of the line
  • c = y-intercept

The algorithm finds the best-fitting line by minimizing the residual errors the vertical distances between actual data points and predicted values.

Understanding Residuals and RMSE

A residual is the difference between actual and predicted values:

Residual = actual y value ? predicted y value

We measure model performance using Root Mean Squared Error (RMSE):

RMSE = ?(?(yi - yi')² / n)

Lower RMSE indicates better model performance.

Implementation using Python

Complete Linear Regression Example

# Import the libraries
import numpy as np
import math
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate random data with numpy
ranstate = np.random.RandomState(1)
x = 10 * ranstate.rand(100)
y = 2 * x - 5 + ranstate.randn(100)

# Display the scatter plot
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(x, y, alpha=0.6)
plt.xlabel('X values')
plt.ylabel('Y values')
plt.title('Original Data Points')

# Create and train the linear regression model
lr_model = LinearRegression(fit_intercept=True)
lr_model.fit(x[:70, np.newaxis], y[:70])

# Make predictions on test data
y_fit = lr_model.predict(x[70:, np.newaxis])

# Calculate performance metrics
mse = mean_squared_error(y[70:], y_fit)
rmse = math.sqrt(mse)

print("Model Performance:")
print(f"Mean Square Error: {mse:.4f}")
print(f"Root Mean Square Error: {rmse:.4f}")
print(f"Model Slope: {lr_model.coef_[0]:.4f}")
print(f"Model Intercept: {lr_model.intercept_:.4f}")

# Plot the regression line
plt.subplot(1, 2, 2)
plt.scatter(x, y, alpha=0.6, label='Data points')
plt.plot(x[70:], y_fit, color='red', linewidth=2, label='Best fit line')
plt.xlabel('X values')
plt.ylabel('Y values')
plt.title('Linear Regression Results')
plt.legend()
plt.tight_layout()
plt.show()
Model Performance:
Mean Square Error: 1.0860
Root Mean Square Error: 1.0421
Model Slope: 1.9684
Model Intercept: -4.9836

Key Components of the Implementation

The implementation demonstrates several important concepts:

Component Purpose Code
Data Generation Create synthetic dataset np.random.rand()
Train/Test Split Split data for validation x[:70] vs x[70:]
Model Training Fit line to training data lr_model.fit()
Performance Evaluation Calculate RMSE mean_squared_error()

Interpreting the Results

The model successfully learns the underlying relationship in our synthetic data. The original equation was y = 2x - 5, and our model discovered parameters close to these true values despite the added noise.

The RMSE of approximately 1.04 indicates good model performance, considering the random noise we added to the data.

Conclusion

Linear regression provides a foundation for understanding relationships between variables. With Python's scikit-learn, implementing regression analysis becomes straightforward and powerful for predictive modeling tasks.

Updated on: 2026-03-26T22:52:20+05:30

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements