Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Regression Analysis and the Best Fitting Line using Python
In this tutorial, we will implement regression analysis and find the best-fitting line using Python. We'll explore linear regression concepts and demonstrate practical implementation with scikit-learn.
What is Regression Analysis?
Regression analysis is a statistical method for modeling relationships between variables. Linear regression specifically models the relationship between a dependent variable (target) and one or more independent variables using a linear equation.
In machine learning, linear regression is a supervised algorithm that predicts continuous target values like salary, temperature, or stock prices based on input features.
Linear Regression Equation
The linear regression equation follows the form:
Y = c + mx
Where:
- Y = target variable (dependent)
- x = independent variable (feature)
- m = slope of the line
- c = y-intercept
The algorithm finds the best-fitting line by minimizing the residual errors the vertical distances between actual data points and predicted values.
Understanding Residuals and RMSE
A residual is the difference between actual and predicted values:
Residual = actual y value ? predicted y value
We measure model performance using Root Mean Squared Error (RMSE):
RMSE = ?(?(yi - yi')² / n)
Lower RMSE indicates better model performance.
Implementation using Python
Complete Linear Regression Example
# Import the libraries
import numpy as np
import math
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate random data with numpy
ranstate = np.random.RandomState(1)
x = 10 * ranstate.rand(100)
y = 2 * x - 5 + ranstate.randn(100)
# Display the scatter plot
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(x, y, alpha=0.6)
plt.xlabel('X values')
plt.ylabel('Y values')
plt.title('Original Data Points')
# Create and train the linear regression model
lr_model = LinearRegression(fit_intercept=True)
lr_model.fit(x[:70, np.newaxis], y[:70])
# Make predictions on test data
y_fit = lr_model.predict(x[70:, np.newaxis])
# Calculate performance metrics
mse = mean_squared_error(y[70:], y_fit)
rmse = math.sqrt(mse)
print("Model Performance:")
print(f"Mean Square Error: {mse:.4f}")
print(f"Root Mean Square Error: {rmse:.4f}")
print(f"Model Slope: {lr_model.coef_[0]:.4f}")
print(f"Model Intercept: {lr_model.intercept_:.4f}")
# Plot the regression line
plt.subplot(1, 2, 2)
plt.scatter(x, y, alpha=0.6, label='Data points')
plt.plot(x[70:], y_fit, color='red', linewidth=2, label='Best fit line')
plt.xlabel('X values')
plt.ylabel('Y values')
plt.title('Linear Regression Results')
plt.legend()
plt.tight_layout()
plt.show()
Model Performance: Mean Square Error: 1.0860 Root Mean Square Error: 1.0421 Model Slope: 1.9684 Model Intercept: -4.9836
Key Components of the Implementation
The implementation demonstrates several important concepts:
| Component | Purpose | Code |
|---|---|---|
| Data Generation | Create synthetic dataset | np.random.rand() |
| Train/Test Split | Split data for validation |
x[:70] vs x[70:]
|
| Model Training | Fit line to training data | lr_model.fit() |
| Performance Evaluation | Calculate RMSE | mean_squared_error() |
Interpreting the Results
The model successfully learns the underlying relationship in our synthetic data. The original equation was y = 2x - 5, and our model discovered parameters close to these true values despite the added noise.
The RMSE of approximately 1.04 indicates good model performance, considering the random noise we added to the data.
Conclusion
Linear regression provides a foundation for understanding relationships between variables. With Python's scikit-learn, implementing regression analysis becomes straightforward and powerful for predictive modeling tasks.
