Multiple linear regression analysis | Going towards data science

The complete code for this example at the bottom of this article.
Use multivariate regression when your response variable y is continuous and has at least a K covariate or independent variable that is linearly related to it. The data form is:
(y₁, x₁),…, (yᵢ, xᵢ),…, (yₙ, xₙ)
where xᵢ=(xᵢ₁,…, xᵢₖ) is the vector of the covariate and n is the number of observed values. Here, XI is a vector of the value of the ITH observation K covariate.
Understand the data
To create this specificity, imagine the following:
You like to run and track performance by recording the distance you run every day. For 100 consecutive days, you collected four information:
- The distance you run,
- The number of hours you run,
- How many hours did you sleep last night,
- And how many hours you work
Now, on Day 101, you recorded everything Apart from The distance you run. You want to use the information you do have to estimate the missing value: the number of hours you ran, the number of hours you slept the night before, and the number of hours you worked that day.
To do this, you can rely on the data from the first 100 days, which takes a table:
(y₁, x₁),…, (yᵢ, xᵢ),…, (y₁₀₀, x₁₀₀)
Here, each Yes It’s the distance you run in one day Ieach covariate vector xᵢ=(xᵢ₁, xᵢ₂, xᵢ₃) Corresponding to:
- xᵢ₁: The number of hours of running,
- xᵢ₂: The number of hours I slept the night before,
- xᵢ₃: The number of hours worked that day.
index i = 1,…,100 Refers to 100 days with complete data. With this dataset, you can now fit multiple linear regression models to estimate the missing response variables on day 101.
Model specifications
If we assume a linear relationship between the response variable and the covariate, you can use Pearson correlation to measure, we can specify the model as:
For i = 1,…,n, where e(ϵᵢ |xᵢ₁,…,xᵢₖ). To take the intercept into account, the first variable is set to xᵢ₁= 1, For i = 1,…,n. To estimate coefficients, the model is represented in matrix notation.

The covariates will be represented by:


We can then rewrite the model as:
y =xβ +ε
Estimation of coefficients
Assuming that the (k+1)*(k+1) matrix is reversible, the form of the least squares estimation is from:

We can obtain the estimated value of the regression function, the unbiased estimate of σ² and the approximately 1-α confidence interval of βⱼ:
- Estimation of regression function: r(x)= ∑ⱼ₌₁ᵏβⱼxⱼ
- σ̂² = (1 / (n -k))×∑ᵢ₌₁ⁿε̂ᵢ², where ϵ̂ = y -xβ̂ is the carrier of the residual.
- and β̂ⱼ±tₙ₋₋ₖ, ₁₋α⁄₂×se(β̂ⱼ) is an approximation (1 -α) Confidence interval. where se(β̂ⱼ) is the diagonal element of j of matrix σ̂²(xᵀx)⁻⁻
Example of application
Since we do not record data on running performance, we will use crime datasets from 47 states in 1960, which can be obtained from here. There are many steps we have to follow before fitting linear regression.
Understand the different variables of the data.
The first 9 observations of the data are given by:
R Age S Ed Ex0 Ex1 LF M N NW U1 U2 W X
79.1 151 1 91 58 56 510 950 33 301 108 41 394 261
163.5 143 0 113 103 95 583 1012 13 102 96 36 557 194
57.8 142 1 89 45 44 533 969 18 219 94 33 318 250
196.9 136 0 121 149 141 577 994 157 80 102 39 673 167
123.4 141 0 121 109 101 591 985 18 30 91 20 578 174
68.2 121 0 110 118 115 547 964 25 44 84 29 689 126
96.3 127 1 111 82 79 519 982 4 139 97 38 620 168
155.5 131 1 109 115 109 542 969 50 179 79 35 472 206
85.6 157 1 90 65 62 553 955 39 286 81 28 421 239
The data has 14 continuous variables (response variable R, 12 predictors and one categorical variable S):
- R: Crime Rate: Crime Rate Reported to Police#
- Age: Number of men aged 14-24 per 1000 population
- S: Indicative variable for southern states (0 = No, 1 = Yes)
- ED: Year of teaching for people aged 25 or older x 10 years
- Ex0: Per capita police spending in 1960, state and local governments
- EX1: Per capita police expenditures were spent by state and local governments in 1959
- LF: Labor force participation rate per 1000 civilian urban males aged 14-24 years old
- M: Number of males per 1,000 women
- N: The population size of hundreds of thousands
- NW: Number of non-whites per 1,000 people
- U1: Unemployment rate for urban males per 1000 years old 14–24
- U2: Unemployment rate for urban males per 1000 years old is 35-39
- W: Median value of transferable goods and assets or household income billions
- X: The median income per 1000 households is less than 1/2
There is no missing value in the data.
Graphical analysis of the relationship between covariate X and response variable y
Graphical analysis of the relationship between variables and response variables is a step in performing linear regression.
It helps to visualize linear trends, detect anomalies, and evaluate the correlation of variables before building any model.

Some variables were positively correlated with crime rates, while others were negatively correlated.
For example, we observe a strong positive relationship between R (crime rate) and EX1.
By contrast, age appears to be negatively associated with crime.
Finally, the block diagram of the binary variable S (indicating region: north or south) shows that the crime rates between the two regions are relatively similar. We can then analyze the correlation matrix.
Heat map of Pearson’s correlation matrix
The correlation matrix allows us to study the strength of the relationship between variables. While Pearson correlations are often used to measure linear relationships, Spearman correlations are more appropriate when we want to capture monotonic, potentially nonlinear relationships between monotonic variables.
In this analysis, we will better illustrate this nonlinear association using Spearman correlation.

The first row of the correlation matrix shows the strength of the relationship between each covariate and the response variable.
For example, both EX0 and EX1 show correlations with R greater than 60%, indicating that the correlation is strong. These variables appear to be good predictors of crime rates.
However, since the correlation between EX0 and EX1 is almost perfect, they may convey a similar message. To avoid redundancy, we can only choose one of them, preferably the redundancy with the strongest correlation with R.
When several variables are Closely related to each other (e.g., 60% correlation)they tend to carry redundant information. In this case, we only keep one of them – the one that is most closely related to the response variable. This allows us to reduce multicollinearity.
This exercise allows us to select these variables: [‘Ex1’, ‘LF’, ‘M’, ’N’, ‘NW’, ‘U2’].
Multicollinearity was studied using VIF (Variance Inflation Factor)
It is important to study multicollinearity before fitting logistic regression.
When there is a correlation between predictors, the standard error of the coefficient estimate increases, resulting in its variance swelling. The differential inflation factor (VIF) is a diagnostic tool for measuring how much variance of predictor variable coefficients is exaggerated due to multicollinearity and is usually provided in the regression output under the “VIF” column.

This VIF is calculated for each predictor in the model. This method regresses the i-th predictor for all other predictors. We then obtain Rᵢ², which can be used to calculate VIFs using formulas:

The following table lists the VIF values for the remaining six variables, all of which are below 5. This suggests that multicollinearity is not a problem and we can continue to fit the linear regression model.

Fit linear regression on six variables
If we fit a linear regression of crime rate on 10 variables, we will get the following:

Diagnostic residual
Before explaining the regression results, we must first evaluate the quality of the residuals, especially checking for autocorrelation, uniformity (constant variance), and normality. The following table shows the diagnosis of residuals:

- Durbin-Watson ≈2 means there is no autocorrelation in the residue.
- From synthesis to kurtosis, all values indicate that the residuals are symmetric and have a normal distribution.
- The low condition number (3.06) confirms that there is no multicollinearity between predictors.
Points to remember
We can also evaluate the overall quality of the model through metrics such as R squares and F statistics, in which case these metrics exhibit satisfactory results. (See the appendix for more details.)
Now, we can explain the regression coefficients from a statistical perspective.
We intend to exclude any business-specific explanation of the results.
The purpose of this analysis is to illustrate some simple and important steps in modeling the problem using multiple linear regressions.
At the 5% significance level, two coefficients were statistically significant: EX1 and NW.
This is not surprising because the correlation between these two variables and the response variable is greater than 40% R. Depending on the context and objectives of the study, variables that are statistically insignificant can be removed or re-evaluated or retained.
This article provides you with some guidelines for performing linear regression:
- Inspection is important Linear Through graphical analysis and research Correlation between response variables and predictor variables.
- Checking the correlation between variables helps reduce Multicollinearity and support Variable selection.
- When two predictors are highly correlated, they may convey Redundant information. In this case, you can keep one More closely related to responseOr – Based on domain expertise, with greater Business Relevance or Practical explanatory.
- this Differential inflation factor (VIF) It is a useful tool for quantifying and evaluating multicollinearity.
- Before statistically interpreting the model coefficients, the autocorrelation, normality, and homogeneity of the residuals must be verified to ensure that the model hypothesis is met.
Although this analysis provides valuable insights, it also has some limitations.
The lack of missing values in the dataset can simplify research, but in reality, it is rare.
If you are building a Predictive Modelwhat is important is Split data Enter train,,,,, testpossible Super Verification Set Ensure reliable assessment.
for Variable selection,For example Select step by step Other feature selection methods can be applied.
When comparing multiple models, the appropriate one must be defined Performance metrics.
In the case of linear regression, commonly used indicators include Average Absolute Error (MAE) and Square Error (MSE).
Image Credits
Unless otherwise stated, all images and visualizations in this article were created by the authors using Python (Pandas, Matplotlib, Seaborn and Plotly) and Excel.
refer to
Wasserman, L. (2013). All Statistics: A Brief Course for Statistical Inference. Springer Science and Business Media.
Data and License
The data set used in this article contains crime-related and demographic data in 47 U.S. states in 1960.
It comes from the FBI’s Unified Crime Reporting (UCR) program and other U.S. government information.
As a work of the U.S. government, data is freely used, shared and reproduced without limitation in the public domain under Article 17, 105 of the U.S.
source:
Code
Import data
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
df = pd.read_csv('data/Multiple_Regression_Dataset.csv')
df.head()
Visual analysis of variables
Create a new figure
# Extract response variable and covariates
response = 'R'
covariates = [col for col in df.columns if col != response]
fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(20, 18))
axes = axes.flatten()
# Plot boxplot for binary variable 'S'
sns.boxplot(data=df, x='S', y='R', ax=axes[0])
axes[0].set_title('Boxplot of R by S')
axes[0].set_xlabel('S')
axes[0].set_ylabel('R')
# Plot regression lines for all other covariates
plot_index = 1
for cov in covariates:
if cov != 'S':
sns.regplot(data=df, x=cov, y='R', ax=axes[plot_index], scatter=True, line_kws={"color": "red"})
axes[plot_index].set_title(f'{cov} vs R')
axes[plot_index].set_xlabel(cov)
axes[plot_index].set_ylabel('R')
plot_index += 1
# Hide unused subplots
for i in range(plot_index, len(axes)):
fig.delaxes(axes[i])
fig.tight_layout()
plt.show()
Analyze the correlation between variables
spearman_corr = df.corr(method='spearman')
plt.figure(figsize=(12, 10))
sns.heatmap(spearman_corr, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix Heatmap")
plt.show()
Filter high correlation predictors (ρ> 0.6)
# Step 2: Correlation of each variable with response R
spearman_corr_with_R = spearman_corr['R'].drop('R') # exclude R-R
# Step 3: Identify pairs of covariates with strong inter-correlation (e.g., > 0.9)
strong_pairs = []
threshold = 0.6
covariates = spearman_corr_with_R.index
for i, var1 in enumerate(covariates):
for var2 in covariates[i+1:]:
if abs(spearman_corr.loc[var1, var2]) > threshold:
strong_pairs.append((var1, var2))
# Step 4: From each correlated pair, keep only the variable most correlated with R
to_keep = set()
to_discard = set()
for var1, var2 in strong_pairs:
if abs(spearman_corr_with_R[var1]) >= abs(spearman_corr_with_R[var2]):
to_keep.add(var1)
to_discard.add(var2)
else:
to_keep.add(var2)
to_discard.add(var1)
# Final selection: all covariates excluding the ones to discard due to redundancy
final_selected_variables = [var for var in covariates if var not in to_discard]
final_selected_variables
Analysis of multicollinearity using VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.preprocessing import StandardScaler
X = df[final_selected_variables]
X_with_const = add_constant(X)
vif_data = pd.DataFrame()
vif_data["variable"] = X_with_const.columns
vif_data["VIF"] = [variance_inflation_factor(X_with_const.values, i)
for i in range(X_with_const.shape[1])]
vif_data = vif_data[vif_data["variable"] != "const"]
print(vif_data)
After normalization, fit the linear regression model on six variables instead of splitting the data into trains and tests
from sklearn.preprocessing import StandardScaler
from statsmodels.api import OLS, add_constant
import pandas as pd
# Variables
X = df[final_selected_variables]
y = df['R']
scaler = StandardScaler()
X_scaled_vars = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled_vars, columns=final_selected_variables)
X_scaled_df = add_constant(X_scaled_df)
model = OLS(y, X_scaled_df).fit()
print(model.summary())
