Data Science

Multiple linear regression analysis | Going towards data science

The complete code for this example at the bottom of this article.

Use multivariate regression when your response variable y is continuous and has at least a K covariate or independent variable that is linearly related to it. The data form is:

(y₁, x₁),…, (yᵢ, xᵢ),…, (yₙ, xₙ)

where xᵢ=(xᵢ₁,…, xᵢₖ) is the vector of the covariate and n is the number of observed values. Here, XI is a vector of the value of the ITH observation K covariate.

Understand the data

To create this specificity, imagine the following:

You like to run and track performance by recording the distance you run every day. For 100 consecutive days, you collected four information:

  • The distance you run,
  • The number of hours you run,
  • How many hours did you sleep last night,
  • And how many hours you work

Now, on Day 101, you recorded everything Apart from The distance you run. You want to use the information you do have to estimate the missing value: the number of hours you ran, the number of hours you slept the night before, and the number of hours you worked that day.

To do this, you can rely on the data from the first 100 days, which takes a table:

(y₁, x₁),…, (yᵢ, xᵢ),…, (y₁₀₀, x₁₀₀)

Here, each Yes It’s the distance you run in one day Ieach covariate vector xᵢ=(xᵢ₁, xᵢ₂, xᵢ₃) Corresponding to:

  • xᵢ₁: The number of hours of running,
  • xᵢ₂: The number of hours I slept the night before,
  • xᵢ₃: The number of hours worked that day.

index i = 1,…,100 Refers to 100 days with complete data. With this dataset, you can now fit multiple linear regression models to estimate the missing response variables on day 101.

Model specifications

If we assume a linear relationship between the response variable and the covariate, you can use Pearson correlation to measure, we can specify the model as:

Specification of linear regression model

For i = 1,…,n, where e(ϵᵢ |xᵢ₁,…,xᵢₖ). To take the intercept into account, the first variable is set to xᵢ₁= 1, For i = 1,…,n. To estimate coefficients, the model is represented in matrix notation.

Result variable.

The covariates will be represented by:

x is Design Matrix (with intercept and K covariates)
β is the column vector of coefficients used in linear regression models. ε is a column vector of random error terms, one for each observation value.

We can then rewrite the model as:

y =xβ +ε

Estimation of coefficients

Assuming that the (k+1)*(k+1) matrix is ​​reversible, the form of the least squares estimation is from:

Least squares estimation of β.

We can obtain the estimated value of the regression function, the unbiased estimate of σ² and the approximately 1-α confidence interval of βⱼ:

  • Estimation of regression function: r(x)= ∑ⱼ₌₁ᵏβⱼxⱼ
  • σ̂² = (1 / (n -k))×∑ᵢ₌₁ⁿε̂ᵢ², where ϵ̂ = y -xβ̂ is the carrier of the residual.
  • and β̂ⱼ±tₙ₋₋ₖ, ₁₋α⁄₂×se(β̂ⱼ) is an approximation (1 -α) Confidence interval. where se(β̂ⱼ) is the diagonal element of j of matrix σ̂²(xᵀx)⁻⁻

Example of application

Since we do not record data on running performance, we will use crime datasets from 47 states in 1960, which can be obtained from here. There are many steps we have to follow before fitting linear regression.

Understand the different variables of the data.

The first 9 observations of the data are given by:

 R	   Age	S	Ed	Ex0	Ex1	LF	M	N	NW	U1	U2	W	X
79.1	151	1	91	58	56	510	950	33	301	108	41	394	261
163.5	143	0	113	103	95	583	1012 13	102	96	36	557	194
57.8	142	1	89	45	44	533	969	18	219	94	33	318	250
196.9	136	0	121	149	141	577	994	157	80	102	39	673	167
123.4	141	0	121	109	101	591	985	18	30	91	20	578	174
68.2	121	0	110	118	115	547	964	25	44	84	29	689	126
96.3	127	1	111	82	79	519	982	4	139	97	38	620	168
155.5	131	1	109	115	109	542	969	50	179	79	35	472	206
85.6	157	1	90	65	62	553	955	39	286	81	28	421	239

The data has 14 continuous variables (response variable R, 12 predictors and one categorical variable S):

  1. R: Crime Rate: Crime Rate Reported to Police#
  2. Age: Number of men aged 14-24 per 1000 population
  3. S: Indicative variable for southern states (0 = No, 1 = Yes)
  4. ED: Year of teaching for people aged 25 or older x 10 years
  5. Ex0: Per capita police spending in 1960, state and local governments
  6. EX1: Per capita police expenditures were spent by state and local governments in 1959
  7. LF: Labor force participation rate per 1000 civilian urban males aged 14-24 years old
  8. M: Number of males per 1,000 women
  9. N: The population size of hundreds of thousands
  10. NW: Number of non-whites per 1,000 people
  11. U1: Unemployment rate for urban males per 1000 years old 14–24
  12. U2: Unemployment rate for urban males per 1000 years old is 35-39
  13. W: Median value of transferable goods and assets or household income billions
  14. X: The median income per 1000 households is less than 1/2

There is no missing value in the data.

Graphical analysis of the relationship between covariate X and response variable y

Graphical analysis of the relationship between variables and response variables is a step in performing linear regression.

It helps to visualize linear trends, detect anomalies, and evaluate the correlation of variables before building any model.

Box and scatter plots with linear regression lines Explain each variable and r.

Some variables were positively correlated with crime rates, while others were negatively correlated.

For example, we observe a strong positive relationship between R (crime rate) and EX1.

By contrast, age appears to be negatively associated with crime.

Finally, the block diagram of the binary variable S (indicating region: north or south) shows that the crime rates between the two regions are relatively similar. We can then analyze the correlation matrix.

Heat map of Pearson’s correlation matrix

The correlation matrix allows us to study the strength of the relationship between variables. While Pearson correlations are often used to measure linear relationships, Spearman correlations are more appropriate when we want to capture monotonic, potentially nonlinear relationships between monotonic variables.

In this analysis, we will better illustrate this nonlinear association using Spearman correlation.

one Heat map of correlation matrix In Python

The first row of the correlation matrix shows the strength of the relationship between each covariate and the response variable.

For example, both EX0 and EX1 show correlations with R greater than 60%, indicating that the correlation is strong. These variables appear to be good predictors of crime rates.

However, since the correlation between EX0 and EX1 is almost perfect, they may convey a similar message. To avoid redundancy, we can only choose one of them, preferably the redundancy with the strongest correlation with R.

When several variables are Closely related to each other (e.g., 60% correlation)they tend to carry redundant information. In this case, we only keep one of them – the one that is most closely related to the response variable. This allows us to reduce multicollinearity.

This exercise allows us to select these variables: [‘Ex1’, ‘LF’, ‘M’, ’N’, ‘NW’, ‘U2’].

Multicollinearity was studied using VIF (Variance Inflation Factor)

It is important to study multicollinearity before fitting logistic regression.

When there is a correlation between predictors, the standard error of the coefficient estimate increases, resulting in its variance swelling. The differential inflation factor (VIF) is a diagnostic tool for measuring how much variance of predictor variable coefficients is exaggerated due to multicollinearity and is usually provided in the regression output under the “VIF” column.

VIF Explanation

This VIF is calculated for each predictor in the model. This method regresses the i-th predictor for all other predictors. We then obtain Rᵢ², which can be used to calculate VIFs using formulas:

VIF of ITH variable

The following table lists the VIF values ​​for the remaining six variables, all of which are below 5. This suggests that multicollinearity is not a problem and we can continue to fit the linear regression model.

The VIF of each variable is higher than 5.

Fit linear regression on six variables

If we fit a linear regression of crime rate on 10 variables, we will get the following:

Outputs of multiple linear regression analysis. The corresponding code is provided in the appendix.

Diagnostic residual

Before explaining the regression results, we must first evaluate the quality of the residuals, especially checking for autocorrelation, uniformity (constant variance), and normality. The following table shows the diagnosis of residuals:

Diagnosis of residues. Return to the summary
  • Durbin-Watson ≈2 means there is no autocorrelation in the residue.
  • From synthesis to kurtosis, all values ​​indicate that the residuals are symmetric and have a normal distribution.
  • The low condition number (3.06) confirms that there is no multicollinearity between predictors.

Points to remember

We can also evaluate the overall quality of the model through metrics such as R squares and F statistics, in which case these metrics exhibit satisfactory results. (See the appendix for more details.)

Now, we can explain the regression coefficients from a statistical perspective.
We intend to exclude any business-specific explanation of the results.
The purpose of this analysis is to illustrate some simple and important steps in modeling the problem using multiple linear regressions.

At the 5% significance level, two coefficients were statistically significant: EX1 and NW.

This is not surprising because the correlation between these two variables and the response variable is greater than 40% R. Depending on the context and objectives of the study, variables that are statistically insignificant can be removed or re-evaluated or retained.

This article provides you with some guidelines for performing linear regression:

  • Inspection is important Linear Through graphical analysis and research Correlation between response variables and predictor variables.
  • Checking the correlation between variables helps reduce Multicollinearity and support Variable selection.
  • When two predictors are highly correlated, they may convey Redundant information. In this case, you can keep one More closely related to responseOr – Based on domain expertise, with greater Business Relevance or Practical explanatory.
  • this Differential inflation factor (VIF) It is a useful tool for quantifying and evaluating multicollinearity.
  • Before statistically interpreting the model coefficients, the autocorrelation, normality, and homogeneity of the residuals must be verified to ensure that the model hypothesis is met.

Although this analysis provides valuable insights, it also has some limitations.

The lack of missing values ​​in the dataset can simplify research, but in reality, it is rare.

If you are building a Predictive Modelwhat is important is Split data Enter train,,,,, testpossible Super Verification Set Ensure reliable assessment.

for Variable selection,For example Select step by step Other feature selection methods can be applied.

When comparing multiple models, the appropriate one must be defined Performance metrics.

In the case of linear regression, commonly used indicators include Average Absolute Error (MAE) and Square Error (MSE).

Image Credits

Unless otherwise stated, all images and visualizations in this article were created by the authors using Python (Pandas, Matplotlib, Seaborn and Plotly) and Excel.

refer to

Wasserman, L. (2013). All Statistics: A Brief Course for Statistical Inference. Springer Science and Business Media.

Data and License

The data set used in this article contains crime-related and demographic data in 47 U.S. states in 1960.
It comes from the FBI’s Unified Crime Reporting (UCR) program and other U.S. government information.

As a work of the U.S. government, data is freely used, shared and reproduced without limitation in the public domain under Article 17, 105 of the U.S.

source:

Code

Import data

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('data/Multiple_Regression_Dataset.csv')
df.head()

Visual analysis of variables

Create a new figure

# Extract response variable and covariates
response = 'R'
covariates = [col for col in df.columns if col != response]

fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(20, 18))
axes = axes.flatten()

# Plot boxplot for binary variable 'S'
sns.boxplot(data=df, x='S', y='R', ax=axes[0])
axes[0].set_title('Boxplot of R by S')
axes[0].set_xlabel('S')
axes[0].set_ylabel('R')

# Plot regression lines for all other covariates
plot_index = 1
for cov in covariates:
    if cov != 'S':
        sns.regplot(data=df, x=cov, y='R', ax=axes[plot_index], scatter=True, line_kws={"color": "red"})
        axes[plot_index].set_title(f'{cov} vs R')
        axes[plot_index].set_xlabel(cov)
        axes[plot_index].set_ylabel('R')
        plot_index += 1

# Hide unused subplots
for i in range(plot_index, len(axes)):
    fig.delaxes(axes[i])

fig.tight_layout()
plt.show()

Analyze the correlation between variables

spearman_corr = df.corr(method='spearman')
plt.figure(figsize=(12, 10))
sns.heatmap(spearman_corr, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix Heatmap")
plt.show()

Filter high correlation predictors (ρ> 0.6)

# Step 2: Correlation of each variable with response R
spearman_corr_with_R = spearman_corr['R'].drop('R')  # exclude R-R

# Step 3: Identify pairs of covariates with strong inter-correlation (e.g., > 0.9)
strong_pairs = []
threshold = 0.6
covariates = spearman_corr_with_R.index

for i, var1 in enumerate(covariates):
    for var2 in covariates[i+1:]:
        if abs(spearman_corr.loc[var1, var2]) > threshold:
            strong_pairs.append((var1, var2))

# Step 4: From each correlated pair, keep only the variable most correlated with R
to_keep = set()
to_discard = set()

for var1, var2 in strong_pairs:
    if abs(spearman_corr_with_R[var1]) >= abs(spearman_corr_with_R[var2]):
        to_keep.add(var1)
        to_discard.add(var2)
    else:
        to_keep.add(var2)
        to_discard.add(var1)

# Final selection: all covariates excluding the ones to discard due to redundancy
final_selected_variables = [var for var in covariates if var not in to_discard]

final_selected_variables

Analysis of multicollinearity using VIF

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.preprocessing import StandardScaler

X = df[final_selected_variables]  

X_with_const = add_constant(X)  

vif_data = pd.DataFrame()
vif_data["variable"] = X_with_const.columns
vif_data["VIF"] = [variance_inflation_factor(X_with_const.values, i)
                   for i in range(X_with_const.shape[1])]

vif_data = vif_data[vif_data["variable"] != "const"]

print(vif_data)

After normalization, fit the linear regression model on six variables instead of splitting the data into trains and tests

from sklearn.preprocessing import StandardScaler
from statsmodels.api import OLS, add_constant
import pandas as pd

# Variables
X = df[final_selected_variables]
y = df['R']

scaler = StandardScaler()
X_scaled_vars = scaler.fit_transform(X)

X_scaled_df = pd.DataFrame(X_scaled_vars, columns=final_selected_variables)

X_scaled_df = add_constant(X_scaled_df)

model = OLS(y, X_scaled_df).fit()
print(model.summary())
Image from the author: OLS regression results

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button