Explanation: How to perform feature selection in L1 regularization?

liralbes April 23, 2025

0 6 minutes read

Explanation: How to perform feature selection in L1 regularization?

It is the process of selecting the best subset of features from a given set of functions; the best subset of functions is the function that maximizes the performance of the model on a given task.

Function selection can be a manual or a clear process Filter or wrapper method. In these methods, the function of adding or iterating based on the value of the fixed metric, which quantifies the correlation of the function in the production prediction. These measures may be information gain, variance, or chi-square statistics, and the algorithm will determine the fixed threshold for accepting/rejecting the function to consider the measure. Note that these methods are not part of the model training phase and were performed before that.

Embed method Implicitly select functional selection without using any predefined selection criteria and derived from the training data itself. This intrinsic feature selection process is part of the model training phase. The model learns to select functions and simultaneously make relevant predictions. In the later sections, we will describe the role of regularization in performing this intrinsic feature selection.

Regularization and model complexity

Regularization is the process of punishing model complexity to avoid overfitting and implementing generalization of tasks.

Here, the complexity of the model is similar to its ability to adapt to patterns in the training data. Assume that’x“Have a degree”d‘, as we add degrees’dIn polynomials, the model has greater flexibility to capture patterns in observed data.

Overfitting and inadequate

If we try to use it with polynomial model d = 2 In a set of training samples from cubic polynomials with a certain noise, the model will not be able to capture the distribution of the samples to a certain extent. The model just lacks flexibility or complex Model data generated from Degree 3 (or Advanced Order) polynomials. It is said that such a model Inadequate fit On training data.

Working on the same example, suppose we now have a model d = 6. Now, as the complexity increases, the model should easily estimate the original cubic polynomial used to generate the data (e.g. set the coefficients for all terms to be with an exponent > 3 to 0). If the training process does not terminate at the right time, the model will continue to utilize its additional flexibility to reduce further errors and start capturing in noisy samples. This will greatly reduce training errors, but now the model over Training data. Noise will change in actual setup (or during the test phase) and any knowledge based on predictions will be broken, resulting in high test errors.

How to determine the optimal model complexity?

In actual settings, we have little understanding of the data generation process or the true distribution of data. Finding the best model with the correct complexity, so misfitting or overfitting does not occur is a challenge.

One technique might be to start with a model that is strong enough and then reduce its complexity through feature selection. Smaller features, smaller ones are the complexity of the model.

As mentioned in the previous section, feature selection can be explicit (filter, wrapper method) or implicit. Redundant features in determining the value of the response variable should be eliminated to avoid the model learning irrelevant patterns. Regularization, perform similar tasks. So, how do you connect regularization and feature selection to achieve the common goal of optimal model complexity?

L1 regularization as function selector

Continue with our polynomial model, which we represent as function f with input xparameters θ and degree d,,,,,

(Image of the author)

For polynomial models, each power input x_i It can be regarded as a feature to form a vector of form.

We also define an objective function, which minimizes the objective function and causes us to enter the optimal parameter θ* And include a Regularization The complexity of the term punishment model.

To determine the minimum value of this function, we need to analyze all its critical points, i.e. points with derivatives of zero or undefined.

partial derivative wrt a parameter, θjcan be written as

Where is the function sgn Defined as

notes: The derivative of the absolute function is different from the SGN function defined above. The original derivative is uncertain at x = 0. We added definitions to remove the inflection point at x = 0 and make the function differentiable across its entire domain. Furthermore, when the underlying calculation involves absolute functions, the ML framework also uses such enhancements. Check this thread on the Pytorch forum.

By calculating the partial derivative of the objective function wrt a single parameter θj, and set it to zero, we can create a θj Have predictions, goals and characteristics.

Let’s check the equation above. If we assume that the input and target are centered on the mean (i.e. the data has been normalized in the preprocessing step), the term on the LHS effectively represents the covariance between J.Th Features and differences between predicted and target values.

The statistical covariance between two variables quantifies the value of one variable affecting the second variable (and vice versa)

The symbolic function on RHS forces the covariance on LHS to assume only three values (because the symbolic function returns only -1, 0, and 1). in the case of Jth Functions are redundant and do not affect the prediction. The covariance will be almost zero, bringing corresponding parameters. θj* is zero. This causes the functionality to be eliminated from the model.

Imagine the sign function is a canyon carved by a river. You can walk in the canyon (i.e. riverbed), but to get out of it, you have these huge obstacles or steep hills. L1 regularization induced a similar “threshold” effect of the loss function gradient. The gradient must be strong enough to break the barrier or turn to zero, eventually lowering the parameter to zero.

For more rooted examples, consider a dataset containing samples derived from straight lines (by two coefficient parameters) with some noise. The optimal model should have no more than two parameters, otherwise it will adapt to the noise present in the data (added degrees of freedom/power with polynomials). Changing the parameters of higher power in the polynomial model does not affect the difference between the target and model prediction, reducing its covariance with that feature.

During training, add/subtract the constant steps from the gradient of the loss function. If the gradient of the loss function (MSE) is smaller than the constant step, the parameter will eventually reach 0. Observe the following equation, which depicts how to update parameters by gradient descent, gradient descent,

If the blue part above is smaller than λαthis itself is very rare. ΔθJ It’s an almost constant step λα. The signs of this step (red part) depend on sgn(θj)its output depends on θj. if θj Positive IE is greater than ε,,,,, sgn(θj) equals 1, so ΔθJ Approximately equal to –λα Push it to zero.

In order to suppress constant step (red part) that makes the parameter zero, the gradient of the loss function (blue part) must be greater than the step size. For larger loss function gradients, the value of this function must significantly affect the output of the model.

This is how features are eliminated, or rather, their corresponding parameters (which have no relation to the output of the model) are regularized to zero by L1 during training.