From one point to L∞| toward data science

0 0 7 minutes read

From one point to L∞| toward data science

You should read this

As a person who does it Bachelor of Mathematics Degree I was first introduced to L l 2 and l² as a measure of distance… Now it seems to be the wrong measure – where are we going wrong? But laughed, it seemed that there was such a misunderstanding l₁ and l₂ Perform the same function – while it may be correct at times, each specification shapes its model in a very different way.

In this article, we will go from a simple point to a line all the way to L∞stop and see why l¹ and l² Important, what are their differences and where L∞ specification Appears in AI.

Our agenda:

When to use l¹ with l² loss
How does L l 2 and l ² regularization shrink the model towards sparsity or smoothness
Why the slightest algebraic differences blur gan images, or leave them with razors
How to summarize the distance of Lᵖ space and what the L∞ specification represents

A brief description of mathematical abstraction

You may have had a conversation (perhaps a confusing one) Mathematical abstraction Pop up, you may have already made that conversation even more confused about the actual work of the mathematician. Abstraction refers to extracting potential patterns and attributes from concepts to summarize their wider applications. This seems really complicated, but look at this trivial example:

One point 1-D yes x =x₁;exist 2-D: x = (x₁, x₂);exist 3-D: x = (x₁, x₂, x₃). I don’t know you now x = (x₁,…,x₄₂).

This seems trivial, but this abstract concept is the key to reaching L∞, not the distance we abstract. From now on, let’s go with x = (x₁, x₂, x₃,…, xₙ), Otherwise known for its official title: x∈ℝⁿ. Any vector is v = x – y = (x₁ -y₁, x₂ -y₂,…, xₙ -yₙ).

“Normal” specifications: L1 and L2

this Key Points Simple but powerful: Because the lsight and ls² specifications act differently in several critical ways, you can combine them in one goal to both competing goals. exist Regularization,The l² and l² terms within the loss function help to optimally locate on the bias variance spectrum, resulting in both accurate and model and Promotionable. exist Gansthis l¹Pixel Loss and Confrontational loss Therefore, the generator makes (i) a realistic image and (ii) matches the expected output. The tiny difference between the two losses explains why the cable performs feature selection and why switching l² in gan often produces blurry images.

Code in Github

l see with l² loss – similarity and difference

If your data may contain many outliers or heavy tail noiseyou usually reach for it l¹.
If you care most about overall square errors and have reasonably clean data,,,,, l² Very good – and easier to optimize because it is smooth.

Since MAE handles each error proportionally, the Lβ-trained model is trained Median Observe, this is exactly why L l ose preserves texture details in gan, and the secondary punishment of MSE pushes the model toward A Meaning It looks very smearing value.

l¹ Regularization (pull)

Optimization and regularization are in the opposite direction: optimization tries to fit perfectly into the training set, while regularization is designed to sacrifice some training accuracy to obtain Summary. Add l¹ fine 𝛼∥W∥₁Promotion Sparse – Many coefficients collapse to zero. Larger alpha means stricter pruning, simpler models and less noise in irrelevant inputs. With Lasso, you get Built-in function selectionBecause the ∥w∥₁ term actually closes the small weights, and l² just shrinks them.

L2 regularization (ridge)

Change the regularization terminology

And you have Ridge return . ridge shrinkThe weight towards zero usually does not hit zero. This will discourage any single function from being able to dominate while still maintaining all functions – convenient when you believe all Input is important, but you want to curb overfitting.

Both the ridges and the ridges have improved Summary ;Using a lasso, once the weight reaches zero, the optimizer doesn’t feel the reason to leave – like standing on a flat ground – so zero will naturally “stick”. Or, use more technical perspectives to shape Coefficient space The difference is – Lasso’s diamond-shaped constraint kit part coordinates, Ridge’s spherical kit just squeezes them. Don’t worry, if you don’t understand this, a lot of the theories are beyond the scope of this article, but if you’re interested you’ll be able to read Lₚ space.

But back to point. Note that when we train two models on the same data, Lasso removes some input functions by setting its coefficients exactly to zero.

from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso, Ridge

X, y = make_regression(n_samples=100, n_features=30, n_informative=5, noise=10)

model = Lasso(alpha=0.1).fit(X, y)
print("Lasso nonzero coeffs:", (model.coef_ != 0).sum())

model = Ridge(alpha=0.1).fit(X, y)
print("Ridge nonzero coeffs:", (model.coef_ != 0).sum())

Please note that if we add α To 10 more features have been removed. This can be very dangerous because we may get rid of the rich data.

model = Lasso(alpha=10).fit(X, y)
print("Lasso nonzero coeffs:", (model.coef_ != 0).sum())

model = Ridge(alpha=10).fit(X, y)
print("Ridge nonzero coeffs:", (model.coef_ != 0).sum())

l¹ Generative Adversarial Network (GAN) Loss

gans pit 2 network interaction, one dynamo g (“Forgery”) Oppose Discriminator d (“detective”). Do g Produce convincing and Faithful images, many images to images of Gans use Mixed Loss

Where

x– Input image (for example, sketch)
y– Real target image (e.g., photos)
λ– The balance knob between realism and loyalty

Swap pixel loss into l² Then you type a pixel error; the large residue dominates the target, so gBy prediction Meaning In all reasonable textures – Results: Smoother, blurry output. and l¹ every pixel error is the same, so gTemptation to Median Texture patches and keep sharp borders.

Why small differences matter

In the return, l¹ Derivatives of lasso zero weak predictor variables, and ridge Only nudge them.
Visually, linear punishment l¹ Keep high-frequency details l² Blurred.
In both cases, you can blend l¹ and l²trade robustness,,,,, Sparseand smooth optimization – the balancing behavior at the core of modern machine learning goals.

Summary to lᵖ

Before we arrive L∞we need to talk about each of the four rules specification Must meet:

Non-negative– The distance cannot be negative; no one said, “I am away from the pool – 10 m.”
Positive certainty– Distance is zero The only one In the zero vector without displacement
Absolute homogeneity (scalability) – Scale the length of the vector by the scale of the α-scale|α|: if you double the speed, the distance is twice
Triangle inequality – The detour through Y is never shorter than being straight from beginning to end (x + y)

At the beginning of this article, the mathematical abstraction we perform is very simple. But now, when we look at the specification below, you see we do something similar on a deeper level. There is a clear pattern: the index within the sum increases one at a time, and the index outside the sum increases as well. We also checked whether this more abstract concept of distance still satisfies the core attributes mentioned above. That’s true. So, what we do is successfully abstract the concept of distance into Lᵖ space.

As a family distance – lᵖspace . Squeeze the limit as p→∞ to that family all the way to L∞ specification .

L∞ specification

The L∞ specification has many names The highest specification, the largest specification, unified specification, Chebyshev Norm but their characteristics are the following limitations:

By generalizing our specification as p-space, in two lines of code we can write a function that calculates distances in any conceivable specification. Very useful.

def Lp_norm(v, p):
    return sum(abs(x)**p for x in v) ** (1/p)

We can now consider how our distance measurements change p Increase. From the chart we see that our distance measure monotonically decreases and approaches a very specific point: the largest absolute value in the vector, by Black dotted line.

The convergence of the LP standard with the largest absolute coordinates.

In fact, it is close not only to the maximum absolute coordinates of our vector, but also to

Maximum value shows any time you need Unified assurance or Worst control. From a less technical point of view, if no separate coordinates cannot exceed a certain threshold, the L∞ specification should be used. If you want to set a hard upper limit on each coordinate of a vector, then this is also your specification.

This is not only a theoretical quirk, but a very useful thing, and it is applied in many different situations:

Maximum absolute error-Bind each prediction, so none drifts too far.
Maximum ABS function zoom– Squeeze every function into [−1,1][-1,1][−1,1] No distortion sparseness.
Maximum weight constraint– Save all parameters in the axis alignment box.
Adversarial robustness– Limit each pixel perturbation to ε-Cube (al∞ ball).
Chebyshev DistanceIn K-NN and Grid Search – The fastest way to measure the “King’s-Move” step.
Strong return/Chebyshev Center portfolio issues– Linear program minimizes the worst residue.
Fair hat– Limit the maximum violations per set, not just the average.
Boundary box collision test– Wrap objects in a axis-aligned box for quick overlap checks.

With our more abstract concepts of distance, various interesting questions arise. We can consider pSay it is not the value of an integer, p =π(as shown in the picture above). We can also consider p∈(0,1), e.g. p= 0.3, is this still suitable for the 4 rules we say each specification must obey?

in conclusion

The concept of abstract distance can feel clumsy, even unnecessary theory, but extracting it to its core properties leads us to questions that otherwise cannot be composed. Doing so can reveal new norms for the use of concrete, real-world use. It is tempting to think of all distance measures as interchangeable, but the algebraic differences are small, giving each canonical attribute on which the model is built. From bias changes in regression tradeoffs to choosing between crisp or blurry images in gan, how it measures distance.

Let’s connect on LinkedIn!

Follow me on x= Twitter

Code on Github