Why are convolutional neural networks very suitable for images?

The general approximation theorem states that a neural network with a single hidden layer and a nonlinear activation function can approximate any continuous function.
Apart from the practical problems, which greatly increases the number of neurons in this hidden layer, we do not need other network architectures. A simple feedforward neural network can solve the problem.
It is a challenge to estimate how many network architectures have been developed.
When you open the popular AI model platform today to embrace your face, you will find over a million proven models. Depending on the task, you will use different architectures such as transformers for natural language processing and convolutional networks for image classification.
So, why need So many neural network architectures?
In this post, I want to provide answers to this question from a physics perspective. The structures in the data stimulate new neural network architectures.
Symmetry and invariance
Physicists like symmetry. The basic laws of physics adopt symmetry, for example, the motion of a particle can be described by the same equation, regardless of where it finds its position in space and time.
Symmetry always means invariance in some transformations. These ice crystals are an example of translation invariance. Smaller structures look the same, regardless of where they appear in a larger context.
Utilizing Symmetry: Convolutional Neural Networks
If you already know that there is some kind of symmetry in your data, you can take advantage of this fact to simplify your neural network architecture.
Let’s explain this with an example of image classification. The group showed three scenes, including the Goldfish. It can be displayed anywhere in the image, but the image should always be classified as goldfish.

Given enough training data, feeding feedforward neural networks can certainly achieve this.
This network architecture requires flat input images. The weight is then allocated between each input layer neuron (representing a pixel in the image) and each hidden layer neuron. Likewise, weights are assigned between each neuron in the hidden layer and the output layer.
Together with this architecture, the panel also displays a “flat” version of the three goldfish images above. Do they still look like you?

By formulating the image, we found two problems:
- Images containing similar objects will not look like once flattened
- For high-resolution images, we will need to train a lot of weights connecting input and hidden layers.
On the other hand, convolutional networks cooperate with the kernel. The kernel size usually ranges from 3 to 7 pixels, and the kernel parameters are learnable during training.
The kernel is applied to images like a raster. The convolutional layer will have multiple cores, allowing each core to focus on different aspects of the image.

For example, one kernel might pick up on a horizontal line in an image, while another kernel might pick up on a convex curve.
Convolutional neural networks retain the order of pixels, which are very suitable for learning local structures. Convolutional layers can be nested to create deep layers. Combined with the merge layer, advanced features can be learned.
The resulting network is much smaller than you use a fully connected neural network. Convolutional layer only requires kernel_size x kernel_size x n_kernel Trainable parameters.
You can save memory and calculate budget by leveraging the facts your objects can find Anywhere In your image!
More advanced deep learning architectures that utilize symmetry are graphical neural networks and physical knowledge neural networks.
Summary
Convolutional neural networks work well in images because they retain local information in the image. Instead of flattening all pixels, rendering the image is meaningless, it is a kernel with learnable parameters that can be picked up on local functions.