Hands-on attention mechanism for time series classification, python

It’s one Change the rules of the game In machine learning. In fact, in recent history of deep learning, the idea of allowing models Focus on the most relevant parts The input sequence when making predictions completely revolutionizes the way we look at neural networks.
That being said, I have a controversial view on the attention mechanism:
The best way study The attention mechanism is no Through Natural Language Processing (NLP)
Technically, this is a controversial meaning for two reasons.
- People use NLP cases naturally (e.g., translation or NSP) because NLP is first of all the reasons for developing attention mechanisms. The initial goal is Overcome the limitations of RNN and CNN When dealing with long-term dependencies of language (if you haven’t already, you should really read paper attention, which is what you need).
- Secondly, I must also say that it is very intuitive to understand that the general idea of putting “attention” on a specific word for translation task.
That being said, if we want to understand how hands-on practice really works, I believe Time series It is the best framework to use. I say this for many reasons.
- Computers are not really “made” to cooperate with strings. They work with one and zero. All the embedding steps required to convert text into vectors add extra complexity that is not strictly related to attentional ideas.
- This attention mechanism, while initially developed for text, has many other applications (for example, in computer vision), so I also like the idea of exploring attention from another perspective.
- and Time series Specifically, we can create very small datasets and run our attention model in minutes (including training) without any fancy GPUs.
In this blog post, we will see how to establish attention mechanisms for time series, especially in Classification set up. We will work with sine waves and will try to classify normal sine waves with “modified” sine waves. “Modify” the sine wave by Flatten part Original signal. That is, somewhere in the wave, we just need to remove the oscillation and replace it with a flat line, as if the signal was temporarily stopped or damaged.
Make things more Spicywe will assume that sine can have any frequency or amplitude, That’s Place and extension (we call it lengthThe “correction” part) is also a parameter. In other words, a sine can be any sinusoidal, and we can put a “line” anywhere on the sine wave.
OK, OK, but why should we even bother with the attention mechanism? Why don’t we use something simpler, such as feedforward neural networks (FFN) or convolutional neural networks (CNN)?
OK, because again, let’s assume that the “modify” signal can be “flat” anywhere (whether in the time race), and can be flattened by any length (the rectifier part can have any length). This means that standard neural networks are not that effective, because the exception “part” of time is not always in the same part of the signal. In other words, if you are just trying to process it with a linear weight matrix + nonlinear function, you will produce suboptimal results, as the index 300 of time series 1 may be completely different from the index 300 of time series 14. What we need is a dynamic approach that focuses on the anomalies of the series. That’s why (and where) attention method shines.
This blog post will be divided into the following 4 steps:
- Code Settings. Before entering the code, I will use all the library display settings we need.
- Data generation. I will provide the code required for the data generation part.
- Model implementation. I will provide the implementation of the attention model
- Explore results. The benefits of attention model will evaluate the performance of our method through attention scores and classification metrics.
It seems we have a lot of foundations to cover. Let’s get started! 🚀
1. Code Settings
Before digging into the code, let’s call some friends and we need to do the rest of the implementation.
These are just the default values that can be used throughout the project. What you see below is the short and sweet request. txt file.
I love it when things are easy to change and modular. So I created a .json file where we can change everything about the settings. Some of these parameters are:
- The number of normal and abnormal time series ( ratio In between)
- Number of time series (how long is your schedule)
- The size of the generated dataset
- Minimum and maximum position and length of linear parts
- More.
The .json file looks like this.
So, before moving to the next step, make sure you have:
- this constant Files in your working folder
- this .json file In your work folder or in the path you remember
- this Library Installed txt file in the requirement
2. Data generation
Two simple functions build normal sine waves and modify (rectification). The code can be found in data_utils.py:
Now that we have the basics, we can do all the backend work data.py. This is intended to be a feature of all functions:
- Receive setup information from .json file (that’s why it is needed!)
- Build a modified sine wave
- Train/test split and train/VAL/test whether it is used for model verification
The data.py script is as follows:
The attached data script is a script that prepares the data of the torch (SineWavetorchDataset) and looks like this:
If you want to see, this is a time series of random exceptions:
This is a non-opposition time series:

Now that we have the dataset, we can worry about the model implementation.
3. Model implementation
Can be found in model, training and loader implementations Model Code:
Now, let me take some time to explain why the attention mechanics are changing the game here. Unlike FFNN or CNN, it will also handle all time steps, instead dynamically highlighting the most important part of the sequence for classification. This allows the model to “magnify” the exception part (regardless of where it appears), making it particularly powerful for irregular or unpredictable time series patterns.
Let me be more precise here, talking about neural networks.
In our model, we use bidirectional LSTM to process time series, capturing past and future contexts in each time step. Then, instead of directly inputting the LSTM output into the classifier, we calculated the attention score over the entire sequence. These fractions determine how much weight each time step should have when forming the final context vector for classification. This means that the model focuses only on meaningful parts of the signal (i.e., flat anomalies), no matter where they occur.
Now, let’s connect the model and data to see the performance of our approach.
4. An example of practice
4.1 Training the model
Given the large backend part of our development, we can train Model with this super simple code block.
This took about 5 minutes to complete the CPU.
Note that we implemented early stop and train/valve/tests (at the back end) to avoid overfitting. We are responsible children.
4.2 Pay attention to the mechanism
Let’s use the following function here to display the attention mechanism along with the sine function.
Let’s show the attention scores for the normal time series.

As we can see, the attention score is positioned on an area with a flat section (there is a temporal change). Nevertheless, these are just these Local spikes.
Now let’s look at an exception time series.

As we see here, the model identifies the area where the (same time moves) function flattened. Despite this, this time it is not a local peak. This is the whole part of the signal, and we have a high score. Bingo.
4.3 Classification Performance
OK, this is good, but does this work? Let’s implement this function to generate a classification report.
The results are as follows:
accuracy: 0.9775
accurate: 0.9855
Remember: 0.9685
F1 score: 0.9769
ROC AUC score : 0.9774Chaos matrix:
[[1002 14][ 31 953]This is given
As far as all metrics are concerned, the performance is very high. Just like charm. 🙃
5. in conclusion
Thank you very much for reading this article ❤️. This means a lot. Let’s summarize what we’ve found on this journey and why this helps. In this blog post, we apply the attention mechanism in the classification task of time series. Classification is between a normal time series and a “modified” classification. By “modify”, we mean correcting the part (random length random length) (replaced with straight lines). We found:
- The attention mechanism was originally developed in NLP But they are also good at identifying abnormal situations in time series data, especially when the location of various abnormalities of the sample changes. It is difficult for traditional CNN or FFNN to achieve this flexibility.
- By using Combining bidirectional LSTM with attention layer,Our model understands which parts of the signal are most important. We see a posteriori through attention scores (alpha), which reveals which time steps are most relevant to the classification. The framework provides a transparent and explainable approach: we can visualize attention weights to understand why the model makes certain predictions.
- In just a few minutes we trained a highly accurate model (F1 score ≈0.98) and in just a few minutes we trained a highly accurate model that is accessible and powerful attentive even for small projects.
6. About me!
Thank you again for your precious time. This means a lot ❤️
My name is Piero Paialunga, and I am here this guy:

I am a candidate for the Department of Aeronautical Engineering at the University of Cincinnati. I talk about AI and machine learning in blog posts and LinkedIn and TD. If you liked this post and want to learn more about machine learning and follow my learning, you can:
A: Follow me LinkedInWhere do I post my story
B. Follow me githubyou can see all my code
C. For questions, you can send me an email [email protected]
Ciao!