Audio spectrogram transformer outside the laboratory

Want to know what attracted me to perform soundscape analysis?
This is a field that combines science, creativity and exploration, and few people do it. first, Wherever your lab is brought – Forest trails, city parks or remote mountain paths can all be spaces for scientific discoveries and acoustic investigations. second, Monitoring the selected geographical area is all about creativity. Innovation is at the heart of environmental audio research, whether it’s on custom devices, hiding sensors in the canopy, or using solar energy for off-grid settings. at last, The huge data is really incredible. As we know, in spatial analysis, all methods are fair games. From hours of animal calls to the subtle buzz of urban machinery, the acoustic data collected can be huge and complex, opening the door to leveraging everything from deep learning to geographic information systems (GIS).
After an earlier adventure of soundscape analysis in one of the Polish rivers, I decided to improve the bar, design and implement solutions that can analyze soundscapes. real time. In this blog post, you will find a description of the proposed method, as well as some code that powers the entire process, mainly using audio spectrogram transformers (ASTs) for sound classification.
method
set up
In this particular case, there are many reasons why I chose to use the Raspberry Pi 4 and AudioMoth. Trust me, I tested a wide range of devices – from the craving-type models that aren’t too eager for the Raspberry Pi family to various Arduino versions including Tortenta, all the way to the Jetson Nano. That’s just the beginning. Choosing the right microphone becomes more complicated.
Finally, I and PI 4 B (4GB RAM) Due to its stable performance and relatively low power consumption (~700mah when running my code). Also, pairing it with Audiomoth in USB microphone mode gives me very flexible during prototyping. Audiomoth is a powerful device with a wealth of configuration options, such as sampling rates ranging from 8 kHz to an amazing 384 kHz. In the long run, I have a strong feeling that it will prove to be ideal for my soundscape research.

Capture sound
Capturing audio from a USB microphone using Python is very cumbersome. After struggling with various libraries for a while, I decided to go back to good old Linux arecord
. The entire sound capture mechanism is encapsulated with the following commands:
arecord -d 1 -D plughw:0,7 -f S16_LE -r 16000 -c 1 -q /tmp/audio.wav
I deliberately used a plugin device to enable automatic conversion just in case I want to be right USB microphone Configuration. AST run 16 kHz sample, so set the recording and audio base sampling to this value.
Pay attention to the generator in the code. Importantly, the device continuously captures audio for the interval I specified. My goal is to store only the latest audio samples on the device and discard them after classification. This approach will be especially useful during large-scale research in urban areas later on, as it helps to ensure and align people with privacy. GDPR Compliance.
import asyncio
import re
import subprocess
from tempfile import TemporaryDirectory
from typing import Any, AsyncGenerator
import librosa
import numpy as np
class AudioDevice:
def __init__(
self,
name: str,
channels: int,
sampling_rate: int,
format: str,
):
self.name = self._match_device(name)
self.channels = channels
self.sampling_rate = sampling_rate
self.format = format
@staticmethod
def _match_device(name: str):
lines = subprocess.check_output(['arecord', '-l'], text=True).splitlines()
devices = [
f'plughw:{m.group(1)},{m.group(2)}'
for line in lines
if name.lower() in line.lower()
if (m := re.search(r'card (d+):.*device (d+):', line))
]
if len(devices) == 0:
raise ValueError(f'No devices found matching `{name}`')
if len(devices) > 1:
raise ValueError(f'Multiple devices found matching `{name}` -> {devices}')
return devices[0]
async def continuous_capture(
self,
sample_duration: int = 1,
capture_delay: int = 0,
) -> AsyncGenerator[np.ndarray, Any]:
with TemporaryDirectory() as temp_dir:
temp_file = f'{temp_dir}/audio.wav'
command = (
f'arecord '
f'-d {sample_duration} '
f'-D {self.name} '
f'-f {self.format} '
f'-r {self.sampling_rate} '
f'-c {self.channels} '
f'-q '
f'{temp_file}'
)
while True:
subprocess.check_call(command, shell=True)
data, sr = librosa.load(
temp_file,
sr=self.sampling_rate,
)
await asyncio.sleep(capture_delay)
yield data
Classification
Now is the most exciting part.
Using audio spectrogram transformers (ASTs) and an excellent embrace surface ecosystem, we can efficiently analyze audio and divide detected segments into over 500 categories.
Note that I have the system ready to support various pre-trained models. By default, I use MIT/AST-FINETUNED-ADIOSET-10–10–0.4593because it gives the best results and works well on the Raspberry Pi 4. but, onnx-community/ast-finetuned-audioset-10–10-0.4593-onnx Also worth exploring – especially Quantitative versionwhich requires less memory and serves inference results faster.
You may notice that I won’t limit the model to a single classification label, which is intentional. I don’t have a sound source at any given time, but I apply one sigmoid function to the logits of the model get Independent probability of each class. This allows the model to express Confidence in multiple tags at the same timethis is for Real world soundscape Often occur in places where overlapping sources such as birds, winds and distant traffic. accept Top five Ensure that the system captures the most likely reasonable events in the sample without forcing the decision to win.
from pathlib import Path
from typing import Optional
import numpy as np
import pandas as pd
import torch
from optimum.onnxruntime import ORTModelForAudioClassification
from transformers import AutoFeatureExtractor, ASTForAudioClassification
class AudioClassifier:
def __init__(self, pretrained_ast: str, pretrained_ast_file_name: Optional[str] = None):
if pretrained_ast_file_name and Path(pretrained_ast_file_name).suffix == '.onnx':
self.model = ORTModelForAudioClassification.from_pretrained(
pretrained_ast,
subfolder='onnx',
file_name=pretrained_ast_file_name,
)
self.feature_extractor = AutoFeatureExtractor.from_pretrained(
pretrained_ast,
file_name=pretrained_ast_file_name,
)
else:
self.model = ASTForAudioClassification.from_pretrained(pretrained_ast)
self.feature_extractor = AutoFeatureExtractor.from_pretrained(pretrained_ast)
self.sampling_rate = self.feature_extractor.sampling_rate
async def predict(
self,
audio: np.array,
top_k: int = 5,
) -> pd.DataFrame:
with torch.no_grad():
inputs = self.feature_extractor(
audio,
sampling_rate=self.sampling_rate,
return_tensors='pt',
)
logits = self.model(**inputs).logits[0]
proba = torch.sigmoid(logits)
top_k_indices = torch.argsort(proba)[-top_k:].flip(dims=(0,)).tolist()
return pd.DataFrame(
{
'label': [self.model.config.id2label[i] for i in top_k_indices],
'score': proba[top_k_indices],
}
)
To run the ONNX version of your model, you need to add the best in the dependencies.
Sound pressure level
In addition to audio classification, I also captured information about sound pressure levels. This method can not only identify What Make a sound, but also insight How strong Every sound exists. In this way, the model captures a richer, more realistic representation of the acoustic scene and can ultimately be used to detect finer particle noise pollution information.
import numpy as np
from maad.spl import wav2dBSPL
from maad.util import mean_dB
async def calculate_sound_pressure_level(audio: np.ndarray, gain=10 + 15, sensitivity=-18) -> np.ndarray:
x = wav2dBSPL(audio, gain=gain, sensitivity=sensitivity, Vadc=1.25)
return mean_dB(x, axis=0)
Gain (preamp + AMP), sensitivity (DB/V) and VADC (V) are mainly used in AudiOmoth and have been confirmed by experiments. If you are using another device, you must identify these values by referring to the technical specifications.
Storage
Data from each sensor is synchronized with the PostgreSQL database every 30 seconds. The current Urban Soundscape monitor prototype uses an Ethernet connection. Therefore, I am not limited by network load. Devices in more remote areas will use GSM connections to synchronize data every hour.
label score device sync_id sync_time
Hum 0.43894055 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
Mains hum 0.3894045 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
Static 0.06389702 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
Buzz 0.047603738 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
White noise 0.03204195 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
Bee, wasp, etc. 0.40881288 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
Fly, housefly 0.38868183 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
Insect 0.35616025 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
Speech 0.23579548 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
Buzz 0.105577625 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
result
Access this data using a separate application built with simplified and drawing. Currently, it displays information about the location of the device, the temporal pl (sound pressure level), determined sound classes and a range of acoustic indices.

Now we are fine. The plan is to expand the sensor network and reach about 20 devices, scattered across multiple places in my city. More information on sensor deployments in larger areas will be available soon.
Additionally, I’m collecting data from deployed sensors and plan to share packets, dashboards, and analytics in upcoming blog posts. I’ll use an interesting approach that can delve deeper into audio classification. The main idea is to match different sound pressure levels with detected audio classes. I hope to find a better way to describe noise pollution. Therefore, please continue to pay attention to more detailed failures in the future.
Meanwhile, you can read a preliminary paper about my soundscape research (headphones are a must).
This article has been proofread and edited using grammar to improve grammar and clarity.