Introduction
Audio and video are widely considered separate phenomena. After all, you listen to music, but you watch a show. On the other hand, both are the result of information streaming from one location to another, and being translated by a piece of technology. Whether the data are interpreted by a television or a radio, the input format is similar from an information point of view.
So how could knowledge of one benefit the other? From an image processing standpoint, tv shows are a hypercube of data — pixels are arranged in a two-dimensional grid, assigned a color value (or, arguably, three color values for each red, green, and blue channel), and updated over time. One slice of such a data cube is a two-dimensional colored image. Similarly, a snippet of audio (a song, for example) is a range of frequencies that can be mapped out over time — again, a two-dimensional image. From an image processing point of view, there’s very little difference between an array of audio (below, left) and the image of a kitten (below, right). Both images contain noise, both contain signal, and both are able to be processed and interpreted.
Using the same tools
One set of tools that becomes immediately applicable to both images is the ability to remove noise and unwanted signals from each array. In the case of the audio array, it is clear there exist a variety of noise profiles. The array — characterized with its time on the horizontal axis and frequency range on the vertical — contains noise sources in the form of vertical streaks, as well as horizontal interferers that are not visible at this resolution. With such a large variety of noise, it can be a confusing space in which to identify, characterize, classify, and remove noise from the strongest signals.
In the below set of images, an attempt is made to remove one specific type of noise. The image on the left (a) illustrates the raw audio data processed via Fourier transform. The magnitude and phase of the data are then (b) shifted and separated before (c) undergoing an image-based noise-removal process. The cleaned data are then (d) re-shifted to return the initial signal format.
Making a difference
If we ‘zoom out’ of the data, we can gain more appreciation for the impact image processing algorithms can have on audio datasets. Below, the top images (a, b) illustrate the “before” and “after” image of the audio stream in Python. The cleaned data can then be fed back into the audio analysis tools to create the same comparison (c, d). Immediately, the clarity in audio due to background noise removal becomes apparent. For the most part, the data cleaned during the image processing portion translates into audio fuzz, similar to the static on an older television set. But the image cleaning algorithms can be further used to target specific signal types for elimination or enhancement.
Down the rabbit hole
In all fairness, it is not always obvious what data is signal and which is noise. Even in circumstances where the noise is well characterized and removed, the audio stream still may have secrets to reveal. Consider the below dataset as an example. The first image (a) is the uncleaned data. The uncleaned dataset was then processed to remove the signals, leaving only (b) the noise profile. In turn, the noise array can then be used to identify and characterize different sources of noise in the audio stream.
But surprises begin as the noise source is picked apart. Zooming in on the first 1/50th of the datastream, image (c) arises — an interestingly shaped audio signal pulsing in time with a broad frequency profile. If we further zoom into the solid line above the top pulsed noise profile (~1/5 of the way down the image), image (d) arises — the signal becomes stronger, but there’s an obvious secondary pulse on top of the zoomed-in image. If we further zoom in on the profile of the same solid line, we see that even this signal has a repetitive pulse to it, and varies in the frequency domain as well. Was this hidden “signal” noise, or was it a relevant signal we failed to remove?
From this analysis, it is apparent that while helpful, audio analysis becomes a problem of scales. Even if you have the proper high-frequency filters created, seeing the high-frequency signals in the dataset may depend on how closely the dataset is being examined.
Conclusion
Although much less technical in description, this article attempts to quickly illustrate the power of merging image processing techniques with audio analysis. As an image processing subject matter expert, this problem continues to be an ongoing topic of interest.