Introduction
Think back to the last time you watched a movie where you found yourself more intrigued by the scene’s background than the subjects of the movie. Maybe it was a walking scene from Lord of the Rings or maybe you’re counting the number of passes the white team makes with a basketball. In both cases, your brain has been tasked with the intriguing image processing task of separating the foreground and background of a series of individual frames, and stitching them back together in a coherent manner. Consider the much more simple case of an individual walking through a leaf-covered field, illustrated below.
Although only three frames are shown, you can already begin to interpolate the frames that exist between those shown above — the individual starts closer to the camera at the top of the hill and only their legs are shown. As time progresses, the individual turns away from the camera and begins walking down the hill, making their entire body visible to the camera. Since the camera remains stationary, the foreground and background remain constant. But what if we wanted to see the background in a single image without waiting for the individual to walk our of the scene entirely? The digital signal processing (DSP) answer to this is called “background estimation”.
Using each frame
Unsurprisingly, the algorithm best suited to fit the situation at hand depends on the dataset. In the case of removing the moving person from the environment, the solution may be as simple as time-averaging each frame then implementing additional logic to separate the foreground pixels and the background pixels. In this technique, the idea is that the undesirable object is moving in each frame; averaging a frame in which the person is present at pixel [math] (X_0,Y_0) [/math] but not present at the same pixel in a second frame greatly reduces the value of the person in the frame at pixel location [math] (X_0,Y_0) [/math] over time. As a consequence, the more frames containing !person, the more the pixel at location [math] (X_0,Y_0) [/math] will resemble the background.
By the same token, knowing what the background looks like helps you understand what the moving subject looks like. The image below helps illustrate this point. With a well-understood background, much of the structure of the hillside, leaves, twigs, etc. is removed from the frame, and we are left with the rough outline of the individual.
As a consequence of isolating the mobile subject, we could apply an edge filter to better define the individual’s outline, or we could find the non-zero pixels of the background and use that information to identify a center-of-mass. In turn, the center of mass would let us calculate the speed and trajectory of the subject throughout multiple frames.
Color coded
The approach we’ve discussed so far uses frames from a black and white video. When color information is added, we gain the ability to isolate specific colors within the frame. For instance, consider the same sequence of frames, but imagine the individual’s sweater is red. Given the (dead) nature of the vegetation within the scenes, we can assume that the sky is probably blue (clear) or gray (overcast). We can further assume that the leaves and twigs are some shade of dull brown, and the hill may also be a light green or gray. Given these conditions, a bright red sweater would be easy to isolate by simply filtering on values in the red color channel, and we could do the same with the deep blue or black of the individual’s pants.
So what happens when you need to isolate a color gradient background? In this case, color information still turns out to be incredibly helpful. Consider the case below where a pink smudge has been artificially added to a gradient blue background. Interestingly, our human eyes (and computer monitors) have a difficult time understanding and discriminating between the pink and bluish-purple pixels. But we can still use color information from the image to generate an artificial gradient similar to the background.
Beautifully, the background gradient estimation and subsequent removal works very, very well. It works so well, in fact, that the isolated smudge (right) seems vertically longer and wider than the initial image of the smudge! This tells us that the initial smudge was actually wider and taller than we initially thought. In this respect, we’ve gained information by removing the background and isolating the signal. Additionally, the pixels defining the artificial smudge can be collected for further analysis. Similar to the individual in the first example, metrics like center-of-mass, isoline topologies, and movement can be processed in an automated manner.
Conclusion
Many times, we look at an image and assume the background is an integral, inseparable portion of the image we’re looking at. But in the case of movies (or even single images), the background can be removed. If we can remove the background, we are able to gain information about the scene, particularly if we are processing a time-series analysis. This method becomes even more powerful when we’re able to reduce the amount of information in each frame of a video (thereby saving memory) over very large, long, or complex data campaigns.