Which signal processing technique is suitable for your device?
Which signal processing technique is suitable for your device?
Acoustic signal processing techniques such as beamforming and blind source separation improve the intelligibility of recorded speech, but which technique is best for which application?
In an increasingly noisy world, it can be difficult to hear clearly. And that’s as true for electronic devices as it is for humans, which is a problem if they’re designed to pick up or respond to our voices. The signals reaching their microphones are a mixture of voices, background noise and other disturbances such as room reverberation. This means that the quality and intelligibility of recorded speech can be badly affected – leading to poor performance.
Intelligible speech is critical to technology ranging from telephones, computers and conference systems to transcription services, in-car infotainment, home assistants and hearing aids. Signal processing techniques such as beamforming and blind source separation (BSS) can help – but they have different advantages and disadvantages. So which technique is best for which application?
Audio beamforming is one of the most versatile multi-microphone methods for emphasizing a particular source in a soundstage. Beamformers can be divided into two types, depending on how they work – data-independent or adaptive. One of the simplest forms of data-independent beamformer is the delay-sum beamformer, where the microphone signals are delayed to compensate for the different path lengths between the target source and the different microphones. This means that when the signals are summed, a target source coming from a particular direction will experience coherent combination and signals coming from other directions are expected to suffer from destructive combining to some extent.
However, in many consumer audio applications, these types of beamformers will be of little use because they require the signal wavelength to be small compared to the size of the microphone array. They work well in high-end conference systems with 1m diameter microphone arrays containing hundreds of microphones to cover a wide dynamic range of wavelengths. But these systems are expensive to manufacture and are therefore only suitable for the business conferencing market.
Consumer devices, on the other hand, typically have only a few microphones in a small array, so beamformers struggle with delay and sum as long speech wavelengths reach the small array of microphones. For example, a beamformer with the delay and sum of the size of a normal hearing aid cannot provide any directional discrimination at low frequencies – and at high frequencies it is limited in its directivity to front/back level discrimination.
Another problem is the fact that sound does not travel in a straight line – a given source has many different paths to the microphone, each with different amounts of reflection and diffraction. This means that simple delay-sum beams are not very effective at extracting the source of interest from the acoustic scene. But they are very easy to implement and provide a small amount of benefit, so they were often used in older devices.
One adaptive beamformer is a minimum variance distortion-free (MVDR) beamformer. This tries to pass the signal coming from the target direction in a distortion-free manner, while trying to minimize the power at the beam output. This has the effect of trying to preserve the target source while reducing noise and interference.
This technique may work well in ideal laboratory conditions, but in the real world, microphone misalignment and reverberation can lead to inaccuracies in modeling the effect of source location relative to the array. The result is that these beam converters often perform poorly because they will start canceling out parts of the target source. A voice activity detector can be added to solve the target cancellation problem and beamformer adjustment can be turned off when the target source is active. This can work well when there is only one target source, but if there are multiple competing speakers, this technique has limited effectiveness.
In addition, MVDR beamforming—just like delay-sum beamforming and most other types of beamforming—requires calibrated microphones, as well as knowledge of the geometry of the microphone array and the direction of the target source. Some beamformers are very sensitive to the accuracy of this information and may reject a target source because it is not coming from the indicated direction.
Many modern devices use another beamforming technique called adaptive sidelobe cancellation, which attempts to cancel sources that are not in the direction of interest. They are the most advanced in modern hearing aids and allow the user to concentrate on the sources directly in front of them. But a significant drawback is that you have to look at what you’re listening to, which can be inconvenient if your visual attention is needed elsewhere—for example, when you’re looking at a computer screen and trying to discuss what you’re seeing with colleagues.
An alternative approach to improving speech intelligibility in noisy environments is to use BSS. Time-Frequency Masking BSS estimates the time-frequency envelope of each source and then attenuates the time-frequency points dominated by interference and noise. Another type of BSS uses linear multichannel filters. The acoustic scene is broken down into its component parts using statistical models of how sources generally behave. BSS then calculates a multichannel filter whose output best fits these statistical models. By doing so, it essentially isolates all the sources in the scene, not just one.
The multi-channel filter method can handle microphone mismatch and will handle echo and multiple competing speakers well. No prior knowledge of the source, microphone array, or acoustic scene is required, as all these variables are absorbed into the multichannel filter design. A change in microphone or a calibration error simply changes the optimal multichannel filter.
As BSS works from audio data rather than microphone geometry, it is a very robust approach that is insensitive to calibration problems and can generally achieve much greater source separation in real-world situations than any beamformer. And because it separates all sources regardless of direction, it can be used to automatically monitor a multi-way conversation. This is particularly useful for hearing aid applications where the user wants to follow the conversation without having to manually interact with the device. BSS can also be very effective when used in VoIP calls, smart home devices and in-car infotainment applications.
But BSS is not without problems. For most BSS algorithms, the number of sources that can be separated depends on the number of microphones in the array. Because it works on data, BSS needs a consistent frame of reference, which currently limits the technique to devices that have a stationary array of microphones—for example, a desktop hearing aid, a microphone array for fixed conferencing systems, or video calls from a phone or tablet which you hold calmly in your hands or on the table.
When background chatter occurs, BSS will generally separate the most dominant sources in the mix, which may include the annoyingly loud person at the next table. Thus, to work effectively, BSS needs to be combined with an auxiliary algorithm to determine which of the sources are the sources of interest.
BSS itself separates the sources very well, but does not reduce background noise by more than about 9dB. To achieve really good performance, it must be paired with noise reduction techniques. Many noise reduction solutions use artificial intelligence (AI) – used by Zoom and other conferencing systems, for example – to analyze the signal in the time-frequency domain and then try to identify which components are due to the signal and which are due to noise. This can work well with just one microphone. But the big problem with this technique is that it extracts the signal by dynamically gassing the time-frequency content, which can lead to unpleasant artifacts in poor signal-to-noise ratios (SNR) and can lead to significant delay.
The low-latency noise suppression algorithm combined with BSS, on the other hand, provides up to 26 dB of noise suppression and makes the products suitable for real-time use – with a delay of only 5 ms and a more natural sound with less distortion than AI solutions. . Hearing aids in particular need ultra-low latency to lip-sync, because it’s extremely off-putting for users if the sound they hear lags behind the mouth movements of the person they’re talking to.
With an increasing number of signal processing techniques to choose from, it’s more important than ever to choose the right one for your application. The choice requires consideration of not only the performance you need, but also the situation in which you need the application to work and the physical limitations of the product you have in mind.
|Dave Betts is chief scientific officer at the audio software specialist AudioTelligence. He has been solving complex audio problems for over 30 years, with experience ranging from sound restoration and audio forensics to designing innovative audio algorithms used in blockbuster movies. At AudioTelligence, Dave leads a team of researchers who deliver innovative commercial audio solutions for the consumer electronics, assistive hearing and automotive markets.|
For more embedded, subscribe to Embedded’s weekly email newsletter.
#signal #processing #technique #suitable #device