Cisco's new AI codec: Improved voice quality in Webex

If the connection is not optimal, the call quality with Webex has suffered badly so far. Now Cisco is catching up with Microsoft and Google with a new AI codec.

Save to Pocket listen Print view
A corded telephone from the side, an arm in a white shirt holds up the receiver.

(Image: Gajus/Shutterstock.com)

4 min. read
By
  • Benjamin Pfister
This article was originally published in German and has been automatically translated.

Real-time applications such as telephony and web meetings with audio and video transmissions have special requirements for data transmission. Particularly in increasingly hybrid working environments from different locations or on the move, difficult network conditions often lead to poor audio quality. Cisco has now released an AI audio codec that is designed to enable good call quality even with extremely low bandwidths and high packet loss rates. It is now officially available in Webex Meetings and Webex Calling.

Choppy sound and unintelligible speech due to distortion are the effects of high packet loss rates, high jitter or excessive delay in data transmission. Current audio techniques, such as packet loss concealment, do not adequately address scenarios with high packet loss. According to Cisco, AI algorithms should now improve this.

The term codec combines the words encoder and decoder: the encoder compresses the analog audio waveform to a specific bit rate, while the decoder reconstructs the waveform on the receiving end. The goal of audio encoding is to compress audio data recorded via a microphone within a certain bit rate budget and reconstruct it at the receiving end as close as possible to the original audio data.

In real-time communication systems, the recorded audio is typically divided into frames. These are then compressed by the codec and packetized before transmission via a network connection. With typical audio codecs such as G.711a-Law, this happens every 20 milliseconds. The transmission takes place via unsecured UDP and the successful transmission of these audio packets depends on the stability and reliability of the network connection, which cannot always be guaranteed.

In some cases, packet loss concealment (PLC), i.e. masking of packet losses – specifically by repeating the audio data of the last successfully received packets, replacing lost packets with silence or reconstructing plausible filler audio based on typical speech patterns, has been used in the event of lost packets. However, repetition is not useful for real-time transmissions, as the latency is then too high.

According to Cisco, if the network quality is poor, with packet loss of over 30 percent, the Webex app on the desktop or cell phone should automatically switch on the Webex AI Codec to improve quality. The basic idea is to encode voice data at a very low bit rate (6 kbit/s) and transmit copies of previous frames at an even lower bit rate (1 kbit/s).

Vector quantization describes a process for identifying data records, which in turn are combined into feature vectors. Self-learning vector quantization is a method from the field of artificial neural networks. A multi-stage VQ (residual VQ) uses several VQ layers, each of which takes the residual signal from the previous layer to further quantize it sequentially. In the Webex AI codec, Cisco says it uses multi-stage VQ to further compress the voice data before transmission and consequently transmit content at minimum bit rates.

To train the neural codec system, Cisco claims to have injected various artifacts into clean voice signals, including background noise, reverberation, bandwidth limitation, packet loss and other peculiarities. More than 10,000 hours of clean speech and noise samples were used for training. This should provide a broad basis for the model. The audio encoder will in turn use a deep neural network to extract a comprehensive set of features, such as complex features of speech and background noise together or separately. The extracted speech features will include attributes such as volume, pitch modulation and accents. The neural encoder learns and refines its feature extraction based on extensive and diverse data sets, which should improve the representation. The codec is intended to supplement the background noise suppression feature known from Webex Meetings.

More technical details on the Webex AI Codec can be found in a Cisco white paper.

(mma)