jmvalin | Entries tagged with speech

Note: This is a first-person account of my involvement in Opus. Since I was not part of the early SILK efforts mentioned below, I cannot speak about its early development. That part is omitted from this account but by no means is that intended to diminish its importance to Opus.

Opus is an open-source, royalty-free, highly versatile audio codec standard. It is now deployed in billions of devices. This is how it came to be. Even before Opus, I had a strong interest in open standards, which led me to start the Speex project in 2002, with help from David Rowe. Speex was one of the first modern royalty-free speech codecs. It was shipped in many applications, especially games, but because it was slightly inferior to the standard codecs of the time, it never achieved a critical mass of deployment.

In 2007, when working on a high-quality videoconferencing project as part of my post-doc, I realized the need for a high-fidelity audio codec that also had very low delay suitable for interactive, real-time applications. At the time, audio codecs were mostly divided into two categories: there were high-delay, high-fidelity transform codecs (like MP3, AAC, and Vorbis) that were unsuitable for real-time operation, and there were low-delay speech codecs (like AMR, Speex, and G.729) with limited audio quality.

That is why I started the Opus ancestor called CELT, an effort to create a high-fidelity transform codec with an ultra low delay around 4-8 ms — even lower than the 20 ms typical delay for VoIP and videoconferencing. My first step was to discuss with Christopher "Monty" Montgomery, who had previously designed Ogg Vorbis, a high-delay, high-fidelity codec, and was then looking at designing a successor. Even though our sets of goals proved too different for us to merge the two efforts, the discussion was very helpful in that I was able to gain some of the experience Monty got while designing Vorbis. The most important advice I got was "always make sure the shape of the energy spectrum is preserved". In Vorbis (and other codecs), that energy constraint was only partially achieved, through very careful tuning of the encoder, and sometimes at great cost in bitrate. For CELT, I attacked the problem from a different perspective: What if CELT could be designed so that the constraint was built into the format, and thus mathematically impossible to violate?

This is where the CELT name originated: Constrained Energy Lapped Transform. The format itself would constrain the energy so that no effort or bits would be wasted. Although simple in principle, that idea required completely new compression and math techniques that had never previously been used in transform codecs. One of them was algebraic vector quantization, which had been used for a long time in speech codecs, but never in transform codecs, which still used scalar quantization. Overall, it took about 2 years to figure out the core of the CELT technology, with the help of Tim Terriberry, Greg Maxwell, and other Xiph contributors.

Because of the ultra low delay constraint, CELT was not trying to match or exceed the bitrate efficiency of MP3 and AAC, since these codecs benefited from a long delay (100-200 ms). It was thus a complete surprise when — only 6 months after the first commits — a listening test showed CELT already out-performing MP3 despite the difference in delay. That was attributed to the ancient technology behind MP3. CELT was still behind the more recent AAC, with no plan to compete on efficiency alone.

Despite still being in early development, some people started using CELT for their projects, mostly because it was the only codec that would suit their needs. These early users greatly helped CELT to improve by providing real-life use cases and raising issues that could not have been foreseen with just "lab" testing. For example, a developer who was using CELT for network music performances (musicians playing live together in different cities) once complained that "CELT works very well for everyone, except for me with my bass guitar". By getting an actual sample, it was easy to find the problem and address it. There were many similar stories and over a few years, many parts of CELT were changed or completely rewritten.

There has been no mention of the name Opus so far because there was still a missing piece. Around the same time CELT was getting started, another codec effort was quietly started by Koen Vos, Søren Skak Jensen, and Karsten Vandborg Sørensen at Skype under the name SILK. SILK was a more traditional speech codec, but with state-of-the-art efficiency, competing with or exceeding other speech codecs. We became aware of SILK in 2009 when Skype proposed it as a royalty-free codec to the Internet Engineering Task Force (IETF), the main standards body governing the Internet. We immediately joined the effort, proposing CELT to the emerging working group. It was a highly political effort, given the presence of organizations heavily invested in royalty-bearing codecs. There was thus strong pressure into restricting the working group’s effort to standardizing a single codec. That drove us to investigate ways to combine SILK and CELT. The two codecs were surprisingly complementary, SILK being more efficient at coding speech up to 8 kHz, and CELT being more efficient at coding music and achieving delays below 10 ms. The only thing none of the codecs did very efficiently was coding high-quality speech covering the full audio bandwidth (up to 20 kHz). This is where both SILK and CELT could be used simultaneously and achieve high-quality, fullband speech codec at just 32 kb/s, something no other codecs could achieve. Opus was born and, thanks to the IETF collaboration, the result would be better than the sum of its SILK and CELT parts.

Integrating SILK and CELT required changes to both technologies. On the CELT side, it meant supporting and optimizing for frame sizes up to 20 ms — no longer ultra-low delay but low enough for videoconferencing. Through collaboration in the working group, CELT also gained a perceptual pitch post-filter contributed by Raymond Chen at Broadcom. The post-filter and the 20-ms frames also increased the efficiency to the point where some audio enthusiasts started comparing Opus to HE-AAC on music compression. Unsurprisingly, they found the higher-delay HE-AAC to have higher quality at the same bitrate, but they also started providing specific feedback that helped improve Opus. This went on for several months, until a listening test eventually showed Opus having higher quality than HE-AAC, despite HE-AAC being designed for much higher delays. At that point, Opus really became a universal audio codec. It was either on par or better than all other audio codecs, regardless of the application, be it speech, music, real-time, storage, or streaming.

Opus officially became an IETF standard in 2012. At the time, the IETF was also defining the WebRTC standard for videoconferencing on the web. Thanks to its efficiency and its royalty-free nature, Opus became the mandatory-to-implement standard for WebRTC. In part thanks to WebRTC, Opus is now included in all major browsers and in both the Android and iOS mobile operating systems. It is also used alongside AV1 in YouTube. Most large technology companies now ship products using Opus. This ensures inter-operability across different applications that can communicate with a common codec. Because there are no royalties, it also enables some products that would not otherwise be viable (e.g. because you can't afford to pay a 0.50$ royalty for each freely-downloaded copy of a client software).

As for many other codecs, only the Opus decoder is standardized, which means that the encoder can keep improving without breaking compatibility. This is how Opus keeps improving to this day, with the latest version, Opus 1.3, released in October 2018.