jmvalin | Entries tagged with speex

Note: This is a first-person account of my involvement in Opus. Since I was not part of the early SILK efforts mentioned below, I cannot speak about its early development. That part is omitted from this account but by no means is that intended to diminish its importance to Opus.

Opus is an open-source, royalty-free, highly versatile audio codec standard. It is now deployed in billions of devices. This is how it came to be. Even before Opus, I had a strong interest in open standards, which led me to start the Speex project in 2002, with help from David Rowe. Speex was one of the first modern royalty-free speech codecs. It was shipped in many applications, especially games, but because it was slightly inferior to the standard codecs of the time, it never achieved a critical mass of deployment.

In 2007, when working on a high-quality videoconferencing project as part of my post-doc, I realized the need for a high-fidelity audio codec that also had very low delay suitable for interactive, real-time applications. At the time, audio codecs were mostly divided into two categories: there were high-delay, high-fidelity transform codecs (like MP3, AAC, and Vorbis) that were unsuitable for real-time operation, and there were low-delay speech codecs (like AMR, Speex, and G.729) with limited audio quality.

That is why I started the Opus ancestor called CELT, an effort to create a high-fidelity transform codec with an ultra low delay around 4-8 ms — even lower than the 20 ms typical delay for VoIP and videoconferencing. My first step was to discuss with Christopher "Monty" Montgomery, who had previously designed Ogg Vorbis, a high-delay, high-fidelity codec, and was then looking at designing a successor. Even though our sets of goals proved too different for us to merge the two efforts, the discussion was very helpful in that I was able to gain some of the experience Monty got while designing Vorbis. The most important advice I got was "always make sure the shape of the energy spectrum is preserved". In Vorbis (and other codecs), that energy constraint was only partially achieved, through very careful tuning of the encoder, and sometimes at great cost in bitrate. For CELT, I attacked the problem from a different perspective: What if CELT could be designed so that the constraint was built into the format, and thus mathematically impossible to violate?

This is where the CELT name originated: Constrained Energy Lapped Transform. The format itself would constrain the energy so that no effort or bits would be wasted. Although simple in principle, that idea required completely new compression and math techniques that had never previously been used in transform codecs. One of them was algebraic vector quantization, which had been used for a long time in speech codecs, but never in transform codecs, which still used scalar quantization. Overall, it took about 2 years to figure out the core of the CELT technology, with the help of Tim Terriberry, Greg Maxwell, and other Xiph contributors.

Because of the ultra low delay constraint, CELT was not trying to match or exceed the bitrate efficiency of MP3 and AAC, since these codecs benefited from a long delay (100-200 ms). It was thus a complete surprise when — only 6 months after the first commits — a listening test showed CELT already out-performing MP3 despite the difference in delay. That was attributed to the ancient technology behind MP3. CELT was still behind the more recent AAC, with no plan to compete on efficiency alone.

Despite still being in early development, some people started using CELT for their projects, mostly because it was the only codec that would suit their needs. These early users greatly helped CELT to improve by providing real-life use cases and raising issues that could not have been foreseen with just "lab" testing. For example, a developer who was using CELT for network music performances (musicians playing live together in different cities) once complained that "CELT works very well for everyone, except for me with my bass guitar". By getting an actual sample, it was easy to find the problem and address it. There were many similar stories and over a few years, many parts of CELT were changed or completely rewritten.

There has been no mention of the name Opus so far because there was still a missing piece. Around the same time CELT was getting started, another codec effort was quietly started by Koen Vos, Søren Skak Jensen, and Karsten Vandborg Sørensen at Skype under the name SILK. SILK was a more traditional speech codec, but with state-of-the-art efficiency, competing with or exceeding other speech codecs. We became aware of SILK in 2009 when Skype proposed it as a royalty-free codec to the Internet Engineering Task Force (IETF), the main standards body governing the Internet. We immediately joined the effort, proposing CELT to the emerging working group. It was a highly political effort, given the presence of organizations heavily invested in royalty-bearing codecs. There was thus strong pressure into restricting the working group’s effort to standardizing a single codec. That drove us to investigate ways to combine SILK and CELT. The two codecs were surprisingly complementary, SILK being more efficient at coding speech up to 8 kHz, and CELT being more efficient at coding music and achieving delays below 10 ms. The only thing none of the codecs did very efficiently was coding high-quality speech covering the full audio bandwidth (up to 20 kHz). This is where both SILK and CELT could be used simultaneously and achieve high-quality, fullband speech codec at just 32 kb/s, something no other codecs could achieve. Opus was born and, thanks to the IETF collaboration, the result would be better than the sum of its SILK and CELT parts.

Integrating SILK and CELT required changes to both technologies. On the CELT side, it meant supporting and optimizing for frame sizes up to 20 ms — no longer ultra-low delay but low enough for videoconferencing. Through collaboration in the working group, CELT also gained a perceptual pitch post-filter contributed by Raymond Chen at Broadcom. The post-filter and the 20-ms frames also increased the efficiency to the point where some audio enthusiasts started comparing Opus to HE-AAC on music compression. Unsurprisingly, they found the higher-delay HE-AAC to have higher quality at the same bitrate, but they also started providing specific feedback that helped improve Opus. This went on for several months, until a listening test eventually showed Opus having higher quality than HE-AAC, despite HE-AAC being designed for much higher delays. At that point, Opus really became a universal audio codec. It was either on par or better than all other audio codecs, regardless of the application, be it speech, music, real-time, storage, or streaming.

Opus officially became an IETF standard in 2012. At the time, the IETF was also defining the WebRTC standard for videoconferencing on the web. Thanks to its efficiency and its royalty-free nature, Opus became the mandatory-to-implement standard for WebRTC. In part thanks to WebRTC, Opus is now included in all major browsers and in both the Android and iOS mobile operating systems. It is also used alongside AV1 in YouTube. Most large technology companies now ship products using Opus. This ensures inter-operability across different applications that can communicate with a common codec. Because there are no royalties, it also enables some products that would not otherwise be viable (e.g. because you can't afford to pay a 0.50$ royalty for each freely-downloaded copy of a client software).

As for many other codecs, only the Opus decoder is standardized, which means that the encoder can keep improving without breaking compatibility. This is how Opus keeps improving to this day, with the latest version, Opus 1.3, released in October 2018.

Ever since we started working on Opus at the IETF, it's been a recurring theme. "You guys don't know how to test codecs", "You can't be serious unless you spend $100,000 testing your codec with several independent labs", or even "designing codecs is easy, it's testing that's hard". OK, subjective testing is indeed important. After all, that's the main thing that differentiates serious signal processing work from idiots using $1000 directional, oxygen-free speaker cable. However, just like speaker cables, more expensive listening tests do not necessarily mean more useful results. In this post I'm going to explain why this kind of thinking is wrong. I will avoid naming anyone here because I want to attack the myth of the $100,000 listening test, not the people who believe in it.

In the Beginning

Back in the 70s and 80s, digital audio equipment was very expensive, complicated to deploy, and difficult to test at all. Not everyone could afford analog-to-digital converters (ADC) or digital-to-analog converters (DAC), so any testing required using expensive, specialized labs. When someone came up with a new piece of equipment or a codec, it could end up being deployed for several decades, so it made sense to give it to one of these labs to test the hell out of it. At the same time, it wasn't too hard to do a good job in testing because algorithms were generally simple and codecs only supported one or two modes of operation. For example, a codec like G.711 only has a single bit-rate and can be implemented in less than 10 lines of code. With something that simple, it's generally not too hard to have 100% code coverage and make sure all corner cases are handled correctly. Considering the investments involved, it just made sense to pay tens or hundreds of thousands of dollars to make sure nothing blows up. This was paid by large telcos and their suppliers, so they could afford it anyway.

Things remained pretty much the same through the 90s. When G.729 was standardized in 1995, it still only had a single bit-rate, and the computational complexity was still beyond what a PC could do in real-time. A few years later, we finally got codecs like AMR-NB that supported several bit-rates, though the number was still small enough that you could test each of them.

Enter Opus

When we first attempted to create a codec working group (WG) at the IETF, some folks were less than thrilled to have their "codec monopoly" challenged. The first objection we heard was "you're not competent enough to write a codec". After pointing out that we already had three candidate codecs on the table (SILK, CELT, BroadVoice), created by the authors of 3 already-deployed codecs (iSAC, Speex, G.728), the objection quickly switched to testing. After all, how was the IETF going to review this work and make sure it was any good?

The best answer came from an old-time ("gray beard") IETF participant and was along the lines of: "we at the IETF are used to reviewing things that are a lot harder to evaluate, like crypto standards. When it comes to audio, at least all of us have two ears". And it makes sense. Among all the things the IETF does (transport protocols, security, signalling, ...), codecs are among the easiest to test because at least you know the criteria and they're directly measurable. Audio quality is a hell of a lot easier to measure than "is this cipher breakable?", "is this signalling extensible enough?", or "Will this BGP update break the Internet?"

Of course, that was not the end of the testing story. For many months in 2011 we were again faced with never-ending complaints that Opus "had not been tested". There was this implicit assumption that testing the final codec improves the codec. Yeah right! Apparently, the Big-Test-At-The-End is meant to ensure that the codec is good and if it's not then you have to go back to the drawing board. Interestingly, I'm not aware of a single ITU-T codec for which that happened. On the other hand, I am aware of at least one case where the Big-Test-At-The-End revealed someting wrong. Let's look at the listening test results from the AMR-WB (a.k.a. G.722.2) codec. AMR-WB has 9 bitrates, ranging from 6.6 kb/s to 23.85 kb/s. The interesting thing with the results is that when looking at the two highest rates (23.05 and 23.85) one notices that the 23.85 kb/s mode actually has lower quality than the lower 23.05 bitrate. That's a sign that something's gone wrong somewhere. I'm not aware of why that was the case or what exactly happened from there, but apparently it didn't bother people enough to actually fix the problem. That's the problem with final tests, they're final.

A Better Approach

What I've learned from Opus is that it's possible to have tests that are far more useful and much cheaper. First, final tests aren't that useful. Although we did conduct some of those, ultimately their main use ends up being for marketing and bragging rights. After all, if you still need these tests to convince yourself that your codec is any good, something's very wrong with your development process. Besides, when you look at a codec like Opus, you have about 1200 possible bitrates, using three different coding modes, four different frame sizes, and either mono or stereo input. That's far more than one can reliably test with traditional subjective listening tests. Even if you could, modern codecs are complex enough that some problems may only occur with very specific audio signals.

The single testing approach that gave us the most useful results was also the simplest: just put the code out there so people can use it. That's how we got reports like "it works well overall, but not on this rare piece of post-neo-modern folk metal" or "it worked for all our instruments except my bass". This is not something you can catch with ITU-style testing. It's one of the most fundamental principles of open-source development: "given enough eyeballs, all bugs are shallow". Another approach was simply to throw tons of audio at it and evaluate the quality using PEAQ-style objective measurement tools. While these tools are generally unreliable for precise evaluation of a codec quality, they're pretty good at flagging files the codec does badly on for further analysis.

We ended up using more than a dozen different approaches to testing, including various flavours of fuzzing. In the end, when it comes to the final testing, nothing beats having the thing out there. After all, as our Skype friends would put it:

Which codec do you trust more? The codec that's been tested by dozens of listeners in a highly controlled lab, or the codec that's been tested by hundreds of millions of listeners in just about all conditions imaginable?

It's not like we actually invented anything here either. Software testing has evolved quite a bit since the 80s and we've mainly attempted to follow the best practices rather than use antiquated methods "because that's what we've always done".

Speex 1.2beta3 has been tagged and will be up on the website shortly. There should even be Windows builds this time thanks to Alexander Chemeris. I'm expecting the next release to be named 1.2rc1. There's still a few things to address before 1.2, but I'm hoping the libspeexdsp API will be complete for rc1. Stay tuned.

There's another releasing coming up: a new Code-Excited Lapped Transform (CELT) codec prototype. This codec is (for now at least) meant to compete with neither Vorbis, nor Speex. Instead, the primary idea is to reduce latency to a minimum -- currently around 8 ms (compared to ~25 ms for Speex and ~100 ms for Vorbis). Of course, this comes with a price in terms of efficiency, but I'm already surprised the price isn't bigger. I've been mainly focusing on speech, but unlike Speex, I'm hoping this one will handle music as well. For the curious, I've put a 56 kbps CBR music file (original). This is still very experimental and everything is still likely to change, including the exact goals. I'm still trying to figure out how to put psychoacoustics into that. Stay tuned for the release of version 0.0.1 (or should I use a negative version number to make it clearer it's experimental?).

CELT is based on a paper I submitted to ICASSP and which I'm hoping will be accepted so I can make it available to everyone. The only difference is that the ICASSP paper was based on the FFT (non critically sampled), whereas this version is based on the MDCT. One part that is already published though is Timothy's explanation of the pulse codebook encoding along with some source code. Now, here's a challenge. Who can beat the algorithm on Timothy's page? Simply stated, the idea is to enumerate all combinations of M pulses in a vector of dimension N, knowing that pulses have unit magnitude and a sign, but all pulses at the same position need to have the same sign.

Updated: The full source for CELT is available at: http://downloads.us.xiph.org/releases/celt/celt-0.0.1.tar.gz or through git at http://git.xiph.org/celt.git

Updated again:: Speex 1.2beta3 is out.

I realised recently that Speex tends to have a lot of array copies done using for loops instead of memmove()/memcpy(). However, I wasn't quite happy with just replacing everything with memmove()/memcpy() because of the very poor type safety they provide. For example, it's all too easy to change the type of an array and have memmove() doing the wrong thing without any warning. So, after a bit of thinking, I came up with the following macro, which I think should work:

#define SPEEX_MOVE(dst, src, n) (((dst)-(src)), memmove((dst), (src), (n)*sizeof(*(dst))))

Compared to memmove(), this macro does two things. First, it removes the need for using sizeof() in the length argument, removing a source of error. Second, the discarded (dst)-(src) value before the comma basically ensures that the compiler will generate an error if src and dst point to different types. Expect to see that macro appearing in Speex soon. I'd be curious to hear if anyone knows of any unwanted side effects of this, except of course the usual "don't use arguments with side effects" limitation.

Update: Actually, the following expression also has the advantage not producing a warning with gcc:

#define SPEEX_MOVE(dst, src, n) (memmove((dst), (src), (n)*sizeof(*(dst)) + 0*((dst)-(src)) ))

So a while ago, I wasn't careful with type lengths and wrote some code in the speex encoder (speexenc, not libspeex) that wouldn't work very well on 64-bit machines. More precisely, it would make speexenc crash on startup 100% of the time, so you can't really miss it if you have a 64-bit machine. Fortunately, someone noticed and the bug was promptly fixed. This should normally have been the end of the story... except that Ubuntu was going to ship Dapper with an older version (current Debian unstable).

Turns out that the bug was reported against Dapper very early on. A patch was even posted more than a month before the release of Dapper. From there, it took 11 months for the 2-line fix to be applied and released. And if it wasn't for me harassing some of the developers (thanks crimsun, tritium for pushing the fix in), I don't think the fix would never have made it.

Sometimes one wonders why it is that Ubuntu has a bug tracker. Another example is bug #52600. You can't see it because it's marked as a security bug, but considering I filed it more than 8 months ago, I don't think making it private makes sense anymore. That one comes down to the fact that any local user with no privilege can crash a Dapper machine very easily. You just compile the following program:

#include <sched.h>
int main() {
struct sched_param param;
param.sched_priority = sched_get_priority_max(SCHED_FIFO);
sched_setscheduler(0,SCHED_FIFO,¶m);
while(1);
}

and then execute it. What this does is simply ask the maximum real-time priority and then spin doing nothing, starving every single other process on the machine and forcing a reboot. While allowing SCHED_FIFO to some users in some circumstances makes sense, I can't understand why it's enabled for everyone on the system. It's a bit like making the shutdown command setuid root. Yeah for the Ubuntu LTS (Long Time to get Support) process!

After a couple days fighting with this annoying overflow bug, I think I've managed to solve the problem. As you can see, some of the fixes are not very nice. It basically comes down to

Adding explicit saturation (SATURATE) before 32-bit to 16-bit conversions.
Scaling the signal up/down for some operations to avoid having to add saturation all over the place, especially in critical loops
PSHR* is evil. Well, not quite but can you shot the danger in the PSHR32 definition?

OK, so I thought the fixed-point code in 1.2-beta1 was getting pretty good. But that was until a user ((wouldn't things be simple without them!) was able to make it fail horribly by feeding it totally clipped speech. It turns out that the file manages to trigger at least a half-dozen overflows all around the code, some of them easily fixed, some not.

So here's the deal with fixed-point. Some CPUs/DSPs support saturating versions of add/sub/mul/... and some don't. Most G.72x codecs are usually implemented assuming that they exist, so they don't need to worry about overflows. For Speex, I decided to do it without assuming hardware saturation, so it can run on ARM and other chips (including x86) that don't support saturation. And that's how everything suddenly becomes more complicated. If once in a while 0.5 + 0.6= 1.0, you usually don't care too much. On the other hand, if 0.5 + 0.6 = -0.9, then suddenly you do care.

So the fundamental question here is how much overflows on corrupted input can be tolerated (based on the "garbage in, garbage out" principle) and how much needs to be avoided regardless of the input? Answer when I get to the bottom of this. To be continued...

Profile

jmvalin

March 2023

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Syndicate

Style Credit

Style: Dreamer for Dusty Foot by timeasmymeasure

Expand Cut Tags

No cut tags

Page generated Jun. 13th, 2025 07:33 am

Jean-Marc Valin

Entries tagged with speex

How Opus Came To Be

The Myth of the $100,000 Listening Test

In the Beginning

Enter Opus

A Better Approach

Preparing for releases

Safe version of memmove()/memcpy()

History of an Ubuntu bug

Think I got it now

Fixed-point fun