This new demo presents LPCNet, an architecture that combines signal processing and deep learning to improve the efficiency of neural speech synthesis. Neural speech synthesis models like WaveNet have recently demonstrated impressive speech synthesis quality. Unfortunately, their computational complexity has made them hard to use in real-time, especially on phones. As was the case in the RNNoise project, one solution is to use a combination of deep learning and digital signal processing (DSP) techniques. This demo explains the motivations for LPCNet, shows what it can achieve, and explores its possible applications.
Opus gets another major upgrade with the release of version 1.2. This release brings quality improvements to both speech and music, while remaining fully compatible with RFC 6716. There are also optimizations, new options, as well as many bug fixes. This Opus 1.2 demo describes a few of the upgrades that users and implementers will care about the most. You can download the code from the Opus website.
Over the last three years, we have published a number of Daala technology demos. With pieces of Daala being contributed to the Alliance for Open Media's AV1 video codec, now seems like a good time to go back over the demos and see what worked, what didn't, and what changed compared to the description we made in the demos.
Here's my new contribution to the Daala demo effort. Perceptual Vector Quantization has been one of the core ideas in Daala, so it was time for me to explain how it works. The details involve lots of maths, but hopefully this demo will make the general idea clear enough. I promise that the equations in the top banner are the only ones you will see!
After more than two years of development, we have released Opus 1.1. This includes:
- new analysis code and tuning that significantly improves encoding quality, especially for variable-bitrate (VBR),
- automatic detection of speech or music to decide which encoding mode to use,
- surround with good quality at 128 kbps for 5.1 and usable down to 48 kbps, and
- speed improvements on all architectures, especially ARM, where decoding uses around 40% less CPU and encoding uses around 30% less CPU.
With the changes, stereo encoding now produces usable audio (of course, not high fidelity) down to about 40 kb/s, with surround 5.1 sounding usable down to 48-64 kb/s. Please give this release a try and report any issues on the mailing list or by joining the #opus channel on irc.freenode.net. The more testing we get, the faster we'll be able to release 1.1-final.
As usual, the code can be downloaded from: http://opus-codec.org/downloads/
We just released Opus 1.1-beta, which includes many improvements over the 1.0.x branch. For this release, Monty made a nice demo page showing off most of the new features. In other news, the AES has accepted my paper on the CELT part of Opus, as well as a paper proposal from Koen Vos on the SILK part.
Ever since we started working on Opus at the IETF, it's been a recurring theme. "You guys don't know how to test codecs", "You can't be serious unless you spend $100,000 testing your codec with several independent labs", or even "designing codecs is easy, it's testing that's hard". OK, subjective testing is indeed important. After all, that's the main thing that differentiates serious signal processing work from idiots using $1000 directional, oxygen-free speaker cable. However, just like speaker cables, more expensive listening tests do not necessarily mean more useful results. In this post I'm going to explain why this kind of thinking is wrong. I will avoid naming anyone here because I want to attack the myth of the $100,000 listening test, not the people who believe in it.
In the Beginning
Back in the 70s and 80s, digital audio equipment was very expensive, complicated to deploy, and difficult to test at all. Not everyone could afford analog-to-digital converters (ADC) or digital-to-analog converters (DAC), so any testing required using expensive, specialized labs. When someone came up with a new piece of equipment or a codec, it could end up being deployed for several decades, so it made sense to give it to one of these labs to test the hell out of it. At the same time, it wasn't too hard to do a good job in testing because algorithms were generally simple and codecs only supported one or two modes of operation. For example, a codec like G.711 only has a single bit-rate and can be implemented in less than 10 lines of code. With something that simple, it's generally not too hard to have 100% code coverage and make sure all corner cases are handled correctly. Considering the investments involved, it just made sense to pay tens or hundreds of thousands of dollars to make sure nothing blows up. This was paid by large telcos and their suppliers, so they could afford it anyway.
Things remained pretty much the same through the 90s. When G.729 was standardized in 1995, it still only had a single bit-rate, and the computational complexity was still beyond what a PC could do in real-time. A few years later, we finally got codecs like AMR-NB that supported several bit-rates, though the number was still small enough that you could test each of them.
When we first attempted to create a codec working group (WG) at the IETF, some folks were less than thrilled to have their "codec monopoly" challenged. The first objection we heard was "you're not competent enough to write a codec". After pointing out that we already had three candidate codecs on the table (SILK, CELT, BroadVoice), created by the authors of 3 already-deployed codecs (iSAC, Speex, G.728), the objection quickly switched to testing. After all, how was the IETF going to review this work and make sure it was any good?
The best answer came from an old-time ("gray beard") IETF participant and was along the lines of: "we at the IETF are used to reviewing things that are a lot harder to evaluate, like crypto standards. When it comes to audio, at least all of us have two ears". And it makes sense. Among all the things the IETF does (transport protocols, security, signalling, ...), codecs are among the easiest to test because at least you know the criteria and they're directly measurable. Audio quality is a hell of a lot easier to measure than "is this cipher breakable?", "is this signalling extensible enough?", or "Will this BGP update break the Internet?"
Of course, that was not the end of the testing story. For many months in 2011 we were again faced with never-ending complaints that Opus "had not been tested". There was this implicit assumption that testing the final codec improves the codec. Yeah right! Apparently, the Big-Test-At-The-End is meant to ensure that the codec is good and if it's not then you have to go back to the drawing board. Interestingly, I'm not aware of a single ITU-T codec for which that happened. On the other hand, I am aware of at least one case where the Big-Test-At-The-End revealed someting wrong. Let's look at the listening test results from the AMR-WB (a.k.a. G.722.2) codec. AMR-WB has 9 bitrates, ranging from 6.6 kb/s to 23.85 kb/s. The interesting thing with the results is that when looking at the two highest rates (23.05 and 23.85) one notices that the 23.85 kb/s mode actually has lower quality than the lower 23.05 bitrate. That's a sign that something's gone wrong somewhere. I'm not aware of why that was the case or what exactly happened from there, but apparently it didn't bother people enough to actually fix the problem. That's the problem with final tests, they're final.
A Better Approach
What I've learned from Opus is that it's possible to have tests that are far more useful and much cheaper. First, final tests aren't that useful. Although we did conduct some of those, ultimately their main use ends up being for marketing and bragging rights. After all, if you still need these tests to convince yourself that your codec is any good, something's very wrong with your development process. Besides, when you look at a codec like Opus, you have about 1200 possible bitrates, using three different coding modes, four different frame sizes, and either mono or stereo input. That's far more than one can reliably test with traditional subjective listening tests. Even if you could, modern codecs are complex enough that some problems may only occur with very specific audio signals.
The single testing approach that gave us the most useful results was also the simplest: just put the code out there so people can use it. That's how we got reports like "it works well overall, but not on this rare piece of post-neo-modern folk metal" or "it worked for all our instruments except my bass". This is not something you can catch with ITU-style testing. It's one of the most fundamental principles of open-source development: "given enough eyeballs, all bugs are shallow". Another approach was simply to throw tons of audio at it and evaluate the quality using PEAQ-style objective measurement tools. While these tools are generally unreliable for precise evaluation of a codec quality, they're pretty good at flagging files the codec does badly on for further analysis.
We ended up using more than a dozen different approaches to testing, including various flavours of fuzzing. In the end, when it comes to the final testing, nothing beats having the thing out there. After all, as our Skype friends would put it:
Which codec do you trust more? The codec that's been tested by dozens of listeners in a highly controlled lab, or the codec that's been tested by hundreds of millions of listeners in just about all conditions imaginable?It's not like we actually invented anything here either. Software testing has evolved quite a bit since the 80s and we've mainly attempted to follow the best practices rather than use antiquated methods "because that's what we've always done".
We just released Opus 1.1-alpha, which includes more than one year of development compared to the 1.0.x branch. There are quality improvements, optimizations, bug fixes, as well as an experimental speech/music detector for mode decisions. That being said, it's still an alpha release, which means it can also do stupid things sometimes. If you come across any of those, please let us know so we can fix it. You can send an email to the mailing list, or join us on IRC in #opus on irc.freenode.net. The main reason for releasing this alpha is to get feedback about what works and what does not.
Most of the quality improvements come from the unconstrained variable bitrate (VBR). In the 1.0.x encoder VBR always attempts to meet its target bitrate. The new VBR code is free to deviate from its target depending on how difficult the file is to encode. In addition to boosting the rate of transients like 1.0.x goes, the new encoder also boosts the rate of tonal signals which are harder to code for Opus. On the other hand, for signals with a narrow stereo image, Opus can reduce the bitrate. What this means in the end is that some files may significantly deviate from the target. For example, someone encoding his music collection at 64 kb/s (nominal) may find that some files end up using as low as 48 kb/s, while others may use up to about 96 kb/s. However, for a large enough collection, the average should be fairly close to the target.
There are a few more ways in which the alpha improves quality. The dynamic allocation code was improved and made more aggressive, the transient detector was once again rewritten, and so was the tf analysis code. A simple thing that improves quality of some files is the new DC rejection (3-Hz high-pass) filter. DC is not supposed to be present in audio signals, but it sometimes is and harms quality. At last, there are many minor improvements for speech quality (both on the SILK side and on the CELT side), including changes to the pitch estimator.
Another big feature is automatic detection of speech and music. This is useful for selecting the optimal encoding mode between SILK-only/hybrid and CELT-only. Unlike what some people think, it's not as simple as encoding all music with CELT and all speech with SILK. It also depends on the bitrate (at very low rate, we'll use SILK for music and at high rate, we'll use CELT for speech). Automatic detection isn't easy, but doing so in real-time (with no look-ahead) is even harder. Because of that the detector tends to take 1-2 seconds before reacting to transitions and will sometimes make bad decisions. We'd be interested in knowing about any screw ups of the algorithm.
The new encoder can also detect the bandwidth of the input signal. This is useful to avoid wasting bits encoding frequencies that aren't present in the signal. While easier than speech/music detection, bandwidth detection isn't as easy as it sounds because of aliasing, quantization and dithering. The current algorithm should do a reasonable job, but again we'd be interested in knowing about any failure.
We're also releasing both version 1.0.0, which is the same code as the RFC, and version 1.0.1, which is a minor update on that code (mainly with the build system). As usual, you can get those from http://opus-codec.org/
Thanks to everyone who contributed by fixing bugs, reporting issues, implementing Opus support, testing, advocating, ... It was a lot of work, but it was worth it.
I just got back from the 84th IETF meeting in Vancouver. The most interesting part (as far as I was concerned anyway) was the rtcweb working group meeting. One of the topics was selecting the mandatory-to-implement (MTI) codecs. For audio, we proposed having both Opus and G.711 as MTI codecs. Much to our surprise, most of the following discussion was over whether G.711 was a good idea. In the end, there was strong consensus (the IETF believes in "rough consensus and running code") in favor of Opus+G.711, so that's what's going to be in rtcweb. Of course, implementers will probably ship with a bunch of other codecs for legacy compatibility purposes.
The video codec discussion was far less successful. Not only is there still no consensus over which codec to use (VP8 vs H.264), but there's also been no significant progress in getting to a consensus. Personally, I can't see how anyone could possibly consider H.264 as a viable option. Not only is it incompatible with open-source, but it's like signing a blank check, nobody knows how much MPEG-LA will decide to charge for it in the next years, especially for the encoder, which is currently not an issue for HTML5 (which only requires a decoder). The main argument I have heard against VP8 is "we don't know if there are patents". While this is true in some sense, the problem is much worse for H.264: not only are there tons of known patents for which we only know the licensing fees in the short term, but there's still at least as much risk when it comes to unlicensed patents (see the current Motorola v. Microsoft case).
The first assumption I make here is that David already checked that both gain and energy are encoded at the "optimal" resolution that balances bitrate and coding artefacts. To reduce the rate, we need a smarter quantizer. Below is the distribution of the pitch and energy for my training database.
So what if we were to use vector quantization to reduce the bit-rate. In theory, we could reduce the rate (for equal error) by having more codevectors in areas where the figure above shows more data. Same error, lower rate, but still a bad idea. It would be bad because it would mean that for some people, whose pitch falls into the range that is less likely, codec2 wouldn't work well. It would also mean that just changing the audio gain could make codec2 do worse. That is clearly not acceptable. We need to not just care about the mean square error (MSE), but also about the outliers. We need to be able to encode any amplitude with increments of 1-2 dB and any pitch with an increment around 0.04-0.08 (between half a semitone and a semitone). So it looks like we're stuck and the best we could do is to have uniform VQ, which wouldn't save much compared to scalar quantization.
The key here is to relax our resolution constraint above. In practice, we only need such good resolution when the signal is stationnary. For example, when the pitch in unvoiced frames jumps around randomly, it's not really important to encode it accurately. Similarly, energy error are much more perceivable when the energy is stable than when it's fluctuating. So this is where prediction becomes very useful, because stationary signals are exactly the ones that are easily predicted. By using a simple first-order recursive predictor (prediction = alpha*previous_value), we can reduce the range for which we need good resolution by a factor (1-alpha). For example, if we have a signal that ranges from 0 to 100 and we want a resolution of 1, then using alpha=0.1, the prediction error (current_value-prediction) will have a range of 0 to 10 when the signal is stationary. We still need to have quantizer values outside that range to encode variations, but we don't need a good resolution.
Now that we have reduced the domain for which we need good resolution, we can actually start using vector quantization too. By combining prediction and vector quantization, it's possible to have a good enough quantizer using only 8 bits for both the energy and the pitch, saving 4 bits, so 200 b/s. The figure below illustrates how the quantizer is trained, with the distribution of the prediction residual (actual value minus prediction) in blue, and the distribution of the code vectors in red. The prediction coefficients are 0.8 for pitch and 0.9 for energy.
First thing we notice from the residual distribution is that it's much less uniform and there's two higher-density areas that stand out. The first is around (0.3,0), which corresponds to the case where the pitch and energy are stationary and is about one fifth of the range for pitch (which has a prediction coefficient of 4/5) and one tenth of the range for energy (which has a prediction coefficient of 9/10). The second higher-density area is a line around residual energy of -2.5, and it corresponds to silence. Now looking at the codebook in red, we can see a very high density of vectors in the area of stationary speech, enough for a resolution of 1-2 dB energy and 1/2 to 1 semitone for pitch. The difference is that this time the high resolution is only needed for much smaller range. Now, the reason we see such a high density of code vectors around stationary speech and not so much around the "silence line" is that the last detail of this quantizer: weighting. The whole codebook training procedure uses weighting based on how important the quantization error is. The weight given to pitch and energy error on stationary voiced speech is much higher than it is for non-stationary speech or silence. This is why this quantizer is able to give good enough quality with 8 bits instead of 12.
I just got back from linux.conf.au 2012 in Ballarat. The video for the talk I gave, Opus, the Swiss Army Knife of Audio Codecs, is now available on the Opus presentations page. For the Ogg-impaired, a lower-quality version is also available on YouTube.
For those who are into speech codecs, I also recommend watching David Rowe's presentation: Codec 2 - Open Source Speech Coding at 2400 bit/s and Below. His presentation was selected as one of the four best talks at LCA this year -- well worth watching.