jmvalin: (opus)
Opus gets another major update with the release of version 1.3. This release brings quality improvements to both speech and music, while remaining fully compatible with RFC 6716. This is also the first release with Ambisonics support. This Opus 1.3 demo describes a few of the upgrades that users and implementers will care about the most. You can download the new version from the Opus website.
jmvalin: (opus)

Opus gets another major upgrade with the release of version 1.2. This release brings quality improvements to both speech and music, while remaining fully compatible with RFC 6716. There are also optimizations, new options, as well as many bug fixes. This Opus 1.2 demo describes a few of the upgrades that users and implementers will care about the most. You can download the code from the Opus website.
jmvalin: (opus)
We just released Opus 1.2-alpha. It's an alpha release for the upcoming Opus 1.2. It includes many quality improvements for low-bitrate speech and music. It also includes new features, as well as a large number of bug fixes. See the announcement for more details.
jmvalin: (opus)
We just released Opus 1.1-rc, which should be the last step before the final 1.1 release. Compared to 1.1-beta, this new release further improves surround encoding quality. It also includes better tuning of surround and stereo for lower bitrates. The complexity has been reduced on all CPUs, but especially ARM, which now has Neon assembly for the encoder.

With the changes, stereo encoding now produces usable audio (of course, not high fidelity) down to about 40 kb/s, with surround 5.1 sounding usable down to 48-64 kb/s. Please give this release a try and report any issues on the mailing list or by joining the #opus channel on The more testing we get, the faster we'll be able to release 1.1-final.

As usual, the code can be downloaded from:
jmvalin: (opus)

We just released Opus 1.1-alpha, which includes more than one year of development compared to the 1.0.x branch. There are quality improvements, optimizations, bug fixes, as well as an experimental speech/music detector for mode decisions. That being said, it's still an alpha release, which means it can also do stupid things sometimes. If you come across any of those, please let us know so we can fix it. You can send an email to the mailing list, or join us on IRC in #opus on The main reason for releasing this alpha is to get feedback about what works and what does not.

Quality improvements

Most of the quality improvements come from the unconstrained variable bitrate (VBR). In the 1.0.x encoder VBR always attempts to meet its target bitrate. The new VBR code is free to deviate from its target depending on how difficult the file is to encode. In addition to boosting the rate of transients like 1.0.x goes, the new encoder also boosts the rate of tonal signals which are harder to code for Opus. On the other hand, for signals with a narrow stereo image, Opus can reduce the bitrate. What this means in the end is that some files may significantly deviate from the target. For example, someone encoding his music collection at 64 kb/s (nominal) may find that some files end up using as low as 48 kb/s, while others may use up to about 96 kb/s. However, for a large enough collection, the average should be fairly close to the target.

There are a few more ways in which the alpha improves quality. The dynamic allocation code was improved and made more aggressive, the transient detector was once again rewritten, and so was the tf analysis code. A simple thing that improves quality of some files is the new DC rejection (3-Hz high-pass) filter. DC is not supposed to be present in audio signals, but it sometimes is and harms quality. At last, there are many minor improvements for speech quality (both on the SILK side and on the CELT side), including changes to the pitch estimator.

Speech/music detector

Another big feature is automatic detection of speech and music. This is useful for selecting the optimal encoding mode between SILK-only/hybrid and CELT-only. Unlike what some people think, it's not as simple as encoding all music with CELT and all speech with SILK. It also depends on the bitrate (at very low rate, we'll use SILK for music and at high rate, we'll use CELT for speech). Automatic detection isn't easy, but doing so in real-time (with no look-ahead) is even harder. Because of that the detector tends to take 1-2 seconds before reacting to transitions and will sometimes make bad decisions. We'd be interested in knowing about any screw ups of the algorithm.

Bandwidth detection

The new encoder can also detect the bandwidth of the input signal. This is useful to avoid wasting bits encoding frequencies that aren't present in the signal. While easier than speech/music detection, bandwidth detection isn't as easy as it sounds because of aliasing, quantization and dithering. The current algorithm should do a reasonable job, but again we'd be interested in knowing about any failure.

jmvalin: (opus)
We finally made it! Opus is now standardized by the IETF as RFC 6716. See the Mozilla hacks post and the Xiph.Org press release for more details. Of course, feel free to help spread the word around.

We're also releasing both version 1.0.0, which is the same code as the RFC, and version 1.0.1, which is a minor update on that code (mainly with the build system). As usual, you can get those from

Thanks to everyone who contributed by fixing bugs, reporting issues, implementing Opus support, testing, advocating, ... It was a lot of work, but it was worth it.
jmvalin: (opus)
Three years after we first tried convincing the IETF to standardize an audio codec, Opus has finally been approved by the IETF. The only remaining step until it's officially an RFC is the RFC editor (fixing last minor issues, typos, ...). That should take in the order of 6-8 weeks (variable), at which point we'll have the RFC and the 1.0 release. Thanks to everyone who helped developing, testing, supporting or advocating Opus.
jmvalin: (Default)

I just got back from 2012 in Ballarat. The video for the talk I gave, Opus, the Swiss Army Knife of Audio Codecs, is now available on the Opus presentations page. For the Ogg-impaired, a lower-quality version is also available on YouTube.

For those who are into speech codecs, I also recommend watching David Rowe's presentation: Codec 2 - Open Source Speech Coding at 2400 bit/s and Below. His presentation was selected as one of the four best talks at LCA this year -- well worth watching.

jmvalin: (Default)

Those who have been following the Opus git repository in the past few weeks probably haven't noticed much work going on. The reason is pretty simple, most of the work has been going on elsewhere in an experimental branch (exp_wip3 names for now) of my private repository. The reason it's in an experimental branch is that its not fully converted to fixed-point and hasn't been tested on any frame size other than 20 ms. Here's an (incomplete) list of changes for now:

  • Really unconstrained VBR (not trying to keep the same average rate)
  • Tonality detection to give highly tonal audio a boost in bit-rate
  • (yet another) rewrite of the transient detection code
  • New dynamic allocation code that boosts the rate of bands that have significant spectral leakage caused by short blocks

Thanks to these changes, the quality has (as far as we can tell) gone up compared to the current master branch. I invite you to judge for yourself by comparing the audio coded with the current master branch with the audio coded with the new exp_wip3 experimental branch. This is 64 kb/s, so fairly low rate for stereo music. The original is here. Let me know what you think.

jmvalin: (Default)
Since yesterday, the IETF audio codec requirements are now published as RFC 6366. While the requirements aren't by themselves interesting (why discuss abstract requirements when you can discuss actual running code?), it's an important milestone in that it's the first document published by the Working Group. It also means one less source of pointless arguments. The guidelines document is now next in line and should go to IETF last call soon.

Now the interesting part of the Opus codec itself. That's the only document that really matters. That one should go to Working Group Last Call (WGLC) pretty soon (possibly next week or two). In the mean time, we're working on improving the clarity of the draft, cleaning up the code and fixing all the last few issues that have been reported since the first WGLC. Stay tuned.
jmvalin: (Default)
I spent my last week in Quebec City at the 81th IETF meeting. The most important meeting there for me was the codec WG. The good news is that there's been a lot of progress in that meeting. A few issues with the Opus bit-stream (e.g. padding, frame packing) were resolved and the chairs are planning a second working group last call in four weeks. After that if all goes well, the codec can go to IETF last call and then RFC.

My week at the IETF meeting was also my first week at my new job working for Mozilla. I've been hired specifically to work on Opus and other codec/multimedia development, so I should have a lot more time for that than I used to. First thing on my list: finishing the Ogg mapping for Opus and releasing an Ogg encoder and decoder.
jmvalin: (Default)
Monty has just finished a very interesting CELT demo that covers most of the techniques used in CELT and their history. It also includes a large number of audio samples, including comparisons with Vorbis and various flavours of AAC. CELT has come a long long way in the past three years and even in the past three months, quality has gone up significantly, to the point where it sounds better than Vorbis on many (most?) samples and even comparable to HE-AAC at 64 kb/s. The target is to freeze the bit-stream early January for integration within the Opus codec, but there may still be a few quality improvements we can make before that -- not to mention all the encoder-side improvements we can make even after the bit-stream is frozen.
jmvalin: (Default)
Recently, I was curious about how CELT and Vorbis differ in the way the allocate bits. Now, CELT's bit allocation is really explicit with a fixed number of bits per band. This is not quite the case of Vorbis, so a comparison isn't straightforward. What I've done is I've ran some audio (mono version of the audio I used in my previous post) through Vorbis and measured the SNR as a function of frequency. By dividing the SNR by 6 db/bit, I can get the (approximate) bit allocation. The result (smoothed a bit) is shown below for encoding quality -1 to 10.

Now, these are the curves currently used by CELT for its bit allocation:

Among the differences are:
1) The Vorbis allocation lines for different rates are nearly parallel, meaning that starting from a certain allocation, bits are added/removed nearly uniformly when changing the bit-rate
2) Vorbis allocates a lot of bits to very low frequencies, and then there is a sharp drop-off around 400 Hz.
3) In the mid-high range, the Vorbis allocation is much flatter than CELT

Now I tend to trust that the Vorbis allocation has been decently tuned, so the question is whether the differences in allocation are due to fundamental differences between Vorbis and CELT or just to bad tuning of CELT so far. I suspect there's a bit of both. I've actually created an exp_vorbis_tuning branch to find out. I just took the Vorbis data and turned that into CELT bit allocation data just to see what it would do. I expected something terrible, but it actually sounds quite decent. In some circumstances, it sounds a bit worse than the original CELT tuning, but I think in other cases it actually sounds better. More investigation needed...
jmvalin: (Default)
I've been doing some tuning of CELT over the past few days and thought it would be a interesting to compare how the quality of CELT has evolved over the coarse of its development. It's easy to lose track when each change you made provides only a tiny improvement. Using this stereo reference file, I've tried encoding with a few different versions. Even though I don't normally recommend using that bit-rate for stereo, I've used 40 kb/s for the comparison because it makes the artefacts (and thus the differences) more obvious. A bit more than two years ago, this is what CELT 0.3.2 sounded like at 40 kb/s. Then there was version 0.5.2 that improved, with the latest version, 0.8.1. And now, here's what in the current git to be released as 0.9.

OK, I know the quality isn't that good at such a low rate, so here's a slightly higher bit-rate. This is current git at 64 kb/s, compared to G.719 at the same rate. I'm curious to hear comments about how CELT does compared to G.719 because we haven't done any formal comparison yet on music.

Even at 64 kb/s, the artefacts are generally audible, even though they're usually no longer annoying. They start being less audible at 80 kb/s, as you can see, and then the quality continues going up all the way to 256 kb/s or even higher.
jmvalin: (Default)
I mentioned in my previous post that much technical work was done while at the IETF meeting. First, it's always good to have other people looking at your code, and meeting face to face is the best way to actually explain your code to others. The first thing that happened while Tim was looking at my code was he found much simpler ways (closed-form) to compute probability distributions I was computing in an iterative manner. The next thing that happened was that while I was trying to explain to him some bit allocation detail, I just couldn't figure out why there was a division by two in the bit allocation of the band split. The explanation was simple: we just shouldn't be dividing by two. That resulted in an easy (though small) increase in quality.

Another CELT related topic that we were finally able to investigate more is allocation of the bits between the fine energy (gain) and the PVQ codebook (shape). There was a mismatch between the code and the theoretical analysis we had. After actual calculations based on (Laplacian) random data, Tim found that it matched the theory almost perfectly. The only problem is that PQevalAudio (objective quality measurement) disagrees with the theory as to what the optimal allocation is. The problem is that it's very hard to tell which one is really optimal just by listening, so this is still not fully resolved.

The last thing we've worked on (with Tim) that's still ongoing is optimising the pdfs used by the range coder for coarse energy encoding. There may be a few
bits there we can save so, it's worth trying.
jmvalin: (Default)
Here's good news from the codec Working group meeting that was held on Monday. Koen Vos and I presented the prototype codec draft, including the results of an informal MUSHRA test (see slide 8). The bottom line is that the hybrid codec running with full audio bandwidth (48 kHz) at 32 kb/s significantly out-performed all other codecs under test, including BV32, SILK-WB, CELT alone and G.719. For the first three, this is hardly surprising: BV32 and SILK were using "wideband" (i.e. bandlimited at 7 kHz) audio, which just cannot match the bandwidth of the hybrid codec, and CELT was just never designed for 32 kb/s and has annoying artifacts at that rate. As for G.719, it was the closet contender in that test, but still had annoying coding noise that was easily noticeable and relatively annoying. On the other hand, several of the listeners had a very hard time telling the hybrid codec from the original.

Following the presentation, the chairs decided to take a hum and there was "rough consensus" in the room for adopting the proposed codec as the baseline codec and thus adopting the draft as a working group document. This still has to be confirmed on the mailing list, but at least things are looking good. This doesn't mean the codec will be accepted as is, but it's a good starting point from which we can keep improving. The rest of the meeting was a lot of discussions on the requirements and the testing, which I'm sure will be better summarized in the minutes.

Other than that, the most useful part of this IETF meeting was having Koen Vos, Timothy Terriberry and I in the same place. We managed to get a lot of technical stuff done -- both conceptual and actual code. More on that later.
jmvalin: (Default)
After trying to publish the technical details of CELT, things are finally finally working. First the journal paper was accepted and now the paper describing the low-complexity version of CELT was accepted for the EUSIPCO 2009 conference. That one is based on a more recent version of CELT (0.5.1) and has comparisons with the ULD codec, which is pretty much the only high-quality codec that supports delays as small as CELT does (don't worry, CELT still comes up on top for quality!). Here's the paper details:

J.-M. Valin, T. B. Terriberry, G. Maxwell, A Full-Bandwidth Audio Codec with Low Complexity and Very Low Delay, Accepted for EUSIPCO 2009.
jmvalin: (Default)
It's been a while since the last time I discusses CELT, so at last, here's an update. A while ago, I was working on a low-complexity "profile" of CELT. The idea is to disable the use of the pitch predictor, which is quite costly in terms of complexity. To help speed things up, I also changed the allocator to do the conversion from bits to pulses one band at a time instead of doing it jointly for all bands at once. This decreases the complexity, while making the allocation a bit less optimal -- in theory. In practice, it means that for higher rates where bands require a large number of bits, the encoding can actually be more efficient because no bits are wasted. Because of that, I was able to replace all 64-bit arithmetic from CELT by 32-bit splits. On top of that, Timothy (derf) managed to -- again -- save some computation in the pulse encoding. The result is that in low-complexity mode, it takes about 1% CPU to encode and decode a 44.1 kHz mono stream at 64 kbit/s (on my 2 GHz box).

Here's what lies ahead now. I'd like to slowly work towards freezing the bit-stream. But there's a few things I want to do before even thinking about a freeze:

- Dynamic bit allocation
Right now, the bit allocation in each band remains about the same for every frame. I'd like to change that and allow more bits in the regions of the spectrum that are hard to encode at any given time. It's not as easy as it looks because: 1) you need to figure out the best allocation based on psychoacoustics and 2) You need to *encode* the allocation information compactly enough that it doesn't waste all you saved from the dynamic allocation. So far, my attempts at 1) haven't been very successful.

- Folding decision
To prevent "birdie" artefacts, we use a certain amount of spectral folding that acts as a noise floor. In most cases, this improves quality, but for very tonal signals (e.g. glockenspiel), it transforms a pure tone into noise, which is annoying. So I'd like to be able to turn that feature on or off based on the data, but again, it's not simple.

- Stereo coupling
CELT already does stereo. It does it by encoding the energy independently for each channel and doing (sort of) M-S encoding of the "residual". This works, but probably doesn't save much compared to using two mono streams. So I want to see how it can be improved. There's already some (disabled) code to do intensity stereo, but maybe there's more that can be done.

Of course, I only have a vague idea of how to do the three things I listed above, so suggestions are welcome.
jmvalin: (Default)
I've been conducting a listening test for a paper on the CELT codec. I've been comparing it to AAC-LD, G.722.1C (aka Siren14) and MP3. Here are the results for the 48 kbit/s MUSHRA test (95% confidence intervals):

And here are the results for the 64 kbit/s MUSHRA test (95% confidence intervals):

Considering that I was just hoping wouldn't be too much worse than these codecs, it's a pleasant surprise. That's because the version of CELT I tested had a latency of 8.7 ms, while the latency of AAC-LD was 34.8 ms (I know it's possible to get down to 20 ms, but the Apple implementation doesn't do it), G.722.1C was 40 ms and MP3 (LAME) was probably way above 100 ms.

In the graphs above, the error bars don't consider the fact that the MUSRA test is paired, so there's more statistically significant results than what is apparent. Basically, CELT and AAC-LD come out ahead of both G.722.1C and MP3 in both tests. CELT comes out ahead of AAC-LD at 48 kbit/s and the two are tied (i.e. no statistically significant difference could be observed) at 64 kbit/s.

Despite those results, I still think CELT can do better. Among the things I'd like to try once I'm done with the paper:
  • Add a psycho-acoustic mode and start changing the bit allocation based on the frequency content
  • Do lots of tuning
  • Do something to prevent time smearing of impulses (not TNS)
  • Encoding (or guessing) the spectral tilt in each band
  • Better stereo support
jmvalin: (Default)
Before reading this, I recommend reading part 1 and part 2. As I explained in part 1, CELT achieves really low latency by using very short MDCT windows. In the current setup, we have two 256-sample overlapping (input) MDCT windows per frame. The reason for not using a single 512-sample MDCT instead is latency (the look-ahead of the MDCT is shorter). With that setup, we get 256 output samples per frame to encode (128 per MDCT window). Now, at 44.1 kHz, it means a resolution of 172 Hz, not to mention the leakage. That's far from enough to separate female pitch harmonics, much less male ones. To the MDCT, a periodic voice signal thus looks pretty much like noise, with no clear structure that can be used to our advantage.

To work around the poor MDCT resolution, we introduce a pitch predictor. Instead of trying to extract the structure from a single (small) frame, the pitch predictor looks outside the current frame (in the past of course) for similar patterns. Pitch prediction itself is not new. Most speech codecs (and all CELP codecs, including Speex) use a pitch predictor. It usually works in the excitation domain, where we find a time offset in the past (we use the decoded signal because the original isn't available to the decoder) that looks similar to the current frame. The time offset (pitch period) is encoded, along with a gain (the prediction gain). When the signal is highly periodic (as is often the case with voice), the gain is close to 1 and the error after the prediction is small.

Unlike CELP, CELT doesn't operate in the time domain, so doing pitch prediction is a bit trickier. What we need to do is find the offset in the time domain, and then apply the MDCTs (remember we have two MDCT windows per frame) and do the rest in the frequency domain. Another complication is the fact that periodicity is generally only present at lower frequencies. For speech, the pitch harmonics tend to go down (compared to the noisy part) after about 3 kHz, with very little present past 8 kHz. Most CELP codecs only have a single gain that is applied throughout the entire frame (across all frequencies). While Speex has a 3-tap predictor that allows a small amount of control on the amount of gain as a function of frequency, it's still very basic. Working in the frequency domain on the other hand, allows a great deal of flexibility. What we do is apply the pitch prediction only up to a certain frequency (e.g. 6 kHz) and divide the rest in several (e.g. 5) bands. For the example from part 2 (corresponding to mode1 of the 0.0.1 release), we use the following bands for the pitch (different from the bands on which we normalise energy):

{0, 4, 8, 12, 20, 36}

Another particulatity of the pitch predictor in CELT (unlike any other algorithm I know of) is that the pitch prediction is computed on the normalised bands. That is we apply the energy normalisation on both the current signal (X) and the delayed (pitch prediction from the past) signal (P). Because of that, the pitch gain can never exceed unity, which is a nice property when it comes to making things stable despite transmission losses. Despite a maximum value of one in the normalised domain, the "effective value" (not normalised) can be greater than one when the energy is increasing, which is the desired effect. The pitch gain for band i is computed simply g_i = <X_i, P_i>, where <,> is the inner product and X_i is the sub-vector of X that corresponds to band i (same for P_i).

Here's what the distribution of the gains look like for each band:

It's clear from the figure above that the lower bands (lower frequencies) tend to have a much higher pitch value. Because of that, a single gain for all the bands wouldn't work very well. Once the gains are computed, they need to be encoded efficiently. Again, using naive scalar quantisation and encoding each gain separately (using 3 or 4 bits each) would be a bit wasteful. So far, I've been using a trained (non-algebraic) vector quantiser (VQ) with 32 entries, which means a total of 5 bits for all gains. The advantage of VQ for that kind of data is that it eliminates all redundancy so it tends to be more efficient. The are a few disadvantages as well. Trained VQ codebooks are not as flexible and can end up taking too much space when there are many entries (I don't think 32 entries is enough for 5 gains).

The last point to address about the pitch predictor is calculating the pitch period. We could try all delays, apply the MDCTs and compute the gains for each and at the end decide which is beat. Unfortunately, the computational cost would be huge. Instead, it's easier to do it in "open loop" just like in Speex (and many other CELP codecs). We compute the generalised cross-correlation (GCC) in the frequency domain (cheaper than computing in the time domain). The cross-spectrum (before computing the IFFT) is weighted by an approximation of the psychoacoustic masking curve just so each band contributes to the result (instead of having the lower frequencies dominate everything else).

Now the results: how much benefit does pitch prediction give? Quite a bit actually, hear for yourself. Here's the same speech sample encoded with or without pitch prediction. Even on music, which is not always periodic, pitch prediction can a bit, though not as much. I think there's potential to do better on music. There's a few leads I'd like to investigate (and again, I'm open to ideas):
  • Using two pitch periods
  • Frequency-domain prediction
Feel free to ask questions below in the (likely) case something's not clear.


jmvalin: (Default)

April 2019

 1234 56


RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Apr. 26th, 2019 04:26 am
Powered by Dreamwidth Studios