A Pitch-Energy Quantizer for Codec2
Mar. 4th, 2012 12:20 amThe first assumption I make here is that David already checked that both gain and energy are encoded at the "optimal" resolution that balances bitrate and coding artefacts. To reduce the rate, we need a smarter quantizer. Below is the distribution of the pitch and energy for my training database.
So what if we were to use vector quantization to reduce the bit-rate. In theory, we could reduce the rate (for equal error) by having more codevectors in areas where the figure above shows more data. Same error, lower rate, but still a bad idea. It would be bad because it would mean that for some people, whose pitch falls into the range that is less likely, codec2 wouldn't work well. It would also mean that just changing the audio gain could make codec2 do worse. That is clearly not acceptable. We need to not just care about the mean square error (MSE), but also about the outliers. We need to be able to encode any amplitude with increments of 1-2 dB and any pitch with an increment around 0.04-0.08 (between half a semitone and a semitone). So it looks like we're stuck and the best we could do is to have uniform VQ, which wouldn't save much compared to scalar quantization.
The key here is to relax our resolution constraint above. In practice, we only need such good resolution when the signal is stationnary. For example, when the pitch in unvoiced frames jumps around randomly, it's not really important to encode it accurately. Similarly, energy error are much more perceivable when the energy is stable than when it's fluctuating. So this is where prediction becomes very useful, because stationary signals are exactly the ones that are easily predicted. By using a simple first-order recursive predictor (prediction = alpha*previous_value), we can reduce the range for which we need good resolution by a factor (1-alpha). For example, if we have a signal that ranges from 0 to 100 and we want a resolution of 1, then using alpha=0.1, the prediction error (current_value-prediction) will have a range of 0 to 10 when the signal is stationary. We still need to have quantizer values outside that range to encode variations, but we don't need a good resolution.
Now that we have reduced the domain for which we need good resolution, we can actually start using vector quantization too. By combining prediction and vector quantization, it's possible to have a good enough quantizer using only 8 bits for both the energy and the pitch, saving 4 bits, so 200 b/s. The figure below illustrates how the quantizer is trained, with the distribution of the prediction residual (actual value minus prediction) in blue, and the distribution of the code vectors in red. The prediction coefficients are 0.8 for pitch and 0.9 for energy.
First thing we notice from the residual distribution is that it's much less uniform and there's two higher-density areas that stand out. The first is around (0.3,0), which corresponds to the case where the pitch and energy are stationary and is about one fifth of the range for pitch (which has a prediction coefficient of 4/5) and one tenth of the range for energy (which has a prediction coefficient of 9/10). The second higher-density area is a line around residual energy of -2.5, and it corresponds to silence. Now looking at the codebook in red, we can see a very high density of vectors in the area of stationary speech, enough for a resolution of 1-2 dB energy and 1/2 to 1 semitone for pitch. The difference is that this time the high resolution is only needed for much smaller range. Now, the reason we see such a high density of code vectors around stationary speech and not so much around the "silence line" is that the last detail of this quantizer: weighting. The whole codebook training procedure uses weighting based on how important the quantization error is. The weight given to pitch and energy error on stationary voiced speech is much higher than it is for non-stationary speech or silence. This is why this quantizer is able to give good enough quality with 8 bits instead of 12.
Back From LCA, Video Available
Jan. 22nd, 2012 05:25 pmI just got back from linux.conf.au 2012 in Ballarat. The video for the talk I gave, Opus, the Swiss Army Knife of Audio Codecs, is now available on the Opus presentations page. For the Ogg-impaired, a lower-quality version is also available on YouTube.
For those who are into speech codecs, I also recommend watching David Rowe's presentation: Codec 2 - Open Source Speech Coding at 2400 bit/s and Below. His presentation was selected as one of the four best talks at LCA this year -- well worth watching.
Opus quality update
Jan. 10th, 2012 04:40 amThose who have been following the Opus git repository in the past few weeks probably haven't noticed much work going on. The reason is pretty simple, most of the work has been going on elsewhere in an experimental branch (exp_wip3 names for now) of my private repository. The reason it's in an experimental branch is that its not fully converted to fixed-point and hasn't been tested on any frame size other than 20 ms. Here's an (incomplete) list of changes for now:
- Really unconstrained VBR (not trying to keep the same average rate)
- Tonality detection to give highly tonal audio a boost in bit-rate
- (yet another) rewrite of the transient detection code
- New dynamic allocation code that boosts the rate of bands that have significant spectral leakage caused by short blocks
Thanks to these changes, the quality has (as far as we can tell) gone up compared to the current master branch. I invite you to judge for yourself by comparing the audio coded with the current master branch with the audio coded with the new exp_wip3 experimental branch. This is 64 kb/s, so fairly low rate for stereo music. The original is here. Let me know what you think.
Opus, the Swiss Army Knife of Audio Codecs
Sep. 7th, 2011 09:45 pmCodec Requirements Go RFC, IETF Update
Sep. 1st, 2011 12:35 pmNow the interesting part of the Opus codec itself. That's the only document that really matters. That one should go to Working Group Last Call (WGLC) pretty soon (possibly next week or two). In the mean time, we're working on improving the clarity of the draft, cleaning up the code and fixing all the last few issues that have been reported since the first WGLC. Stay tuned.
IETF update, Mozilla
Aug. 1st, 2011 09:41 amMy week at the IETF meeting was also my first week at my new job working for Mozilla. I've been hired specifically to work on Opus and other codec/multimedia development, so I should have a lot more time for that than I used to. First thing on my list: finishing the Ogg mapping for Opus and releasing an Ogg encoder and decoder.
A complete demo of CELT
Dec. 23rd, 2010 06:01 pmEvolution of CELT
Sep. 3rd, 2010 09:28 pmOK, I know the quality isn't that good at such a low rate, so here's a slightly higher bit-rate. This is current git at 64 kb/s, compared to G.719 at the same rate. I'm curious to hear comments about how CELT does compared to G.719 because we haven't done any formal comparison yet on music.
Even at 64 kb/s, the artefacts are generally audible, even though they're usually no longer annoying. They start being less audible at 80 kb/s, as you can see, and then the quality continues going up all the way to 256 kb/s or even higher.
On the CELT front
Jul. 30th, 2010 07:11 pmAnother CELT related topic that we were finally able to investigate more is allocation of the bits between the fine energy (gain) and the PVQ codebook (shape). There was a mismatch between the code and the theoretical analysis we had. After actual calculations based on (Laplacian) random data, Tim found that it matched the theory almost perfectly. The only problem is that PQevalAudio (objective quality measurement) disagrees with the theory as to what the optimal allocation is. The problem is that it's very hard to tell which one is really optimal just by listening, so this is still not fully resolved.
The last thing we've worked on (with Tim) that's still ongoing is optimising the pdfs used by the range coder for coarse energy encoding. There may be a few
bits there we can save so, it's worth trying.
Following the presentation, the chairs decided to take a hum and there was "rough consensus" in the room for adopting the proposed codec as the baseline codec and thus adopting the draft as a working group document. This still has to be confirmed on the mailing list, but at least things are looking good. This doesn't mean the codec will be accepted as is, but it's a good starting point from which we can keep improving. The rest of the meeting was a lot of discussions on the requirements and the testing, which I'm sure will be better summarized in the minutes.
Other than that, the most useful part of this IETF meeting was having Koen Vos, Timothy Terriberry and I in the same place. We managed to get a lot of technical stuff done -- both conceptual and actual code. More on that later.
CELT development update
Nov. 11th, 2008 08:04 pmHere's what lies ahead now. I'd like to slowly work towards freezing the bit-stream. But there's a few things I want to do before even thinking about a freeze:
- Dynamic bit allocation
Right now, the bit allocation in each band remains about the same for every frame. I'd like to change that and allow more bits in the regions of the spectrum that are hard to encode at any given time. It's not as easy as it looks because: 1) you need to figure out the best allocation based on psychoacoustics and 2) You need to *encode* the allocation information compactly enough that it doesn't waste all you saved from the dynamic allocation. So far, my attempts at 1) haven't been very successful.
- Folding decision
To prevent "birdie" artefacts, we use a certain amount of spectral folding that acts as a noise floor. In most cases, this improves quality, but for very tonal signals (e.g. glockenspiel), it transforms a pure tone into noise, which is annoying. So I'd like to be able to turn that feature on or off based on the data, but again, it's not simple.
- Stereo coupling
CELT already does stereo. It does it by encoding the energy independently for each channel and doing (sort of) M-S encoding of the "residual". This works, but probably doesn't save much compared to using two mono streams. So I want to see how it can be improved. There's already some (disabled) code to do intensity stereo, but maybe there's more that can be done.
Of course, I only have a vague idea of how to do the three things I listed above, so suggestions are welcome.
Testing CELT
May. 14th, 2008 08:32 pmAnd here are the results for the 64 kbit/s MUSHRA test (95% confidence intervals):
Considering that I was just hoping wouldn't be too much worse than these codecs, it's a pleasant surprise. That's because the version of CELT I tested had a latency of 8.7 ms, while the latency of AAC-LD was 34.8 ms (I know it's possible to get down to 20 ms, but the Apple implementation doesn't do it), G.722.1C was 40 ms and MP3 (LAME) was probably way above 100 ms.
In the graphs above, the error bars don't consider the fact that the MUSRA test is paired, so there's more statistically significant results than what is apparent. Basically, CELT and AAC-LD come out ahead of both G.722.1C and MP3 in both tests. CELT comes out ahead of AAC-LD at 48 kbit/s and the two are tied (i.e. no statistically significant difference could be observed) at 64 kbit/s.
Despite those results, I still think CELT can do better. Among the things I'd like to try once I'm done with the paper:
- Add a psycho-acoustic mode and start changing the bit allocation based on the frequency content
- Do lots of tuning
- Do something to prevent time smearing of impulses (not TNS)
Encoding (or guessing) the spectral tilt in each band- Better stereo support
CELT part 3: Pitch prediction
Dec. 24th, 2007 08:40 pmTo work around the poor MDCT resolution, we introduce a pitch predictor. Instead of trying to extract the structure from a single (small) frame, the pitch predictor looks outside the current frame (in the past of course) for similar patterns. Pitch prediction itself is not new. Most speech codecs (and all CELP codecs, including Speex) use a pitch predictor. It usually works in the excitation domain, where we find a time offset in the past (we use the decoded signal because the original isn't available to the decoder) that looks similar to the current frame. The time offset (pitch period) is encoded, along with a gain (the prediction gain). When the signal is highly periodic (as is often the case with voice), the gain is close to 1 and the error after the prediction is small.
Unlike CELP, CELT doesn't operate in the time domain, so doing pitch prediction is a bit trickier. What we need to do is find the offset in the time domain, and then apply the MDCTs (remember we have two MDCT windows per frame) and do the rest in the frequency domain. Another complication is the fact that periodicity is generally only present at lower frequencies. For speech, the pitch harmonics tend to go down (compared to the noisy part) after about 3 kHz, with very little present past 8 kHz. Most CELP codecs only have a single gain that is applied throughout the entire frame (across all frequencies). While Speex has a 3-tap predictor that allows a small amount of control on the amount of gain as a function of frequency, it's still very basic. Working in the frequency domain on the other hand, allows a great deal of flexibility. What we do is apply the pitch prediction only up to a certain frequency (e.g. 6 kHz) and divide the rest in several (e.g. 5) bands. For the example from part 2 (corresponding to mode1 of the 0.0.1 release), we use the following bands for the pitch (different from the bands on which we normalise energy):
{0, 4, 8, 12, 20, 36}
Another particulatity of the pitch predictor in CELT (unlike any other algorithm I know of) is that the pitch prediction is computed on the normalised bands. That is we apply the energy normalisation on both the current signal (X) and the delayed (pitch prediction from the past) signal (P). Because of that, the pitch gain can never exceed unity, which is a nice property when it comes to making things stable despite transmission losses. Despite a maximum value of one in the normalised domain, the "effective value" (not normalised) can be greater than one when the energy is increasing, which is the desired effect. The pitch gain for band i is computed simply g_i = <X_i, P_i>, where <,> is the inner product and X_i is the sub-vector of X that corresponds to band i (same for P_i).
Here's what the distribution of the gains look like for each band:
It's clear from the figure above that the lower bands (lower frequencies) tend to have a much higher pitch value. Because of that, a single gain for all the bands wouldn't work very well. Once the gains are computed, they need to be encoded efficiently. Again, using naive scalar quantisation and encoding each gain separately (using 3 or 4 bits each) would be a bit wasteful. So far, I've been using a trained (non-algebraic) vector quantiser (VQ) with 32 entries, which means a total of 5 bits for all gains. The advantage of VQ for that kind of data is that it eliminates all redundancy so it tends to be more efficient. The are a few disadvantages as well. Trained VQ codebooks are not as flexible and can end up taking too much space when there are many entries (I don't think 32 entries is enough for 5 gains).
The last point to address about the pitch predictor is calculating the pitch period. We could try all delays, apply the MDCTs and compute the gains for each and at the end decide which is beat. Unfortunately, the computational cost would be huge. Instead, it's easier to do it in "open loop" just like in Speex (and many other CELP codecs). We compute the generalised cross-correlation (GCC) in the frequency domain (cheaper than computing in the time domain). The cross-spectrum (before computing the IFFT) is weighted by an approximation of the psychoacoustic masking curve just so each band contributes to the result (instead of having the lower frequencies dominate everything else).
Now the results: how much benefit does pitch prediction give? Quite a bit actually, hear for yourself. Here's the same speech sample encoded with or without pitch prediction. Even on music, which is not always periodic, pitch prediction can a bit, though not as much. I think there's potential to do better on music. There's a few leads I'd like to investigate (and again, I'm open to ideas):
- Using two pitch periods
- Frequency-domain prediction
CELT part 2: Coding into bands
Dec. 20th, 2007 08:35 pmIdeally, the width of each band should be roughly one critical band. In practice, there isn't much much point in having a single frequency bin per critical band, so although the ear has roughly 25 critical bands, we only use about 15-20 in CELT. Here's an example using 256-sample MDCTs (128 output samples) and 15 bands. The band boundaries are:
{0, 2, 4, 6, 8, 12, 16, 20, 24, 28, 36, 44, 52, 68, 84, 116, 128}
Using this, band number 0 includes samples 0 and 1, while band number 14 includes samples 84 to 115. The remaining samples (116-127) are just discarded because they are outside the ear's range (a 44.1 kHz sampling rate is assumed here).
Now, the first thing we need to do is actually encode the energy of each band in an efficient way. The ear is more sensitive to lower frequencies, so these will need to be encoded with better resolution. Of course, we use the log (dB) domain and add a small value (equivalent to -10 dB) just to prevent overflows when taking the log. In this example, we use a quantisation interval of 0.75 dB for the lowest band, increasing linearly to 4.25 dB for the highest band. Doing naive quantisation/encoding over a fixed range would require a prohobitive number of bits (>100 bits per frame) and is thus not an option. Measuring the ideal entropy (assuming a perfect probability model for the data) for same speech and music samples gives us an average of 71 bits per frame. That's still expensive, considering we're going to encode around 200 frames per second.
The only way to further reduce the number of bits used for energy quantisation is to eliminate redundancy. Energy usually doesn't vary that much from one frame to the next, so we can use a time-domain predictor of the form P(z) = 1 - alpha*z^-1. That means we remove from the current energy alpha times the previous energy (we're already in log domain). Here's what the entropy per frame looks like (as a function of alpha) if we use that predictor:
That's already much better. As we increase the prediction coefficient (alpha) from 0 to 1, we can reduce the entropy from 71 bits down to around 45 bits, a 26 bits improvement. Unfortunately, using alpha=1 for prediction isn't practical because it would mean that any transmission error (e.g. lost packet) would propagate through time with no attenuation (even 0.95 would take too long). A value of alpha around 0.7 would be a nice tradeoff between redundancy reduction and limited error propagation. That's 52 bits per frame. However, we're not done yet eliminating redundancy. There's still a correlation across the bands in the same frame. This time, we can use any predictor we like because a frame either arrives completely or it doesn't. So we use a second predictor Q(z) = (1 - z^-1)/(1 - beta*z^-1). With that second predictor, the entropy goes down again:
With alpha=0.7 and beta=0.5, we have just under 44 bits of entropy. Much better than the 71 bits we started from and even better than only the first predictor with alpha=1. Of course, that entropy value is optimistic because it assumes a perfect probability model and because it assumes that prediction isn't degraded by quantisation.
For encoding, it's not very practical to use the actual probability model because would require storing the probability for each value of each band (and for each bit-rate if we change the resolution). However, it turns out that the distributions are somewhere between a Gaussian distribution and a Laplacian distribution. Although actually closer to being Gaussian, we use a Laplacian model because it reduces the spikes in bit-rate (a Gaussian would significantly underestimate the probability of extreme values). Despite the rough approximation, the average actual encoding rate for all 15 bands is 46 bits per frame. That's just 2 bits worse than the theoretical best case using that predictor. Not bad at all.
I've also played with the DCT (for intra-frame redundancy) without getting better results, mainly because it's harder to control the error in each band. Still, there may be better ways that what I've done so far to reduce the bit-rate for the energy. I'm open to ideas/suggestions on that.
Updated: Fixed the definition of the example bands.
Below is an overview of how CELT works:
It may look a bit hairy, but it's actually a relatively simple idea. The four main ideas are:
- We use a lapped transform (here an MDCT) on very short windows (128-256 samples)
- The spectrum is divided in bands and the energy in each band is encoded and kept constant
- We use a time-domain pitch predictor, with frequency-domain gains
- The residual is encoded using a pulse codebook
I'll address each of these (and more) in later posts.
Preparing for releases
Dec. 8th, 2007 11:52 pmThere's another releasing coming up: a new Code-Excited Lapped Transform (CELT) codec prototype. This codec is (for now at least) meant to compete with neither Vorbis, nor Speex. Instead, the primary idea is to reduce latency to a minimum -- currently around 8 ms (compared to ~25 ms for Speex and ~100 ms for Vorbis). Of course, this comes with a price in terms of efficiency, but I'm already surprised the price isn't bigger. I've been mainly focusing on speech, but unlike Speex, I'm hoping this one will handle music as well. For the curious, I've put a 56 kbps CBR music file (original). This is still very experimental and everything is still likely to change, including the exact goals. I'm still trying to figure out how to put psychoacoustics into that. Stay tuned for the release of version 0.0.1 (or should I use a negative version number to make it clearer it's experimental?).
CELT is based on a paper I submitted to ICASSP and which I'm hoping will be accepted so I can make it available to everyone. The only difference is that the ICASSP paper was based on the FFT (non critically sampled), whereas this version is based on the MDCT. One part that is already published though is Timothy's explanation of the pulse codebook encoding along with some source code. Now, here's a challenge. Who can beat the algorithm on Timothy's page? Simply stated, the idea is to enumerate all combinations of M pulses in a vector of dimension N, knowing that pulses have unit magnitude and a sign, but all pulses at the same position need to have the same sign.
Updated: The full source for CELT is available at: http://downloads.us.xiph.org/releases/celt/celt-0.0.1.tar.gz or through git at http://git.xiph.org/celt.git
Updated again:: Speex 1.2beta3 is out.