jmvalin: (Default)
jmvalin ([personal profile] jmvalin) wrote2012-03-04 12:20 am
Entry tags:

A Pitch-Energy Quantizer for Codec2

During LCA 2012, I got to meet face-to-face (for only the second time) with David Rowe and discuss Codec2. This led to a hacking session where we figured out how to save about 10 bits on LSP quantization by using vector quantization (VQ). This may not sound like a lot, but for a 2 kb/s codec, 10 bits every 20 ms is 500 b/s, so one quarter of the bit-rate. That new code is now in David's hands and he's been doing a good job of tweaking it to get optimal quality/bitrate. This led me to look at the rest of the bits, which are taken mostly by the pitch frequency (between 50 Hz and 400 Hz) and the excitation energy (between -10 dB and 40 dB). The pitch is currently coded linearly (constant spacing in Hz) with 7 bits, while the energy is coded linearly in dB using 5 bits. That's a total of 12 bits for pitch and energy. Now, how can we improve that?

The first assumption I make here is that David already checked that both gain and energy are encoded at the "optimal" resolution that balances bitrate and coding artefacts. To reduce the rate, we need a smarter quantizer. Below is the distribution of the pitch and energy for my training database.



So what if we were to use vector quantization to reduce the bit-rate. In theory, we could reduce the rate (for equal error) by having more codevectors in areas where the figure above shows more data. Same error, lower rate, but still a bad idea. It would be bad because it would mean that for some people, whose pitch falls into the range that is less likely, codec2 wouldn't work well. It would also mean that just changing the audio gain could make codec2 do worse. That is clearly not acceptable. We need to not just care about the mean square error (MSE), but also about the outliers. We need to be able to encode any amplitude with increments of 1-2 dB and any pitch with an increment around 0.04-0.08 (between half a semitone and a semitone). So it looks like we're stuck and the best we could do is to have uniform VQ, which wouldn't save much compared to scalar quantization.

The key here is to relax our resolution constraint above. In practice, we only need such good resolution when the signal is stationnary. For example, when the pitch in unvoiced frames jumps around randomly, it's not really important to encode it accurately. Similarly, energy error are much more perceivable when the energy is stable than when it's fluctuating. So this is where prediction becomes very useful, because stationary signals are exactly the ones that are easily predicted. By using a simple first-order recursive predictor (prediction = alpha*previous_value), we can reduce the range for which we need good resolution by a factor (1-alpha). For example, if we have a signal that ranges from 0 to 100 and we want a resolution of 1, then using alpha=0.1, the prediction error (current_value-prediction) will have a range of 0 to 10 when the signal is stationary. We still need to have quantizer values outside that range to encode variations, but we don't need a good resolution.

Now that we have reduced the domain for which we need good resolution, we can actually start using vector quantization too. By combining prediction and vector quantization, it's possible to have a good enough quantizer using only 8 bits for both the energy and the pitch, saving 4 bits, so 200 b/s. The figure below illustrates how the quantizer is trained, with the distribution of the prediction residual (actual value minus prediction) in blue, and the distribution of the code vectors in red. The prediction coefficients are 0.8 for pitch and 0.9 for energy.



First thing we notice from the residual distribution is that it's much less uniform and there's two higher-density areas that stand out. The first is around (0.3,0), which corresponds to the case where the pitch and energy are stationary and is about one fifth of the range for pitch (which has a prediction coefficient of 4/5) and one tenth of the range for energy (which has a prediction coefficient of 9/10). The second higher-density area is a line around residual energy of -2.5, and it corresponds to silence. Now looking at the codebook in red, we can see a very high density of vectors in the area of stationary speech, enough for a resolution of 1-2 dB energy and 1/2 to 1 semitone for pitch. The difference is that this time the high resolution is only needed for much smaller range. Now, the reason we see such a high density of code vectors around stationary speech and not so much around the "silence line" is that the last detail of this quantizer: weighting. The whole codebook training procedure uses weighting based on how important the quantization error is. The weight given to pitch and energy error on stationary voiced speech is much higher than it is for non-stationary speech or silence. This is why this quantizer is able to give good enough quality with 8 bits instead of 12.

neat stuff

[identity profile] prodicus.myopenid.com (from livejournal.com) 2012-03-08 08:54 pm (UTC)(link)
Amazing what you and David are managing to accomplish down at 1% of the raw 8kHz audio bitrate. Any word on plans to release another alpha? In the LCA video and some discussion on his blog it kinda seemed like a release was waiting on Customs etc, is that still an issue?

Just out of curiosity I checked out the latest codec2-dev svn to see whether anything would build on cygwin. Needed to comment out valgrind-related lines in vq_train_jvm.c and then it seemed to build OK. Trying to use c2enc at 1200 resulted in
assertion "nbit == codec2_bits_per_frame(c2)" failed: file "codec2.c", line 848, function: codec2_encode_1200
Aborted (core dumped)
With a svn version I don't know whether this is simply expected breakage or something that would be useful to debug. Successfully encoded and decoded a sample of my own at 1500 and 2500; as expected there's very little quality difference. The clip has both female and male speakers; while the female voice is still mostly comprehensible it has a _lot_ more/more obvious distortion and artifacts than the male speaker's.

Re: neat stuff

[identity profile] prodicus.myopenid.com (from livejournal.com) 2012-03-08 08:57 pm (UTC)(link)
Just found an email on the mailing list which explains the crash: David said "I was experimenting with 30 bit VQ codebooks recently, which is why 1200 is broken." So it's just expected breakage.

Re: neat stuff

[identity profile] https://www.google.com/accounts/o8/id?id=AItOawl32fMtu4nSLABF_Uae-OmkSo2y-9J9xAo (from livejournal.com) 2012-07-11 01:09 pm (UTC)(link)
Yes, works in cygwin very well.

Codec with lower than 1200bps?

[identity profile] https://www.google.com/accounts/o8/id?id=AItOawl32fMtu4nSLABF_Uae-OmkSo2y-9J9xAo (from livejournal.com) 2012-07-11 01:07 pm (UTC)(link)
This led to a hacking session where we figured out how to save about 10 bits on LSP quantization by using vector quantization (VQ). This may not sound like a lot, but for a 2 kb/s codec, 10 bits every 20 ms is 500 b/s, so one quarter of the bit-rate.
Does this apply to the 1200bps as well? Is it possible to go under the FS-1015's 800bps, without creating a synthetic voice?
Can't wait to test such a codec in my simulations.