Ever since we started working on Opus at the IETF, it's been a recurring theme.
"You guys don't know how to test codecs", "You can't be serious unless you spend
$100,000 testing your codec with several independent labs", or even "designing
codecs is easy, it's testing that's hard". OK, subjective testing is indeed
important. After all, that's the main thing that differentiates serious signal
processing work from idiots using $1000 directional, oxygen-free
speaker cable. However, just like speaker cables, more expensive listening tests
do not necessarily mean more useful results. In this post I'm going to explain
why this kind of thinking is wrong. I will avoid naming anyone here because I
want to attack the myth of the $100,000 listening test, not the people who
believe in it.
In the Beginning
Back in the 70s and 80s, digital audio equipment was very expensive,
complicated to deploy, and difficult to test at all.
Not everyone could afford analog-to-digital converters (ADC) or digital-to-analog
converters (DAC), so any testing required using expensive, specialized labs.
When someone came up with a new piece of equipment or a codec,
it could end up being deployed for several decades, so
it made sense to give it to one of these labs to test the hell
out of it.
At the same time, it wasn't too hard to do
a good job in testing because algorithms were generally simple and codecs only
supported one or two modes of operation. For example, a codec
like G.711 only has a single bit-rate and can be implemented in less than 10 lines
of code. With something that simple, it's generally not too hard to have 100% code
coverage and make sure all corner cases are handled correctly. Considering
the investments involved, it just made sense to pay tens or hundreds of thousands
of dollars
to make sure nothing blows up. This was paid by large telcos and their
suppliers, so they could afford it anyway.
Things remained pretty much the same through the 90s.
When G.729 was standardized in 1995, it still only had a single bit-rate, and
the computational complexity was still beyond what a PC could do in real-time.
A few years later, we finally got codecs like AMR-NB that supported several
bit-rates, though the number was still small enough that you could test each of
them.
Enter Opus
When we first attempted to create a codec working group (WG) at the IETF, some folks were
less than thrilled to have their "codec monopoly" challenged. The first objection we heard
was "you're not competent enough to write a codec". After pointing out that we
already had three candidate codecs on the table (SILK, CELT, BroadVoice), created by the
authors of 3 already-deployed codecs (iSAC, Speex, G.728), the objection quickly
switched to testing. After all, how was the IETF going to review this work and
make sure it was any good?
The best answer came from an old-time ("gray beard")
IETF participant and was along the lines of: "we at the IETF are used to
reviewing things that are a lot harder to evaluate, like crypto standards. When
it comes to audio, at least all of us have two ears". And it makes sense.
Among all the things the IETF does (transport protocols, security, signalling,
...), codecs are among the easiest to test because at least you know the criteria and
they're directly measurable. Audio quality is a hell of a lot easier to measure
than "is this cipher breakable?", "is this signalling extensible enough?", or "Will
this BGP update break the Internet?"
Of course, that was not the end of the testing story. For many months in
2011 we were again faced with never-ending complaints that Opus "had not been tested".
There was this implicit assumption that testing the final codec improves the
codec. Yeah right!
Apparently, the Big-Test-At-The-End is meant to ensure that the codec is good
and if it's not then you have to go back to the drawing board. Interestingly,
I'm not aware of a single ITU-T codec for which that happened.
On the other hand, I am aware of at least one case where the Big-Test-At-The-End
revealed someting wrong.
Let's look at the
listening test results from the AMR-WB (a.k.a. G.722.2) codec. AMR-WB
has 9 bitrates, ranging from 6.6 kb/s to 23.85 kb/s. The interesting thing
with the results is that when looking at the two highest rates (23.05 and 23.85)
one notices that the 23.85 kb/s mode actually has lower quality than the
lower 23.05 bitrate. That's a sign that something's gone wrong somewhere. I'm
not aware of why that was the case or what exactly happened from there, but
apparently it didn't bother people enough to actually fix the problem. That's
the problem with final tests, they're final.
A Better Approach
What I've learned from Opus is that it's possible to have tests that are
far more useful and much cheaper. First, final tests aren't that useful.
Although we did conduct some of those, ultimately their main use ends up
being for marketing and bragging rights.
After all, if you still need these
tests to convince yourself that your codec is any good, something's very wrong
with your development process. Besides, when you look at a codec like Opus,
you have about 1200 possible bitrates, using three different coding modes,
four different frame sizes, and either mono or stereo input. That's far more than one can
reliably test with traditional subjective listening tests. Even if you
could, modern codecs are complex enough that some problems may only occur
with very specific audio signals.
The single testing approach that gave us the
most useful results was also the simplest: just put the code out there so
people can use it. That's how we got reports like "it works well overall, but
not on this rare piece of post-neo-modern folk metal" or "it worked
for all our instruments except my bass". This is not something you can catch
with ITU-style testing. It's one of the most fundamental principles
of open-source development: "given enough eyeballs, all bugs are shallow".
Another approach was simply to throw tons of audio at
it and evaluate the quality using PEAQ-style objective measurement tools.
While these tools are generally unreliable for precise evaluation of a
codec quality, they're pretty good at flagging files the codec does badly
on for further analysis.
We ended up using more than a
dozen different approaches to testing, including
various flavours of fuzzing. In the end, when it comes to the final testing, nothing
beats having the thing out there. After all, as our Skype friends would put it:
Which codec do you trust more? The codec that's been tested by dozens of listeners
in a highly controlled lab, or the codec that's been tested by hundreds of millions
of listeners in just about all conditions imaginable?
It's not like we actually
invented anything here either. Software testing has evolved quite a bit since the
80s and we've mainly attempted to follow the best practices rather than use antiquated
methods "because that's what we've always done".