Zimtohrli: A New Psychoacoustic Perceptual Metric for Audio Compression

12 days ago (github.com)

27 comments

judiisis

A few comments:

- My understanding is that a gamma chirp is the established filter to use for an auditory filter bank--any reason you choose an elliptical filter instead?

- I didn't look too closely, but it seems like you are analyzing the output of the filter bank as real numbers. I highly recommend you convolve with a complex representation of the filter and keep all of the math in the complex domain until you collapse to loudness.

- I'd not bucket to discrete 100hz time slices, instead just convolve the temporal masking function with the full time resolution of the filter bank output.

- You want to think about some volume normalization step that would give the final minimized Zimtohrli distance metric between A and B*x, where x is a free variable for volume. Otherwise, a perceptual codec that just tends to make things a bit quieter might get a bad score.

- For fletcher munson, I assume you are just using a curve at a high-ish volume? If so, good :)

- Not sure how you are spacing filter bank center frequencies relative to ERB size, but I'd recommend oversampling by a factor of 2-3. (That is, a few filters per ERB).

Apologies if any of these are off base--I just took a quick look.

qingcharles 11 days ago

This man codecs.

givinguflac 11 days ago

I looked through the deeper explanation and found this interesting:

“Performing a simple experiment where we have 5 separate components

1000 Hz sine probe 57 dB SPL 750 Hz sine masker A at 71dB SPL 800 Hz sine masker B at 71 dB SPL 850 Hz sine masker C at 67 dB SPL 900 Hz sine masker D at 65 dB SPL I record the following data

When playing probe + masker A through D individually I experience the probe approximately as intensely as a 1000Hz tone at 53dB SPL. When playing probe + all maskers I experience the probe approximately as intensely as a 1000Hz tone at 48dB SPL.”

I would be very interested in understanding more about their testing methodology and hardware setup especially.

Is the perceiver a trained listener? Are they using headphones or speakers or some other transducer method?

It's awfully difficult to say that there is equivalent perceived SPL for different frequency domains, even as a trained listener. Especially given the different frequency response for different listening setups.

The average user has no chance; hence my curiosity of their specific credentials considering they’re building an entirely new perceptual model based on that.

DoctorOetker 11 days ago
>It's awfully difficult to say that there is equivalent perceived SPL for different frequency domains, even as a trained listener.
The snippet you quote doesn't claim comparing intensities at different frequencies.
He is comparing only perceived 1kHz intensities, (in the presence or absence of maskers at other frequencies, whose intensity is not subjectively being scored)
- givinguflac 11 days ago
  
  Ah, thank you for clarifying, I misunderstood but still have the same curiosity about their methods .

Thoreandan 11 days ago

Interesting, if hard-to-understand.

It would be nice to see ELi5 explanations for items like this akin to Monty's 'A Digital Media Primer for Geeks' ( https://people.xiph.org/~xiphmont/demo/#:~:text=Xiph )

formerly_proven 12 days ago

I'm guessing the name is meant to allude to cinnamon pig ears (https://en.wikipedia.org/wiki/Palmier).

atoav 12 days ago
Probably, this Zimt is cinnamon, Ohrli is swiss German dialect for ear.
- jo-m 12 days ago
  
  Öhrli actually. It maddens me to no end that they pick words which contain umlauts and then leave them out.
  
  1 reply →
- b3orn 11 days ago
  
  Öhrli is actually the swiss german diminutive of Ohr (ear). Swiss german uses -li a lot for diminutives, whereas standard german uses -chen or -lein, the vowel of them stem is turned into an umlaut Ohr -> Öhrli/Öhrchen/Öhrlein.
w-m 12 days ago
Likely inspired by https://github.com/google/butteraugli
- kopadudl 11 days ago
  
  Also https://github.com/google/zopfli and https://github.com/google/brotli

DoctorOetker 11 days ago

Are there any associated scientific articles and/or datasets that back up the experimental claim/insinuation of matching JNDs or perceptual differences?

Is this a proposal without experimental verification?

Lerc 11 days ago

This seems to be targeted at signals that are already quite close. Is there anything similar for broad ballpark similarity?

Whenever I save searched for such things I have more often encountered techniques designed to detect re-use for copyright reasons.

I have played around with generating instrument sounds from a blend of very few basic waveforms with attack,decay,sustain,release, pitch sliding and bell modulation.

While it is quite fun just trying to make things by tweaking parameters, your ear/perception drifts as you hear the same thing over and over.

It would be really nice to have an automated "how close is this abomination?". I'd even give evolution a go to try and make some more difficult matches.

tux3 11 days ago
How close is broad ballpark, have you tried chromaprint?
It's probably far from state of the art today, but you can get a percentage similarity out of it. I've successfully used it to find similar (or outright duplicate) songs in a big library
- Lerc 11 days ago
  
  Things like chromaprint are why I have found it difficult to search for what I want.
  These tools are geared towards identifying matches (one to many) Chromaprint specifically bins things into notes assuming it it trying to match music.
  I'm after something that will tell me, in human perception, how much a dogs bark sounds like a quack.
  One to one comparison of short-ish samples with no assumption of content style.

yalok 12 days ago

It’d be very interesting to see the results for this metric for the existing audio and voice codecs (like AAC, AAC-LD, mp3, opus), and how it compares to the existing metrics for them?

Couldn’t find it in their paper.

ant6n 11 days ago

This says it works on just-noticeable-differences. Would this work well if the quality of the compressed audio is very poor? Could one for example compare two speech codecs at 8Khz, 4bit against the original source to find out which one sounds better?

Or should one just... I dunno, calculate the mean squared error in some sort of continuous frequency domain, perhaps weighted by some hearing curve.

mrob 11 days ago

Audibility of error (and sound in general) depends on what other audio is playing at the same time, with both frequency domain and time domain effects:
https://en.wikipedia.org/wiki/Auditory_masking
Here's a two-part lecture with audio demonstrations by Bernhard Seeber of the Audio Information Processing Group at the Technical University of Munich:
https://www.youtube.com/watch?v=R9UZnMsm9o8
https://www.youtube.com/watch?v=bU0_Kaj7cPk
A simple weighed frequency domain error calculation is not very useful for comparing lossy audio codecs, because effectively exploiting auditory masking to hide the errors is a major factor in codec quality.
jononor 11 days ago

PEAQ/PESQ and visqol is worth trying for that. In principle they operate as you suggest. I keep a short overview of audio quality methods/tools here: https://github.com/jonnor/machinehearing/blob/master/audio-q...

marcodiego 11 days ago

Can it be used to make LAME even better? I mean, I'm still fond of mp3, specially now that it is patent/royalty free and there are literary billions of compatible devices.

iamnotsure 11 days ago

Lossy compression may be a bad idea, brains may not support it very well.

bbstats 11 days ago

very useful - I find a lot of audio SR (compression) algos to sound really bad - likely just because of the loss functions and/or eval metrics are 'inhuman'.

p0nce 11 days ago

How does it compare to visqol v3?

rurban 7 days ago

It is tested against visqol. See the code and esp. COMPARISON.md
Visqol is still the overall winner