Zimtohrli: A New Psychoacoustic Perceptual Metric for Audio Compression

12 days ago (github.com)

A few comments:

- My understanding is that a gamma chirp is the established filter to use for an auditory filter bank--any reason you choose an elliptical filter instead?

- I didn't look too closely, but it seems like you are analyzing the output of the filter bank as real numbers. I highly recommend you convolve with a complex representation of the filter and keep all of the math in the complex domain until you collapse to loudness.

- I'd not bucket to discrete 100hz time slices, instead just convolve the temporal masking function with the full time resolution of the filter bank output.

- You want to think about some volume normalization step that would give the final minimized Zimtohrli distance metric between A and B*x, where x is a free variable for volume. Otherwise, a perceptual codec that just tends to make things a bit quieter might get a bad score.

- For fletcher munson, I assume you are just using a curve at a high-ish volume? If so, good :)

- Not sure how you are spacing filter bank center frequencies relative to ERB size, but I'd recommend oversampling by a factor of 2-3. (That is, a few filters per ERB).

Apologies if any of these are off base--I just took a quick look.

I looked through the deeper explanation and found this interesting:

“Performing a simple experiment where we have 5 separate components

1000 Hz sine probe 57 dB SPL 750 Hz sine masker A at 71dB SPL 800 Hz sine masker B at 71 dB SPL 850 Hz sine masker C at 67 dB SPL 900 Hz sine masker D at 65 dB SPL I record the following data

When playing probe + masker A through D individually I experience the probe approximately as intensely as a 1000Hz tone at 53dB SPL. When playing probe + all maskers I experience the probe approximately as intensely as a 1000Hz tone at 48dB SPL.”

I would be very interested in understanding more about their testing methodology and hardware setup especially.

Is the perceiver a trained listener? Are they using headphones or speakers or some other transducer method?

It's awfully difficult to say that there is equivalent perceived SPL for different frequency domains, even as a trained listener. Especially given the different frequency response for different listening setups.

The average user has no chance; hence my curiosity of their specific credentials considering they’re building an entirely new perceptual model based on that.

  • >It's awfully difficult to say that there is equivalent perceived SPL for different frequency domains, even as a trained listener.

    The snippet you quote doesn't claim comparing intensities at different frequencies.

    He is comparing only perceived 1kHz intensities, (in the presence or absence of maskers at other frequencies, whose intensity is not subjectively being scored)

    • Ah, thank you for clarifying, I misunderstood but still have the same curiosity about their methods .

I'm guessing the name is meant to allude to cinnamon pig ears (https://en.wikipedia.org/wiki/Palmier).

Are there any associated scientific articles and/or datasets that back up the experimental claim/insinuation of matching JNDs or perceptual differences?

Is this a proposal without experimental verification?

This seems to be targeted at signals that are already quite close. Is there anything similar for broad ballpark similarity?

Whenever I save searched for such things I have more often encountered techniques designed to detect re-use for copyright reasons.

I have played around with generating instrument sounds from a blend of very few basic waveforms with attack,decay,sustain,release, pitch sliding and bell modulation.

While it is quite fun just trying to make things by tweaking parameters, your ear/perception drifts as you hear the same thing over and over.

It would be really nice to have an automated "how close is this abomination?". I'd even give evolution a go to try and make some more difficult matches.

  • How close is broad ballpark, have you tried chromaprint?

    It's probably far from state of the art today, but you can get a percentage similarity out of it. I've successfully used it to find similar (or outright duplicate) songs in a big library

    • Things like chromaprint are why I have found it difficult to search for what I want.

      These tools are geared towards identifying matches (one to many) Chromaprint specifically bins things into notes assuming it it trying to match music.

      I'm after something that will tell me, in human perception, how much a dogs bark sounds like a quack.

      One to one comparison of short-ish samples with no assumption of content style.

It’d be very interesting to see the results for this metric for the existing audio and voice codecs (like AAC, AAC-LD, mp3, opus), and how it compares to the existing metrics for them?

Couldn’t find it in their paper.

This says it works on just-noticeable-differences. Would this work well if the quality of the compressed audio is very poor? Could one for example compare two speech codecs at 8Khz, 4bit against the original source to find out which one sounds better?

Or should one just... I dunno, calculate the mean squared error in some sort of continuous frequency domain, perhaps weighted by some hearing curve.

Can it be used to make LAME even better? I mean, I'm still fond of mp3, specially now that it is patent/royalty free and there are literary billions of compatible devices.

very useful - I find a lot of audio SR (compression) algos to sound really bad - likely just because of the loss functions and/or eval metrics are 'inhuman'.