GPSYCHO - A LGPL'd Psycho-Acoustic Model
GPSYCHO is an open source psycho-acoustic and noise shaping model for ISO based MP3 encoders. GPSYCHO fixes some substantial bugs in the ISO demonstration source psycho-acoustic model (ISO psy-model). In addition, GPSYCHO adds mid/side stereo, real bit reservoir control, much improved critical band bit allocation routines, variable bit rate (optional) and very good pre-echo control. At 128kbs, the quality is significantly better than that produced by the ISO psy-model (as found in almost all other free encoders). An example of these improvements is shown in Screenshots. GPSYCHO is close to the quality of the FhG encoder, but there is still room for improvement. Read on if you want to help!
As this code is released under the LGPL, it can be used in any projects, even commercial ones. I would also encourage others to help improve GPSYCHO. Some things that would help:
- Find (and send me) samples where your favorite encoder does a better job than GPSYCHO.
- Run your own listening tests and try tuning some of the algorithms below. Most have parameters that are set via trial and error.
- Try out new algorithms!
The GPSYCHO algorithms are rigorously tested through listening tests. You can find more infomation here.
GPSYCHO operates in a number of modes. Notes are available to provide information about each of them:
- GPSYCHO variable bit rate (VBR).
- GPSYCHO average bit rate (ABR).
- GPSYCHO Mid/Side Stereo.
- GPSYCHO outer_loop quantization algorithm (CBR noise shaping).
New Features (which may need some tuning):
- Bit allocation outer loop improved based on ideas in an MPEG2 J. Audio Eng. Soc. 1997 paper. The ISO demonstration source outer loop can produce some very poor quality frames in certain situations.
- VBR (variable bit rate) is now working! See the VBR link above for details.
MS_STEREOswitch. ISO formula is primitive. I use a switch described here.
MS_STEREOISO sparsing formula does not work. It will remove 95% of the side channel coefficients. GPSYCHO does not sparse the side channel at all, but allocate less bits for encoding. Martin Weghofer has a coder which does effectively use side channel sparsing, but the algorithm does not work well with the LAME quantization procedure. This is an area that needs further work.
MS_STEREOnow uses ideas in a Johnson ICASSP 1992 paper to compute true Mid and Side thresholds which compensate for stereo de-masking. Similar to that used in PAC and AAC.
- Bit reservoir use. Again the ISO formula performs poorly. At 128kbs, it always thinks it needs to drain the reservoir, and thus the reservoir can never build up. It will also use up all the bits for the left channel before even looking at the right channel.
- Mid/Side bit allocation. GPSYCHO allocate bits based on the differences between left and right masking thresholds. Anyone have a better idea?
- Lowpass filtering based on the compression ratio. For high compression ratios, low pass filtering will improve the results. The exact amount of filtering needed depends on the music and personal preferences - the formula to decide how much lowpass filtering to use may need some tuning. At 256kbs, no filterings is done. At 128kbs, the lowpass filter is around 15.5khz.
- Improved shortblock switching. It is now based on surges in PE or large fluctuations in energy within a single granule. These improvements trigger some critical window switching that LAME used to miss.
Features to try out:
- Add a high-pass filter. 20Hz?
- Shorter FFT for the long block noise threshold calculation. A 768 FFT centered over the 576 sample granule would be more accurate for the high frequency energies than the 1024 FFT. This should also improve the perceptual entropy (pe) calculation since there will be less interference from data outside the granule. Another advantage might be for the applaud.wav test - see the Quality section for details. It will of course make the low frequency energy estimates less accurate.
- Subblock_gain. This seems to be important. FhG uses it for most short blocks. LAME and other dist10 based codes do not make any use of this. VBR modes will use subblock gain.
Things we've learned from analyzing FhG (mp3enc3.1) produced .mp3 frames:
- FhG uses mixed_blocks only if specified as an option.
- FhG uses intensity stereo only at lower bitrates.
- FhG does not seem to use scsfi <> 0.
- Removes data in scalefactor band 21 at 128kbs.
- Almost always uses ms_stereo. Does not use ISO formula for ms_stereo switching.
- More sophisticated mid/side bit allocation.
- Excellent short block detection.
- Good bit reservoir use. Not totally based on pe, since they often allocate extra bits to long blocks.