16th October 2020
Here is a blog post on some papers from ISMIR 2020. Hopefully it saves you some time. Maybe it will only be of use for personal reference.
Here are some of my key papers - I am mainly interested in:
︎︎︎ generative modeling
︎︎︎ unsupervised and self-supervised learning
︎︎︎ neural audio synthesis
︎︎︎ metric learning
︎︎︎ disentanglement
etc. etc. etc.
Music Fadernets: Controllable Music Generation Based on High-level Features via Low-level Feature Modelling | Hao Hao Tan, Dorien Herremans
Ultra-light Deep MIR by Trimming Lottery Tickets | Philippe Esling et al.
DrumGAN: Synthesis of Drum Sounds with Timbral Feature Conditioning Using Generative Adversarial Networks | Javier Nistal, Stefan Lattner, Gaël Richard
Unsupervised Disentanglement of Pitch and Timbre for Isolated Musical Instrument Sounds | Yin-Jyun Luo et al.
Hierarchical Timbre-painting and Articulation Generation | Michael M Michelashvili, Lior Wolf
Here is a blog post on some papers from ISMIR 2020. Hopefully it saves you some time. Maybe it will only be of use for personal reference.
Here are some of my key papers - I am mainly interested in:
︎︎︎ generative modeling
︎︎︎ unsupervised and self-supervised learning
︎︎︎ neural audio synthesis
︎︎︎ metric learning
︎︎︎ disentanglement
etc. etc. etc.
Music Fadernets: Controllable Music Generation Based on High-level Features via Low-level Feature Modelling | Hao Hao Tan, Dorien Herremans
- This work deals with music generation and representation learning.
- Low-level compositional feature descriptors e.g rhythmic complexity and note density are enforced in a latent code via a classifier.
- This latent code is then mapped to a second latent vector, which represents higher level compositional attributes, i.e arousal/valence, through a semi-supervised gaussian-mixture variational autoencoder.
- Check it out - the conceptual design is quite cool.
- This tackles singing voice timbre transfer.
- Input features are taken from the WORLD vocoder.
- I really like the use of a pretrained speaker embedding network to condition the model and enable adaptation to unseen voices.
Ultra-light Deep MIR by Trimming Lottery Tickets | Philippe Esling et al.
- Follows from Esling's prior work on using the lottery ticket hypothesis for generative audio models.
-
Lottery ticket trimming applied to several MIR tasks, such that the resulting models are compatible with the memory constraints of embedded hardware - think mixing consoles, Bela, raspberry pi etc
-
Provides a theoretical basis for pruning various NN layers.
- It would be interesting to explore a 'perceptual' form of pruning.
DrumGAN: Synthesis of Drum Sounds with Timbral Feature Conditioning Using Generative Adversarial Networks | Javier Nistal, Stefan Lattner, Gaël Richard
- Think descriptor-based drum sample sound design.
- Progressively growing GANs; trained on kick, snare and cymbal one-shots; conditioned on continuous timbre semantics descriptors from the Audio Commons timbral models.
- Does well to capture the relationship between timbre descriptor distributions and output audio. For example, making a kick drum less 'boomy' and more 'sharp' might push it towards the snare drum part of the space
- Good set of quantitative evaluation metrics for neural audio synthesis.
- The audio quality from this model is pretty nice - the saturation gives a nice aesthetic.
Unsupervised Disentanglement of Pitch and Timbre for Isolated Musical Instrument Sounds | Yin-Jyun Luo et al.
- This work attempts to learn disentangled representations for timbre and pitch, without supervision i.e no conditioning the model on labels.
- Cool assumption: a timbre representation should be invariant to small pitch shifts of a sound. A contrastive loss component enforces that the timbre representations for a sound and a slightly pitch-shifted version should be similar in euclidean space.
- What about timbre variance to pitch beyond one octave?
Hierarchical Timbre-painting and Articulation Generation | Michael M Michelashvili, Lior Wolf
- Iterates on some conceptual ideas from the Neural-Source-Filter model and DDSP for generative modeling and timbre representation of a single instrument.
- Reconstructs the timbre at increasingly large resolutions, inspired by CV superresolution.
- Idea: first extract a sine-excitation from the signal, to describe its pitch and 'articulation' over time. Starting from the 'articulation' representation, use several upsampling blocks to build the timbre at various sampling resolutions. Each block has an adversarial loss.
- Using the feature activations at each layer of CREPE pitch classifier, a 'perceptual' loss is used to align the real and synthetic features. I personally would not call this perceptual, but in computer vision they sometimes use this terminology for losses that align features between real and generated data
- Good audio quality and timbre transfer, but no quantitative evaluation. Can this model represent multiple timbres simultaneously?
Highlights from ISMIR 2020 Proceedings
Model Architecture | Unsupervised Disentanglement of Pitch and Timbre | YJ Luo
![]()
Model Architecture | Hierarchical Timbre-painting and Articulation Generation | Michael M Michelashvili
![]()
Model Architecture | Unsupervised Disentanglement of Pitch and Timbre | YJ Luo

Model Architecture | Hierarchical Timbre-painting and Articulation Generation | Michael M Michelashvili
