Jeng-Yue Liu1,2,*, Ting-Chao Hsu1,*, Yen-Tung Yeh1, Li Su2, Yi-Hsuan Yang1
1 National Taiwan University 2 Academia Sinica
* Equal contribution
Electronic synthesizer sounds are controlled by presets, parameters settings that yield complex timbral characteristics and ADSR envelopes, making preset conversion particularly challenging. Recent approaches to timbre transfer often rely on spectral objectives or implicit style matching, offering limited control over envelope shaping. Moreover, public synthesizer datasets rarely provide diverse coverage of timbres and ADSR envelopes. To address these gaps, we present SynthCloner, a factorized codec model that disentangles audio into three attributes: ADSR envelope, timbre, and content. This separation enables expressive synthesizer preset conversion with independent control over these three attributes. Additionally, we introduce SynthCAT, a new synthesizer dataset with a task-specific rendering pipeline covering 250 timbres, 120 ADSR envelopes, and 100 MIDI sequences. Experiments show that SynthCloner outperforms baselines on both objective and subjective metrics, while enabling independent attribute control. The implementation code is available here, and the evaluation dataset can be accessed here.
Figure 1: SynthCloner model framework
ADSR (Attack, Decay, Sustain, Release) is a fundamental envelope model in sound synthesis that shapes how a note evolves over time. As illustrated in Figure 2, this four-stage process controls the dynamic characteristics of synthesized sounds:
Together, these stages give electronic instruments their dynamic and expressive qualities, mimicking the natural behavior of acoustic instruments.
Figure 2: Visualization of the ADSR envelope
This section presents audio pairs from our preset conversion experiments, organized into three groups based on ADSR characteristics. Each pair includes: original audio, reference audio, ground-truth reconstruction, our proposed model output, ablation without ADSR extractor, CTD [1], and SS-VQVAE [2].
As shown in Figure 1, our model disentangles audio into three latent factors: ADSR, Content, and Timbre. In these experiments, we preserve the Content features from the original audio while replacing the ADSR and Timbre features with those from the reference audio. Specifically, if the original audio is represented as (e₁, c₁, t₁) and the reference as (e₂, c₂, t₂), the reconstructed output yields (e₂, c₁, t₂). The ablation study without the ADSR extractor demonstrates the critical importance of this component in our model architecture.
Pair ID | Original | Reference | Ground Truth | PropoProposedsed | w/o ADSR Extractor | CTD | SSVQVAE |
---|---|---|---|---|---|---|---|
Pair 1-1 | |||||||
Pair 1-2 | |||||||
Pair 1-3 | |||||||
Pair 1-4 | |||||||
Pair 1-5 | |||||||
Pair 1-6 | |||||||
Pair 1-7 | |||||||
Pair 1-8 | |||||||
Pair 1-9 | |||||||
Pair 1-10 |
Pair ID | Original | Reference | Ground Truth | Proposed | w/o ADSR Extractor | CTD | SSVQVAE |
---|---|---|---|---|---|---|---|
Pair 2-1 | |||||||
Pair 2-2 | |||||||
Pair 2-3 | |||||||
Pair 2-4 |
Pair ID | Original | Reference | Ground Truth | Proposed | w/o ADSR Extractor | CTD | SSVQVAE |
---|---|---|---|---|---|---|---|
Pair 3-1 | |||||||
Pair 3-2 |
This section demonstrates our model’s disentanglement control capabilities, showcasing how ADSR and timbre characteristics can be independently manipulated while preserving other audio properties. Each example presents the original audio, reference audio, and converted audio with their corresponding visual representations.
[1] N. Demerlé, P. Esling, G. Doras, and D. Genova, “Combining audio control and style transfer using latent diffusion,” in Proc. International Society for Music Information Retrieval (ISMIR), 2024
[2] O. Cífka, A. Ozerov, U. ̧Sim ̧sekli, and G. Richard “Self-supervised vq-vae for one-shot music style transfer,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processin(ICASSP). IEEE, 2021, pp. 96–100