synthcloner

SynthCloner: Synthesizer Preset Conversion via Factorized Codec with Disentangled Timbre and ADSR Control

Jeng-Yue Liu1,2,*, Ting-Chao Hsu1,*, Yen-Tung Yeh1, Li Su2, Yi-Hsuan Yang1
1 National Taiwan University   2 Academia Sinica
* Equal contribution

arXiv Code Dataset

Abstract

Electronic synthesizer sounds are controlled by presets, parameters settings that yield complex timbral characteristics and ADSR envelopes, making preset conversion particularly challenging. Recent approaches to timbre transfer often rely on spectral objectives or implicit style matching, offering limited control over envelope shaping. Moreover, public synthesizer datasets rarely provide diverse coverage of timbres and ADSR envelopes. To address these gaps, we present SynthCloner, a factorized codec model that disentangles audio into three attributes: ADSR envelope, timbre, and content. This separation enables expressive synthesizer preset conversion with independent control over these three attributes. Additionally, we introduce SynthCAT, a new synthesizer dataset with a task-specific rendering pipeline covering 250 timbres, 120 ADSR envelopes, and 100 MIDI sequences. Experiments show that SynthCloner outperforms baselines on both objective and subjective metrics, while enabling independent attribute control. The implementation code is available here, and the evaluation dataset can be accessed here.

Model Framework
Figure 1: SynthCloner model framework

ADSR Definition

ADSR (Attack, Decay, Sustain, Release) is a fundamental envelope model in sound synthesis that shapes how a note evolves over time. As illustrated in Figure 2, this four-stage process controls the dynamic characteristics of synthesized sounds:

Together, these stages give electronic instruments their dynamic and expressive qualities, mimicking the natural behavior of acoustic instruments.

ADSR
Figure 2: Visualization of the ADSR envelope

Preset Conversion Experiments with Audio Pairs

This section presents audio pairs from our preset conversion experiments, organized into three groups based on ADSR characteristics. Each pair includes: original audio, reference audio, ground-truth reconstruction, our proposed model output, ablation without ADSR extractor, CTD [1], and SS-VQVAE [2].

As shown in Figure 1, our model disentangles audio into three latent factors: ADSR, Content, and Timbre. In these experiments, we preserve the Content features from the original audio while replacing the ADSR and Timbre features with those from the reference audio. Specifically, if the original audio is represented as (e₁, c₁, t₁) and the reference as (e₂, c₂, t₂), the reconstructed output yields (e₂, c₁, t₂). The ablation study without the ADSR extractor demonstrates the critical importance of this component in our model architecture.

Normal Cases

Pair ID Original Reference Ground Truth PropoProposedsed w/o ADSR Extractor CTD SSVQVAE
Pair 1-1
Pair 1-2
Pair 1-3
Pair 1-4
Pair 1-5
Pair 1-6
Pair 1-7
Pair 1-8
Pair 1-9
Pair 1-10

Short2Long

Pair ID Original Reference Ground Truth Proposed w/o ADSR Extractor CTD SSVQVAE
Pair 2-1
Pair 2-2
Pair 2-3
Pair 2-4

Long2Short

Pair ID Original Reference Ground Truth Proposed w/o ADSR Extractor CTD SSVQVAE
Pair 3-1
Pair 3-2

Timbre/ADSR Disentanglement Control

This section demonstrates our model’s disentanglement control capabilities, showcasing how ADSR and timbre characteristics can be independently manipulated while preserving other audio properties. Each example presents the original audio, reference audio, and converted audio with their corresponding visual representations.

ADSR Control Example 1

Example 1 (Audio)
Example 1 (Image)
02_orig
Example 1 (Audio)
Example 1 (Image)
02_ref
Example 1 (Audio)
Example 1 (Image)
02_conv_adsr

ADSR Control Example 2

Example 2 (Audio)
Example 2 (Image)
04_orig
Example 2 (Audio)
Example 2 (Image)
04_ref
Example 2 (Audio)
Example 2 (Image)
04_conv_adsr

ADSR Control Example 3

Example 3 (Audio)
Example 3 (Image)
07_orig
Example 3 (Audio)
Example 3 (Image)
07_ref
Example 3 (Audio)
Example 3 (Image)
07_conv_adsr

Timbre Control Example 1

Example 1 (Audio)
Example 1 (Image)
01_orig
Example 1 (Audio)
Example 1 (Image)
01_ref
Example 1 (Audio)
Example 1 (Image)
01_conv_timbre

Timbre Control Example 2

Example 2 (Audio)
Example 2 (Image)
02_orig
Example 2 (Audio)
Example 2 (Image)
02_ref
Example 2 (Audio)
Example 2 (Image)
02_conv_timbre

References

[1] N. Demerlé, P. Esling, G. Doras, and D. Genova, “Combining audio control and style transfer using latent diffusion,” in Proc. International Society for Music Information Retrieval (ISMIR), 2024

[2] O. Cífka, A. Ozerov, U. ̧Sim ̧sekli, and G. Richard “Self-supervised vq-vae for one-shot music style transfer,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processin(ICASSP). IEEE, 2021, pp. 96–100