synthcloner

SynthCloner: Synthesizer Preset Conversion via Factorized Codec with Disentangled Timbre and ADSR Control

Jeng-Yue Liu^1,2,*, Ting-Chao Hsu^1,*, Yen-Tung Yeh¹, Li Su², Yi-Hsuan Yang¹
¹ National Taiwan University ² Academia Sinica
^* Equal contribution

Abstract

Electronic synthesizer sounds are controlled by presets, parameters settings that yield complex timbral characteristics and ADSR envelopes, making preset conversion particularly challenging. Recent approaches to timbre transfer often rely on spectral objectives or implicit style matching, offering limited control over envelope shaping. Moreover, public synthesizer datasets rarely provide diverse coverage of timbres and ADSR envelopes. To address these gaps, we present SynthCloner, a factorized codec model that disentangles audio into three attributes: ADSR envelope, timbre, and content. This separation enables expressive synthesizer preset conversion with independent control over these three attributes. Additionally, we introduce SynthCAT, a new synthesizer dataset with a task-specific rendering pipeline covering 250 timbres, 120 ADSR envelopes, and 100 MIDI sequences. Experiments show that SynthCloner outperforms baselines on both objective and subjective metrics, while enabling independent attribute control. The implementation code is available here, and the evaluation dataset can be accessed here.

Figure 1: SynthCloner model framework

ADSR Definition

ADSR (Attack, Decay, Sustain, Release) is a fundamental envelope model in sound synthesis that shapes how a note evolves over time. As illustrated in Figure 2, this four-stage process controls the dynamic characteristics of synthesized sounds:

Attack: The initial rising phase where sound reaches maximum amplitude from silence
Decay: The descending phase where sound decreases from peak to sustain level
Sustain: The steady amplitude level maintained while the note is held (a level, not a time duration)
Release: The final phase where sound fades back to silence after the note is released

Together, these stages give electronic instruments their dynamic and expressive qualities, mimicking the natural behavior of acoustic instruments.

ADSR
Figure 2: Visualization of the ADSR envelope

Preset Conversion Experiments with Audio Pairs

This section presents audio pairs from our preset conversion experiments, organized into three groups based on ADSR characteristics. Each pair includes: original audio, reference audio, ground-truth reconstruction, our proposed model output, ablation without ADSR extractor, CTD [1], and SS-VQVAE [2].

As shown in Figure 1, our model disentangles audio into three latent factors: ADSR, Content, and Timbre. In these experiments, we preserve the Content features from the original audio while replacing the ADSR and Timbre features with those from the reference audio. Specifically, if the original audio is represented as (e₁, c₁, t₁) and the reference as (e₂, c₂, t₂), the reconstructed output yields (e₂, c₁, t₂). The ablation study without the ADSR extractor demonstrates the critical importance of this component in our model architecture.

Normal Cases

Pair ID	Original	Reference	Ground Truth	PropoProposedsed	w/o ADSR Extractor	CTD	SSVQVAE
Pair 1-1
Pair 1-2
Pair 1-3
Pair 1-4
Pair 1-5
Pair 1-6
Pair 1-7
Pair 1-8
Pair 1-9
Pair 1-10

Short2Long

Pair ID	Original	Reference	Ground Truth	Proposed	w/o ADSR Extractor	CTD	SSVQVAE
Pair 2-1
Pair 2-2
Pair 2-3
Pair 2-4

Long2Short

Pair ID	Original	Reference	Ground Truth	Proposed	w/o ADSR Extractor	CTD	SSVQVAE
Pair 3-1
Pair 3-2

Timbre/ADSR Disentanglement Control

This section demonstrates our model’s disentanglement control capabilities, showcasing how ADSR and timbre characteristics can be independently manipulated while preserving other audio properties. Each example presents the original audio, reference audio, and converted audio with their corresponding visual representations.

ADSR Control Example 1

Example 1 (Audio)

Example 1 (Image)
02_orig

Example 1 (Audio)

Example 1 (Image)
02_ref

Example 1 (Audio)

Example 1 (Image)
02_conv_adsr

ADSR Control Example 2

Example 2 (Audio)

Example 2 (Image)
04_orig

Example 2 (Audio)

Example 2 (Image)
04_ref

Example 2 (Audio)

Example 2 (Image)
04_conv_adsr

ADSR Control Example 3

Example 3 (Audio)

Example 3 (Image)
07_orig

Example 3 (Audio)

Example 3 (Image)
07_ref

Example 3 (Audio)

Example 3 (Image)
07_conv_adsr

Timbre Control Example 1

Example 1 (Audio)

Example 1 (Image)
01_orig

Example 1 (Audio)

Example 1 (Image)
01_ref

Example 1 (Audio)

Example 1 (Image)
01_conv_timbre

Timbre Control Example 2

Example 2 (Audio)

Example 2 (Image)
02_orig

Example 2 (Audio)

Example 2 (Image)
02_ref

Example 2 (Audio)

Example 2 (Image)
02_conv_timbre

References

[1] N. Demerlé, P. Esling, G. Doras, and D. Genova, “Combining audio control and style transfer using latent diffusion,” in Proc. International Society for Music Information Retrieval (ISMIR), 2024

[2] O. Cífka, A. Ozerov, U. ̧Sim ̧sekli, and G. Richard “Self-supervised vq-vae for one-shot music style transfer,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processin(ICASSP). IEEE, 2021, pp. 96–100