• Supertonic์€ ONNX Runtime ๊ธฐ๋ฐ˜์˜ ์˜จ๋””๋ฐ”์ด์Šค(on-device) ๋ฉ€ํ‹ฐ๋ง๊ถ TTS ์‹œ์Šคํ…œ
  • ํด๋ผ์šฐ๋“œยทAPI ํ˜ธ์ถœ ์—†์ด ํ…์ŠคํŠธ์—์„œ ์Œ์„ฑ์„ ํ•ฉ์„ฑํ•˜๋Š” ๋กœ์ปฌ ์Œ์„ฑ ํ•ฉ์„ฑ ์—”์ง„
  • Supertonic 3๋Š” ์ตœ์‹  ๋ฒ„์ „์œผ๋กœ ~99M ํŒŒ๋ผ๋ฏธํ„ฐยท31๊ฐœ ์–ธ์–ด๋ฅผ ์ง€์›ํ•˜๋Š” ๊ฒฝ๋Ÿ‰ ๋ชจ๋ธ
  • Flow Matching ๊ธฐ๋ฐ˜ ์•„ํ‚คํ…์ฒ˜๋กœ ์ž‘์€ ํฌ๊ธฐ์—๋„ ์ŠคํŠœ๋””์˜ค๊ธ‰(44.1kHz) ์Œ์งˆ์„ ๋‚ด๋Š” ๊ตฌ์กฐ
  • GPU ์—†์ด CPU๋งŒ์œผ๋กœ ๋™์ž‘ํ•˜๋Š” ์—ฃ์ง€ ๋ฐฐํฌ์šฉ ์„ค๊ณ„

ํ•ด๋‹น ๊ฐœ๋…์ด ํ•„์š”ํ•œ ์ด์œ 

  • ๊ธฐ์กด ์˜คํ”ˆ TTS๋Š” 0.7B~2B ํŒŒ๋ผ๋ฏธํ„ฐ๊ธ‰์œผ๋กœ ๋ฌด๊ฒ๊ณ  GPUยทVRAM์„ ๋งŽ์ด ์š”๊ตฌ
  • ํด๋ผ์šฐ๋“œ TTS API๋Š” ๋น„์šฉยท๋„คํŠธ์›Œํฌ ์ง€์—ฐยทํ”„๋ผ์ด๋ฒ„์‹œ ๋…ธ์ถœ์ด๋ผ๋Š” ๋ถ€๋‹ด
  • ์˜จ๋””๋ฐ”์ด์Šค๋กœ ๋Œ๋ฆฌ๋ ค๋ฉด ๋ชจ๋ธ ํฌ๊ธฐยท๋ฉ”๋ชจ๋ฆฌยท์†๋„๋ฅผ ๋ชจ๋‘ ์ค„์—ฌ์•ผ ํ•˜๋Š” ์ œ์•ฝ

AS-IS (๊ธฐ์กด ๋Œ€ํ˜• TTS / ํด๋ผ์šฐ๋“œ TTS)

sequenceDiagram
    autonumber
    participant App as ์•ฑ
    participant Cloud as ํด๋ผ์šฐ๋“œ TTS API
    participant GPU as GPU ์„œ๋ฒ„(0.7B~2B)
    App->>Cloud: ํ…์ŠคํŠธ ์ „์†ก (๋„คํŠธ์›Œํฌ ํ•„์š”)
    Cloud->>GPU: ๋Œ€ํ˜• ๋ชจ๋ธ ์ถ”๋ก  (VRAM ๅคš)
    GPU-->>Cloud: ํ•ฉ์„ฑ ์Œ์„ฑ
    Cloud-->>App: ์Œ์„ฑ ๋ฐ˜ํ™˜ (์ง€์—ฐยท๋น„์šฉยทํ”„๋ผ์ด๋ฒ„์‹œ ๋…ธ์ถœ)

TO-BE (Supertonic 3 ์˜จ๋””๋ฐ”์ด์Šค)

sequenceDiagram
    autonumber
    participant App as ์•ฑ
    participant TTS as Supertonic 3 (~99M, ONNX)
    participant CPU as ๋กœ์ปฌ CPU
    App->>TTS: ํ…์ŠคํŠธ ์ž…๋ ฅ
    TTS->>CPU: ๊ฒฝ๋Ÿ‰ ๋ชจ๋ธ ์ถ”๋ก  (GPU ๋ถˆํ•„์š”)
    CPU-->>TTS: latent โ†’ 44.1kHz WAV
    TTS-->>App: ์Œ์„ฑ ๋ฐ˜ํ™˜ (์˜คํ”„๋ผ์ธยท์ €์ง€์—ฐยทํ”„๋ผ์ด๋ฒ„์‹œ ๋ณด์กด)

์•„ํ‚คํ…์ฒ˜ 3๋‹จ ๊ตฌ์„ฑ

Flow Matching ๊ธฐ๋ฐ˜์œผ๋กœ ํ…์ŠคํŠธ๋ฅผ ์Œ์„ฑ์œผ๋กœ ๋ฐ”๊พธ๋Š” ์„ธ ๋ชจ๋“ˆ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ์‹ค์ œ๋กœ๋Š” ๋‹จ์ผ ํŒŒ์ผ์ด ์•„๋‹ˆ๋ผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ .onnx ๋ชจ๋“ˆ์ด ๋‹จ๊ณ„๋ณ„๋กœ ONNX Runtime ์œ„์—์„œ ์‹คํ–‰๋˜๋ฉฐ, ๊ทธ ์•ž๋‹จ์— ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ์™€ voice style ์ž„๋ฒ ๋”ฉ์ด ๊ฒฐํ•ฉ๋œ๋‹ค.

  1. Speech Autoencoder โ€” ์˜ค๋””์˜ค๋ฅผ ์ž ์žฌ(latent) ํ‘œํ˜„์œผ๋กœ ์ธ์ฝ”๋”ฉยท๋””์ฝ”๋”ฉ
  2. Text-to-Latent ๋ชจ๋“ˆ โ€” Flow Matching์œผ๋กœ ํ…์ŠคํŠธ๋ฅผ ์Œํ–ฅ ์ž ์žฌ ํ‘œํ˜„์œผ๋กœ ๋ณ€ํ™˜
  3. Vocoder โ€” ์ž ์žฌ ์ฝ”๋“œ๋กœ๋ถ€ํ„ฐ 44.1kHz ์˜ค๋””์˜ค๋ฅผ ๋ณต์›
flowchart LR
    T["ํ…์ŠคํŠธ ์ž…๋ ฅ<br/>(+ ํ‘œํ˜„ ํƒœ๊ทธ)"] --> P[ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ]
    V["voice style<br/>์ž„๋ฒ ๋”ฉ (M1 ๋“ฑ)"] --> M2
    P --> M2["text-to-latent.onnx<br/>(flow matching, total_steps)"]
    M2 --> M3["vocoder.onnx"]
    M3 --> W["44.1kHz WAV ์Œ์„ฑ"]

    subgraph RT["ONNX Runtime (CPU/๋ธŒ๋ผ์šฐ์ €)"]
        M2
        M3
    end

from supertonic import TTS์˜ TTS ๊ฐ์ฒด๊ฐ€ ์ด ONNX ๋ชจ๋“ˆ๋“ค์„ ONNX Runtime์œผ๋กœ ๊ตฌ๋™ํ•˜๋Š” ๋ž˜ํผ(wrapper)๋‹ค. ์‚ฌ์šฉ์ž๋Š” ONNX ํŒŒ์ผ์„ ์ง์ ‘ ๋‹ค๋ฃจ์ง€ ์•Š๊ณ  tts.synthesize(text, ...) ํ•œ ์ค„๋กœ TTS๋ฅผ ์“ด๋‹ค.

CPU๋งŒ ์“ฐ๋Š”๊ฐ€? โ€” โ€œGPU ๋ถˆํ•„์š”(CPU๋กœ ์ถฉ๋ถ„)โ€

Supertonic 3๋Š” GPU ์—†์ด CPU๋งŒ์œผ๋กœ ๋™์ž‘ํ•˜๋„๋ก ์„ค๊ณ„๋œ ๊ฒƒ์ด ํ•ต์‹ฌ์ด๋‹ค. ์‹ค์ œ๋กœ CPU ์ถ”๋ก ์ด A100 GPU์—์„œ ์ธก์ •ํ•œ ๋” ํฐ ๋ชจ๋ธ๋“ค๋ณด๋‹ค๋„ ๋น ๋ฅธ ์ง€์—ฐ์‹œ๊ฐ„์„ ๋‚ด๊ณ  ๋ฉ”๋ชจ๋ฆฌ๋„ ํ›จ์”ฌ ์ ๊ฒŒ ์“ด๋‹ค. ๋‹ค๋งŒ โ€œCPU ์ „์šฉโ€์€ ์•„๋‹ˆ๋‹ค. ๋ฐ‘๋‹จ์ด ONNX Runtime์ด๋ฏ€๋กœ GPU๊ฐ€ ์žˆ์œผ๋ฉด CUDA ๊ฐ™์€ Execution Provider๋‚˜ ๋ธŒ๋ผ์šฐ์ €์˜ WebGPU๋กœ GPU ๊ฐ€์†๋„ ์„ ํƒ์ ์œผ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰ โ€œGPU๊ฐ€ ํ•„์ˆ˜๊ฐ€ ์•„๋‹ˆ๋‹คโ€๊ฐ€ ์ •ํ™•ํ•œ ํ‘œํ˜„์ด๋‹ค.

Supertonic์ด ๊ธฐ๋ฐ˜ํ•œ ๋…ผ๋ฌธ 3๊ฐœ

๋…ผ๋ฌธ์—ญํ• 
SupertonicTTS์ „์ฒด ์•„ํ‚คํ…์ฒ˜์™€ ํšจ์œจ์  ์„ค๊ณ„
Length-Aware RoPEcross-attention์—์„œ ํ…์ŠคํŠธ-์Œ์„ฑ ์ •๋ ฌ(alignment) ๊ฐœ์„ 
Self-Purifying Flow Matching๋…ธ์ด์ฆˆ ์žˆ๋Š” ๋ผ๋ฒจ๋กœ ํ•™์Šตํ•  ๋•Œ ์•ˆ์ •ํ™”

๋ฒ„์ „ ๋น„๊ต

๋ฒ„์ „์ƒํƒœํŒŒ๋ผ๋ฏธํ„ฐ์–ธ์–ดํ•ต์‹ฌ ํŠน์ง•
Supertonic 3์ตœ์‹ ~99M31ํ‘œํ˜„ ํƒœ๊ทธยท๋ฉ€ํ‹ฐ๋ง๊ถ
Supertonic 2์•ˆ์ •~66M5๊ธฐ๋ฐ˜ ๋ฆด๋ฆฌ์Šค
Supertonic 1๋ ˆ๊ฑฐ์‹œ~66M1 (์˜์–ด)์ตœ์ดˆ ๋ฒ„์ „

์ฃผ์š” ๊ธฐ๋Šฅ

  • ํ‘œํ˜„ ํƒœ๊ทธ(Expression Tags): <laugh>, <breath>, <sigh> ๋“ฑ 10์ข…์„ ์ธ๋ผ์ธ์œผ๋กœ ์‚ฝ์ž…ํ•ด ์ž์—ฐ์Šค๋Ÿฌ์›€ ์ถ”๊ฐ€
  • Zero-shot ์ปค์Šคํ…€ ๋ณด์ด์Šค: Voice Builder๋กœ ํ•™์Šต ์—†์ด ์ƒˆ ๋ชฉ์†Œ๋ฆฌ ์ƒ์„ฑ
  • ์ŠคํŠœ๋””์˜ค ์Œ์งˆ: 44.1kHz 16-bit WAV ์ง์ ‘ ์ถœ๋ ฅ
  • ๊ด‘๋ฒ”์œ„ํ•œ ํ”Œ๋žซํผ SDK: Python, Node.js, Browser(WebGPU/WASM), Java, C++, C#, Go, Swift, iOS, Rust, Flutter

์‚ฌ์šฉ๋ฒ•

from supertonic import TTS
 
tts = TTS(auto_download=True)
style = tts.get_voice_style(voice_name="M1")
wav, duration = tts.synthesize(
    text="Supertonic is lightning fast TTS.",
    lang="en",
    voice_style=style,
    total_steps=8,  # ํ’ˆ์งˆ: 5-12 (= flow matching ODE ์Šคํ… ์ˆ˜)
    speed=1.05      # ์†๋„: 0.7-2.0
)
tts.save_audio(wav, "output.wav")

์ฐธ๊ณ  ๋ฌธ์„œ