• Whisper๋Š” OpenAI๊ฐ€ ๊ณต๊ฐœํ•œ ๋ฒ”์šฉ ์Œ์„ฑ ์ธ์‹(ASR) ๋ชจ๋ธ
  • 680,000์‹œ๊ฐ„์˜ ๋‹ค๊ตญ์–ด ์›น ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋œ Transformer ๊ธฐ๋ฐ˜ Encoder-Decoder ๋ชจ๋ธ
  • 99๊ฐœ ์–ธ์–ด์˜ ์ „์‚ฌ(transcription), ์˜์–ด ๋ฒˆ์—ญ(translation), ์–ธ์–ด ๊ฐ์ง€(language detection)๋ฅผ ๋‹จ์ผ ๋ชจ๋ธ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ์•„ํ‚คํ…์ฒ˜
  • whisper.cpp๋ฅผ ํ†ตํ•ด ๋ชจ๋ฐ”์ผยท์—ฃ์ง€ ๊ธฐ๊ธฐ์—์„œ ์˜จ๋””๋ฐ”์ด์Šค ์ถ”๋ก ์ด ๊ฐ€๋Šฅํ•œ ๊ฒฝ๋Ÿ‰ ๋ฐฐํฌ ๋ชจ๋ธ
  • STT(Speech-to-Text) ์ „์šฉ ๋ชจ๋ธ๋กœ, TTS(Text-to-Speech)๋Š” ์ง€์›ํ•˜์ง€ ์•Š์Œ

ํ•ด๋‹น ๊ฐœ๋…์ด ํ•„์š”ํ•œ ์ด์œ 

  • ๊ธฐ์กด ์Œ์„ฑ ์ธ์‹ ๋ชจ๋ธ์€ ํŠน์ • ์–ธ์–ดยทํ™˜๊ฒฝ์— ํŠนํ™”๋˜์–ด, ์ƒˆ๋กœ์šด ๋„๋ฉ”์ธ์— ์ ์šฉํ•˜๋ ค๋ฉด fine-tuning์ด ํ•„์ˆ˜์ ์ด์—ˆ์Œ
  • Whisper๋Š” ๋Œ€๊ทœ๋ชจ ์•ฝ์ง€๋„ ํ•™์Šต(weak supervision)์œผ๋กœ zero-shot ์ „์ด ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚˜, fine-tuning ์—†์ด๋„ ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ์—์„œ ๋ฐ”๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  • MIT ๋ผ์ด์„ ์Šค๋กœ ๊ณต๊ฐœ๋˜์–ด ์ƒ์—…์  ํ™œ์šฉ ํฌํ•จ ์ž์œ ๋กญ๊ฒŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  • ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๊ฐ€ ๋กœ์ปฌ์— ๋‹ค์šด๋กœ๋“œ๋˜์–ด API ํ˜ธ์ถœ์ด๋‚˜ ์ธํ„ฐ๋„ท ์—ฐ๊ฒฐ ์—†์ด ์™„์ „ํ•œ ์˜คํ”„๋ผ์ธ ๋™์ž‘์ด ๊ฐ€๋Šฅ

AS-IS

sequenceDiagram
    autonumber
    participant User as ์‚ฌ์šฉ์ž
    participant STT as ๊ธฐ์กด ASR ์‹œ์Šคํ…œ
    participant Pipeline as ํ›„์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ

    User->>STT: ์Œ์„ฑ ์ž…๋ ฅ (ํŠน์ • ์–ธ์–ด)
    STT->>STT: ์Œ์„ฑ ์ธ์‹ (๋‹จ์ผ ์–ธ์–ด ์ „์šฉ ๋ชจ๋ธ)
    STT->>Pipeline: ํ…์ŠคํŠธ ์ถœ๋ ฅ
    Pipeline->>Pipeline: ๋ณ„๋„ ์–ธ์–ด ๊ฐ์ง€ ๋ชจ๋“ˆ
    Pipeline->>Pipeline: ๋ณ„๋„ ๋ฒˆ์—ญ ๋ชจ๋“ˆ
    Pipeline->>User: ์ตœ์ข… ๊ฒฐ๊ณผ
    Note over STT,Pipeline: ๊ฐ ๋‹จ๊ณ„๋ณ„ ๋ณ„๋„ ๋ชจ๋ธ ํ•„์š”<br/>์ƒˆ ์–ธ์–ด/๋„๋ฉ”์ธ๋งˆ๋‹ค fine-tuning ํ•„์ˆ˜

TO-BE

sequenceDiagram
    autonumber
    participant User as ์‚ฌ์šฉ์ž
    participant Whisper as Whisper ๋ชจ๋ธ

    User->>Whisper: ์Œ์„ฑ ์ž…๋ ฅ (์–ด๋–ค ์–ธ์–ด๋“ )
    Whisper->>Whisper: Mel Spectrogram ๋ณ€ํ™˜
    Whisper->>Whisper: Encoder (์Œ์„ฑ ํŠน์ง• ์ถ”์ถœ)
    Whisper->>Whisper: Decoder (ํƒœ์Šคํฌ ํ† ํฐ์— ๋”ฐ๋ผ ์ „์‚ฌ/๋ฒˆ์—ญ/์–ธ์–ด๊ฐ์ง€)
    Whisper->>User: ํ…์ŠคํŠธ ๊ฒฐ๊ณผ
    Note over Whisper: ๋‹จ์ผ ๋ชจ๋ธ์ด ๋ชจ๋“  ํƒœ์Šคํฌ ์ˆ˜ํ–‰<br/>zero-shot์œผ๋กœ ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ ๋Œ€์‘

Whisper ์•„ํ‚คํ…์ฒ˜

Whisper๋Š” Transformer ๊ธฐ๋ฐ˜์˜ Encoder-Decoder ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

Encoder ๋‹จ๊ณ„์—์„œ๋Š” ์ž…๋ ฅ ์˜ค๋””์˜ค๋ฅผ 80์ฑ„๋„์˜ Mel Spectrogram์œผ๋กœ ๋ณ€ํ™˜ํ•œ ๋’ค, 30์ดˆ ๋‹จ์œ„์˜ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ๋กœ ๋ถ„ํ• ํ•˜์—ฌ ์ฒ˜๋ฆฌํ•œ๋‹ค. Encoder๋Š” ์ด ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ์—์„œ ์Œ์„ฑ์˜ ํŠน์ง•(feature)์„ ์ถ”์ถœํ•œ๋‹ค.

Decoder ๋‹จ๊ณ„์—์„œ๋Š” ํŠน์ˆ˜ ํƒœ์Šคํฌ ํ† ํฐ(<|transcribe|>, <|translate|>, <|language|> ๋“ฑ)์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„, ํ•˜๋‚˜์˜ ๋ชจ๋ธ์ด ์—ฌ๋Ÿฌ ํƒœ์Šคํฌ๋ฅผ autoregressiveํ•˜๊ฒŒ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ด ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ํ•™์Šต ํฌ๋งท ๋•๋ถ„์— ์ „์‚ฌยท๋ฒˆ์—ญยท์–ธ์–ด ๊ฐ์ง€๋ฅผ ๋ณ„๋„ ๋ชจ๋ธ ์—†์ด ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.

์˜ค๋””์˜ค โ†’ [Mel Spectrogram (80ch, 30s window)]
       โ†’ [Encoder: Transformer blocks]
       โ†’ [Decoder: Task token + Autoregressive generation]
       โ†’ ํ…์ŠคํŠธ ์ถœ๋ ฅ

Whisper๊ฐ€ ์ˆ˜ํ–‰ํ•˜๋Š” ํƒœ์Šคํฌ๋Š” ๋ชจ๋‘ ์Œ์„ฑ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ํ…์ŠคํŠธ๋ฅผ ์ถœ๋ ฅํ•˜๋Š” STT ๋ฐฉํ–ฅ์ด๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ ๋‹ค๊ตญ์–ด ์Œ์„ฑ ์ „์‚ฌ, ์Œ์„ฑโ†’์˜์–ด ๋ฒˆ์—ญ, ์–ธ์–ด ๊ฐ์ง€, ์Œ์„ฑ ํ™œ๋™ ๊ฐ์ง€(VAD)๋ฅผ ์ง€์›ํ•œ๋‹ค. ํ…์ŠคํŠธ์—์„œ ์Œ์„ฑ์„ ์ƒ์„ฑํ•˜๋Š” TTS๋Š” Whisper์˜ ๋ฒ”์œ„ ๋ฐ–์ด๋ฉฐ, OpenAI๋Š” TTS๋ฅผ ์œ„ํ•ด ๋ณ„๋„ ๋ชจ๋ธ์„ ์ œ๊ณตํ•œ๋‹ค.

๋ชจ๋ธ ์‚ฌ์ด์ฆˆ๋ณ„ ๋น„๊ต

๋ชจ๋ธํŒŒ๋ผ๋ฏธํ„ฐVRAM์ƒ๋Œ€ ์†๋„๋น„๊ณ 
Tiny39M~1GB10xEnglish-only ๋ณ€ํ˜• ์ œ๊ณต
Base74M~1GB7xEnglish-only ๋ณ€ํ˜• ์ œ๊ณต
Small244M~2GB4xEnglish-only ๋ณ€ํ˜• ์ œ๊ณต
Medium769M~5GB2xEnglish-only ๋ณ€ํ˜• ์ œ๊ณต
Large1,550M~10GB1x (๊ธฐ์ค€)๋‹ค๊ตญ์–ด ์ „์šฉ
Turbo809M~6GB8xLarge ์ตœ์ ํ™” ๋ณ€ํ˜•, ๋ฒˆ์—ญ ๋ฏธ์ง€์›

English-only ๋ชจ๋ธ(tiny.en, base.en ๋“ฑ)์€ ์†Œํ˜• ์‚ฌ์ด์ฆˆ์—์„œ ๋‹ค๊ตญ์–ด ๋ชจ๋ธ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•˜๋‹ค. Large ๋ชจ๋ธ๋ถ€ํ„ฐ๋Š” ๋‹ค๊ตญ์–ด ๋ชจ๋ธ๋งŒ ์ œ๊ณต๋œ๋‹ค.

์•ฝ์ง€๋„ ํ•™์Šต(Weak Supervision)๊ณผ Zero-shot ์ „์ด

Whisper์˜ ํ•ต์‹ฌ ํ˜์‹ ์€ ํ•™์Šต ๋ฐ์ดํ„ฐ ์ „๋žต์— ์žˆ๋‹ค. ์ธํ„ฐ๋„ท์—์„œ ์ˆ˜์ง‘ํ•œ 680,000์‹œ๊ฐ„์˜ ์˜ค๋””์˜ค-ํ…์ŠคํŠธ ์Œ์„ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, ์ด ๋ฐ์ดํ„ฐ๋Š” ์™„๋ฒฝํ•˜๊ฒŒ ๋ ˆ์ด๋ธ”๋ง๋œ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์ž๋™ ์ƒ์„ฑ๋œ ์ž๋ง‰ ๋“ฑ **์•ฝํ•œ ์ง€๋„ ์‹ ํ˜ธ(weak supervision)**๋ฅผ ํ™œ์šฉํ•œ ๊ฒƒ์ด๋‹ค.

์ด ์ ‘๊ทผ๋ฒ•์˜ ์žฅ์ :

  • ๋ฐ์ดํ„ฐ ๋‹ค์–‘์„ฑ: ๋‹ค์–‘ํ•œ ์–ต์–‘, ๋ฐฐ๊ฒฝ ์†Œ์Œ, ์ „๋ฌธ ์šฉ์–ด๊ฐ€ ํฌํ•จ๋œ ์‹คํ™˜๊ฒฝ ๋ฐ์ดํ„ฐ
  • ๊ทœ๋ชจ: ๊ธฐ์กด ์ง€๋„ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹ ๋Œ€๋น„ ์••๋„์  ๊ทœ๋ชจ
  • ์ผ๋ฐ˜ํ™”: ํŠน์ • ๋ฒค์น˜๋งˆํฌ์— ๊ณผ์ ํ•ฉ๋˜์ง€ ์•Š์•„ zero-shot ์ „์ด ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜
  • ๋…ผ๋ฌธ์— ๋”ฐ๋ฅด๋ฉด โ€œ์‚ฌ๋žŒ์˜ ์ •ํ™•๋„์™€ ๊ฐ•๊ฑด์„ฑ์— ๊ทผ์ ‘ํ•˜๋Š”โ€ ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ

์˜จ๋””๋ฐ”์ด์Šค ๋ฐฐํฌ: whisper.cpp

Whisper๋Š” pip install openai-whisper๋งŒ์œผ๋กœ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๊ฐ€ ๋กœ์ปฌ์— ๋‹ค์šด๋กœ๋“œ๋˜์–ด ์™„์ „ํžˆ ์˜คํ”„๋ผ์ธ์œผ๋กœ ๋™์ž‘ํ•œ๋‹ค. ์—ฌ๊ธฐ์— ๋”ํ•ด whisper.cpp ํ”„๋กœ์ ํŠธ๊ฐ€ C/C++๋กœ Whisper๋ฅผ ์žฌ๊ตฌํ˜„ํ•˜์—ฌ, ๋ชจ๋ฐ”์ผยท์—ฃ์ง€ ๊ธฐ๊ธฐ์—์„œ๋„ ์˜จ๋””๋ฐ”์ด์Šค ์ถ”๋ก ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ๋‹ค. ์™ธ๋ถ€ ์˜์กด์„ฑ ์—†์ด ์ˆœ์ˆ˜ C/C++๋กœ ์ž‘์„ฑ๋˜์–ด ๋‹ค์–‘ํ•œ ํ”Œ๋žซํผ์—์„œ ๋™์ž‘ํ•œ๋‹ค.

์ง€์› ํ”Œ๋žซํผ:

  • ๋ชจ๋ฐ”์ผ: iOS, Android (Apple Silicon์— ์ตœ์ ํ™”)
  • ๋ฐ์Šคํฌํƒ‘: macOS (Intel/ARM), Linux, Windows
  • ์—ฃ์ง€/์ž„๋ฒ ๋””๋“œ: Raspberry Pi, WebAssembly, Docker

์˜จ๋””๋ฐ”์ด์Šค ์ตœ์ ํ™” ๊ธฐ์ˆ :

  • ARM NEON / Accelerate ํ”„๋ ˆ์ž„์›Œํฌ (Apple Silicon)
  • AVX intrinsics (x86)
  • Metal GPU ๊ฐ€์† / Core ML (Apple Neural Engine)
  • NVIDIA CUDA, Vulkan (ํฌ๋กœ์Šค ๋ฒค๋” GPU)
  • GGML ํฌ๋งท + ์–‘์žํ™”(Quantization): Q5_0 ๋“ฑ์œผ๋กœ ๋ชจ๋ธ ํฌ๊ธฐ ๋Œ€ํญ ์ถ•์†Œ
  • ๋Ÿฐํƒ€์ž„ ์‹œ zero memory allocation

์˜จ๋””๋ฐ”์ด์Šค ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰:

๋ชจ๋ธ๋””์Šคํฌ๋Ÿฐํƒ€์ž„ ๋ฉ”๋ชจ๋ฆฌ
Tiny75 MiB~273 MB
Base142 MiB~388 MB
Small466 MiB~852 MB
Medium1.5 GiB~2.1 GB
Large2.9 GiB~3.9 GB

์‹ค์‹œ๊ฐ„ ์ถ”๋ก ์„ ์œ„ํ•ด 500ms ๊ฐ„๊ฒฉ์œผ๋กœ ๋งˆ์ดํฌ ์ž…๋ ฅ์„ ์ƒ˜ํ”Œ๋งํ•˜๋Š” stream ๋„๊ตฌ๋„ ์ œ๊ณต๋œ๋‹ค.

๊ฒฝ์Ÿ ๋ชจ๋ธ ๋น„๊ต

๋ชจ๋ธ๊ฐœ๋ฐœ์‚ฌํŠน์ง•Whisper ๋Œ€๋น„์˜จ๋””๋ฐ”์ด์Šค
Faster WhisperSYSTRANCTranslate2 ๊ธฐ๋ฐ˜ Whisper ์žฌ๊ตฌํ˜„4x ๋น ๋ฆ„, ๋™์ผ ์ •ํ™•๋„, ๋ฉ”๋ชจ๋ฆฌ ์ ˆ๊ฐO (๋กœ์ปฌ GPU/CPU)
NVIDIA Canary-1BNVIDIAFastConformer ์ธ์ฝ”๋”, 1B ํŒŒ๋ผ๋ฏธํ„ฐWER ์šฐ์ˆ˜ (LibriSpeech clean 1.48%), 4๊ฐœ ์–ธ์–ด, CC-BY-NCO (๋กœ์ปฌ ์‹คํ–‰ ๊ฐ€๋Šฅ, NeMo ํ”„๋ ˆ์ž„์›Œํฌ)
whisper.cppggerganovWhisper์˜ C/C++ ํฌํŒ…์˜จ๋””๋ฐ”์ด์Šค ํŠนํ™”, ์–‘์žํ™” ์ง€์›, ๋™์ผ ๋ชจ๋ธO (๋ชจ๋ฐ”์ผ/์—ฃ์ง€/์ž„๋ฒ ๋””๋“œ)
Google Speech-to-TextGoogleํด๋ผ์šฐ๋“œ API ๊ธฐ๋ฐ˜๋†’์€ ์ •ํ™•๋„X (ํด๋ผ์šฐ๋“œ API ํ•„์ˆ˜)
Apple Speech FrameworkAppleiOS/macOS ๋‚ด์žฅApple ์ƒํƒœ๊ณ„ ํ•œ์ •, ๋น„๊ณต๊ฐœ ๋ชจ๋ธO (Apple ๊ธฐ๊ธฐ ์ „์šฉ)
Meta SeamlessM4TMeta์Œ์„ฑ+ํ…์ŠคํŠธ ๋‹ค๊ตญ์–ด ๋ฒˆ์—ญ๋ฒˆ์—ญ ๊ฐ•์ , 100๊ฐœ ์–ธ์–ด, ์ œํ•œ์  ๋ผ์ด์„ ์ŠคO (๋กœ์ปฌ ์‹คํ–‰ ๊ฐ€๋Šฅ, HuggingFace ๊ณต๊ฐœ)

Google Speech-to-Text๋งŒ ์˜จ๋””๋ฐ”์ด์Šค๊ฐ€ ๋ถˆ๊ฐ€ํ•˜๊ณ  ๋‚˜๋จธ์ง€๋Š” ๋ชจ๋‘ ๋กœ์ปฌ ์‹คํ–‰์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๋‹ค๋งŒ ์˜จ๋””๋ฐ”์ด์Šค์˜ ์ˆ˜์ค€์—๋Š” ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค:

  • ๋ชจ๋ฐ”์ผ/์—ฃ์ง€๊นŒ์ง€ ๊ฐ€๋Šฅ: whisper.cpp, Apple Speech Framework
  • ๋ฐ์Šคํฌํƒ‘/์„œ๋ฒ„ ๋กœ์ปฌ: Faster Whisper, Canary-1B, SeamlessM4T

๋ชจ๋ธ ์„ ํƒ ๊ธฐ์ค€:

  • ์ •ํ™•๋„ ์ตœ์šฐ์„  โ†’ NVIDIA Canary ๋˜๋Š” Whisper Large
  • ์†๋„ ์ตœ์šฐ์„  โ†’ Faster Whisper (GPU) ๋˜๋Š” whisper.cpp (์—ฃ์ง€)
  • ์˜จ๋””๋ฐ”์ด์Šค + ์˜คํ”ˆ์†Œ์Šค โ†’ Whisper + whisper.cpp ์กฐํ•ฉ์ด ๊ฐ€์žฅ ํ˜„์‹ค์ 
  • ์ƒ์—…์  ์ž์œ ๋„ โ†’ Whisper (MIT) > ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์Ÿ ๋ชจ๋ธ

์‚ฌ์šฉ๋ฒ•

CLI๋กœ ์ „์‚ฌ:

whisper audio.wav --model turbo

ํŠน์ • ์–ธ์–ด ์ง€์ •:

whisper japanese.wav --language Japanese

์˜์–ด ๋ฒˆ์—ญ:

whisper japanese.wav --model medium --language Japanese --task translate

Python API:

import whisper
 
model = whisper.load_model("turbo")
result = model.transcribe("audio.mp3")
print(result["text"])

์ €์ˆ˜์ค€ API (detect_language(), decode(), load_audio(), log_mel_spectrogram())๋ฅผ ํ†ตํ•ด ์–ธ์–ด ๊ฐ์ง€์™€ ๋””์ฝ”๋”ฉ์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ์ œ์–ดํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

์ฐธ๊ณ  ๋ฌธ์„œ