• Speculative Decoding์€ ์ž‘์€ Draft ๋ชจ๋ธ์ด ์—ฌ๋Ÿฌ ํ† ํฐ์„ ๋ฏธ๋ฆฌ ์˜ˆ์ธกํ•˜๊ณ , ํฐ Target ๋ชจ๋ธ์ด ํ•œ ๋ฒˆ์˜ Forward Pass๋กœ ๊ฒ€์ฆํ•˜๋Š” ์ถ”๋ก  ๊ฐ€์† ๊ธฐ๋ฒ•
  • ์ถœ๋ ฅ ํ’ˆ์งˆ ์ €ํ•˜ ์—†์ด ๋‹ค์ˆ˜ ํ† ํฐ์„ ๋™์‹œ ์ƒ์„ฑํ•˜์—ฌ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ค„์ด๋Š” ๋ณ‘๋ ฌํ™” ์ „๋žต
  • ๊ธฐ์กด ์ˆœ์ฐจ ๋””์ฝ”๋”ฉ ๋Œ€๋น„ ์ตœ๋Œ€ 3๋ฐฐ ์†๋„ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•˜๋Š” ๋ฌด์†์‹ค(lossless) ์ตœ์ ํ™” ๋ฐฉ์‹

ํ•ด๋‹น ๊ฐœ๋…์ด ํ•„์š”ํ•œ ์ด์œ 

  • LLM ์ถ”๋ก ์€ memory-bound โ€” GPU ์—ฐ์‚ฐ ๋Šฅ๋ ฅ์€ ๋‚จ์ง€๋งŒ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด ๋ณ‘๋ชฉ
  • ๊ธฐ์กด autoregressive ๋””์ฝ”๋”ฉ์€ ํ† ํฐ 1๊ฐœ๋‹น Forward Pass 1ํšŒ ํ•„์š” โ†’ ์ˆœ์ฐจ์ ์ด๋ผ ๋А๋ฆผ
  • ์‰ฌ์šด ํ† ํฐ(โ€œtheโ€, โ€œisโ€ ๋“ฑ)๋„ ๊ฑฐ๋Œ€ ๋ชจ๋ธ์ด ๋งค๋ฒˆ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์€ ๋น„ํšจ์œจ

AS-IS (๊ธฐ์กด Autoregressive Decoding)

sequenceDiagram
    autonumber
    participant LLM as Target ๋ชจ๋ธ (70B)
    participant Out as ์ถœ๋ ฅ

    LLM->>Out: Forward Pass 1 โ†’ "์˜ค๋Š˜"
    LLM->>Out: Forward Pass 2 โ†’ "๋‚ ์”จ"
    LLM->>Out: Forward Pass 3 โ†’ "๊ฐ€"
    LLM->>Out: Forward Pass 4 โ†’ "์ข‹"
    LLM->>Out: Forward Pass 5 โ†’ "์Šต๋‹ˆ๋‹ค"
    Note over LLM: 5๊ฐœ ํ† ํฐ = 5๋ฒˆ์˜ Forward Pass<br/>(๋งค๋ฒˆ ๊ฑฐ๋Œ€ ๋ชจ๋ธ ์ „์ฒด ์‹คํ–‰)

TO-BE (Speculative Decoding)

sequenceDiagram
    autonumber
    participant Draft as Draft ๋ชจ๋ธ (1B)
    participant Target as Target ๋ชจ๋ธ (70B)
    participant Out as ์ถœ๋ ฅ

    Draft->>Draft: ๋น ๋ฅด๊ฒŒ 4ํ† ํฐ ์˜ˆ์ธก
    Note over Draft: "๋‚ ์”จ", "๊ฐ€", "์ข‹", "์Šต๋‹ˆ๋‹ค"

    Draft->>Target: 4๊ฐœ Draft ํ† ํฐ ์ „๋‹ฌ
    Target->>Target: ํ•œ ๋ฒˆ์˜ Forward Pass๋กœ<br/>4๊ฐœ ํ† ํฐ ๋™์‹œ ๊ฒ€์ฆ

    Note over Target: "๋‚ ์”จ" โœ… ์ฑ„ํƒ<br/>"๊ฐ€" โœ… ์ฑ„ํƒ<br/>"์ข‹" โœ… ์ฑ„ํƒ<br/>"์Šต๋‹ˆ๋‹ค" โœ… ์ฑ„ํƒ

    Target->>Out: 4๊ฐœ ํ† ํฐ ํ•œ๊บผ๋ฒˆ์— ์ถœ๋ ฅ
    Note over Out: Forward Pass 1ํšŒ๋กœ<br/>4ํ† ํฐ ์ƒ์„ฑ ์™„๋ฃŒ

๋™์ž‘ ์›๋ฆฌ โ€” 4๋‹จ๊ณ„

1๋‹จ๊ณ„: Draft ์ƒ์„ฑ

์ž‘์€ ๋ชจ๋ธ(์˜ˆ: 1B)์ด ๋‹ค์Œ K๊ฐœ ํ† ํฐ์„ ๋น ๋ฅด๊ฒŒ ์˜ˆ์ธก

2๋‹จ๊ณ„: ๋ณ‘๋ ฌ ๊ฒ€์ฆ

ํฐ ๋ชจ๋ธ(์˜ˆ: 70B)์ด K๊ฐœ ํ† ํฐ์„ ํ•œ ๋ฒˆ์˜ Forward Pass๋กœ ๋™์‹œ์— ๊ฒ€์ฆ. ๊ฐ ํ† ํฐ์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๋น„๊ต

3๋‹จ๊ณ„: ์ฑ„ํƒ ๋˜๋Š” ๊ฑฐ๋ถ€

  • Target ๋ชจ๋ธ์˜ ํ™•๋ฅ ๊ณผ ์ผ์น˜ํ•˜๋Š” ๊ฐ€์žฅ ๊ธด ์ ‘๋‘์‚ฌ(prefix)๋ฅผ ์ฑ„ํƒ
  • ๋ถˆ์ผ์น˜ํ•˜๋Š” ์ฒซ ํ† ํฐ๋ถ€ํ„ฐ Target ๋ชจ๋ธ์ด ์ง์ ‘ ์ƒ์„ฑ

4๋‹จ๊ณ„: ๋ฐ˜๋ณต

์ฑ„ํƒ๋œ ํ† ํฐ๋ถ€ํ„ฐ ์ด์–ด์„œ Draft ๋ชจ๋ธ์ด ๋‹ค์‹œ K๊ฐœ ์˜ˆ์ธก โ†’ ๊ฒ€์ฆ ๋ฐ˜๋ณต

์™œ ํ’ˆ์งˆ ์ €ํ•˜๊ฐ€ ์—†๋Š”๊ฐ€? (๋ฌด์†์‹ค ๋ณด์žฅ)

  • Draft ํ† ํฐ์€ ์ œ์•ˆ์ผ ๋ฟ, ์ตœ์ข… ํŒ๋‹จ์€ ํ•ญ์ƒ Target ๋ชจ๋ธ์ด ์ˆ˜ํ–‰
  • ๊ฑฐ๋ถ€๋œ ํ† ํฐ์€ Target ๋ชจ๋ธ์ด ์ง์ ‘ ์ƒ์„ฑ โ†’ ๊ฒฐ๊ณผ์ ์œผ๋กœ Target ๋ชจ๋ธ๋งŒ ์‚ฌ์šฉํ•œ ๊ฒƒ๊ณผ ๋™์ผํ•œ ์ถœ๋ ฅ
  • ์ˆ˜ํ•™์ ์œผ๋กœ ๋™์ผํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๋ณด์žฅ

ํ•ต์‹ฌ ์ง€ํ‘œ: Acceptance Rate (ฮฑ)

ฮฑ ๊ฐ’์˜๋ฏธํšจ๊ณผ
๋†’์Œ (0.8+)Draft ํ† ํฐ ๋Œ€๋ถ€๋ถ„ ์ฑ„ํƒ์†๋„ ๋Œ€ํญ ํ–ฅ์ƒ
์ค‘๊ฐ„ (0.5)์ ˆ๋ฐ˜ ์ฑ„ํƒ, ์ ˆ๋ฐ˜ ๊ฑฐ๋ถ€์ ๋‹นํ•œ ์†๋„ ํ–ฅ์ƒ
๋‚ฎ์Œ (0.2)๋Œ€๋ถ€๋ถ„ ๊ฑฐ๋ถ€์˜คํžˆ๋ ค ์˜ค๋ฒ„ํ—ค๋“œ ๋ฐœ์ƒ ๊ฐ€๋Šฅ

ฮฑ ๊ฐ’์€ Draft ๋ชจ๋ธ๊ณผ Target ๋ชจ๋ธ์˜ ๋Šฅ๋ ฅ ์ฐจ์ด์™€ ์ž…๋ ฅ ๋‚œ์ด๋„์— ๋”ฐ๋ผ ๊ฒฐ์ •๋จ.

์ฐธ๊ณ  ๋ฌธ์„œ