• Continuous Batching은 μš”μ²­ λ‹¨μœ„κ°€ μ•„λ‹Œ 토큰 생성 λ‹¨μœ„(iteration-level)둜 배치λ₯Ό 동적 κ΄€λ¦¬ν•˜λŠ” μŠ€μΌ€μ€„λ§ 기법
  • μ™„λ£Œλœ μš”μ²­μ˜ μžλ¦¬μ— μƒˆ μš”μ²­μ„ μ¦‰μ‹œ μ‚½μž…ν•˜μ—¬ GPU 유휴 μ‹œκ°„μ„ μ΅œμ†Œν™”ν•˜λŠ” μΆ”λ‘  μ΅œμ ν™” 방식
  • Static Batching λŒ€λΉ„ μ΅œλŒ€ 23λ°° μ²˜λ¦¬λŸ‰ ν–₯상, GPU ν™œμš©λ₯  3060% β†’ 8095%둜 κ°œμ„ ν•˜λŠ” μ‹€μ‹œκ°„ 배치 μ „λž΅

ν•΄λ‹Ή κ°œλ…μ΄ ν•„μš”ν•œ 이유

  • LLM μš”μ²­λ§ˆλ‹€ 좜λ ₯ 길이가 닀름 β€” μ–΄λ–€ μš”μ²­μ€ 50토큰, μ–΄λ–€ μš”μ²­μ€ 500토큰
  • Static Batching은 κ°€μž₯ κΈ΄ μš”μ²­μ΄ 끝날 λ•ŒκΉŒμ§€ λ‹€λ₯Έ μš”μ²­μ„ κΈ°λ‹€λ¦¬κ²Œ λ§Œλ“¦
  • μ‹€μ‹œκ°„ μ„œλΉ„μŠ€(챗봇, API)μ—μ„œλŠ” 응닡 지연이 κ³§ μ‚¬μš©μž κ²½ν—˜ μ €ν•˜

AS-IS (Static Batching)

sequenceDiagram
    autonumber
    participant R1 as μš”μ²­ 1 (50토큰)
    participant R2 as μš”μ²­ 2 (100토큰)
    participant R3 as μš”μ²­ 3 (500토큰)
    participant GPU as GPU

    Note over GPU: Batch μ‹œμž‘ (3개 μš”μ²­)
    R1->>GPU: 50토큰 생성 μ™„λ£Œ
    Note over R1: ⬜ λŒ€κΈ° (GPU λ‚­λΉ„)
    R2->>GPU: 100토큰 생성 μ™„λ£Œ
    Note over R2: ⬜ λŒ€κΈ° (GPU λ‚­λΉ„)
    R3->>GPU: 500토큰 생성 μ™„λ£Œ
    Note over GPU: 전체 Batch μ™„λ£Œ<br/>β†’ κ·Έμ œμ„œμ•Ό μƒˆ μš”μ²­ 수용

TO-BE (Continuous Batching)

sequenceDiagram
    autonumber
    participant R1 as μš”μ²­ 1 (50토큰)
    participant R4 as μš”μ²­ 4 (μƒˆ μš”μ²­)
    participant R2 as μš”μ²­ 2 (100토큰)
    participant R5 as μš”μ²­ 5 (μƒˆ μš”μ²­)
    participant GPU as GPU

    Note over GPU: Iteration-level μŠ€μΌ€μ€„λ§
    R1->>GPU: 50토큰 μ™„λ£Œ β†’ μ¦‰μ‹œ 빠짐
    R4->>GPU: λΉˆμžλ¦¬μ— μ¦‰μ‹œ νˆ¬μž…
    R2->>GPU: 100토큰 μ™„λ£Œ β†’ μ¦‰μ‹œ 빠짐
    R5->>GPU: λΉˆμžλ¦¬μ— μ¦‰μ‹œ νˆ¬μž…
    Note over GPU: GPU 항상 μ΅œλŒ€ ν™œμš©

Static vs Continuous Batching 비ꡐ

Static BatchingContinuous Batching
μŠ€μΌ€μ€„λ§ λ‹¨μœ„λ°°μΉ˜(Batch) λ‹¨μœ„μ΄ν„°λ ˆμ΄μ…˜(토큰) λ‹¨μœ„
빈자리 처리배치 끝날 λ•ŒκΉŒμ§€ λŒ€κΈ°μ™„λ£Œ μ¦‰μ‹œ μƒˆ μš”μ²­ μ‚½μž…
νŒ¨λ”© λ‚­λΉ„κ°€μž₯ κΈ΄ μš”μ²­μ— 맞좰 νŒ¨λ”©κ° μš”μ²­μ˜ μ‹€μ œ ν† ν°λ§Œ 처리
GPU ν™œμš©λ₯ 30~60%80~95%
μ²˜λ¦¬λŸ‰κΈ°μ€€μ΅œλŒ€ 23λ°° ν–₯상
μ ν•©ν•œ μ‚¬μš©μ²˜μ˜€ν”„λΌμΈ 일괄 μ²˜λ¦¬μ‹€μ‹œκ°„ μ„œλΉ„μŠ€ (챗봇, API)

λ™μž‘ 원리 β€” Iteration-Level Scheduling

  1. μ—¬λŸ¬ μš”μ²­μ„ ν•˜λ‚˜μ˜ 배치둜 λ¬Άμ–΄ GPU에 νˆ¬μž…
  2. λ§€ 토큰 생성 μ΄ν„°λ ˆμ΄μ…˜λ§ˆλ‹€ 배치 ꡬ성을 μž¬ν‰κ°€
  3. μ™„λ£Œλœ μš”μ²­μ€ μ¦‰μ‹œ λ°°μΉ˜μ—μ„œ 제거
  4. 빈 μŠ¬λ‘―μ— λŒ€κΈ° 쀑인 μƒˆ μš”μ²­μ„ μ¦‰μ‹œ μ‚½μž…
  5. 반볡 β€” GPUκ°€ 항상 μ΅œλŒ€ μš©λŸ‰μœΌλ‘œ λ™μž‘

μ°Έκ³  λ¬Έμ„œ