• vLLM์€ LLM ์ถ”๋ก  ๋ฐ ์„œ๋น™์„ ์œ„ํ•œ ๊ณ ์ฒ˜๋ฆฌ๋Ÿ‰ยท๊ณ ํšจ์œจ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ์˜คํ”ˆ์†Œ์Šค ์—”์ง„
  • UC Berkeley Sky Computing Lab์—์„œ ๊ฐœ๋ฐœ๋œ PagedAttention ๊ธฐ๋ฐ˜ ์ถ”๋ก  ํ”„๋ ˆ์ž„์›Œํฌ
  • ๊ธฐ์กด ์‹œ์Šคํ…œ ๋Œ€๋น„ KV Cache ๋ฉ”๋ชจ๋ฆฌ ๋‚ญ๋น„๋ฅผ 60~80%์—์„œ 4% ๋ฏธ๋งŒ์œผ๋กœ ์ค„์ธ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

ํ•ด๋‹น ๊ฐœ๋…์ด ํ•„์š”ํ•œ ์ด์œ 

  • LLM์„ ํ”„๋กœ๋•์…˜์— ๋ฐฐํฌํ•  ๋•Œ, GPU ๋ฉ”๋ชจ๋ฆฌ ๋น„์šฉ์ด ๊ฐ€์žฅ ํฐ ๋ณ‘๋ชฉ
  • ๊ธฐ์กด ์ถ”๋ก  ์‹œ์Šคํ…œ์€ KV Cache์˜ 60~80%๋ฅผ ๋‚ญ๋น„ํ•˜์—ฌ ๋™์‹œ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅํ•œ ์š”์ฒญ ์ˆ˜๊ฐ€ ์ œํ•œ๋จ
  • vLLM์€ ๋™์ผํ•œ GPU ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋” ๋งŽ์€ ์š”์ฒญ์„ ์ฒ˜๋ฆฌํ•˜์—ฌ ๋น„์šฉ ๋Œ€๋น„ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๊ทน๋Œ€ํ™”

AS-IS

sequenceDiagram
    autonumber
    participant Client
    participant Server as ๊ธฐ์กด ์ถ”๋ก  ์„œ๋ฒ„
    participant GPU as GPU Memory

    Client->>Server: Request 1 (max_tokens=2048)
    Server->>GPU: KV Cache 2048 ์Šฌ๋กฏ ์‚ฌ์ „ ํ• ๋‹น
    Note over GPU: ์‹ค์ œ ์‚ฌ์šฉ: 512 ํ† ํฐ<br/>๋‚ญ๋น„: 1536 ์Šฌ๋กฏ (75%)

    Client->>Server: Request 2 (max_tokens=2048)
    Server->>GPU: KV Cache 2048 ์Šฌ๋กฏ ์‚ฌ์ „ ํ• ๋‹น
    Note over GPU: ์‹ค์ œ ์‚ฌ์šฉ: 300 ํ† ํฐ<br/>๋‚ญ๋น„: 1748 ์Šฌ๋กฏ (85%)

    Client->>Server: Request 3
    Server--xClient: GPU ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ โ†’ ๊ฑฐ๋ถ€

TO-BE

sequenceDiagram
    autonumber
    participant Client
    participant vLLM as vLLM (PagedAttention)
    participant GPU as GPU Memory (Block ๋‹จ์œ„)

    Client->>vLLM: Request 1
    vLLM->>GPU: Block 1~3 ๋™์  ํ• ๋‹น (ํ•„์š”ํ•œ ๋งŒํผ)
    Note over GPU: ์‚ฌ์šฉ: 512 ํ† ํฐ โ†’ 32๋ธ”๋ก<br/>๋‚ญ๋น„: 4% ๋ฏธ๋งŒ

    Client->>vLLM: Request 2
    vLLM->>GPU: Block 33~51 ๋™์  ํ• ๋‹น
    Note over GPU: ์‚ฌ์šฉ: 300 ํ† ํฐ โ†’ 19๋ธ”๋ก<br/>๋‚จ์€ ๋ธ”๋ก ์žฌํ™œ์šฉ ๊ฐ€๋Šฅ

    Client->>vLLM: Request 3
    vLLM->>GPU: ๋นˆ ๋ธ”๋ก์—์„œ ๋™์  ํ• ๋‹น
    Note over GPU: ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์  ์‚ฌ์šฉ โ†’ ์ˆ˜์šฉ ๊ฐ€๋Šฅ

PagedAttention โ€” ํ•ต์‹ฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜

OS์˜ ๊ฐ€์ƒ ๋ฉ”๋ชจ๋ฆฌ ํŽ˜์ด์ง• ๊ฐœ๋…์„ GPU์˜ KV Cache ๊ด€๋ฆฌ์— ์ ์šฉํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜.

๊ฐœ๋…OS ๊ฐ€์ƒ ๋ฉ”๋ชจ๋ฆฌPagedAttention
๊ด€๋ฆฌ ๋‹จ์œ„Page FrameKV Block (๋ณดํ†ต 16ํ† ํฐ)
๋งคํ•‘ ํ…Œ์ด๋ธ”Page TableBlock Table
์ฃผ์†Œ ๊ณต๊ฐ„Virtual โ†’ PhysicalLogical Block โ†’ Physical GPU Block
๊ณต์œ  ๋ฉ”์ปค๋‹ˆ์ฆ˜Copy-on-WriteCopy-on-Write

Block Table: ๋…ผ๋ฆฌ์  KV Cache ์ฃผ์†Œ๋ฅผ ๋ฌผ๋ฆฌ์  GPU ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋งคํ•‘ ๊ตฌ์กฐ. ์—ฐ์† ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ์—†์ด ์ž„์˜์˜ ์œ„์น˜์— ๋ธ”๋ก์„ ๋ฐฐ์น˜ ๊ฐ€๋Šฅ.

๊ธฐ์กด ์‹œ์Šคํ…œ์˜ 3๊ฐ€์ง€ ๋ฉ”๋ชจ๋ฆฌ ๋‚ญ๋น„

  1. Internal Fragmentation: ์ถœ๋ ฅ ํ† ํฐ ์ˆ˜๊ฐ€ ์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅํ•˜์—ฌ ์‚ฌ์ „ ํ• ๋‹น๋œ ์Šฌ๋กฏ์ด ๋ฏธ์‚ฌ์šฉ ์ƒํƒœ๋กœ ๋‚จ์Œ
  2. Reservation: ์š”์ฒญ ๊ธฐ๊ฐ„ ๋™์•ˆ ์ „์ฒด ๋ฉ”๋ชจ๋ฆฌ ๋ธ”๋ก์„ ์ž ๊ธˆ โ†’ ๋ถ€๋ถ„ ์‚ฌ์šฉ ์‹œ์—๋„ ์žฌํ™œ์šฉ ๋ถˆ๊ฐ€
  3. External Fragmentation: ๊ฐ€๋ณ€ ๊ธธ์ด ์‹œํ€€์Šค ์‚ฌ์ด์˜ ๋นˆ ๊ณต๊ฐ„์ด ํ™œ์šฉ ๋ถˆ๊ฐ€

PagedAttention์€ ์˜จ๋””๋งจ๋“œ ๋™์  ํ• ๋‹น์œผ๋กœ ์„ธ ๊ฐ€์ง€๋ฅผ ๋ชจ๋‘ ํ•ด๊ฒฐ.

์„ฑ๋Šฅ ๋น„๊ต

๋น„๊ต ๋Œ€์ƒ์ฒ˜๋ฆฌ๋Ÿ‰ ๊ฐœ์„ 
HuggingFace Transformers24x
HuggingFace TGI3.5x
์ผ๋ฐ˜์  ์ถ”๋ก  ์‹œ์Šคํ…œ2~4x (๋™์ผ latency ๊ธฐ์ค€)

์ฃผ์š” ์ตœ์ ํ™” ๊ธฐ๋ฒ•

Continuous Batching

์š”์ฒญ์ด ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ฆฌ์ง€ ์•Š๊ณ , ์ƒˆ ์š”์ฒญ์„ ์ฆ‰์‹œ ๋ฐฐ์น˜์— ์ถ”๊ฐ€. Static batching ๋Œ€๋น„ GPU ํ™œ์šฉ๋ฅ  ๊ทน๋Œ€ํ™”.

Speculative Decoding

์ž‘์€ draft ๋ชจ๋ธ์ด ํฐ ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์„ ๋ฏธ๋ฆฌ ์˜ˆ์ธก โ†’ ๊ฒ€์ฆ ํ›„ ์ฑ„ํƒ. ์†๋„ ์ตœ๋Œ€ 2๋ฐฐ ํ–ฅ์ƒ.

Prefix Caching

์‹œ์Šคํ…œ ํ”„๋กฌํ”„ํŠธ ๋“ฑ ๊ณตํ†ต ์ ‘๋‘์‚ฌ์˜ KV Cache๋ฅผ ์ €์žฅยท์žฌ์‚ฌ์šฉ. ๋ฐ˜๋ณต ํ”„๋กฌํ”„ํŠธ ์‹œ 400%+ ์„ฑ๋Šฅ ํ–ฅ์ƒ.

Quantization

๋ชจ๋ธ ๊ฐ€์ค‘์น˜๋ฅผ ๋‚ฎ์€ ์ •๋ฐ€๋„(FP8, INT8, AWQ ๋“ฑ)๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ๊ฐ์†Œ.

  • ์˜ˆ: Llama-13B โ†’ FP32 ๊ธฐ์ค€ 52GB โ†’ INT8๋กœ 13GB

Memory Sharing

  • Parallel Sampling: ํ•˜๋‚˜์˜ ํ”„๋กฌํ”„ํŠธ์—์„œ ์—ฌ๋Ÿฌ ์ถœ๋ ฅ ์ƒ์„ฑ ์‹œ KV Cache ๊ณต์œ 
  • Beam Search: ๊ณตํ†ต prefix์˜ KV Cache ๊ณต์œ 
  • Copy-on-Write: ๋ถ„๊ธฐ ์‹œ์ ๊นŒ์ง€ ๊ณต์œ , ๋ณ€๊ฒฝ ์‹œ์—๋งŒ ๋ณต์‚ฌ

์ง€์› ํ™˜๊ฒฝ

์นดํ…Œ๊ณ ๋ฆฌ์ง€์› ํ•ญ๋ชฉ
GPUNVIDIA, AMD
CPUIntel, ARM, PowerPC
๊ฐ€์†๊ธฐTPU, Intel Gaudi, Huawei Ascend
๋ชจ๋ธLlama, Qwen, Gemma, DeepSeek, Mixtral, LLaVA ๋“ฑ
๋ณ‘๋ ฌํ™”Tensor, Pipeline, Data, Expert Parallelism
APIOpenAI-compatible REST API

์ฐธ๊ณ  ๋ฌธ์„œ