• Quantization์€ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜์˜ ์ˆ˜์น˜ ์ •๋ฐ€๋„๋ฅผ ๋‚ฎ์ถฐ(FP32 โ†’ INT8/INT4) ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰๊ณผ ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ด๋Š” ๋ชจ๋ธ ์••์ถ• ๊ธฐ๋ฒ•
  • ์ •ํ™•๋„ ์†์‹ค์„ ์ตœ์†Œํ™”ํ•˜๋ฉด์„œ ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ์ตœ๋Œ€ 8๋ฐฐ ์ถ•์†Œํ•˜๋Š” ๊ฒฝ๋Ÿ‰ํ™” ์ „๋žต
  • ๋™์ผ GPU์—์„œ ๋” ํฐ ๋ชจ๋ธ ์‹คํ–‰ ๋˜๋Š” ๋” ๋งŽ์€ ๋™์‹œ ์š”์ฒญ ์ฒ˜๋ฆฌ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ํ”„๋กœ๋•์…˜ ์ตœ์ ํ™” ๋ฐฉ์‹

ํ•ด๋‹น ๊ฐœ๋…์ด ํ•„์š”ํ•œ ์ด์œ 

  • LLM์€ ์ˆ˜์‹ญ~์ˆ˜๋ฐฑ์–ต ๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง€๋ฉฐ, ๊ฐ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ฐจ์ง€
  • FP32 ๊ธฐ์ค€ 70B ๋ชจ๋ธ โ†’ ์•ฝ 280GB ๋ฉ”๋ชจ๋ฆฌ ํ•„์š” โ†’ ๊ณ ๊ฐ€์˜ GPU ์—ฌ๋Ÿฌ ์žฅ ํ•„์š”
  • Quantization์œผ๋กœ INT4 ๋ณ€ํ™˜ ์‹œ ์•ฝ 35GB โ†’ ๋‹จ์ผ GPU์—์„œ๋„ ์‹คํ–‰ ๊ฐ€๋Šฅ

AS-IS

sequenceDiagram
    autonumber
    participant Model as LLaMA-13B (FP32)
    participant GPU as GPU Memory

    Model->>GPU: ๊ฐ€์ค‘์น˜ ๋กœ๋“œ (52GB)
    Note over GPU: 80GB GPU ์ค‘ 52GB ์‚ฌ์šฉ<br/>KV Cache์šฉ ๋‚จ์€ ๊ณต๊ฐ„: 28GB
    Note over GPU: ๋™์‹œ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ: ~7๋ช…<br/>(4K ์ปจํ…์ŠคํŠธ ๊ธฐ์ค€)

TO-BE

sequenceDiagram
    autonumber
    participant Model as LLaMA-13B (INT8)
    participant GPU as GPU Memory

    Model->>GPU: ๊ฐ€์ค‘์น˜ ๋กœ๋“œ (13GB)
    Note over GPU: 80GB GPU ์ค‘ 13GB ์‚ฌ์šฉ<br/>KV Cache์šฉ ๋‚จ์€ ๊ณต๊ฐ„: 67GB
    Note over GPU: ๋™์‹œ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ: ~47๋ช…<br/>(4K ์ปจํ…์ŠคํŠธ ๊ธฐ์ค€)

์ •๋ฐ€๋„ ๋‹จ๊ณ„๋ณ„ ๋น„๊ต

ํฌ๋งท๋น„ํŠธ ์ˆ˜๋ชจ๋ธ ํฌ๊ธฐ (70B ๊ธฐ์ค€)์ •ํ™•๋„ ์†์‹ค์ ํ•ฉํ•œ ์šฉ๋„
FP3232bit~280GB๊ธฐ์ค€ํ•™์Šต(Training)
FP16/BF1616bit~140GB๋ฌด์‹œ ๊ฐ€๋Šฅ์ถ”๋ก  ๊ธฐ๋ณธ๊ฐ’
INT88bit~70GB~0.04%ํ”„๋กœ๋•์…˜ ๊ถŒ์žฅ
INT44bit~35GB~1.9%๋ฆฌ์†Œ์Šค ์ œ์•ฝ ํ™˜๊ฒฝ

์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋Š”๊ฐ€?

ํ•ต์‹ฌ ์•„์ด๋””์–ด: ์—ฐ์†์ ์ธ ์‹ค์ˆ˜(float) ๊ฐ’์„ ์ด์‚ฐ์ ์ธ ์ •์ˆ˜(integer) ๊ฐ’์œผ๋กœ ๋งคํ•‘

FP32 ๊ฐ€์ค‘์น˜:  [0.0312, -0.1875, 0.5625, -0.8750, ...]
              โ†“ Quantization (INT8)
INT8 ๊ฐ€์ค‘์น˜:  [8, -48, 144, -224, ...]
              + Scale Factor: 0.00390625
              + Zero Point: 0

๋ณต์› ์‹œ: INT8 ๊ฐ’ ร— Scale Factor โ‰ˆ ์›๋ž˜ FP32 ๊ฐ’

์ฃผ์š” Quantization ๊ธฐ๋ฒ•

๊ธฐ๋ฒ•๋ฐฉ์‹ํŠน์ง•
PTQ (Post-Training Quantization)ํ•™์Šต ์™„๋ฃŒ ํ›„ ๋ณ€ํ™˜์žฌํ•™์Šต ๋ถˆํ•„์š”, ๋น ๋ฅธ ์ ์šฉ
GPTQ๋ ˆ์ด์–ด๋ณ„ ์˜ค์ฐจ ์ตœ์†Œํ™”INT4 + FP16 ํ˜ผํ•ฉ, ์ •ํ™•๋„ ์œ ์ง€ ์šฐ์ˆ˜
AWQ (Activation-aware)์ค‘์š” ๊ฐ€์ค‘์น˜ ๋ณด์กดํ™œ์„ฑํ™” ํŒจํ„ด ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ต์‹ฌ ์ฑ„๋„ ๋ณดํ˜ธ
QAT (Quantization-Aware Training)ํ•™์Šต ์ค‘ ์–‘์žํ™” ์‹œ๋ฎฌ๋ ˆ์ด์…˜์ตœ๊ณ  ์ •ํ™•๋„, ์žฌํ•™์Šต ๋น„์šฉ ๋ฐœ์ƒ

์‹ค์ƒํ™œ ๋น„์œ  - ์ˆ˜์—… ๋…ธํŠธ ํ•„๊ธฐ

์–‘์žํ™” ์ˆ˜์ค€์‹ค์ƒํ™œ ๋น„์œ ์„ค๋ช… ์˜ˆ์‹œ์ •๋ณด ๋ณด์กด์šฉ๋Ÿ‰
FP32๊ฐ•์˜ ์ „์ฒด ๋…น์Œโ€ํ–‰๋™์ด๋ž€ ์‚ฌ์šฉ์ž๊ฐ€ ๋ฌผ๋ฆฌ์ ์ด๋‚˜ ์ •์‹ ์ ์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋Š” ๋ชจ๋“  ๊ฒƒ์„ ์˜๋ฏธํ•˜๋ฉฐโ€ฆโ€œ100%500MB+
INT8์ƒ์„ธ ์ •๋ฆฌ ๋…ธํŠธโ€ํ–‰๋™: ๋ฌผ๋ฆฌ์ /์ •์‹ ์  ์›€์ง์ž„๊ณผ ๋ฐ˜์‘, ํ”ผ๋“œ๋ฐฑ ํฌํ•จโ€~90%150MB
INT4ํ‚ค์›Œ๋“œ ๋ฉ”๋ชจโ€ํ–‰๋™: ์›€์ง์ž„, ๋ฐ˜์‘, ํ”ผ๋“œ๋ฐฑโ€~80%50MB

โ†’ ํ•ต์‹ฌ: ์ธ๊ฐ„์˜ ๋‡Œ๋„ ์„ธ๋ถ€์‚ฌํ•ญ ์ผ๋ถ€ ์ƒ๋žตํ•˜๋ฉด์„œ ํ•ต์‹ฌ ๊ฐœ๋…๋งŒ ์ถ”์ถœํ•ด ํ•™์Šตํ•˜๋“ฏ, Quantization ๋„ ์ •๋ณด ์†์‹ค์„ ์ตœ์†Œํ™”ํ•˜๋ฉด์„œ ํšจ์œจ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•ฉ๋‹ˆ๋‹ค.

QK ํŒŒ์ผ๋ช… ํฌ๋งท ํ•ด๋…

llama.cpp ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์–‘์žํ™” ํŒŒ์ผ๋ช…์œผ๋กœ, ์–‘์žํ™” ์ˆ˜์ค€๊ณผ ๊ธฐ๋ฒ•์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค:

ํŒŒ์ผ๋ช…์˜๋ฏธ์„ค๋ช…
Q4_K_M4-bit K-Medium๊ท ํ˜•ํ˜• ์–‘์žํ™” (์ •ํ™•๋„/์••์ถ•๋ฅ  ์ตœ์ ๋ฐธ๋Ÿฐ์Šค)
Q4_K_S4-bit K-Small๋” ์ž‘์€ ํŒŒ์ผ, ์•ฝ๊ฐ„ ๋‚ฎ์€ ์ •ํ™•๋„
Q5_K_M5-bit K-Medium4-bit ๋ณด๋‹ค ์ •ํ™•๋„ โ†‘, ํŒŒ์ผ ํฌ๊ธฐ โ†‘
Q5_K_S5-bit K-Small5-bit ์ค‘ ์ปดํŒฉํŠธ ๋ฒ„์ „
Q8_08-bit ์ •ํ™•ํ•œ ์–‘์žํ™”๊ฑฐ์˜ FP16 ์ˆ˜์ค€ ์ •ํ™•๋„, ํฐ ํŒŒ์ผ

Medium / Small / Large ์˜๋ฏธ

llama.cpp ์˜ โ€œK-quantizationโ€ ์‹œ๋ฆฌ์ฆˆ์—์„œ ์ ‘๋ฏธ์‚ฌ๋Š” ์–‘์žํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ •๊ตํ•จ๊ณผ ์šฉ๋Ÿ‰์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค:

| ์ ‘๋ฏธ์‚ฌ | ์˜๋ฏธ | ํŠน์ง• | ์‹ค์ƒํ™œ ๋น„์œ  | ||---::--------||โ€”:โ€”:-------------------||----:-:------|-||---------|| | S (Small) | ์ž‘์€ ๋ฒ„์ „ | ์ตœ์†Œํ•œ์˜ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ, ์•ฝ๊ฐ„์˜ ์ •ํ™•๋„ ํƒ€ํ˜‘ | โ€œํ•ต์‹ฌ ํ‚ค์›Œ๋“œ๋งŒโ€ ๋…ธํŠธ | | M (Medium) | ์ค‘๊ฐ„ ๋ฒ„์ „ | ์ •ํ™•๋„์™€ ํฌ๊ธฐ์˜ ์ตœ์  ๊ท ํ˜• | โ€œ์ ์ ˆํ•œ ์š”์•ฝโ€ ๋…ธํŠธ โญ | | L (Large) | ํฐ ๋ฒ„์ „ | ๋” ๋†’์€ ์ •ํ™•๋„, ๋” ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ | โ€œ์„ธ๋ถ€์‚ฌํ•ญ ํฌํ•จโ€ ์ƒ์„ธ ๋…ธํŠธ |

์ถ”์ฒœ ํŒŒ์ผ ํฌ๊ธฐ ์˜ˆ์‹œ (7B ๋ชจ๋ธ ๊ธฐ์ค€)

| ํŒŒ์ผ๋ช… | ํฌ๊ธฐ | ํŠน์ง• | ||โ€”::::---|-:---:| | Q3_K_S | ~2.5GB | ๊ฐ€์žฅ ์ž‘์€ ํŒŒ์ผ (์ตœ์†Œ ์šฉ๋Ÿ‰) | | Q4_K_M | ~3.5GB | ์ตœ์  ๊ท ํ˜• (์ถ”์ฒœ) โญ | | Q5_K_L | ~4.0GB | ๋†’์€ ์ •ํ™•๋„ | | Q8_0 | ~7.0GB | ๊ฑฐ์˜ ์›๋ณธ ํ’ˆ์งˆ |

์˜ˆ์‹œ: llama-2-7b.Q4_K_M.gguf

  • LLaMA-2 7B ๋ชจ๋ธ์˜ 4-bit K-Medium ์–‘์žํ™” ๋ฒ„์ „
  • ์›๋ณธ (FP16) 14GB โ†’ 3.5GB ๋กœ ์ถ•์†Œ, ์ •ํ™•๋„ ์†์‹ค ์ตœ์†Œํ™”

์ถ”์ฒœ ์„ ํƒ ๊ฐ€์ด๋“œ

์šฉ๋„์ถ”์ฒœ ํŒŒ์ผ
์ตœ๊ณ  ํ’ˆ์งˆ ํ•„์š”Q8_0 ๋˜๋Š” Q5_K_M
๊ท ํ˜• (์ถ”์ฒœ)Q4_K_M
์ €์‚ฌ์–‘ ์žฅ์น˜Q4_K_S ๋˜๋Š” Q3_K_M

์ฐธ๊ณ  ๋ฌธ์„œ