12 May 2026 64 min read

Long-Context, Sparse, Agentic LLM 시대의 ML System 변화

OS / system engineer perspective에서 본 HBM requirement, attention evolution, inference/training data plane, RLHF·DPO·GRPO 시스템 해부

작성 기준일: 2026-05-11
주요 입력 자료: 첨부된 DeepSeek_V4.pdf, Deepseek v4.pptx, 2601.07372v1.pdf(Engram) 및 공개 technical report / model card / top-tier conference papers.
문서 형식: standalone Markdown. PDF로 변환하면 일반적인 A4 10–11pt single-column 기준 약 50–80쪽 분량을 목표로 구성했다.

주의: GPT, Gemini 같은 closed frontier model은 architecture와 kernel-level data movement가 대부분 공개되어 있지 않다. 이 보고서는 공개 technical report, model card, official blog/API docs, open-weight analog, DeepSeek류 open technical report를 결합해 OS/system 관점의 구조적 intuition을 정리한다. 공개되지 않은 부분은 명시적으로 “추정” 또는 “system-level implication”으로 표시한다.

Executive Summary
OS/System Engineer가 봐야 할 세 가지 변화 축
Data Plane Lens: LLM workload를 memory movement로 보기
Inference Anatomy: prefill, decode, reasoning, agent loop
HBM Requirement의 정량 모델
Attention Mechanism의 진화: MHA에서 CSA/HCA까지
DeepSeek-V4 Case Study: million-token context를 가능하게 한 설계
Gemini, GPT, Llama, Qwen, Kimi, MiniMax: 공개 정보 기반 비교
HBM 최소화 기법 Taxonomy
KV Cache Management: OS virtual memory와 LLM serving의 만남
MoE와 Conditional Memory: sparse parameter가 만든 새 data plane
Data Movement Order-of-Scale: bandwidth / latency / capacity 계산
Training Step: pretraining data plane와 memory wall
DPO, RLHF, GRPO, OPD: post-training을 system workload로 해석하기
Agentic AI Training Infrastructure: sandbox, rollout, WAL, environment
Top-tier Conference Approaches: ICML/ICLR/OSDI/SOSP/MLSys에서 뜬 흐름
Bandwidth-aware System Design을 위한 연구 아이디어
Appendix A: Formula Sheet
Appendix B: Practical Profiling Checklist
References

\newpage

1. Executive Summary

LLM system은 더 이상 “큰 dense Transformer를 빨리 돌리는 문제”가 아니다. 2024–2026년 frontier model의 방향을 OS/system engineer 입장에서 압축하면 다음과 같다.

첫째, context가 model parameter만큼 중요한 resource가 되었다. GPT-5.4/5.5, Gemini 3.x, DeepSeek-V4, Llama 4 Scout, MiniMax-M1 같은 모델들은 1M~10M token context 또는 multi-window compaction을 전면에 내세운다. 그런데 vanilla attention은 long context에서 FLOPs와 KV cache가 폭발한다. 따라서 ML system은 attention을 “계산 graph”가 아니라 memory hierarchy를 통과하는 data plane으로 다루기 시작했다. DeepSeek-V4의 CSA/HCA, DeepSeek-V2/V3의 MLA, GPT-OSS의 alternating sliding/full attention, MiniMax의 lightning attention, StreamingLLM의 attention sink, YaRN/LongRoPE류 context extension, PagedAttention/vAttention류 KV memory manager가 모두 이 축 위에 있다.

둘째, HBM은 이제 model weight보다 KV cache와 runtime state 때문에 더 자주 막힌다. Dense 70B를 FP16으로 올리면 weight만 140GB라서 당연히 어렵지만, MoE와 FP4/MXFP4로 weight footprint를 줄이면 다음 bottleneck은 KV cache, activation, optimizer state, all-to-all buffer, prefix cache, rollout trace가 된다. 예를 들어 80-layer, 64-head, head_dim 128의 BF16 MHA model은 KV cache만 token당 약 2.5MiB이다. 1M tokens면 sequence 하나의 KV cache가 약 2.5TiB가 된다. GQA-8로 줄여도 1M tokens에서 약 305GiB다. DeepSeek-V4가 “1M context에서 BF16 GQA8 baseline의 약 2% KV cache”를 주장하는 이유가 여기에 있다. attention architecture가 HBM allocator보다 상위의 “semantic compression policy”가 된 것이다.

셋째, sparsity의 의미가 MoE에서 conditional memory로 확장되고 있다. MoE는 “conditional computation”이다. token마다 일부 expert만 활성화한다. Engram은 “conditional memory”다. token n-gram에서 deterministic ID를 만들고, huge embedding table에서 O(1) lookup을 한다. 중요한 차이는 routing이 runtime hidden state에 의존하는 MoE와 달리 Engram은 input token만 보고 memory address를 예측할 수 있다는 점이다. 이는 host memory / SSD tier에서 prefetch하고 compute와 overlap할 수 있는 system primitive가 된다.

넷째, training은 optimizer보다 rollout service가 더 큰 시스템이 되는 중이다. DPO는 pairwise supervised objective에 가까워서 비교적 단순하다. 반면 PPO/GRPO/agent RL은 policy rollout, reference logprob, reward/verifier, sandbox tool execution, trajectory logging, replay, preemption, fault tolerance가 결합된 distributed system이다. DeepSeek-V4는 mixed RL을 OPD(On-Policy Distillation)로 대체하고, full-vocabulary KL을 위해 teacher last-layer hidden state만 cache한 뒤 logits를 재구성하는 등, post-training에서도 HBM/DRAM/I/O를 직접 설계한다.

다섯째, OS/system 연구 관점에서 가장 흥미로운 부분은 “모델 구조가 점점 OS abstraction을 요구한다”는 점이다. KV cache는 page table과 eviction policy를 가진다. prefix는 copy-on-write와 interning 대상이다. MoE expert dispatch는 network scheduler의 문제다. Engram table은 multi-tier cache hierarchy다. agent sandbox는 container/microVM/fullVM scheduler이며, rollout은 WAL과 checkpoint를 요구한다. 즉, modern ML system은 GPU kernel 최적화를 넘어 memory manager, scheduler, storage hierarchy, preemption/fault tolerance, compiler/runtime co-design 문제로 이동하고 있다.

\newpage

2. OS/System Engineer가 봐야 할 세 가지 변화 축

2.1 Model capability scaling의 축이 바뀌었다

전통적 scaling은 “parameter count × pretraining tokens × FLOPs”였다. 그러나 최근 모델은 세 가지 추가 축을 적극적으로 사용한다.

Test-time scaling: reasoning token을 더 생성하게 하거나, tool loop를 더 길게 돌리거나, multi-sample/self-verification을 한다.
Context-time scaling: 긴 문서, codebase, conversation history, tool result, memory trace를 한 번에 넣는다.
Sparsity-time scaling: MoE expert, sparse attention block, external memory row, retrieved document, tool result 중 일부만 활성화한다.

OS/system engineer에게 이는 model size가 아니라 active data volume을 봐야 한다는 뜻이다. 총 parameter가 1T라도 token당 active parameter가 30–50B이면 compute는 30–50B급이지만, 전체 expert weight를 HBM에 residency시켜야 하는지, host에서 prefetch 가능한지, routing이 언제 결정되는지에 따라 system requirement가 완전히 달라진다.

2.2 HBM은 “용량”이 아니라 “per-token bandwidth” 문제다

Inference decode step에서 batch가 작으면 GPU tensor core는 놀고 HBM bandwidth가 병목이 된다. 각 output token을 만들 때 model weight shard, KV cache, routing metadata, dequant scale, activation buffer를 읽는다. HBM 용량이 충분해도 매 token마다 너무 많은 byte를 읽으면 latency가 커진다.

OS 관점에서는 다음 질문이 중요하다.

이 tensor는 매 token마다 읽히는가, prefill에서 한 번 쓰고 decode에서 반복적으로 읽히는가?
locality가 sequence-local인가, request-shared prefix인가, expert-local인가, global table인가?
address가 미리 예측 가능한가?
compression은 lossless인가, model-trained approximation인가?
kernel fusion으로 HBM round-trip을 줄일 수 있는가?
communication은 compute 밑에 숨길 수 있는가?

DeepSeek-V4의 CSA/HCA는 단순한 attention 변형이 아니라, KV cache의 “lifetime, granularity, addressability, reuse policy”를 바꾼 system design이다. Engram은 embedding table을 host memory로 offload할 수 있게 address determinism을 만든다. GPT-OSS는 MXFP4 quantized MoE weight와 sliding/full attention mix로 single 80GB GPU deployment를 가능하게 한다.

2.3 Training은 memory state machine이다

Training step에서는 inference보다 data plane이 훨씬 복잡하다. Forward activation, backward gradient, optimizer state, parameter shard, communication bucket, activation checkpoint, recomputation graph가 모두 얽힌다. Long context training은 sequence dimension을 sharding해야 하고, compressed attention은 rank boundary를 넘는 token group을 다뤄야 한다. DeepSeek-V4가 contextual parallelism에서 “rank i의 마지막 m token을 rank i+1로 보내고, compressed KV를 all-gather한 뒤 select-and-pad”를 하는 이유는 compression이 sequence-local invariant를 깨기 때문이다.

Post-training은 더 복잡하다. DPO는 chosen/rejected pair만 있으면 되지만, RLHF/PPO/GRPO는 online rollout을 생성해야 한다. rollout은 inference workload이면서 동시에 training data 생성이다. 긴 context reasoning model에서는 rollout length가 수만 token이 되고, tool-use agent에서는 environment state와 sandbox I/O가 GPU scheduling과 결합된다.

\newpage

3. Data Plane Lens: LLM workload를 memory movement로 보기

LLM system을 이해하는 가장 빠른 방법은 모든 tensor와 object를 다음 네 가지로 분류하는 것이다.

Data class	예	주요 위치	읽기/쓰기 패턴	System bottleneck
Static parameter	dense weight, expert weight, embedding table, LM head	HBM, host DRAM, NVMe	decode token마다 반복 read, training에서 gradient update	HBM capacity/bandwidth, expert residency
Per-sequence state	KV cache, sliding window state, compressed KV, uncompressed tail	HBM, host DRAM, SSD	prefill write, decode repeated read, prefix reuse	KV memory allocator, page fragmentation, cache hit
Per-step activation	hidden states, attention logits, MLP intermediate, router logits	HBM/SRAM/register	forward write, backward read/recompute	activation memory, recomputation cost
Control/runtime metadata	router decision, top-k index, page table, prefix tree, tool trace, rollout log	CPU DRAM/HBM	small but latency-sensitive	scheduler, RPC, metadata cache

일반적으로 GPU kernel tuning은 per-step activation의 HBM traffic을 줄인다. FlashAttention이 대표적이다. Serving system은 per-sequence state를 줄인다. PagedAttention, vAttention, prefix caching, on-disk KV cache가 여기에 속한다. Model architecture는 static parameter와 per-sequence state를 동시에 바꾼다. MoE는 static parameter active set을 줄이고, GQA/MLA/CSA/HCA는 KV state를 줄인다. Post-training system은 control/runtime metadata와 rollout trace를 production-grade distributed system으로 만든다.

3.1 Prefill data plane

Prefill은 prompt token 전체에 대해 KV cache를 만들고 첫 output token 분포를 계산한다. Dense attention이면 attention score matrix가 O(S²)이며, S가 context length다. FlashAttention은 attention score matrix를 HBM에 materialize하지 않고 SRAM tile에서 online softmax를 수행한다. 그래서 memory complexity는 O(S²)에서 O(S)로 떨어지지만, dense attention의 compute complexity O(S²)는 여전히 남는다. Long-context prefill에서 FlashAttention은 필요조건이지 충분조건이 아니다.

Prefill의 data movement는 다음과 같다.

input tokens -> embedding read
for each layer:
  Q/K/V projection weight read
  attention: K/V tile read, Q tile read, O write
  MLP/MoE weight read, activation write
  KV cache write
final logits / sampling

Long prompt에서는 KV cache write traffic도 중요하다. 예를 들어 token당 KV가 320KiB인 GQA-8 model이 128K prompt를 prefill하면 KV write만 약 40GiB다. 물론 layer-wise로 분산되고 GPU마다 shard가 다르지만, allocator와 page placement가 성능에 직접 영향을 준다.

3.2 Decode data plane

Decode는 output token을 하나씩 만든다. 각 step에서 새 token의 hidden state를 모든 layer에 통과시키고, attention은 과거 KV cache를 읽는다.

new token -> embedding
for each layer:
  read Q projection weight
  read K/V cache for visible context
  write new K/V cache
  read MLP/MoE weight
  maybe all-to-all dispatch/combine
logits -> sample next token

Decode는 batch가 작으면 memory-bandwidth-bound가 된다. 왜냐하면 token 하나를 만들 때 active weight를 거의 한 번씩 읽어야 하고, attention은 visible KV cache를 읽어야 하는데, GEMM dimension은 prefill보다 작아서 tensor core utilization이 낮기 때문이다. Speculative decoding, Medusa, EAGLE, MTP 같은 방법이 “여러 token을 한 번에” 만들려고 하는 이유가 바로 여기에 있다. output token 수가 줄어드는 것이 아니라 full model pass 수가 줄어든다.

3.3 Agent loop data plane

Agentic AI에서는 decode step이 tool call로 중단되고, tool result가 context에 append된다. 이때 data plane은 GPU 밖으로 확장된다.

LLM decode -> tool call JSON/XML
CPU scheduler -> tool sandbox / browser / Python / bash
tool output -> tokenizer -> prefill or append KV
conversation state -> prefix cache / compaction / eviction

DeepSeek-V4의 post-training section에서 sandbox platform DSec, preemptible rollout service, token-granular WAL을 설명하는 이유는 agent training이 단순한 ML loop가 아니라 OS-level lifecycle management 문제이기 때문이다. tool call은 nondeterministic할 수 있고, sandbox state는 preemption 후 재현되어야 하며, rollout 중간에 GPU task가 preempt되면 KV cache와 token log를 복원해야 한다.

\newpage

4. Inference Anatomy: prefill, decode, reasoning, agent loop

4.1 Prefill과 decode의 성격 차이

LLM serving에서 prefill과 decode는 서로 다른 hardware affinity를 가진다.

Prefill: large matrix multiplication과 attention이 큰 batch/sequence로 실행된다. Compute-bound에 가깝다. 긴 prompt에서는 attention compute와 KV write가 크다.
Decode: token-by-token sequential step이다. Batch가 충분히 크지 않으면 weight/KV read가 지배한다. Memory-bandwidth-bound에 가깝다.

이 차이 때문에 DistServe, Splitwise, Sarathi-Serve 같은 system은 prefill과 decode를 분리하거나 chunked prefill로 decode batch에 piggyback한다. DeepSeek-V4, GPT-5.4, Gemini 3 같은 long-horizon agent model에서는 prefill이 수십만~백만 token이 될 수 있고 decode도 reasoning token 때문에 수만 token이 될 수 있다. 따라서 “TTFT(Time To First Token)”와 “TPOT(Time Per Output Token)”를 동시에 최적화해야 한다.

4.2 Reasoning model의 새로운 비용: token budget

Reasoning model은 답변 전에 hidden chain-of-thought 또는 visible thinking token을 생성한다. OpenAI o1부터 강조된 test-time scaling은 “더 오래 생각하면 더 잘 푼다”는 방향이다. DeepSeek-V4는 Non-think, High, Max 같은 reasoning effort mode를 나누고, mode별 context window와 length penalty를 달리한다. MiniMax-M1은 40K/80K thinking budget을 공개적으로 언급한다. GPT-5.4/5.5는 token-efficient reasoning을 강조한다.

System 관점에서 reasoning token은 output token과 동일한 decode data plane을 탄다. 즉, reasoning capability는 다음 비용을 낳는다.

더 많은 decode step
더 긴 KV cache lifetime
더 많은 intermediate tool state
post-training rollout length 증가
evaluation/inference에서 scheduler occupancy 증가

따라서 modern model은 단순히 context window를 키우는 것이 아니라, attention compression, KV cache reuse, speculative decoding, compaction, tool-use memory를 함께 설계해야 한다.

4.3 Long-context model의 core tension

Long context는 두 가지 서로 다른 목적을 가진다.

Retrieval purpose: 1M token 안에서 필요한 needle을 찾는다.
Computation purpose: 긴 chain-of-thought, multi-step plan, codebase state, tool trace를 유지한다.

Retrieval purpose는 sparse attention과 memory lookup으로 해결하기 쉽다. Computation purpose는 local reasoning state와 global consistency를 동시에 유지해야 한다. 그래서 최신 architecture는 보통 다음을 조합한다.

Sliding window / local attention: local coherence 유지
Sparse global selection: distant relevant block retrieval
Compressed dense attention: rough global summary 유지
Attention sink / special token: streaming stability 유지
RoPE scaling / YaRN / LongRoPE: position extrapolation
Compaction / summarization: multi-window task continuation
External tool/file memory: attention 밖으로 context를 이동

DeepSeek-V4의 CSA/HCA hybrid는 이 조합을 모델 내부에 넣은 사례다. GPT-5.1-Codex-Max의 compaction은 context window를 여러 개 넘나드는 task continuation을 모델/agent workflow 차원에서 훈련한 사례다. Gemini 3 page는 long-horizon coding과 tool-use 개선을 강조하지만, low-level architecture는 공개하지 않는다.

\newpage

5. HBM Requirement의 정량 모델

이 장은 bandwidth-aware system design에 바로 쓸 수 있는 계산식을 제공한다.

5.1 KV cache size formula

Autoregressive decoder에서 layer마다 이전 token의 key와 value를 저장한다. 단순화하면 token당 KV cache는 다음과 같다.

KV_bytes_per_token = 2 * L * H_kv * D_head * bytes_per_element

2: key와 value
L: transformer layer 수
H_kv: key/value head 수
D_head: head dimension
bytes_per_element: BF16/FP16이면 2, FP8이면 1, FP4이면 0.5에 scale overhead 추가

Example 1: MHA 70B-like

가정:

L = 80
H_kv = 64
D_head = 128
BF16 = 2 bytes

계산:

2 * 80 * 64 * 128 * 2 = 2,621,440 bytes ≈ 2.5 MiB / token

1M token context면 sequence 하나의 KV cache가 약 2.5TiB다. 이는 H200 141GB나 B200 192GB 하나로는 불가능하고, multi-GPU sharding을 해도 decode에서 KV read bandwidth가 매우 커진다.

Example 2: GQA-8

같은 모델에서 H_kv=8이면:

2 * 80 * 8 * 128 * 2 = 327,680 bytes ≈ 320 KiB / token

1M token이면 약 305GiB다. 훨씬 줄었지만, decode step마다 full context를 모두 읽는다면 token 하나에 수백 GiB의 KV read가 필요하다.

Example 3: MQA

H_kv=1이면:

2 * 80 * 1 * 128 * 2 = 40,960 bytes ≈ 40 KiB / token

1M token이면 약 38GiB다. MQA는 KV cache에는 강력하지만 quality degradation이 있을 수 있다. GQA는 MHA와 MQA 사이의 trade-off다.

5.2 KV read bandwidth lower bound

Decode token 하나에서 attention이 visible KV를 모두 읽는다고 가정하면:

KV_read_per_decode_token ≈ context_length * KV_bytes_per_token_visible

GQA-8 예제에서 1M context는 약 305GiB의 KV read다. H100 SXM의 HBM bandwidth 3.35TB/s를 이상적으로 모두 쓴다고 해도 lower bound는 약 91ms다.

305 GiB / 3.35 TB/s ≈ 0.091 s

이 lower bound는 weight read, MLP compute, communication, kernel overhead를 제외한 값이다. 따라서 1M context에서 dense full attention decode는 system적으로 불리하다. DeepSeek-V4가 compressed sparse attention으로 visible KV 수를 줄이고, GPT-OSS가 sliding/full attention을 섞고, MiniMax가 lightning attention을 쓰는 이유가 바로 이 bandwidth lower bound다.

5.3 Weight read bandwidth lower bound

Decode에서 active parameter를 한 번씩 읽는다고 가정하면:

weight_read_per_token ≈ active_params * bytes_per_weight / parallel_shards

예를 들어 active 49B parameter를 FP8 1 byte로 읽고 8-way tensor/model parallel로 shard하면 GPU당 약 6.1GB/token이다. H100 3.35TB/s에서 ideal lower bound는 약 1.8ms다.

49GB / 8 / 3.35TB/s ≈ 1.8ms

하지만 실제로는 GEMM shape, dequant scale read, MoE dispatch, cache miss, pipeline bubble, batch size가 영향을 준다. 또한 FP4/MXFP4로 weight를 줄이면 weight bandwidth는 줄지만, dequant scale과 kernel support가 새 bottleneck이 된다.

5.4 GPT-OSS 120B를 open-weight 계산 예제로 보기

OpenAI gpt-oss-120b의 공개 Hugging Face config는 다음 특징을 가진다.

117B parameters, token당 5.1B active parameters
36 layers
hidden size 2880
64 attention heads
8 KV heads
head_dim 64
alternating sliding_attention / full_attention
sliding_window 128
max_position_embeddings 131072
MoE weight MXFP4 quantization

모든 layer가 full attention이라고 단순화하면 BF16 KV cache는:

2 * 36 * 8 * 64 * 2 = 73,728 bytes ≈ 72 KiB/token

128K context에서는 약 9GiB다. 그런데 config상 half layer가 sliding attention이고 sliding window가 128이므로 long-range KV를 모든 layer에서 보존할 필요가 없다. open-weight model이 80GB GPU deployment를 목표로 할 때, weight quantization만큼 attention layout이 중요하다는 것을 보여준다.

5.5 DeepSeek-V4가 주장하는 scale

DeepSeek-V4 report에 따르면 V4-Pro는 1.6T total parameters, 49B activated parameters이며 1M context를 지원한다. V4-Flash는 284B total, 13B activated다. V4는 CSA/HCA hybrid attention, mHC, Muon optimizer, FP4/QAT, heterogeneous KV cache를 조합한다. Report는 1M context에서 V4-Pro가 DeepSeek-V3.2 대비 single-token inference FLOPs 27%, KV cache 10%만 필요하고, V4-Flash는 FLOPs 10%, KV cache 7% 수준이라고 설명한다. 또한 BF16 GQA8 head_dim 128 baseline 대비 V4 series의 KV cache는 1M setting에서 약 2% 수준으로 줄어든다고 한다.

이를 OS 관점에서 해석하면, V4의 핵심은 다음이다.

full KV cache residency
  -> compressed block residency
  -> sparse selected block read
  -> sliding local state
  -> on-disk prefix reuse

즉, KV cache는 단일 contiguous array가 아니라, layer별 policy와 compression ratio가 다른 heterogeneous object graph가 된다.

\newpage

6. Attention Mechanism의 진화: MHA에서 CSA/HCA까지

6.1 MHA: quality는 좋지만 KV cache가 크다

Multi-Head Attention(MHA)은 각 query head마다 key/value head를 가진다. Quality는 강하지만 KV cache가 H_kv = H_q에 비례한다. Long context에서는 memory bandwidth가 지배적이기 때문에 MHA는 serving 비용이 크다.

MHA의 OS/system 문제는 명확하다.

KV cache가 token마다 layer마다 커진다.
Decode에서 모든 과거 token의 K/V를 읽어야 한다.
Prefix sharing 없이는 같은 system prompt가 반복 prefill된다.
Fragmentation 때문에 batch size가 줄어든다.

FlashAttention은 MHA의 HBM score matrix 문제를 해결했지만, KV cache 자체를 줄이지는 않는다.

Multi-Query Attention(MQA)은 key/value head를 1개만 둔다. Query head는 여러 개지만 K/V를 공유한다. KV cache가 크게 줄고 decode가 빨라진다. 단점은 quality degradation 가능성이다.

Grouped-Query Attention(GQA)은 MHA와 MQA 사이의 중간이다. 여러 query head가 하나의 KV head group을 공유한다. GQA paper는 기존 MHA checkpoint를 5% pretraining compute 정도로 uptrain해서 MQA/GQA로 바꿀 수 있고, GQA가 MQA에 가까운 speed와 MHA에 가까운 quality를 제공한다고 보고했다.

System 관점에서 GQA는 매우 강력하다. KV cache formula에서 H_kv만 줄이면 되기 때문이다. 그러나 1M context에서는 GQA-8도 수백 GiB가 될 수 있다. 그래서 GQA는 128K~수백K context에는 충분할 수 있지만, million-token context에서는 추가 compression/sparsity가 필요하다.

6.3 MLA: latent KV compression

DeepSeek-V2/V3의 Multi-head Latent Attention(MLA)은 K/V를 직접 cache하지 않고 low-rank latent vector를 cache한다. Report에 따르면 DeepSeek-V2는 MLA와 DeepSeekMoE를 도입해 128K context를 지원하고, DeepSeek 67B 대비 KV cache를 93.3% 줄였다고 설명한다.

MLA의 의미는 “head sharing”에서 더 나아가 K/V state를 latent space로 factorize하는 것이다. OS 관점에서는 KV cache page의 payload가 줄어든다. 단, attention kernel은 latent에서 key/value를 재구성하거나 low-rank projection을 수행해야 하므로 compute와 memory의 trade-off가 생긴다.

6.4 Sliding Window Attention: locality를 system invariant로 만들기

Sliding Window Attention(SWA)은 최근 n_win token만 본다. Memory footprint는 sequence length가 아니라 window size에 비례한다.

장점:

KV cache bound가 O(n_win)
Decode latency predictable
Local coherence 유지

단점:

Distant retrieval 불가능
Long-context QA, codebase reasoning에 약함

따라서 최신 model은 SWA만 쓰지 않고 global mechanism과 섞는다. GPT-OSS config는 sliding/full attention을 alternating한다. DeepSeek-V4는 CSA/HCA branch에 추가 sliding window KV entries를 붙인다. StreamingLLM은 attention sink token을 보존해 finite window model이 streaming input에서 안정적으로 동작하도록 한다.

6.5 Sparse Attention: 모든 과거 token을 보지 않는다

Sparse attention은 query마다 일부 key/value만 선택한다. 과거에는 Longformer/BigBird처럼 fixed pattern이 많았다. 최신 long-context model에서는 content-based selection, learned indexer, compressed block selection이 중요해졌다.

DeepSeek-V3.2의 DSA와 DeepSeek-V4의 CSA는 “lightning indexer”로 compressed block 중 top-k를 선택한다. 이 방식은 MoE router와 비슷하다. Attention의 data plane이 다음처럼 바뀐다.

dense attention:
  read all KV blocks

sparse attention:
  compute cheap index score
  select top-k block ids
  gather selected KV blocks
  run core attention

여기서 system challenge는 top-k index와 gather pattern이 irregular하다는 것이다. Block layout, alignment, cache line padding, kernel co-design이 중요하다.

6.6 Compressed Attention: token dimension을 줄인다

Compressed attention은 여러 token의 KV를 하나의 compressed entry로 합친다. Compression rate를 m이라고 하면 sequence length가 1/m로 줄어든다. DeepSeek-V4 CSA는 m=4, HCA는 m'=128을 사용한다. CSA는 compression 후 sparse top-k selection을 하고, HCA는 훨씬 강한 compression 후 dense attention을 한다.

이 조합은 local detail과 global coverage를 분리한다.

CSA: 비교적 작은 compression + sparse precise retrieval
HCA: 큰 compression + dense rough global context
SWA: uncompressed local detail

OS 관점에서는 page size와 compression group size가 결합된다. DeepSeek-V4 inference KV layout은 block token size를 N * lcm(m, m')로 맞춘다. 이는 classic memory allocator의 alignment problem과 같다.

6.7 Linear Attention / State Space / Lightning Attention

Linear attention은 softmax attention의 O(S²)을 kernel trick 또는 recurrence로 줄인다. State Space Model(SSM)과 hybrid attention도 같은 목표를 가진다. MiniMax-M1은 hybrid MoE architecture와 Lightning Attention을 결합해 1M context와 test-time compute scaling을 강조한다. Kimi Linear 같은 model도 efficient attention architecture를 전면에 내세운다.

System적으로 linear attention은 KV cache 대신 recurrent state를 갖는다. 이는 memory footprint를 줄이지만, exact retrieval이나 arbitrary token addressing이 어려워질 수 있다. 그래서 frontier model은 dense/sparse attention과 linear/state components를 hybrid로 조합하는 방향이 강하다.

6.8 Position Encoding: RoPE scaling, YaRN, LongRoPE

Long context를 지원하려면 attention pattern뿐 아니라 position encoding도 extrapolate되어야 한다. RoPE는 상대적 위치를 잘 encode하지만 training length를 넘어가면 degradation이 생긴다. YaRN은 RoPE 기반 model의 context window extension을 compute-efficient하게 수행하며, 기존 method보다 더 적은 token과 training step을 요구한다고 보고했다. LongRoPE/LongRoPE2류도 position interpolation/extrapolation을 개선한다.

중요한 점은 position extension은 KV cache나 attention FLOPs를 줄이지 않는다는 것이다. 따라서 YaRN은 long-context “capability”를 열어주는 기술이고, CSA/HCA/MLA/SWA는 long-context “cost”를 줄이는 기술이다.

\newpage

7. DeepSeek-V4 Case Study: million-token context를 가능하게 한 설계

첨부 DeepSeek-V4 report와 slide는 modern ML system 변화의 좋은 case study다. 여기서는 OS/system 관점에서 핵심만 재해석한다.

7.1 Model scale

DeepSeek-V4 series는 preview version으로 두 모델을 제시한다.

Model	Total params	Activated params/token	Context
DeepSeek-V4-Pro	1.6T	49B	1M tokens
DeepSeek-V4-Flash	284B	13B	1M tokens

둘 다 DeepSeekMoE를 유지하고, attention을 CSA/HCA hybrid로 바꾸고, residual path에 mHC를 도입하고, optimizer로 Muon을 사용한다. Pretraining corpus는 32T+ tokens 수준이다.

7.2 CSA: compressed sparse attention

CSA의 data plane은 다음과 같다.

hidden states H
  -> token-level compressor
  -> compressed KV entries C_comp, length ≈ n/m
  -> lightning indexer keys, length ≈ n/m
query h_t
  -> indexer query
  -> index scores over compressed blocks
  -> top-k selected compressed KV entries
  -> core MQA attention over selected blocks + sliding window KV

DeepSeek-V4-Pro에서는 CSA compression rate m=4, attention top-k 1024, indexer query heads 64, indexer head dimension 128이다. Flash는 top-k가 512다. 이 구조는 attention에서 “router → selected blocks → expert-like compute”로 바꾼다.

System implication

CSA는 KV cache size를 줄일 뿐 아니라, decode-time KV read를 all blocks에서 top-k blocks + local window로 바꾼다. 하지만 irregular gather와 top-k selection이 생긴다. 따라서 sparse attention kernel과 KV cache layout이 co-designed되어야 한다. page/block alignment, compressed block ID, padding, sequence boundary handling이 중요하다.

7.3 HCA: heavily compressed attention

HCA는 compression rate m'=128을 사용한다. CSA보다 훨씬 강하게 압축하지만 sparse top-k selection은 하지 않고 dense attention을 한다.

hidden states H
  -> token-level compressor over 128-token chunks
  -> heavily compressed KV entries, length ≈ n/128
  -> MQA over all compressed entries + sliding window KV

System implication

HCA는 global context를 coarse summary로 유지한다. 1M tokens도 7812 compressed entries가 된다. Dense attention over 7812는 full 1M dense attention에 비해 매우 작다. HCA가 global context “background memory”를 제공하고, CSA가 precise sparse retrieval을 제공하는 형태다.

7.4 SWA branch와 attention sink

DeepSeek-V4는 CSA/HCA 모두에 uncompressed sliding window branch를 추가한다. Compression block 내부의 causality 문제와 recent token importance를 보완하기 위해서다. Window size는 128이다. 또한 attention sink trick을 사용한다. Sink logit은 attention probability mass가 반드시 실제 token에 모두 분배되지 않아도 되도록 한다. StreamingLLM과 유사하게 long sequence stability에 기여한다.

7.5 Mixed KV precision과 FP4 indexer

DeepSeek-V4는 KV storage에서 RoPE dimension은 BF16, 나머지는 FP8을 쓴다고 설명한다. Lightning indexer attention computation은 FP4 precision을 사용한다. System적으로 이는 KV cache payload와 indexer bandwidth를 동시에 줄이는 design이다. 단, low precision은 scale metadata와 dequant kernel이 필요하므로 memory layout이 더 복잡해진다.

7.6 KV cache layout

DeepSeek-V4 report의 Figure 6은 KV cache가 두 부분으로 나뉜다고 설명한다.

Classical KV cache: CSA/HCA compressed KV entries
State cache: SWA KV와 compression-ready가 아닌 uncompressed tail token state

SWA와 tail token은 sequence-specific state이며 fixed-size state cache로 관리된다. CSA/HCA compressed entries는 block-based KV cache에 들어간다. Block은 lcm(m, m') 원 token을 cover하도록 설계된다.

OS analogy

CSA/HCA block: page cache / compressed object
SWA state: per-process register file / recurrent state
Tail token: write buffer
Prefix hit: shared page mapping
On-disk KV: swap / persistent cache

7.7 On-disk KV cache

DeepSeek-V4는 shared-prefix request의 repeated prefill을 줄이기 위해 on-disk KV cache를 사용한다. CSA/HCA compressed entries는 전부 disk에 저장한다. SWA KV는 크기가 훨씬 커서 세 가지 policy를 제안한다.

Full SWA caching: compute redundancy는 없지만 write-heavy
Periodic checkpointing: storage와 recomputation trade-off
Zero SWA caching: 저장하지 않고 tail recomputation

이는 OS page cache의 write-back policy, checkpoint interval, recompute-vs-I/O trade-off와 거의 같다. Long-context serving에서는 SSD가 단순 storage가 아니라 prefix reuse tier가 된다.

7.8 mHC: residual stream도 memory layout 문제다

mHC는 residual stream을 여러 lane으로 확장하고, pre/post/residual mixing을 학습한다. Standard Hyper-Connection은 instability가 있을 수 있어 mHC는 residual mapping matrix를 doubly stochastic matrix manifold로 constrain한다. System 관점에서는 mHC가 activation memory와 pipeline communication을 증가시킨다. DeepSeek-V4는 fused kernel, recomputation, DualPipe overlap으로 mHC overhead를 줄인다.

핵심은 architecture improvement가 항상 system cost를 낳는다는 점이다. mHC는 modeling quality를 올리지만 activation state가 커진다. 따라서 mHC paper/DeepSeek-V4 report는 recomputation과 kernel fusion을 함께 제시한다.

7.9 Expert parallelism: all-to-all을 compute 아래 숨기기

DeepSeek-V4 MoE layer는 dispatch all-to-all, Linear1, activation, Linear2, combine all-to-all로 나눌 수 있다. Report는 communication time이 compute time보다 작다면 fine-grained wave scheduling으로 communication을 compute 아래 숨길 수 있다고 설명한다. DeepSeek-V4-Pro에서 token-expert pair당 compute는 6 h d FLOPs, communication은 3 h bytes로 근사되고, full overlap condition은 다음처럼 정리된다.

C / B <= V_comp / V_comm = 2d = 6144 FLOPs/Byte

즉 interconnect 1GB/s는 약 6.1TFLOP/s compute를 숨길 수 있다. 이 식은 bandwidth-aware system design에 매우 직접적이다. 단순히 “더 빠른 네트워크”가 아니라, model hidden dimension과 expert compute intensity에 맞춘 compute/comm balance point가 중요하다.

7.10 Post-training: OPD와 rollout service

DeepSeek-V4는 기존 mixed RL stage를 OPD(On-Policy Distillation)로 대체했다고 설명한다. 여러 domain-specific expert teacher를 만들고, unified student가 자기 trajectory에 대해 teacher distribution과 reverse KL을 맞춘다. Full-vocabulary KL은 logits가 너무 커서 teacher last-layer hidden state만 cache하고 training time에 prediction head를 통과시켜 logits를 재구성한다.

이것은 memory system design이다.

naive:
  teacher logits [tokens × vocab × teachers] 저장 -> 불가능

DeepSeek-V4 style:
  teacher hidden [tokens × hidden] 저장
  teacher head를 필요할 때 load
  logits를 TileLang kernel로 즉석 계산

Post-training infrastructure는 preemptible rollout service, token-granular WAL, million-token rollout data format split, sandbox platform을 포함한다. 즉, model alignment는 distributed storage, logging, preemption, sandbox orchestration이 결합된 system problem이다.

\newpage

8. Gemini, GPT, Llama, Qwen, Kimi, MiniMax: 공개 정보 기반 비교

이 장은 공개 정보에 기반한다. Closed model은 architecture를 공개하지 않으므로, low-level attention mechanism은 확정할 수 없다.

8.1 OpenAI GPT line: context, compaction, tool/agent emphasis

OpenAI의 공개 자료 기준으로 GPT-5.4는 ChatGPT/API/Codex에서 1M token context를 지원한다고 설명되고, API docs에서는 GPT-5.4/5.4 Pro가 1.05M context window를 가진다고 명시한다. GPT-5.5 역시 API에서 1M context window를 제공할 예정이라고 발표되었다. GPT-5.1-Codex-Max는 “compaction”으로 여러 context window를 넘나들며 millions of tokens task를 coherent하게 수행하도록 훈련되었다고 설명된다.

System implication:

GPT frontier line은 공개 architecture보다 agent workflow와 context lifecycle을 강조한다.
1M single window와 multi-window compaction은 서로 다르다. Single window는 KV cache와 attention cost 문제이고, compaction은 summarized state / memory transform / task continuity 문제다.
GPT-5.4/5.5의 “token-efficient reasoning”은 reasoning token이 serving cost의 중요한 축이 되었음을 보여준다.

8.2 OpenAI gpt-oss: open-weight architecture signal

gpt-oss-120b는 공개된 open-weight reasoning model이다. Model card는 efficient MoE Transformer, large-scale distillation, reinforcement learning을 언급한다. Hugging Face model card는 117B parameters, 5.1B active parameters, MXFP4 quantized MoE weights, single 80GB GPU deployment를 설명한다. Config는 alternating sliding/full attention, 36 layers, 8 KV heads, sliding window 128을 보여준다.

이 model이 주는 system intuition은 강하다.

Weight: MXFP4로 active/total weight footprint를 줄인다.
Attention: sliding layer로 long-range KV cache를 절반가량 줄인다.
MoE: active params 5.1B로 per-token weight read를 제한한다.
Reasoning effort: low/medium/high로 output token budget을 control한다.

즉 open-weight GPT-style model에서도 HBM requirement는 quantization + MoE + attention layout + reasoning budget이 함께 결정한다.

8.3 Google Gemini line: multimodal long context와 agentic workflows

Gemini 2.5 technical report는 Gemini 2.X family가 advanced reasoning, multimodality, long context, agentic capability를 결합한다고 설명한다. Gemini 2.5 Pro는 3시간 video content processing을 언급한다. Gemini 3 public page는 Gemini 3.1 Pro, Gemini 3.1 Deep Think, Gemini 3 Flash/Flash-Lite 등을 소개하며, reasoning, multimodal understanding, agentic coding/tool-use, massive context window를 강조한다. Benchmark table은 MRCR v2 1M pointwise long-context performance를 포함한다.

System implication:

Gemini는 multimodal token stream이 핵심이다. Text-only 1M context보다 video/audio/image tokenization과 compression이 더 큰 system issue다.
Long video는 token count가 폭발하므로, visual encoder 단계에서 adaptive selection/compression이 필요하다.
Public low-level attention architecture는 제한적이므로, OS 관점에서는 “long-context serving architecture가 공개되지 않은 black-box”로 다뤄야 한다.

8.4 Meta Llama 4: MoE와 10M context claim

Meta의 Llama 4 page는 Llama 4 Scout가 single H100 efficiency와 10M context window를 제공한다고 소개한다. Meta blog는 Llama 4가 Meta의 첫 MoE architecture라고 설명한다. Hugging Face release note는 Scout가 약 109B total / 17B active, Maverick이 약 400B total / 17B active라고 정리한다.

System implication:

10M context는 dense KV cache로는 불가능하므로, architecture 또는 serving layer에서 aggressive compression/sparsity/position scaling/eviction이 필요하다고 보는 것이 합리적이다.
17B active params는 weight bandwidth를 낮춰 single-GPU or small-node serving 가능성을 높인다.
Multimodal long context는 KV cache뿐 아니라 image token cache와 encoder output cache가 중요하다.

8.5 Qwen3: dense + MoE family, 256K/1M extendability

Qwen3 technical report는 dense와 MoE architecture를 모두 포함하고, 0.6B부터 235B 규모까지 모델을 제시한다. Qwen3 repository는 long-context understanding 256K와 1M까지 extendability를 언급한다. Qwen3의 중요한 특징은 multilingual, reasoning, tool use, open-weight ecosystem이다.

System implication:

Open-weight ecosystem에서 long-context serving은 vLLM/SGLang/TensorRT-LLM 같은 runtime과 결합되어야 한다.
256K native와 1M extendable은 “훈련된 effective context”와 “API maximum context”를 구분해서 봐야 한다.

8.6 Kimi K2: trillion-parameter MoE와 MuonClip

Kimi K2 technical report는 1T total / 32B active MoE model, MuonClip optimizer, 15.5T token pretraining, agentic intelligence를 강조한다. MuonClip은 Muon의 token efficiency와 QK-clip 기반 stability를 결합한다.

System implication:

Trillion-parameter MoE의 bottleneck은 expert residency와 all-to-all이다.
Optimizer stability technique이 hardware utilization과 직결된다. Loss spike는 rollback, wasted FLOPs, cluster scheduling inefficiency를 만든다.
Agentic data synthesis와 RL은 post-training infra를 키운다.

8.7 MiniMax-M1 / MiniMax-01: Lightning Attention과 1M~4M context

MiniMax-M1은 hybrid MoE architecture와 Lightning Attention을 결합한 open-weight reasoning model로, 1M context와 test-time compute scaling을 강조한다. MiniMax-01은 1M training context와 4M inference extrapolation을 언급한다.

System implication:

Lightning/linear attention류는 long context cost를 architecture level에서 줄인다.
RL training 비용을 H800 512 GPUs 3주 수준으로 공개한 점은 post-training system cost transparency 측면에서 중요하다.
긴 thinking budget은 decoding bandwidth와 rollout service를 크게 키운다.

\newpage

9. HBM 최소화 기법 Taxonomy

9.1 Weight footprint 줄이기

Quantization

FP16/BF16 → FP8 → FP4/MXFP4
Weight-only quantization: inference HBM 줄임
QAT: training/post-training에서 quantization noise에 적응
Scale metadata: group/tile 단위 scale이 추가 HBM read를 만든다

GPT-OSS는 MXFP4 quantized MoE weights로 120B model을 single 80GB GPU에서 실행 가능하다고 설명한다. DeepSeek-V4는 post-training stage에서 MoE expert weight와 CSA indexer QK path에 FP4 QAT를 적용한다. FP4는 raw byte를 줄이지만, scale format과 dequant kernel의 efficiency가 중요하다.

MoE

MoE는 total parameter를 키우면서 active parameter를 제한한다. HBM 관점에서 문제는 두 가지다.

모든 expert weight를 HBM에 resident시킬 것인가?
expert가 remote GPU에 있을 때 activation all-to-all 비용을 compute로 숨길 수 있는가?

DeepSeek-V4는 fine-grained EP wave scheduling으로 dispatch/combine communication을 Linear1/Linear2 compute 아래 숨긴다. Hardware design은 peak bandwidth 자체보다 compute/comm ratio에 맞춰야 한다.

Conditional memory / Engram

Engram은 deterministic n-gram hash lookup으로 huge embedding table을 host memory에 둘 수 있다. MoE expert는 hidden state를 계산해야 routing을 알 수 있어 prefetch가 어렵지만, Engram은 input token만 있으면 address를 알 수 있다. 이 차이는 OS 입장에서 매우 크다.

9.2 KV cache footprint 줄이기

MQA/GQA

KV head 수를 줄인다. 구현이 쉽고 효과가 크다. 단, 1M context에서는 여전히 부족할 수 있다.

MLA

K/V를 latent vector로 joint compression한다. DeepSeek-V2/V3에서 검증되었다. K/V cache payload를 줄이지만 projection compute가 필요하다.

CSA/HCA

Sequence dimension을 compression한다. CSA는 compression 후 sparse top-k, HCA는 heavy compression 후 dense attention이다. KV cache가 token 개수에 선형으로 늘지 않고 compressed block 개수에 따라 늘어난다.

Sliding window / local state

Recent token만 보존한다. Long-range info는 다른 branch가 담당해야 한다.

KV precision

RoPE dimension은 BF16, 나머지는 FP8처럼 mixed format을 쓸 수 있다. RoPE/position-sensitive dimension은 precision-sensitive할 수 있으므로 homogeneous quantization보다 mixed format이 유리하다.

9.3 KV cache fragmentation 줄이기

PagedAttention

PagedAttention은 OS paging에서 영감을 받아 KV cache를 block 단위로 관리한다. vLLM은 near-zero waste와 prefix sharing을 목표로 한다. Fragmentation 때문에 batch size가 줄어드는 문제를 해결한다.

vAttention

vAttention은 CUDA virtual memory management API를 사용해 virtual address는 contiguous하게 유지하고 physical allocation을 demand paging한다. Attention kernel이 paging을 직접 알 필요를 줄이는 방향이다.

Prefix cache / RadixAttention

SGLang의 RadixAttention은 multi-turn, few-shot, agent program에서 shared prefix KV를 재사용한다. Prefix tree/radix tree와 KV page reference counting이 결합된다.

On-disk KV cache

DeepSeek-V4는 compressed KV를 SSD에 저장해 shared-prefix repeated prefill을 제거한다. 이 방식은 KV cache를 persistent object로 만든다.

9.4 Attention FLOPs 줄이기

Sparse attention: top-k blocks only
Linear attention: recurrence/kernel trick
Local/global hybrid: local dense + global sparse/compressed
Position extension only: FLOPs는 줄이지 않음
FlashAttention: FLOPs는 같지만 HBM traffic을 줄여 speedup

9.5 Decode step 수 줄이기

Speculative decoding

Small draft model이 여러 token을 예측하고 large model이 검증한다. Correctness를 유지하면서 large model pass 수를 줄일 수 있다. 하지만 draft model 관리가 필요하다.

Medusa

Backbone에 multiple decoding heads를 붙여 다음 여러 token을 예측한다. Draft model이 필요 없고 integration이 쉽다. Extra head가 HBM/compute를 조금 추가하지만 full pass 수를 줄인다.

EAGLE

Feature-level speculative sampling이다. Second-to-top-layer feature를 예측해 efficient speculative decoding을 수행한다.

Multi-Token Prediction(MTP)

DeepSeek-V3/V4가 사용하는 training objective다. Inference에서 speculative decoding과 결합될 수 있고, training에서 representation을 미래 token 예측에 맞춘다.

\newpage

10. KV Cache Management: OS virtual memory와 LLM serving의 만남

10.1 KV cache는 process address space와 닮았다

Serving runtime에서 request는 process와 비슷하다. 각 request는 own context, KV pages, state cache, output buffer를 가진다. Shared system prompt는 shared library page와 비슷하다. Prefix cache는 copy-on-write mapping과 비슷하다.

OS analogy:

OS concept	LLM serving analog
Virtual address space	logical token positions / KV block table
Physical page	HBM KV block
Page table	request block table
Copy-on-write	shared prefix reuse 후 divergence
Swap	host/SSD KV offload
TLB/cache	block metadata cache
Page replacement	KV eviction policy
NUMA placement	GPU shard / NVLink domain placement

PagedAttention은 이 analogy를 명시적으로 사용한다. DeepSeek-V4의 heterogeneous KV cache는 이 analogy를 더 복잡하게 만든다. 왜냐하면 layer마다 KV type, compression ratio, visibility policy가 다르기 때문이다.

10.2 Fragmentation과 internal waste

Naive KV allocation은 maximum sequence length만큼 contiguous memory를 잡거나, request마다 variable-sized buffer를 할당한다. Generation length가 예측 불가능하므로 waste가 크다. PagedAttention은 fixed-size block으로 나눠 waste를 줄인다.

하지만 block size는 trade-off다.

큰 block: metadata overhead 작음, coalesced read 유리, internal fragmentation 큼
작은 block: memory efficiency 좋음, metadata/cache overhead 큼, kernel gather 복잡

DeepSeek-V4에서는 block size가 compression ratio m, m', kernel alignment, lcm(m, m')와도 결합된다. 이는 allocator가 architecture-specific이 될 수 있음을 의미한다.

Production LLM workload는 같은 system prompt, same documents, same tool instructions가 반복된다. Prefix sharing은 prefill compute와 KV memory를 모두 줄인다.

Prefix sharing의 핵심은 다음이다.

prefix tokens -> KV blocks
KV blocks refcount++
request diverges -> append new blocks
if mutation needed -> copy-on-write

SGLang RadixAttention은 prefix tree를 사용해 structured generation program의 repeated prefix를 재사용한다. GPT/Codex류 agent workflow에서는 tool schema, repository context, instructions가 반복되므로 prefix cache hit가 중요하다.

10.4 SSD tier와 recomputation trade-off

Long-context prefix는 HBM에 계속 둘 수 없다. Host DRAM이나 SSD tier가 필요하다. 하지만 SSD random read latency와 bandwidth는 HBM에 비해 orders of magnitude 느리다. 따라서 SSD tier는 decode critical path에 직접 들어가면 안 된다. DeepSeek-V4의 on-disk KV cache는 shared-prefix prefill elimination에 초점을 둔다. Prefix hit 시 compressed KV를 읽어 prefill을 건너뛰고, SWA tail은 checkpoint/recompute policy로 복원한다.

Design principle:

Decode critical path: HBM resident 또는 compute-overlapped prefetch
Prefill shortcut: SSD compressed KV read 가능
Rare memory lookup: host/SSD tier + prefetch 가능하면 acceptable
Random small read: batching/coalescing 필요

10.5 Heterogeneous KV policy가 필요한 이유

DeepSeek-V4의 CSA/HCA/SWA는 서로 다른 cache policy를 갖는다.

CSA compressed main KV: sparse selected, long-lived
CSA indexer KV: top-k selection에 필요, low precision 가능
HCA KV: heavily compressed, dense over compressed entries
SWA KV: recent window only, high churn
Uncompressed tail: compression block이 완성될 때까지 buffer

단일 allocator policy로는 비효율적이다. OS에서도 anonymous page, file page, huge page, device memory, page cache policy가 다르듯, LLM serving도 KV object type별 allocator와 eviction policy가 필요하다.

\newpage

11. MoE와 Conditional Memory: sparse parameter가 만든 새 data plane

11.1 MoE: conditional computation

MoE layer는 token hidden state를 router에 넣고 top-k expert를 선택한다. 선택된 expert만 compute한다.

Data plane:

hidden states on each GPU
  -> router logits
  -> top-k expert ids
  -> dispatch activation to expert owner GPUs
  -> expert MLP compute
  -> combine output back to token owner

이 구조는 all-to-all communication을 만든다. Expert parallelism은 parameter capacity를 키우지만 network scheduling이 중요해진다.

MoE HBM issue

MoE가 active compute를 줄여도, expert weight를 어디에 둘지가 문제다.

모든 expert를 HBM에 resident: latency 좋음, HBM capacity 많이 필요
Expert offload: HBM 절약, routing 후 weight fetch latency 큼
Predictive prefetch: router decision이 hidden state 이후에 나오므로 어려움

따라서 대부분의 MoE serving은 expert weights를 GPU memory에 올리고 activation을 이동시킨다. DeepSeek-V4 slide도 “MoE depends on runtime hidden state; you have to run to certain stage to know which expert you need; so all expert weights should stay in HBM”라는 intuition을 제시한다.

11.2 Engram: conditional memory

Engram은 input token n-gram을 hash해서 embedding table row를 lookup한다. Retrieval index가 input token에만 의존하므로 forward pass 전에 알 수 있다.

Data plane:

input token ids
  -> tokenizer compression
  -> n-gram hash ids
  -> async prefetch embedding rows from host/SSD/HBM cache
  -> layer 2 or 15에서 hidden state와 context-aware gating
  -> residual add

이 구조는 MoE와 다르다.

Aspect	MoE	Engram
sparsity type	conditional computation	conditional memory
routing input	runtime hidden state	input token IDs
prefetch 가능성	낮음	높음
HBM residency	expert weights 대부분 필요	hot rows만 HBM 가능
communication	all-to-all activation	host/GPU row fetch
semantic role	dynamic reasoning capacity	static local pattern/knowledge lookup

11.3 Engram의 modeling insight

Engram paper는 language modeling을 “compositional reasoning + knowledge retrieval”로 본다. Transformer는 retrieval primitive가 없어서 static n-gram/entity pattern도 attention+FFN computation으로 재구성한다. Engram은 n-gram lookup을 explicit memory primitive로 넣는다.

Ablation과 analysis는 다음 intuition을 준다.

Engram-27B는 iso-parameter/iso-FLOPs MoE-27B보다 MMLU, BBH, HumanEval, MATH 등에서 개선된다.
Sparse budget allocation에서 pure MoE보다 MoE+Engram hybrid가 좋고, optimum은 MoE inactive parameter budget 약 75–80% / Engram 20–25% 근처로 관찰된다.
LogitLens/CKA 분석은 Engram shallow layer representation이 MoE deeper layer와 비슷해져 effective depth를 늘리는 것으로 해석한다.
Long-context RULER에서 Engram은 local dependency를 lookup으로 처리해 attention capacity를 global context에 쓰도록 만든다.

11.4 Engram의 system efficiency

Engram은 100B-parameter table을 host memory에 두고 H800 single GPU에서 Dense-4B/8B backbone에 붙였을 때 throughput penalty가 3% 미만이라고 보고한다. 실험은 모든 retrieval을 PCIe로 보내는 conservative baseline이며, Zipfian locality를 이용해 hot n-gram을 HBM에 cache하면 더 좋아질 수 있다.

이 결과가 bandwidth-aware design에 주는 메시지는 다음이다.

Parameter count가 HBM capacity requirement와 반드시 같지 않다.
Address determinism이 있으면 host memory도 model parameter tier가 된다.
Memory hierarchy-aware architecture는 OS prefetcher와 cache policy를 model semantics에 맞춰 설계할 수 있다.

11.5 Sparse parameter의 future: MoE + Memory + Retrieval + Tool

다음 generation의 sparse model은 네 가지 sparse primitive가 결합될 가능성이 크다.

MoE expert: dynamic computation
Engram/memory layer: static pattern lookup
External retrieval/RAG: editable non-parametric memory
Tool execution: computation을 model 밖으로 offload

OS/system engineer에게 이는 “which state lives where?”라는 memory placement 문제다. Expert weight, memory rows, retrieved documents, tool outputs가 서로 다른 latency/bandwidth/correctness property를 갖기 때문이다.

\newpage

12. Data Movement Order-of-Scale: bandwidth / latency / capacity 계산

12.1 Hardware bandwidth reference

대표적인 public spec 기준:

Hardware	HBM capacity	HBM bandwidth	Notes
NVIDIA H100 SXM	80GB	3.35TB/s	FP8 Tensor Core ~3.958PFLOP/s class
NVIDIA H200	141GB	4.8TB/s	HBM3e, H100보다 capacity/bandwidth 증가
NVIDIA DGX B200 8-GPU	1.44TB total	64TB/s aggregate	B200 8개, FP4 support
NVIDIA GB200 NVL72	72 Blackwell GPUs	rack-scale NVLink domain	130TB/s GPU bandwidth domain 언급

HBM은 TB/s, NVLink는 GPU당 수백 GB/s~TB/s aggregate, PCIe Gen5 x16은 수십 GB/s, host DRAM은 CPU socket당 수백 GB/s, NVMe SSD는 sequential 수 GB/s~수십 GB/s, random latency는 tens~hundreds µs 수준이다. 따라서 decode critical path에서 HBM 밖으로 나가는 read는 prefetch/overlap 없이는 위험하다.

12.2 Decode token의 lower-bound budgeting

Decode token latency는 단순히 다음 합으로 lower-bound를 잡을 수 있다.

T_decode >= max(
  weight_read_bytes / HBM_BW,
  KV_read_bytes / HBM_BW,
  GEMM_FLOPs / compute_peak,
  communication_bytes / interconnect_BW - overlapped_compute_time,
  scheduler/kernel_overhead
)

실제는 max가 아니라 pipeline overlap과 dependency가 섞이지만, bottleneck intuition에는 충분하다.

Example: 1M context GQA-8

KV read/token ≈ 305GiB
H100 HBM 3.35TB/s
Lower bound ≈ 91ms/token

이 값은 unacceptable하다. 1M context에서 dense GQA-8 full attention은 bandwidth만 봐도 어렵다.

Example: DeepSeek-V4 2% GQA8 baseline

위 305GiB의 2%면 약 6.1GiB/token이다. H100 lower bound는 약 1.8ms/token이다. 물론 실제로는 selected block gather, indexer compute, MLP/MoE, communication이 추가된다. 그래도 order-of-scale이 완전히 달라진다.

Example: active weight read

49B active FP8 parameter를 8 GPU에 shard하면 GPU당 약 6.1GB/token이다. H100 lower bound 약 1.8ms/token이다. 따라서 1M context에서 KV compression이 충분히 되면 attention KV read와 active weight read가 비슷한 scale로 내려온다. 이것이 long-context efficient architecture의 목표다.

12.3 MoE dispatch bandwidth

DeepSeek-V4-Pro의 hidden dimension은 7168이다. Report의 근사에 따르면 token-expert pair communication은 FP8 dispatch와 BF16 combine으로 3h bytes이다.

3 * 7168 = 21,504 bytes ≈ 21KB / token-expert-pair

Top-6 expert를 모두 remote로 보낸다고 worst-case 근사하면 layer당 token당 약 129KB communication이다. 61 layers면 약 7.9MB/token이다. 실제는 local expert hit, batching, parallelism, overlap 때문에 달라지지만, MoE communication은 KV cache와 weight read에 비해 작아 보일 수 있다. 그러나 interconnect latency와 all-to-all synchronization이 있기 때문에 kernel fusion과 wave scheduling이 필요하다.

12.4 Prefill order-of-scale

Prefill은 attention FLOPs가 dominant일 수 있다.

Dense attention layer의 rough FLOPs:

QK + AV ≈ 4 * S^2 * H_q * D_head

S=1M이면 S²=10¹²이다. Dense full prefill은 불가능에 가깝다. FlashAttention은 score matrix memory를 줄이지만 S² compute를 없애지 않는다. 따라서 million-token prefill은 sparse/compressed/linear attention 또는 chunked retrieval/compaction 없이는 어렵다.

DeepSeek-V4는 training sequence length를 4K → 16K → 64K → 1M로 늘리고, sparse attention은 64K부터 도입한다. 이는 model을 dense short context로 안정화한 뒤 sparse long context로 curriculum을 바꾸는 방식이다.

12.5 Training memory order-of-scale

Mixed-precision Adam training에서 parameter당 state는 보통 다음 정도다.

BF16 param:       2 bytes
BF16 grad:        2 bytes
FP32 master:      4 bytes
FP32 Adam m/v:    8 bytes
-------------------------
Total:           16 bytes / parameter

1T parameter를 naive Adam으로 train하면 model state만 16TB다. ZeRO/FSDP가 필수다. MoE에서는 total parameter가 커도 active compute는 작지만, optimizer state와 checkpoint는 total parameter에 비례한다. 따라서 training system은 sharding, offload, checkpointing, optimizer choice를 함께 설계해야 한다.

Muon은 Adam과 다른 optimizer state를 가지지만, matrix-wise orthogonalized update가 필요해 full gradient matrix access와 ZeRO bucket assignment가 문제가 된다. DeepSeek-V4는 Muon을 위해 hybrid ZeRO strategy, BF16 gradient communication, batched Newton-Schulz를 사용한다.

12.6 Long-context training activation

Activation memory rough formula:

Activation ≈ batch * seq * hidden * layers * bytes * factor

여기에 attention intermediate, MLP intermediate, residual/mHC lanes, MoE router state가 붙는다. FlashAttention과 activation checkpointing이 없으면 long context training은 불가능하다. DeepSeek-V4는 tensor-level activation checkpointing을 구현해 필요한 tensor만 checkpoint/recompute한다. 이는 module-level checkpointing보다 fine-grained trade-off를 가능하게 한다.

\newpage

13. Training Step: pretraining data plane와 memory wall

13.1 Training step의 기본 data plane

Pretraining step은 다음 pipeline이다.

CPU/data loader:
  document packing, tokenization, sample mask
GPU forward:
  embedding -> attention -> MLP/MoE -> logits -> loss
GPU backward:
  dlogits -> layer gradients -> activation gradients
Distributed:
  all-reduce/reduce-scatter/all-gather/all-to-all
Optimizer:
  update parameters, optimizer states
Checkpoint:
  periodic snapshot, RNG/data position state

Inference와 달리 training은 backward 때문에 activation을 저장하거나 recompute해야 한다. Optimizer state도 huge하다. Distributed training에서는 every step마다 gradient communication이 발생한다.

13.2 Parallelism taxonomy

Parallelism	Split dimension	Main communication	Memory effect	Bottleneck
Data Parallel	batch	gradient all-reduce/reduce-scatter	model replica 중복	optimizer/grad comm
Tensor Parallel	hidden/head/intermediate	all-reduce/all-gather per layer	weight shard	latency-sensitive layer comm
Pipeline Parallel	layers	activation send between stages	stage별 layer shard	bubble, activation comm
Expert Parallel	experts	MoE dispatch/combine all-to-all	expert shard	all-to-all, load balance
Context/Sequence Parallel	sequence	attention KV exchange/all-gather	activation/KV shard	long context attention comm
ZeRO/FSDP	parameter state	param all-gather, grad reduce-scatter	optimizer/grad/param shard	bandwidth, prefetch

Modern training은 이들을 4D/5D로 조합한다. DeepSeek-V3는 DualPipe, EP, ZeRO-1, FP8 training을 조합했고, DeepSeek-V4는 Muon/mHC/CSA-HCA 때문에 더 복잡한 framework를 만든다.

13.3 Activation checkpointing과 recomputation

Activation checkpointing은 forward activation을 저장하지 않고 backward 때 재계산한다. Memory를 줄이는 대신 compute가 증가한다. Long context에서는 activation memory가 너무 커서 checkpointing이 필수다.

DeepSeek-V4의 tensor-level checkpointing은 developer가 개별 tensor에 annotation을 달면 TorchFX로 minimal recomputation graph를 찾아 backward에 삽입한다. 이 방식은 OS의 demand paging과 비슷한 철학이다.

memory pressure high:
  don't store tensor
  store enough graph metadata
  recompute just before gradient needs it

mHC처럼 activation state가 늘어나는 architecture에서는 module-level checkpointing이 너무 coarse하다. Tensor-level policy가 필요하다.

13.4 Context parallelism for compressed attention

Sequence dimension을 rank별로 나눠 long context를 train하면, compression group이 rank boundary를 넘을 수 있다. DeepSeek-V4는 두 단계 communication을 사용한다.

rank i가 마지막 m uncompressed KV를 rank i+1로 보낸다.
각 rank가 fixed length compressed entries를 만들고 all-gather한다.
fused select-and-pad operator가 padding을 tail로 보내고 full compressed KV set을 구성한다.

이는 compression이 local partition invariant를 깨기 때문에 필요한 절차다. System적으로는 “operator semantic이 sharding boundary에 영향을 준다”는 중요한 예다. 기존 context parallelism은 dense token sequence를 나누면 됐지만, compressed attention은 group boundary와 padding policy를 고려해야 한다.

13.5 MoE training stability와 load balancing

MoE training은 router가 특정 expert로 몰리면 load imbalance와 outlier가 생긴다. DeepSeek-V3/V4는 auxiliary-loss-free load balancing, sequence-wise balance loss, anticipatory routing, SwiGLU clamping 등을 사용한다. DeepSeek-V4는 loss spike가 MoE outlier와 routing mechanism에 묶여 있다고 보고, backbone parameter와 routing index update를 decouple하는 anticipatory routing을 제안한다.

System 관점에서 loss spike는 단순 model quality 문제가 아니다.

Rollback이 필요하면 cluster time이 낭비된다.
Checkpoint frequency가 증가한다.
Debugging을 위해 deterministic kernel이 필요하다.
Monitoring metric과 anomaly trigger가 필요하다.

DeepSeek-V4가 batch-invariant, deterministic kernel libraries를 강조하는 이유도 여기에 있다. Determinism은 large-scale training에서 reproducible debugging을 가능하게 하는 system property다.

13.6 FP8/FP4 training과 QAT

FP8 training은 HBM traffic과 memory footprint를 줄인다. 그러나 scale management, accumulation precision, deterministic reduction이 중요하다. DeepSeek-V3는 FP8 mixed precision training을 매우 큰 scale에서 검증했다고 설명한다. DeepSeek-V4는 post-training QAT로 MoE expert weights와 CSA indexer QK path를 FP4에 적응시킨다.

FP4 QAT의 data plane:

optimizer master FP32 weight
  -> quantize FP4 for forward
  -> dequantize to FP8 for compute
  -> backward gradients through STE
  -> update FP32 master

Inference/rollout phase에서는 native FP4 quantized weight를 사용해 online deployment behavior와 sampling behavior를 맞춘다. 이 일관성은 RL training에서 중요하다. Training-time simulated quantization과 serving-time native quantization이 다르면 rollout distribution이 달라질 수 있다.

\newpage

14. DPO, RLHF, GRPO, OPD: post-training을 system workload로 해석하기

14.1 Classic RLHF pipeline

Classic RLHF는 보통 다음 단계다.

SFT: high-quality instruction data로 supervised fine-tuning
Reward model training: human preference pair로 scalar reward model 학습
PPO-style RL: policy가 rollout을 생성하고 reward model과 KL penalty로 update

System적으로 PPO RLHF는 heavy하다. 최소한 다음 model copies가 필요하다.

Policy model: trainable
Reference model: KL 계산용 frozen
Reward model: scalar reward
Value model/critic: advantage estimation

Rollout generation은 inference workload이고, policy update는 training workload다. GPU cluster는 generation과 training을 번갈아 또는 pipeline으로 실행해야 한다.

14.2 DPO: online RL을 supervised pair loss로 치환

DPO는 preference pair (prompt, chosen, rejected)를 사용해 reward model과 PPO loop 없이 policy를 직접 최적화한다. DPO의 핵심은 KL-constrained reward maximization 문제의 closed-form policy를 이용해 pairwise classification loss로 바꾸는 것이다.

DPO data plane:

batch of prompt/chosen/rejected
  -> policy forward for chosen/rejected logprobs
  -> reference logprobs (precompute 가능)
  -> DPO loss
  -> backward/update policy

장점:

Online rollout 없음
Reward/value model 없음
Hyperparameter와 stability가 PPO보다 쉬움
Training stack은 SFT와 유사

단점:

On-policy exploration이 없음
Preference dataset coverage에 의존
Reasoning token budget, tool-use behavior를 직접 optimize하기 어렵다

OS/system 관점에서 DPO는 “alignment as supervised training”이다. GPU memory는 policy activation과 reference logprob 계산이 주요 부담이다. Reference logprob를 offline precompute하면 training step은 매우 단순해진다.

14.3 PPO/RLHF: rollout service가 bottleneck

PPO-style RL은 policy가 현재 상태에서 sample을 생성하고, reward를 받아 update한다. Data plane은 다음과 같다.

rollout worker:
  prompt prefill
  autoregressive decode
  store tokens, logprobs, values, KL refs
  run reward model/verifier/tool environment
trainer:
  load trajectories
  compute advantages
  minibatch PPO loss
  update policy

Long-context reasoning model에서는 prompt + reasoning + answer가 길다. Rollout의 KV cache가 커지고, reward/verifier가 tool execution을 포함하면 CPU/sandbox latency가 생긴다. Rollout data는 token-level logprob와 reward metadata를 포함하므로 CPU DRAM과 storage traffic도 커진다.

14.4 GRPO: value model을 제거하는 reasoning RL

GRPO(Group Relative Policy Optimization)는 PPO variant로, group 내 여러 sample의 reward를 상대적으로 normalize해 advantage를 만든다. DeepSeekMath와 DeepSeek-R1에서 사용되었다. 핵심 system benefit은 critic/value model을 제거해 memory와 compute를 줄인다는 점이다.

GRPO data plane:

for each prompt:
  sample G completions from policy
  score each completion with rule/verifier/reward
  group-normalize rewards -> advantages
  policy update with KL/reference term

Trade-off:

Value model 제거 → model copies 감소
Group sampling G개 → rollout token 수 증가
Rule-based verifier가 가능하면 reward model inference 감소
Hard-to-verify task는 generative reward model 또는 human/AI judge 필요

System적으로 GRPO는 “critic memory를 rollout bandwidth로 바꾸는” 방식이다. Math/code처럼 verifier가 싸고 정확한 domain에서는 매우 강하다. Agentic task에서는 tool environment가 bottleneck이 될 수 있다.

14.5 DeepSeek-V4 OPD: teacher ensemble을 logits-level로 merge

DeepSeek-V4는 specialist expert model을 domain별로 훈련한 뒤, unified student에 OPD로 distill한다. Student가 자기 trajectory를 생성하고, 여러 teacher distribution과 reverse KL을 맞춘다.

Naive full-vocabulary OPD는 불가능에 가깝다.

tokens × vocab(>100k) × teachers(>10) × bytes

DeepSeek-V4는 teacher logits를 저장하지 않고 last-layer hidden state만 cache한다. Training time에 teacher prediction head를 load해 full logits를 재구성한다. Teacher head는 sample을 teacher index로 정렬해 mini-batch당 하나만 HBM에 올린다. Hidden state loading/offloading은 async background로 overlap한다.

이 design은 system적으로 매우 중요하다.

Full logits materialization을 hidden-state cache로 대체
Teacher weight를 centralized distributed storage에 offload
Teacher head residency를 scheduling으로 제한
KL computation을 specialized TileLang kernel로 수행

14.6 Generative Reward Model

DeepSeek-V4는 hard-to-verify task에서 scalar reward model 대신 Generative Reward Model(GRM)을 사용한다고 설명한다. Actor network가 judging capability도 학습한다. 이는 reward evaluation이 단순 scalar inference가 아니라 reasoning generation이 될 수 있음을 뜻한다.

System impact:

Reward scoring도 decode workload가 된다.
Judge reasoning token이 추가된다.
Reward model과 policy model의 role이 섞이면서 caching/reuse 기회가 생길 수 있다.
Rubric-guided data와 model-as-judge pipeline의 reproducibility가 중요해진다.

14.7 DPO vs GRPO vs OPD: system cost 비교

Method	Online sampling	Extra models	Main memory pressure	Main bottleneck	적합 domain
SFT	no	none	activations	training throughput	instruction imitation
DPO	no	reference	chosen/rejected forward	policy/ref logprob	preference alignment
PPO RLHF	yes	ref, reward, value	rollout KV, value/reward copies	generation + reward	broad alignment
GRPO	yes	ref, verifier/reward	group rollout tokens	sampling volume	math/code/verifiable reasoning
OPD	yes	many teachers	teacher hidden/logit reconstruction	teacher scheduling/KL	specialist merge

Bandwidth-aware research에서는 DPO를 baseline으로 삼고, RL/OPD에서 추가되는 data movement를 분해하는 것이 좋다. 특히 rollout KV cache, trajectory serialization, reference logprob computation, reward/verifier execution을 별도 counters로 측정해야 한다.

\newpage

15. Agentic AI Training Infrastructure: sandbox, rollout, WAL, environment

15.1 Agent training은 GPU job이 아니라 distributed OS workload다

Agentic AI는 model이 tool을 호출하고 environment와 상호작용한다. Coding agent라면 repository checkout, file edit, test run, compiler error, shell command, package install이 포함된다. Search agent라면 browser/search/fetch/Python이 포함된다. 이는 GPU만으로 끝나지 않는다.

System components:

GPU rollout worker
Tool gateway
Sandbox runtime(container/microVM/fullVM)
File system / object store
Trajectory logger
Reward/verifier service
Scheduler/preemption manager
Dataset builder / replay buffer

DeepSeek-V4 DSec sandbox는 Rust components(API gateway, per-host agent, watcher), 3FS distributed filesystem, function call/container/microVM/fullVM substrate, trajectory logging, preemption-safe resumption을 설명한다. 이는 agent RL이 ML training과 cloud OS가 결합된 workload임을 보여준다.

15.2 Token-granular WAL의 필요성

Rollout generation 중 task가 preempt되거나 GPU error가 나면 어떻게 해야 할까? Naively 처음부터 regenerate하면 sample length bias가 생긴다. 짧은 response는 interruption 전에 완료될 확률이 높고, 긴 response는 더 자주 regenerate되어 distribution이 왜곡된다.

DeepSeek-V4는 token-granular WAL을 사용한다.

for each generated token:
  append token to WAL
on preemption:
  save unfinished KV cache if possible
on resume:
  replay persisted tokens / restore KV
on fatal error:
  rerun prefill from WAL tokens to reconstruct KV

OS 관점에서 이는 database logging과 같다. Rollout correctness는 sampling distribution의 correctness이고, WAL은 이를 보존한다.

15.3 Sandbox determinism과 replay

Agent trajectory는 command와 output의 sequence다. Non-idempotent command가 있을 수 있으므로 retry가 위험하다. DSec는 globally ordered trajectory log를 유지하고, preemption 후 이미 완료된 command result를 replay해 recovery를 빠르게 한다.

Research opportunity:

Sandbox syscall trace와 LLM token trace를 하나의 causal log로 관리
Tool output compression과 KV prefix reuse 결합
Environment snapshot delta와 model rollout WAL 통합
Deterministic replay 가능한 agent benchmark runtime

15.4 Agent workload의 HBM/HDD/CPU coupling

Agent loop에서는 GPU가 tool result를 기다리며 idle할 수 있다. 반대로 CPU sandbox가 GPU rollout을 기다릴 수 있다. Tool result가 길면 다시 prefill이 발생하고 KV cache가 커진다. 따라서 scheduler는 다음을 함께 봐야 한다.

GPU decode occupancy
CPU sandbox queue
file system I/O
network search latency
context length growth
prefix cache hit probability
tool result truncation/compaction policy

이것은 classic OS multi-resource scheduling 문제와 닮았다. Difference는 reward signal과 model quality가 scheduling policy에 영향을 받는다는 점이다.

\newpage

16. Top-tier Conference Approaches: ICML/ICLR/OSDI/SOSP/MLSys에서 뜬 흐름

16.1 FlashAttention: IO-aware algorithm이 기준이 되다

FlashAttention은 attention을 HBM/SRAM memory hierarchy에 맞춰 tile로 재배열한다. 핵심은 attention score matrix를 HBM에 저장하지 않고 SRAM에서 online softmax로 처리하는 것이다. 이는 exact attention이지만 HBM read/write를 줄인다. 이후 FlashAttention-2/3는 work partitioning, parallelism, Hopper features를 더 활용한다.

System lesson:

Algorithm이 IO complexity를 명시적으로 최소화해야 한다.
FLOPs만 줄여서는 wall-clock speedup이 보장되지 않는다.
HBM traffic model이 paper contribution의 중심이 될 수 있다.

16.2 PagedAttention / vLLM: KV cache allocator가 SOTA를 만들다

PagedAttention은 SOSP 2023 계열의 대표적 LLM serving system이다. OS paging에서 영감을 받아 KV cache를 block 단위로 관리하고, near-zero waste와 prefix sharing을 제공한다.

System lesson:

Model serving bottleneck은 kernel이 아니라 allocator일 수 있다.
Request-level memory lifetime을 정확히 모델링하면 throughput이 크게 오른다.
GPU memory manager는 application-specific semantic을 알아야 한다.

16.3 vAttention: virtual memory API를 활용한 KV management

vAttention은 PagedAttention의 non-contiguous physical layout이 kernel complexity를 만든다고 보고, CUDA virtual memory를 사용해 virtual contiguity와 physical demand allocation을 분리한다.

System lesson:

GPU virtual memory API가 ML serving abstraction의 일부가 될 수 있다.
Kernel portability와 allocator efficiency의 trade-off가 존재한다.
OS-level mechanism을 GPU runtime에 어떻게 expose할지가 중요하다.

16.4 SGLang / RadixAttention: program-level cache reuse

SGLang은 structured language model program을 위한 frontend와 runtime을 제공한다. RadixAttention은 KV cache reuse를 prefix tree 수준에서 최적화한다. JSON constrained decoding, parallel generation, few-shot examples, multi-turn chat에서 repeated prefix가 많기 때문에 효과가 크다.

System lesson:

LLM workload는 단일 prompt가 아니라 program이다.
Control flow와 prefix structure를 runtime이 알면 caching이 좋아진다.
Agent framework와 serving runtime의 boundary가 흐려진다.

16.5 YaRN / LongLoRA / RingAttention: long-context training의 세 축

YaRN: RoPE context window extension을 적은 token/step으로 수행
LongLoRA: long-context fine-tuning을 shifted sparse attention과 parameter-efficient tuning으로 줄임
RingAttention: sequence를 device들에 나누고 KV block communication을 blockwise attention compute와 overlap

System lesson:

Position extrapolation, attention compute, distributed sequence sharding은 서로 다른 문제다.
Long context training은 context parallelism과 communication overlap이 필수다.
Sequence dimension은 이제 first-class parallel dimension이다.

16.6 StreamingLLM / Attention Sink

StreamingLLM은 attention sink token을 보존하면 finite window model이 매우 긴 stream에서도 안정적으로 동작할 수 있음을 보였다. DeepSeek-V4도 CSA/HCA core attention에서 attention sink trick을 사용한다.

System lesson:

모든 old token을 보존할 필요는 없지만, 특정 sink/state token은 long-term stability에 중요할 수 있다.
KV eviction policy는 recency만으로 충분하지 않다.
Learned sink/state는 OS cache policy의 “pinned page”와 비슷하다.

16.7 UltraMem / Memory Layers / Engram: memory as sparsity axis

ICLR/ICML 2025 전후로 memory layer, ultra-sparse memory, over-tokenized transformer, BLT, SCONE, Engram 같은 접근이 주목받았다. 공통점은 parameter를 compute가 아니라 lookup capacity로 확장하려는 것이다.

System lesson:

HBM에 모든 parameter를 resident시키는 방식은 한계가 있다.
Deterministic or predictable lookup은 host/SSD tier를 활용할 수 있다.
Embedding/tokenization/memory table은 architecture와 system co-design 대상이다.

16.8 EAGLE / Medusa / Speculative decoding: latency는 token step 수 문제다

EAGLE, Medusa, SpecInfer, Lookahead decoding은 autoregressive sequential dependency를 완화하려는 시도다. Decode step은 HBM-bound이므로 full model pass 수를 줄이면 latency가 줄어든다.

System lesson:

Output token throughput은 memory bandwidth뿐 아니라 acceptance rate에 의존한다.
Draft model/head가 만드는 extra compute가 HBM-bound main pass를 줄일 수 있다.
Scheduler는 speculative tree verification batch를 효율적으로 구성해야 한다.

16.9 DistServe / Sarathi-Serve: prefill과 decode를 분리하거나 섞기

DistServe는 prefill과 decode phase를 disaggregate해 resource를 독립 scaling한다. Sarathi-Serve는 chunked prefill과 stall-free scheduling으로 latency-throughput trade-off를 완화한다.

System lesson:

Prefill은 compute-heavy, decode는 memory-heavy다.
같은 GPU pool에서 섞으면 interference가 생긴다.
Long-context workload에서는 chunked prefill이 tail latency를 줄이는 핵심 scheduling primitive다.

\newpage

17. Bandwidth-aware System Design을 위한 연구 아이디어

17.1 Tensor data plane profiler

현재 ML profiler는 kernel time과 FLOPs를 잘 보여주지만, OS/system engineer가 원하는 “semantic data movement”는 부족하다. 다음 단위로 traffic을 attribution하는 profiler가 필요하다.

Traffic type	Counter
Weight read	layer/expert/precision별 bytes
KV write	prefill/decode, layer, cache type별 bytes
KV read	full/sparse/SWA/HCA/CSA별 bytes
Activation checkpoint	saved/recomputed bytes/FLOPs
MoE communication	dispatch/combine bytes, local hit ratio
Prefix cache	hit/miss, shared pages, COW events
Offload	host/SSD read/write, prefetch hit
Rollout	tokens, WAL bytes, reward/tool latency

이 profiler는 CUDA kernel counter만으로는 부족하다. Runtime metadata, model graph, request scheduler와 결합되어야 한다.

17.2 Compression-aware KV allocator

DeepSeek-V4처럼 layer마다 cache policy가 다르면 generic PagedAttention allocator는 최적이 아닐 수 있다. Research direction:

KV object type별 page class
lcm(m, m') alignment-aware block allocator
SWA state cache와 compressed KV cache 분리
prefix cache refcount와 compression block boundary 통합
SSD checkpoint interval optimizer
sparse gather locality를 고려한 block placement

17.3 Address-predictable memory prefetcher

Engram은 address가 input token으로 결정된다. CSA indexer는 query hidden state가 필요하지만, some indexer metadata는 precompute 가능할 수 있다. MoE router는 hidden state 이후 결정된다.

Research question:

Address predictability spectrum을 정의할 수 있는가?
Predictable memory row는 host/SSD tier에서 async prefetch할 수 있는가?
GPU kernel launch 전에 CPU가 next-layer memory address를 얼마나 빨리 준비할 수 있는가?
Prefix cache와 Engram hot n-gram cache를 통합할 수 있는가?

17.4 Decode memory QoS scheduler

Serving은 heterogeneous request를 처리한다.

Short chat: small prefill, short decode
Long document QA: huge prefill, short decode
Reasoning: medium/long prefill, huge decode
Agent: intermittent tool stalls, long context growth

Bandwidth-aware scheduler는 request를 token count가 아니라 expected HBM bytes per next step으로 scheduling해야 한다.

cost(request) = active_weight_bytes
              + visible_KV_bytes
              + expected_comm_bytes
              + dequant_scale_bytes
              + metadata_overhead

이를 사용해 batch를 구성하면 HBM bandwidth saturation과 tail latency를 더 잘 제어할 수 있다.

17.5 MoE all-to-all overlap optimizer

DeepSeek-V4의 formula C/B <= 2d는 hardware/software co-design의 좋은 출발점이다. Research direction:

Expert wave size 자동 조정
Network topology-aware expert placement
Token routing locality regularization
Router가 bandwidth budget을 cost로 인식하도록 training
All-to-all primitive의 pull vs push scheduling
Communication progress engine과 GEMM pipeline co-scheduling

17.6 RL rollout runtime

Post-training RL은 GPU serving과 training이 결합된 workload다. Research direction:

Token-granular WAL + KV checkpoint compression
Reward/verifier scheduling with GPU rollout
Sandbox state snapshot delta compression
Trajectory replay determinism
Long-context rollout compaction policy
On-policy/off-policy freshness와 system queueing delay trade-off

17.7 “Attention outside GPU” architecture

Long-context agent는 모든 context를 attention으로 처리하지 않아도 된다. File system, database, vector store, symbolic tool이 context processor가 될 수 있다.

Possible design:

LLM context window = active working set
external files/db = backing store
retrieval/search/tool = page fault handler
compaction = garbage collector
KV prefix cache = page cache

GPT-5.1-Codex-Max의 compaction, coding agents that manipulate file systems, Gemini/GPT agent workflows는 이 방향과 맞닿아 있다.

\newpage

18. Appendix A: Formula Sheet

A.1 KV cache

KV_bytes_per_token = 2 * L * H_kv * D_head * bytes
KV_cache_sequence = S * KV_bytes_per_token
KV_read_decode = visible_S * KV_bytes_per_token_visible

A.2 Dense attention FLOPs

QK FLOPs ≈ 2 * S_q * S_k * H_q * D_head
AV FLOPs ≈ 2 * S_q * S_k * H_q * D_head
Total ≈ 4 * S_q * S_k * H_q * D_head

For decode, S_q=1, S_k=context_length.

A.3 MLP/MoE FLOPs

SwiGLU expert rough per token-expert:

FLOPs ≈ 6 * h * d_ff_expert

DeepSeek-V4 report uses token-expert pair compute 6 h d and communication 3h bytes, yielding:

C/B <= 2d FLOPs/Byte

A.4 Adam training state

param BF16       2B
grad BF16        2B
master FP32      4B
Adam m/v FP32    8B
--------------------
~16B / parameter

ZeRO/FSDP partitions parts of this across data-parallel ranks.

A.5 DPO loss intuition

DPO optimizes pairwise preference using policy/reference logprob difference. For chosen y+, rejected y-:

Δ_policy = log πθ(y+|x) - log πθ(y-|x)
Δ_ref    = log πref(y+|x) - log πref(y-|x)
loss     = -log sigmoid(β * (Δ_policy - Δ_ref))

System implication: no online rollout, reference logprobs can often be precomputed.

A.6 GRPO advantage intuition

For prompt x, sample group {y_i} and rewards {r_i}:

A_i = (r_i - mean(r_group)) / std(r_group)
policy gradient with KL/reference regularization

System implication: no value model, but group sampling multiplies rollout tokens.

\newpage

19. Appendix B: Practical Profiling Checklist

B.1 Inference microbenchmarks

Decode latency vs context length: 4K, 32K, 128K, 1M
KV read bytes/token by layer
Weight read bytes/token by layer/expert
HBM bandwidth utilization vs tensor core utilization
Prefix cache hit ratio and saved prefill FLOPs
KV allocator fragmentation and page waste
SWA window size vs latency/quality
Sparse top-k gather locality
FP8/FP4 dequant overhead
MoE all-to-all latency hidden ratio

B.2 Training microbenchmarks

Activation memory peak by layer
Recomputation FLOPs vs saved bytes
Optimizer state memory per parameter
ZeRO/FSDP all-gather/reduce-scatter bandwidth
EP all-to-all dispatch/combine throughput
Context parallel boundary communication
Data loader/token packing throughput
Loss spike rollback frequency
Deterministic kernel overhead
Checkpoint write/read time

B.3 Post-training microbenchmarks

Rollout tokens/sec per GPU
Reward/verifier latency distribution
Tool sandbox queueing delay
WAL bytes/token and recovery time
Reference logprob recomputation cost
DPO chosen/rejected forward ratio
GRPO group size vs GPU utilization
OPD teacher hidden cache I/O
Full-vocab KL kernel time
Preemption survival and length bias check

\newpage

20. References

아래 reference는 보고서에서 사용한 핵심 자료다. 첨부 자료는 사용자가 제공한 local files이며, 공개 자료는 official page, arXiv, OpenReview, conference proceedings 중심으로 정리했다.

20.1 User-provided materials

DeepSeek-AI, DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence, uploaded PDF DeepSeek_V4.pdf.
Peichen Guo, Deepseek v4 presentation slides, uploaded PPTX Deepseek v4 (_).pptx.
Xin Cheng et al., Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models, uploaded PDF 2601.07372v1.pdf.

20.2 Frontier model / technical reports

DeepSeek-AI, DeepSeek-V3 Technical Report, arXiv:2412.19437, https://arxiv.org/abs/2412.19437
DeepSeek-AI, DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, arXiv:2405.04434, https://arxiv.org/abs/2405.04434
DeepSeek-AI, DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models, arXiv:2512.02556, https://arxiv.org/abs/2512.02556
OpenAI, gpt-oss-120b & gpt-oss-20b Model Card, arXiv:2508.10925, https://arxiv.org/abs/2508.10925
OpenAI, gpt-oss-120b Hugging Face model card/config, https://huggingface.co/openai/gpt-oss-120b
OpenAI, Introducing GPT-5.4, https://openai.com/index/introducing-gpt-5-4/
OpenAI, GPT-5.4 API model docs, https://developers.openai.com/api/docs/models/gpt-5.4
OpenAI, Introducing GPT-5.5, https://openai.com/index/introducing-gpt-5-5/
OpenAI, Building more with GPT-5.1-Codex-Max, https://openai.com/index/gpt-5-1-codex-max/
Google DeepMind, Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities, arXiv:2507.06261, https://arxiv.org/abs/2507.06261
Google DeepMind, Gemini 3 model page, https://deepmind.google/models/gemini/
Meta, The Llama 4 herd, https://ai.meta.com/blog/llama-4-multimodal-intelligence/
Meta, Llama 4 model page, https://www.llama.com/models/llama-4/
Qwen Team, Qwen3 Technical Report, arXiv:2505.09388, https://arxiv.org/abs/2505.09388
Kimi Team, Kimi K2: Open Agentic Intelligence, arXiv:2507.20534, https://arxiv.org/abs/2507.20534
MiniMax, MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention, arXiv:2506.13585, https://arxiv.org/abs/2506.13585
MiniMax, MiniMax-01: Scaling Foundation Models with Lightning Attention, arXiv:2501.08313, https://arxiv.org/abs/2501.08313

20.3 Attention / memory / serving systems

Tri Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, NeurIPS 2022 / arXiv:2205.14135, https://arxiv.org/abs/2205.14135
Joshua Ainslie et al., GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, EMNLP 2023 / arXiv:2305.13245, https://arxiv.org/abs/2305.13245
Woosuk Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP 2023 / arXiv:2309.06180, https://arxiv.org/abs/2309.06180
Ramya Prabhu et al., vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention, arXiv:2405.04437, https://arxiv.org/abs/2405.04437
Lianmin Zheng et al., SGLang: Efficient Execution of Structured Language Model Programs, arXiv:2312.07104, https://arxiv.org/abs/2312.07104
Amey Agrawal et al., Sarathi-Serve: Taming Throughput-Latency Tradeoff in LLM Inference, arXiv:2403.02310, https://arxiv.org/abs/2403.02310
Yinmin Zhong et al., DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized LLM Serving, OSDI 2024, https://www.usenix.org/conference/osdi24/presentation/zhong-yinmin
Guangxuan Xiao et al., Efficient Streaming Language Models with Attention Sinks, ICLR 2024 / arXiv:2309.17453, https://arxiv.org/abs/2309.17453
Bowen Peng et al., YaRN: Efficient Context Window Extension of Large Language Models, ICLR 2024 / arXiv:2309.00071, https://arxiv.org/abs/2309.00071
Yukang Chen et al., LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, ICLR 2024 / arXiv:2309.12307, https://arxiv.org/abs/2309.12307
Hao Liu et al., Ring Attention with Blockwise Transformers for Near-Infinite Context, arXiv:2310.01889, https://arxiv.org/abs/2310.01889

20.4 Sparsity / memory layers / decoding acceleration

Xu Owen He, Mixture of a Million Experts, arXiv:2407.04153, https://arxiv.org/abs/2407.04153
Zhen Huang et al., Ultra-Sparse Memory Network, ICLR 2025 / arXiv:2411.12364, https://arxiv.org/abs/2411.12364
Hongzhi Huang et al., Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling, ICML 2025, https://proceedings.mlr.press/v267/huang25bb.html
Yuhui Li et al., EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty, ICML 2024 / arXiv:2401.15077, https://arxiv.org/abs/2401.15077
Tianle Cai et al., Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads, arXiv:2401.10774, https://arxiv.org/abs/2401.10774
Xupeng Miao et al., SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification, arXiv:2305.09781, https://arxiv.org/abs/2305.09781
Yichao Fu et al., Lookahead Decoding, arXiv:2402.02057, https://arxiv.org/abs/2402.02057

20.5 Training / RLHF / DPO / GRPO

Samyam Rajbhandari et al., ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, SC 2020 / arXiv:1910.02054, https://arxiv.org/abs/1910.02054
Mohammad Shoeybi et al., Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, arXiv:1909.08053, https://arxiv.org/abs/1909.08053
Sam Ade Jacobs et al., DeepSpeed Ulysses, arXiv:2309.14509, https://arxiv.org/abs/2309.14509
Rafael Rafailov et al., Direct Preference Optimization: Your Language Model is Secretly a Reward Model, NeurIPS 2023 / arXiv:2305.18290, https://arxiv.org/abs/2305.18290
Zhen Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, arXiv:2402.03300, https://arxiv.org/abs/2402.03300
DeepSeek-AI, DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning, Nature 2025, https://www.nature.com/articles/s41586-025-09422-z
OpenAI, Learning to Reason with LLMs, https://openai.com/index/learning-to-reason-with-llms/
Anthropic, Constitutional AI: Harmlessness from AI Feedback, arXiv:2212.08073, https://arxiv.org/abs/2212.08073

20.6 Hardware references

NVIDIA, H100 Tensor Core GPU, https://www.nvidia.com/en-us/data-center/h100/
NVIDIA, H200 Tensor Core GPU, https://www.nvidia.com/en-us/data-center/h200/
NVIDIA, DGX B200, https://www.nvidia.com/en-us/data-center/dgx-b200/
NVIDIA, GB200 NVL72, https://www.nvidia.com/en-us/data-center/gb200-nvl72/
NVIDIA, Multi-node NVLink systems tuning guide, https://docs.nvidia.com/multi-node-nvlink-systems/multi-node-tuning-guide/overview.html

Closing note

이 보고서의 중심 관점은 단순하다. SOTA ML system은 더 이상 FLOPs만 최적화하지 않는다. 모델 구조가 HBM, NVLink, PCIe, host memory, SSD, sandbox filesystem, rollout log를 모두 염두에 두고 설계된다. OS/system engineer에게 가장 큰 기회는 이 변화 속에서 “model semantic을 이해하는 memory manager와 scheduler”를 만드는 것이다.

목차

1. Executive Summary

2. OS/System Engineer가 봐야 할 세 가지 변화 축

2.1 Model capability scaling의 축이 바뀌었다

2.2 HBM은 “용량”이 아니라 “per-token bandwidth” 문제다

2.3 Training은 memory state machine이다

3. Data Plane Lens: LLM workload를 memory movement로 보기

3.1 Prefill data plane

3.2 Decode data plane

3.3 Agent loop data plane

4. Inference Anatomy: prefill, decode, reasoning, agent loop

4.1 Prefill과 decode의 성격 차이

4.2 Reasoning model의 새로운 비용: token budget

4.3 Long-context model의 core tension

5. HBM Requirement의 정량 모델

5.1 KV cache size formula

Example 1: MHA 70B-like

Example 2: GQA-8

Example 3: MQA

5.2 KV read bandwidth lower bound

5.3 Weight read bandwidth lower bound

5.4 GPT-OSS 120B를 open-weight 계산 예제로 보기

5.5 DeepSeek-V4가 주장하는 scale

6. Attention Mechanism의 진화: MHA에서 CSA/HCA까지

6.1 MHA: quality는 좋지만 KV cache가 크다

6.2 MQA와 GQA: KV head sharing

6.3 MLA: latent KV compression

6.4 Sliding Window Attention: locality를 system invariant로 만들기

6.5 Sparse Attention: 모든 과거 token을 보지 않는다

6.6 Compressed Attention: token dimension을 줄인다

6.7 Linear Attention / State Space / Lightning Attention

6.8 Position Encoding: RoPE scaling, YaRN, LongRoPE

7. DeepSeek-V4 Case Study: million-token context를 가능하게 한 설계

7.1 Model scale

7.2 CSA: compressed sparse attention

System implication

7.3 HCA: heavily compressed attention

System implication

7.4 SWA branch와 attention sink

7.5 Mixed KV precision과 FP4 indexer

7.6 KV cache layout

OS analogy

7.7 On-disk KV cache

7.8 mHC: residual stream도 memory layout 문제다

7.9 Expert parallelism: all-to-all을 compute 아래 숨기기

7.10 Post-training: OPD와 rollout service

8. Gemini, GPT, Llama, Qwen, Kimi, MiniMax: 공개 정보 기반 비교

8.1 OpenAI GPT line: context, compaction, tool/agent emphasis

8.2 OpenAI gpt-oss: open-weight architecture signal

8.3 Google Gemini line: multimodal long context와 agentic workflows

8.4 Meta Llama 4: MoE와 10M context claim

8.5 Qwen3: dense + MoE family, 256K/1M extendability

8.6 Kimi K2: trillion-parameter MoE와 MuonClip

8.7 MiniMax-M1 / MiniMax-01: Lightning Attention과 1M~4M context

9. HBM 최소화 기법 Taxonomy

9.1 Weight footprint 줄이기

Quantization

MoE

Conditional memory / Engram

9.2 KV cache footprint 줄이기

MQA/GQA

MLA

CSA/HCA

Sliding window / local state

KV precision

9.3 KV cache fragmentation 줄이기

PagedAttention

vAttention

Prefix cache / RadixAttention

On-disk KV cache

9.4 Attention FLOPs 줄이기

9.5 Decode step 수 줄이기

Speculative decoding

Medusa

EAGLE

Multi-Token Prediction(MTP)

10. KV Cache Management: OS virtual memory와 LLM serving의 만남

10.1 KV cache는 process address space와 닮았다

10.2 Fragmentation과 internal waste

10.3 Prefix sharing