08 Mar 2025 2 min read system

3FS

Fire-Flyer AI-HPC cluster

1250 compute node ( 8 gpus each) -> 10k gpus
- 1x 200Gbps NIC
- 80% Nvidia perf 60% cost
180 storage nodes w/ RDMA (IB)
- 2x 200Gbps NIC
- 16*16TB PCIe 4.0x4, ~2.8GB/s read
- aggregated 9TB/s, peak read 8TB/s (very saturated)
Topology
- Two Zones
  - gpu nodes and storage nodes connected to one leaf
  - each zone : two layer fat-free, 40 ports switch, 20 spine, 40 leaf, 800 nodes
  - 2 switches connect two zones -> total 122 switches, nvidia's one is 11x expensive

Fire-Flyer Filesystem

Their workload characteristic
- massive small random read (few KB~MB)
- cache unfriendly
- may not care about latency much
Why arent' they using Hadoop?
- consistency?
- random read perf?
Why FS than key-value/object store
- atomic directory ops
  - ops that not well supported by flat-namespaced
  - ex: create a/b/temp and put all ckpt, and atomic renaming to latest-ckpt
  - ex: removing recursive data structure. with dir we can just remove parent folder

Architecture
- Cluster manager
- Clients - FUSE based async IO
- Metadata - FoundationDB (transactional kv store)
- Data - divided into chunks and striped with chain rep. (CRAQ)
- Earlier 3FS uses Linux AIO, no cache
Problems w/ FUSE
- Memory copy overhead
  - multiple copy (user ->kernel->user ...)

MT scalability
- no concurrent write on the same file
- spinlock on req queue -> 400k reqs/s for 4kb read

Requirements
- Userspace
- Zero-copy
- Async -> submit a request and don't care
User Space Ring Based IO (USRBIO)
- App -> ring -> FS
- IOV : memory maps to App & FS, buffer
- RDMA directly to FS
- polling based, wait required to check
Metadata management
- consistent KV store, covered by very sophistic testing -> reliable
- inodes / dir entries are mapped to the metadata
Interesting design choices
- Opened fd ref_cnt tracking for deletion
  - ref cnt for how many opened fds
  - read only fds -> don't track ref cnt. only write
- File length lazy update
- Large stripe size
Chunk storage
- storage node stores chunks as files and uses RocksDB to manage metadata
- variable size chunks 64kb~64mb power of 2
CRAQ
- vanilla chain replication [OSDI'04]
  - write to the head and push along the chain
  - bad latency
- CRAQ [ATC'09]
  - read can happen at any nodes on the chain
  - data may be in a 'dirty' state or 'clean'
  - 잘 모르겟음 더 찾아봐야 함
  - dirty -> won't ask tail if it is commited or not
  - multicast? problem?

The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors - SOSP'13