Fire-Flyer AI-HPC cluster
- 1250 compute node ( 8 gpus each) -> 10k gpus
- 1x 200Gbps NIC
- 80% Nvidia perf 60% cost
- 180 storage nodes w/ RDMA (IB)
- 2x 200Gbps NIC
- 16*16TB PCIe 4.0x4, ~2.8GB/s read
- aggregated 9TB/s, peak read 8TB/s (very saturated)
- Topology
- Two Zones
- gpu nodes and storage nodes connected to one leaf
- each zone : two layer fat-free, 40 ports switch, 20 spine, 40 leaf, 800 nodes
- 2 switches connect two zones -> total 122 switches, nvidia's one is 11x expensive
- Two Zones
Fire-Flyer Filesystem
- Their workload characteristic
- massive small random read (few KB~MB)
- cache unfriendly
- may not care about latency much
- Why arent' they using Hadoop?
- consistency?
- random read perf?
- Why FS than key-value/object store
- atomic directory ops
- ops that not well supported by flat-namespaced
- ex: create a/b/temp and put all ckpt, and atomic renaming to latest-ckpt
- ex: removing recursive data structure. with dir we can just remove parent folder
- atomic directory ops
- symbolic and hard links
- useful to access 'latest'
- familar interface
- Metadata performance -> ? maybe overhead
- Architecture
- Cluster manager
- Clients - FUSE based async IO
- Metadata - FoundationDB (transactional kv store)
- Data - divided into chunks and striped with chain rep. (CRAQ)
- Earlier 3FS uses Linux AIO, no cache
- Problems w/ FUSE
- Memory copy overhead
- multiple copy (user ->kernel->user ...)
- Memory copy overhead
- MT scalability
- no concurrent write on the same file
- spinlock on req queue -> 400k reqs/s for 4kb read
- Requirements
- Userspace
- Zero-copy
- Async -> submit a request and don't care
- User Space Ring Based IO (USRBIO)
- App -> ring -> FS
- IOV : memory maps to App & FS, buffer
- RDMA directly to FS
- polling based, wait required to check
- Metadata management
- consistent KV store, covered by very sophistic testing -> reliable
- inodes / dir entries are mapped to the metadata
- Interesting design choices
- Opened fd ref_cnt tracking for deletion
- ref cnt for how many opened fds
- read only fds -> don't track ref cnt. only write
- File length lazy update
- Large stripe size
- Opened fd ref_cnt tracking for deletion
- Chunk storage
- storage node stores chunks as files and uses RocksDB to manage metadata
- variable size chunks 64kb~64mb power of 2
- CRAQ
- vanilla chain replication [OSDI'04]
- write to the head and push along the chain
- bad latency
- CRAQ [ATC'09]
- read can happen at any nodes on the chain
- data may be in a 'dirty' state or 'clean'
- 잘 모르겟음 더 찾아봐야 함
- dirty -> won't ask tail if it is commited or not
- multicast? problem?
- vanilla chain replication [OSDI'04]