3FS

Fire-Flyer AI-HPC cluster

  • 1250 compute node ( 8 gpus each) -> 10k gpus
    • 1x 200Gbps NIC
    • 80% Nvidia perf 60% cost
  • 180 storage nodes w/ RDMA (IB)
    • 2x 200Gbps NIC
    • 16*16TB PCIe 4.0x4, ~2.8GB/s read
    • aggregated 9TB/s, peak read 8TB/s (very saturated)
  • Topology
    • Two Zones
      • gpu nodes and storage nodes connected to one leaf
      • each zone : two layer fat-free, 40 ports switch, 20 spine, 40 leaf, 800 nodes
      • 2 switches connect two zones -> total 122 switches, nvidia's one is 11x expensive

Fire-Flyer Filesystem

  • Their workload characteristic
    • massive small random read (few KB~MB)
    • cache unfriendly
    • may not care about latency much
  • Why arent' they using Hadoop?
    • consistency?
    • random read perf?
  • Why FS than key-value/object store
    • atomic directory ops
      • ops that not well supported by flat-namespaced
      • ex: create a/b/temp and put all ckpt, and atomic renaming to latest-ckpt
      • ex: removing recursive data structure. with dir we can just remove parent folder
    • symbolic and hard links
      • useful to access 'latest'
    • familar interface
    • Metadata performance -> ? maybe overhead
  • Architecture
    • Cluster manager
    • Clients - FUSE based async IO
    • Metadata - FoundationDB (transactional kv store)
    • Data - divided into chunks and striped with chain rep. (CRAQ)
    • Earlier 3FS uses Linux AIO, no cache
  • Problems w/ FUSE
    • Memory copy overhead
      • multiple copy (user ->kernel->user ...)
    • MT scalability
      • no concurrent write on the same file
      • spinlock on req queue -> 400k reqs/s for 4kb read
  • Requirements
    • Userspace
    • Zero-copy
    • Async -> submit a request and don't care
  • User Space Ring Based IO (USRBIO)
    • App -> ring -> FS
    • IOV : memory maps to App & FS, buffer
    • RDMA directly to FS
    • polling based, wait required to check
  • Metadata management
    • consistent KV store, covered by very sophistic testing -> reliable
    • inodes / dir entries are mapped to the metadata
  • Interesting design choices
    • Opened fd ref_cnt tracking for deletion
      • ref cnt for how many opened fds
      • read only fds -> don't track ref cnt. only write
    • File length lazy update
    • Large stripe size
  • Chunk storage
    • storage node stores chunks as files and uses RocksDB to manage metadata
    • variable size chunks 64kb~64mb power of 2
  • CRAQ
    • vanilla chain replication [OSDI'04]
      • write to the head and push along the chain
      • bad latency
    • CRAQ [ATC'09]
      • read can happen at any nodes on the chain
      • data may be in a 'dirty' state or 'clean'
      • 잘 모르겟음 더 찾아봐야 함
      • dirty -> won't ask tail if it is commited or not
      • multicast? problem?