NanoSLG
v0.5Minimal Multi-Mode Parallel LLM Inference Server
NanoSLG is a from-scratch inference server built to run large language models efficiently across multiple GPUs. It supports three parallelism strategies (pipeline, tensor, and hybrid TP+PP), automatic GPU detection for KV cache backend selection, radix prefix caching, native GQA, and an OpenAI-compatible API. Designed as both a production-viable server and an educational reference for understanding multi-GPU inference internals.
Features
Dual KV cache backend
Auto-detects GPU SM version and selects FlashInfer paged attention (SM80+: L4, A100, H100) or contiguous SDPA (SM75: T4) with zero configuration.
Pipeline parallelism
Splits model layers across GPUs sequentially. Lower memory per GPU, good for memory-constrained setups.
Tensor parallelism
Shards weights across GPUs for parallel computation via NCCL all-reduce. Lowest latency mode.
Hybrid TP+PP
Combines tensor and pipeline parallelism for maximum scalability across 4+ GPU configurations.
Radix prefix caching
Reuses KV cache across requests with shared prefixes on FlashInfer backend. Copy-on-write sharing.
Native GQA
Uses enable_gqa=True on PyTorch 2.5+, eliminating repeat_interleave overhead for grouped query attention.
OpenAI-compatible API
Drop-in replacement for /v1/chat/completions with SSE streaming support.
Built-in benchmarking
TTFT, tokens/sec, memory tracking, and concurrent load testing out of the box.
Performance
Architecture
GPU Detection (get_sm_version) routes to either FlashInfer Paged KV Cache (SM80+) or Contiguous SDPA KV Cache (SM75/fallback). Both backends implement a shared CacheContext interface exposing attend(), get_position(), and get_start_pos() — making the model layer code backend-agnostic.
The server runs FastAPI with async I/O, communicating with GPU worker processes via mp.Queue. Each GPU worker holds its shard of model weights and a local KV cache instance. For tensor parallelism, NCCL handles all-reduce across shards. For pipeline parallelism, activations flow sequentially across GPU stages.
Stack
Quick start
git clone https://github.com/Guney-olu/nanoslg
cd nanoslg
pip install torch>=2.0 transformers safetensors fastapi uvicorn requests
# Tensor Parallel - 2 GPUs
python -m nanoslg --model /path/to/model --mode tensor --tp-size 2