NanoSLG

v0.5

Minimal Multi-Mode Parallel LLM Inference Server

research-previewinference

NanoSLG is a from-scratch inference server built to run large language models efficiently across multiple GPUs. It supports three parallelism strategies (pipeline, tensor, and hybrid TP+PP), automatic GPU detection for KV cache backend selection, radix prefix caching, native GQA, and an OpenAI-compatible API. Designed as both a production-viable server and an educational reference for understanding multi-GPU inference internals.

github

Features

Dual KV cache backend

Auto-detects GPU SM version and selects FlashInfer paged attention (SM80+: L4, A100, H100) or contiguous SDPA (SM75: T4) with zero configuration.

Pipeline parallelism

Splits model layers across GPUs sequentially. Lower memory per GPU, good for memory-constrained setups.

Tensor parallelism

Shards weights across GPUs for parallel computation via NCCL all-reduce. Lowest latency mode.

Hybrid TP+PP

Combines tensor and pipeline parallelism for maximum scalability across 4+ GPU configurations.

Radix prefix caching

Reuses KV cache across requests with shared prefixes on FlashInfer backend. Copy-on-write sharing.

Native GQA

Uses enable_gqa=True on PyTorch 2.5+, eliminating repeat_interleave overhead for grouped query attention.

OpenAI-compatible API

Drop-in replacement for /v1/chat/completions with SSE streaming support.

Built-in benchmarking

TTFT, tokens/sec, memory tracking, and concurrent load testing out of the box.

Performance

Single request throughputLlama-3.1-8B FP16 on 2x L4, tensor parallel21.8 tok/s

Batch x4 throughputFlashInfer backend, +528% vs v0.476.0 tok/s

TTFT (single)FlashInfer backend, -51% vs v0.452ms

Sustained throughputUnder continuous load37.4 tok/s

Burst x16 throughput16 concurrent requests48.6 tok/s

v0.4 to v0.5 improvementAcross all benchmarked scenarios3-5x

Architecture

GPU Detection (get_sm_version) routes to either FlashInfer Paged KV Cache (SM80+) or Contiguous SDPA KV Cache (SM75/fallback). Both backends implement a shared CacheContext interface exposing attend(), get_position(), and get_start_pos() — making the model layer code backend-agnostic.

The server runs FastAPI with async I/O, communicating with GPU worker processes via mp.Queue. Each GPU worker holds its shard of model weights and a local KV cache instance. For tensor parallelism, NCCL handles all-reduce across shards. For pipeline parallelism, activations flow sequentially across GPU stages.

Stack

Python 3.10+PyTorch 2.0+CUDA / NCCLFlashInferFastAPISafetensors

Quick start

git clone https://github.com/Guney-olu/nanoslg
cd nanoslg
pip install torch>=2.0 transformers safetensors fastapi uvicorn requests

# Tensor Parallel - 2 GPUs
python -m nanoslg --model /path/to/model --mode tensor --tp-size 2

← all projects home