Projects
Open source infrastructure and tooling built at tiny compute research.
Minimal Multi-Mode Parallel LLM Inference Server
Lightweight LLM inference server with pipeline parallelism, tensor parallelism, hybrid TP+PP, and a dual-backend KV cache that auto-selects FlashInfer or contiguous SDPA based on GPU architecture.
Categoryinference
StackPython 3.10+, PyTorch 2.0+, CUDA / NCCL, FlashInfer
Single request throughput21.8 tok/s
Batch x4 throughput76.0 tok/s
TTFT (single)52ms
AgentGuard
research-previewLocal Sidecar for Real-Time AI Agent Security
Local sidecar for real-time AI agent security. Monitors agent tool calls via Mamba-2 SSM, detects prompt injection, and blocks attacks before they execute.
Categorysecurity
StackPython, PyTorch, Transformers, Mamba-2
Classify latency~187ms
Memory complexityO(1)
Model size2.8B