Training competitive AI models
on hardware I can actually afford.

Independent research lab focused on efficient model architectures, knowledge distillation, and multilingual AI. Every model trained on our own GPU cluster. Every result reproducible. Everything open source.

models released

240h

GPU hours

GPUs in cluster

Models

AgentGuard-2.8B

Mamba-2 SSM fine-tuned to detect prompt injection, exfiltration, and tool-call hijacking in AI agent sessions. Runs as a local sidecar with O(1) memory — monitors arbitrarily long agent trajectories without truncation.

2.8B params

CBD-LLM-PoC-V1

Hybrid diffusion architecture enabling block-parallel text generation while retaining standard causal attention and KV caching.

1.2B params2x A10072h training120ms latency

Qwen3-0.6B-Tool-Router

Lightweight tool call router for agentic systems. 29.2% accuracy overall vs industry 23.93 600M params models on BFCL

600M params2x T448h training45ms latency

IND-QWENTTS-V1

Multilingual TTS for 2 languages. MOS 3.8/5.0. Cross-lingual transfer from high-resource anchors. Edge-deployable.

500M params1x A10072h training200ms latency

STRM-4B-v1

LoRA fine-tune of Qwen3-4B for parsing unstructured spoken-language input into structured JSON. Maintains running state to handle corrections, cancellations, and quantity changes in a single forward pass. ~94% exact-match accuracy.

4B params1x A10048h training80ms latency

View all models with full benchmarks →

Research focus

The question driving this lab: how far can you push a small model with the right training recipe? Each project below started as a hypothesis about efficiency and ended with a shipped, open-source model.

Parallel decoding without bidirectional attention

CBD-LLM (Causal Block Diffusion) is a hybrid architecture that enables block-parallel text generation using standard causal attention. It preserves KV-caching and pretrained AR weights while combining block-wise variable noise with topological token reordering. Result: parallel decoding speed without retraining from scratch. 91.3% performance retention at 10:1 compression, trained on 2x A100.

Sub-50ms tool routing at 600M parameters

Industry standard for agentic tool routing is 7B+ models. I built a 600M parameter router that achieves 90.89% relevance detection at 45ms latency by decoupling tool selection from tool execution. 12x smaller. 8x faster. Deployable in latency-sensitive production systems.

Cross-lingual TTS for underserved languages

Multilingual text-to-speech for Indian languages using cross-lingual transfer from English/Hindi anchors. MOS 3.8/5.0. Real-time synthesis factor 0.8x, optimized for edge and mobile deployment. The insight: a small amount of high-quality anchor data can bootstrap synthesis quality across related languages.

Open source tooling

Infrastructure we build and use internally, released as standalone projects.

NanoSLG

v0.5

Lightweight LLM inference server with pipeline parallelism, tensor parallelism, hybrid TP+PP, and a dual-backend KV cache that auto-selects FlashInfer or contiguous SDPA based on GPU architecture.

inferenceresearch-preview76.0 tok/s batch throughput

AgentGuard

Local sidecar for real-time AI agent security. Monitors agent tool calls via Mamba-2 SSM, detects prompt injection, and blocks attacks before they execute.

securityresearch-previewO(1) batch throughput

View all projects →

Infrastructure

Dedicated GPU cluster for training and inference.

NVIDIA A100

80GB

NVIDIA T4

16GB

Total: 480GB VRAM across 10 GPUs. Training stack: PyTorch, DeepSpeed ZeRO Stage 2, mixed precision, gradient checkpointing. Every training run logged with exact GPU hours and hyperparameters.

Approach

Reproducibility first

Every model ships with full training configs and hardware specs. If you have access to similar hardware, you can verify any result on this site.

Efficiency as methodology

Constraints force better architecture decisions. Data curation over data scale. Surgical fine-tuning over brute-force pretraining. Every GPU hour earns its place.

Open by default

Weights, code, training logs. The field benefits more from transparent research than from closed papers citing massive clusters.

Writing

Notes on training, architecture decisions, and things learned the hard way.

Coming soon.

About

I'm Aryan. I run this lab independently because the most interesting problems in ML right now are efficiency problems, and those are best studied with focused compute rather than unlimited budgets. My background spans model architecture design, distributed training systems, and production ML deployment.

The work here is driven by a specific conviction: that the next meaningful advances in AI will come from researchers who understand their hardware intimately and optimize at every layer of the stack. Scaling laws describe one axis of progress. This lab explores the other axis.

Open to research collaborations, consulting on efficient ML systems, and technical discussions.

aryanpurohit1995@gmail.com

Training competitive AI modelson hardware I can actually afford.