Training competitive AI models
on hardware I can actually afford.
Independent research lab focused on efficient model architectures, knowledge distillation, and multilingual AI. Every model trained on our own GPU cluster. Every result reproducible. Everything open source.
Models
Research focus
The question driving this lab: how far can you push a small model with the right training recipe? Each project below started as a hypothesis about efficiency and ended with a shipped, open-source model.
Parallel decoding without bidirectional attention
CBD-LLM (Causal Block Diffusion) is a hybrid architecture that enables block-parallel text generation using standard causal attention. It preserves KV-caching and pretrained AR weights while combining block-wise variable noise with topological token reordering. Result: parallel decoding speed without retraining from scratch. 91.3% performance retention at 10:1 compression, trained on 2x A100.
Sub-50ms tool routing at 600M parameters
Industry standard for agentic tool routing is 7B+ models. I built a 600M parameter router that achieves 90.89% relevance detection at 45ms latency by decoupling tool selection from tool execution. 12x smaller. 8x faster. Deployable in latency-sensitive production systems.
Cross-lingual TTS for underserved languages
Multilingual text-to-speech for Indian languages using cross-lingual transfer from English/Hindi anchors. MOS 3.8/5.0. Real-time synthesis factor 0.8x, optimized for edge and mobile deployment. The insight: a small amount of high-quality anchor data can bootstrap synthesis quality across related languages.
Open source tooling
Infrastructure we build and use internally, released as standalone projects.
Infrastructure
Dedicated GPU cluster for training and inference.
Total: 480GB VRAM across 10 GPUs. Training stack: PyTorch, DeepSpeed ZeRO Stage 2, mixed precision, gradient checkpointing. Every training run logged with exact GPU hours and hyperparameters.
Approach
Reproducibility first
Every model ships with full training configs and hardware specs. If you have access to similar hardware, you can verify any result on this site.
Efficiency as methodology
Constraints force better architecture decisions. Data curation over data scale. Surgical fine-tuning over brute-force pretraining. Every GPU hour earns its place.
Open by default
Weights, code, training logs. The field benefits more from transparent research than from closed papers citing massive clusters.
Writing
Notes on training, architecture decisions, and things learned the hard way.
Coming soon.
About
I'm Aryan. I run this lab independently because the most interesting problems in ML right now are efficiency problems, and those are best studied with focused compute rather than unlimited budgets. My background spans model architecture design, distributed training systems, and production ML deployment.
The work here is driven by a specific conviction: that the next meaningful advances in AI will come from researchers who understand their hardware intimately and optimize at every layer of the stack. Scaling laws describe one axis of progress. This lab explores the other axis.
Open to research collaborations, consulting on efficient ML systems, and technical discussions.