NanoVLA-Flow
v1.0Parameter-Efficient Continuous Vision-Language-Action Model
NanoVLA-Flow addresses catastrophic forgetting when fine-tuning Vision-Language Models into Vision-Language-Action architectures. It demonstrates that a 2.6B parameter backbone can retain general intelligence while functioning as a continuous trajectory prediction engine through LoRA-based parameter-efficient fine-tuning. An 80M ActionExpert head denoises continuous 3D velocity vectors via Flow Matching, integrated at inference with Heun's 2nd-order ODE solver. The result: a small, single-GPU-trainable VLA that retains 55%+ A-OKVQA accuracy after robotic fine-tuning.
Features
Continuous Flow Matching action head
80M parameter ActionExpert denoises continuous 3D velocity trajectories via Flow Matching, avoiding the discretization loss of action-token VLAs.
LoRA on Gemma-4-E2B backbone
LoRA (r=16, alpha=32) on attention projections of a 2.6B Gemma-4-E2B vision-language backbone. Adapters merged into base weights for zero inference overhead.
No catastrophic forgetting
Retains 55.11% A-OKVQA multiple-choice accuracy after robotic fine-tuning — VLM general intelligence preserved alongside action capability.
Hierarchical feature extraction
Intermediate transformer layer hooks expose multi-scale visual-language features to the action head rather than only the final hidden state.
Think-Then-Act inference pipeline
Backbone reasons over the scene and instruction before the ActionExpert integrates a 50-step Euler ODE to produce the trajectory.
Temporal trajectory ensembling
Smooths predicted action chunks across overlapping inference steps for stable closed-loop control.
Single-GPU training
Full fine-tuning pipeline runs on one consumer-grade 16GB T4 — no multi-GPU or A100-class requirement.
Performance
Architecture
Three-component system: (1) Vision-Language Backbone — google/gemma-4-E2B-it (2.6B) processes images and text instructions, with hierarchical feature extraction via intermediate layer hooks. (2) Proprioceptive State Projector — sinusoidal positional encoding lifts robot kinematics into the same embedding space. (3) Action Expert — 80M parameter transformer that denoises continuous 3D velocity trajectories via Flow Matching, integrated at inference with a 50-step Euler ODE.
Training uses LoRA (r=16, alpha=32) on the backbone's attention projections, with MSE loss on the predicted trajectory plus BCEWithLogitsLoss on a discrete gripper head. After training, LoRA adapters are merged back into the base weights so inference has no PEFT overhead.
Stack
Quick start
git clone https://github.com/Guney-olu/NanoVLA-Flow
cd nanovla-flow
pip install torch>=2.0 transformers safetensors fastapi uvicorn requests
# Tensor Parallel - 2 GPUs
python -m nanovla-flow --model /path/to/model --mode tensor --tp-size 2