NanoVLA-Flow

v1.0

Parameter-Efficient Continuous Vision-Language-Action Model

experimentalrobotics

NanoVLA-Flow addresses catastrophic forgetting when fine-tuning Vision-Language Models into Vision-Language-Action architectures. It demonstrates that a 2.6B parameter backbone can retain general intelligence while functioning as a continuous trajectory prediction engine through LoRA-based parameter-efficient fine-tuning. An 80M ActionExpert head denoises continuous 3D velocity vectors via Flow Matching, integrated at inference with Heun's 2nd-order ODE solver. The result: a small, single-GPU-trainable VLA that retains 55%+ A-OKVQA accuracy after robotic fine-tuning.

github

Features

Continuous Flow Matching action head

80M parameter ActionExpert denoises continuous 3D velocity trajectories via Flow Matching, avoiding the discretization loss of action-token VLAs.

LoRA on Gemma-4-E2B backbone

LoRA (r=16, alpha=32) on attention projections of a 2.6B Gemma-4-E2B vision-language backbone. Adapters merged into base weights for zero inference overhead.

No catastrophic forgetting

Retains 55.11% A-OKVQA multiple-choice accuracy after robotic fine-tuning — VLM general intelligence preserved alongside action capability.

Hierarchical feature extraction

Intermediate transformer layer hooks expose multi-scale visual-language features to the action head rather than only the final hidden state.

Think-Then-Act inference pipeline

Backbone reasons over the scene and instruction before the ActionExpert integrates a 50-step Euler ODE to produce the trajectory.

Temporal trajectory ensembling

Smooths predicted action chunks across overlapping inference steps for stable closed-loop control.

Single-GPU training

Full fine-tuning pipeline runs on one consumer-grade 16GB T4 — no multi-GPU or A100-class requirement.

Performance

A-OKVQA (MC Accuracy)1,145-sample validation set — VLM capabilities preserved post fine-tuning55.11%

A-OKVQA (Direct Answer)Direct answer scoring on the same validation set19.33%

Flow Trajectory MSEOn LIBERO spatial held-out episodes0.156

Trajectory Cosine SimilarityPredicted vs ground-truth velocity vector alignment0.666

Architecture

Three-component system: (1) Vision-Language Backbone — google/gemma-4-E2B-it (2.6B) processes images and text instructions, with hierarchical feature extraction via intermediate layer hooks. (2) Proprioceptive State Projector — sinusoidal positional encoding lifts robot kinematics into the same embedding space. (3) Action Expert — 80M parameter transformer that denoises continuous 3D velocity trajectories via Flow Matching, integrated at inference with a 50-step Euler ODE.

Training uses LoRA (r=16, alpha=32) on the backbone's attention projections, with MSE loss on the predicted trajectory plus BCEWithLogitsLoss on a discrete gripper head. After training, LoRA adapters are merged back into the base weights so inference has no PEFT overhead.

Stack

PythonPyTorchTransformersGemma-4-E2BLoRA / PEFTFlow MatchingLeRobot

Quick start

git clone https://github.com/Guney-olu/NanoVLA-Flow
cd nanovla-flow
pip install torch>=2.0 transformers safetensors fastapi uvicorn requests

# Tensor Parallel - 2 GPUs
python -m nanovla-flow --model /path/to/model --mode tensor --tp-size 2

← all projects home