Retrieve, Don't Retrain:
Extending Vision-Language-Action Models to New Tasks at Test Time

ReCAP — Retrieval-Conditioned Action Policy

1 NAVER AI Lab   2 Korea University

Abstract

Extending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection and compute. We show that this target-side per-task adaptation cost can be replaced by retrieval. Our retrieval-augmented policy is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by simply appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters.

We find retrieval especially effective on Cosmos Policy, a video-generation-based world action model (WAM): retrieval supplies coarse task progression, while the WAM's future-image objective adds a visual-consistency signal that strengthens the retrieval-conditioned actions. On PushT, retrieval provides a reusable high-level motion prior for cross-embodiment generalization to unseen goal angles; on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks; and we further demonstrate the method on a real robot.

34.9%PushT unseen angles
(vs. 6.0% no retrieval)
31.5%RoboTwin 2.0 unseen tasks
(vs. 26.0% baseline)
×18cheaper data collection
no additional training

How ReCAP Works

ReCAP adapts a policy to new tasks entirely at test time. Instead of teleoperating each new task and fine-tuning (~24 GPU-hours/task for Cosmos Policy), ReCAP appends cheap human-hand demonstrations to a retrieval pool while keeping the policy frozen. Key ideas:

  • Retrieval, not retraining. At every control step the frozen policy retrieves a matching state-action chunk from a growing pool database; new tasks are absorbed by extending the pool.
  • Residual action parameterization. The retrieved pool chunk already encodes coarse motion, so the policy learns only the embodiment-specific correction on top.
  • World-action backbone. Built on Cosmos Policy, the future-image prediction objective enforces consistency between the retrieved trajectory and the predicted evolution of the scene.

Experiment 1

Real Robot — Human-Hand Pool to Teleoperated Target

The pool embodiment is a human hand (video with wrist pose tracked in VR); the target is a teleoperated robot. We fine-tune on a single task, open cabinet, then freeze the policy and add two held-out tasks, close cabinet and place bottle in plastic box, using only 10 human-hand demonstrations per task at test time — no retraining.

Data Collection Setup

Training — Open Cabinet (paired, fine-tuning)

Robot (query / target)
Human hand (pool)

Test-Time Pool — Held-Out (human-hand, no retraining)

Close cabinet (human hand)
Place bottle in plastic box (human hand)

Seen Task — Open Cabinet (fine-tuned)

Held-Out Task — Close Cabinet (added via retrieval, no retraining)

Baseline
Baseline
ReCAP (Ours)
Retrieved input
ReCAP (Ours)

Held-Out Task — Place Bottle in Plastic Box (added via retrieval, no retraining)

Baseline
Baseline
ReCAP (Ours)
Retrieved input
ReCAP (Ours)

Experiment 2

PushT — Cross-Embodiment Generalization

A 2D agent pushes a T-shaped block to a goal pose. Training pairs a triangle pusher (query embodiment) and a disc pusher (pool embodiment) at ±45°. At test time, the frozen query-side policy retrieves from a disc-pusher pool spanning goal angles from −60° to +60°, generalizing to seven unseen angles without retraining.

Train-Time Database

At training time, the query embodiment (triangle pusher) is paired with the pool embodiment (disc pusher) at goal angles of ±45°.

Query (triangle)
Pool (disc)
Query (triangle)
Pool (disc)

Test-Time Pool

At test time, the frozen query policy retrieves from a pool of disc-pusher demonstrations spanning goal angles from −60° to +60° (the seven shown are unseen) — added without retraining.

−60°
−30°
−15°
+15°
+30°
+60°

Results (query rollout + retrieved pool, side by side)

Goal angle: −60°
Goal angle: −14°
Goal angle: 0°
Goal angle: +30°

Effect of Next-State Prediction (future-image objective)

The world-action backbone jointly predicts the next observation. Removing this future-image objective weakens the retrieval-conditioned actions; restoring it recovers accurate rollouts.

Goal angle +15°

Without next-state prediction
With next-state prediction (Ours)

Goal angle −15°

Without next-state prediction
With next-state prediction (Ours)

Retrieval Attention Visualization

Where the policy attends across layers — the primary (query) stream vs. the retrieved (pool) stream.

Goal angle 0°

Rollout
Primary L10
Primary L15
Retrieval L05
Retrieval L10

Goal angle −30°

Rollout
Primary L10
Primary L15
Retrieval L05
Retrieval L10

Experiment 3

RoboTwin 2.0 — Multi-Task Dual-Arm Manipulation

We take Aloha-Agilex as the target and UR5 as the pool, training on five paired tasks and evaluating on five unseen ones. New tasks are added at test time via human-hand pool demonstrations.

Pool-Embodiment Demonstrations (cheap, used for retrieval)

Click Bell
Pick Dual Bottles
Grab Roller
Handover Mic
Lift Pot
Move Can to Pot
Move Pillbottle to Pad
Open Microwave
Place Bread on Skillet
Place Cans in Plasticbox

Query-Embodiment Training Tasks (paired, teleoperated)

Click Bell
Pick Dual Bottles
Grab Roller
Move Can to Pot
Open Microwave

ReCAP (Ours) — Target-Robot Rollouts

Click Bell
Grab Roller
Handover Mic
Lift Pot
Move Can to Pot
Move Pillbottle to Pad
Open Microwave
Place Bread on Skillet
Place Cans in Plasticbox
Pick Diverse

Method Comparison

We compare against cross-embodiment baselines. The retrieved trajectory (dashed) is not a competing method — it is the real-time reference that conditions ReCAP.

Click Bell

Baselines
Baseline
Retrieval-only
Co-train
ReCAP (Ours)
Retrieved input
ReCAP (Ours)

Handover Mic

Baselines
Baseline
Retrieval-only
Co-train
ReCAP (Ours)
Retrieved input
ReCAP (Ours)

Acknowledgements

We thank our colleagues at NAVER AI Lab for their valuable feedback and support throughout this work. We also thank Jihwan Yoon at Korea University for help with data collection.

BibTeX