ReCAP: Retrieve, Don't Retrain

Abstract

Extending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection and compute. We show that this target-side per-task adaptation cost can be replaced by retrieval. Our retrieval-augmented policy is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by simply appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters.

We find retrieval especially effective on Cosmos Policy, a video-generation-based world action model (WAM): retrieval supplies coarse task progression, while the WAM's future-image objective adds a visual-consistency signal that strengthens the retrieval-conditioned actions. On PushT, retrieval provides a reusable high-level motion prior for cross-embodiment generalization to unseen goal angles; on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks; and we further demonstrate the method on a real robot.

34.9%PushT unseen angles
(vs. 6.0% no retrieval)

31.5%RoboTwin 2.0 unseen tasks
(vs. 26.0% baseline)

×18cheaper data collection
no additional training

How ReCAP Works

ReCAP adapts a policy to new tasks entirely at test time. Instead of teleoperating each new task and fine-tuning (~24 GPU-hours/task for Cosmos Policy), ReCAP appends cheap human-hand demonstrations to a retrieval pool while keeping the policy frozen. Key ideas:

Retrieval, not retraining. At every control step the frozen policy retrieves a matching state-action chunk from a growing pool database; new tasks are absorbed by extending the pool.
Residual action parameterization. The retrieved pool chunk already encodes coarse motion, so the policy learns only the embodiment-specific correction on top.
World-action backbone. Built on Cosmos Policy, the future-image prediction objective enforces consistency between the retrieved trajectory and the predicted evolution of the scene.

Experiment 1

Real Robot — Human-Hand Pool to Teleoperated Target

The pool embodiment is a human hand (video with wrist pose tracked in VR); the target is a teleoperated robot. We fine-tune on a single task, open cabinet, then freeze the policy and add two held-out tasks, close cabinet and place bottle in plastic box, using only 10 human-hand demonstrations per task at test time — no retraining.

Data Collection Setup

Training — Open Cabinet (paired, fine-tuning)

Robot (query / target)

Human hand (pool)

Test-Time Pool — Held-Out (human-hand, no retraining)

Close cabinet (human hand)

Place bottle in plastic box (human hand)

Seen Task — Open Cabinet (fine-tuned)

Held-Out Task — Close Cabinet (added via retrieval, no retraining)

Baseline

ReCAP (Ours)

Retrieved input

→

ReCAP (Ours)

Held-Out Task — Place Bottle in Plastic Box (added via retrieval, no retraining)

Baseline

ReCAP (Ours)

Retrieved input

→

ReCAP (Ours)

Experiment 2

PushT — Cross-Embodiment Generalization

A 2D agent pushes a T-shaped block to a goal pose. Training pairs a triangle pusher (query embodiment) and a disc pusher (pool embodiment) at ±45°. At test time, the frozen query-side policy retrieves from a disc-pusher pool spanning goal angles from −60° to +60°, generalizing to seven unseen angles without retraining.

Train-Time Database

At training time, the query embodiment (triangle pusher) is paired with the pool embodiment (disc pusher) at goal angles of ±45°.

Query (triangle)

Pool (disc)

Query (triangle)

Pool (disc)

Test-Time Pool

At test time, the frozen query policy retrieves from a pool of disc-pusher demonstrations spanning goal angles from −60° to +60° (the seven shown are unseen) — added without retraining.

−60°

−30°

−15°

0°

+15°

+30°

+60°

Results (query rollout + retrieved pool, side by side)

Goal angle: −60°

Goal angle: −14°

Goal angle: 0°

Goal angle: +30°

Effect of Next-State Prediction (future-image objective)

The world-action backbone jointly predicts the next observation. Removing this future-image objective weakens the retrieval-conditioned actions; restoring it recovers accurate rollouts.

Goal angle +15°

Without next-state prediction

With next-state prediction (Ours)

Goal angle −15°

Without next-state prediction

With next-state prediction (Ours)

Retrieval Attention Visualization

Where the policy attends across layers — the primary (query) stream vs. the retrieved (pool) stream.

Goal angle 0°

Rollout

Primary L10

Primary L15

Retrieval L05

Retrieval L10

Goal angle −30°

Rollout

Primary L10

Primary L15

Retrieval L05

Retrieval L10

Experiment 3

RoboTwin 2.0 — Multi-Task Dual-Arm Manipulation

We take Aloha-Agilex as the target and UR5 as the pool, training on five paired tasks and evaluating on five unseen ones. New tasks are added at test time via human-hand pool demonstrations.

Pool-Embodiment Demonstrations (cheap, used for retrieval)

Click Bell

Pick Dual Bottles

Grab Roller

Handover Mic

Lift Pot

Move Can to Pot

Move Pillbottle to Pad

Open Microwave

Place Bread on Skillet

Place Cans in Plasticbox

Query-Embodiment Training Tasks (paired, teleoperated)

Click Bell

Pick Dual Bottles

Grab Roller

Move Can to Pot

Open Microwave

ReCAP (Ours) — Target-Robot Rollouts

Click Bell

Grab Roller

Handover Mic

Lift Pot

Move Can to Pot

Move Pillbottle to Pad

Open Microwave

Place Bread on Skillet

Place Cans in Plasticbox

Pick Diverse

Method Comparison

We compare against cross-embodiment baselines. The retrieved trajectory (dashed) is not a competing method — it is the real-time reference that conditions ReCAP.

Click Bell

Baselines

Baseline

Retrieval-only

Co-train

ReCAP (Ours)

Retrieved input

→

ReCAP (Ours)

Handover Mic

Baselines

Baseline

Retrieval-only

Co-train

ReCAP (Ours)

Retrieved input

→

ReCAP (Ours)

Acknowledgements

We thank our colleagues at NAVER AI Lab for their valuable feedback and support throughout this work. We also thank Jihwan Yoon at Korea University for help with data collection.

BibTeX

@article{park2026retrieve,
  title={Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time},
  author={Park, Jeongeun and Park, Juhan and Kim, Taekyung and Choi, Sungjoon and Han, Dongyoon and Yun, Sangdoo},
  journal={arXiv preprint arXiv:2606.15631},
  year={2026}
}

Retrieve, Don't Retrain: Extending Vision-Language-Action Models to New Tasks at Test Time

Abstract

How ReCAP Works

Real Robot — Human-Hand Pool to Teleoperated Target

PushT — Cross-Embodiment Generalization

RoboTwin 2.0 — Multi-Task Dual-Arm Manipulation

Acknowledgements

BibTeX

Retrieve, Don't Retrain:
Extending Vision-Language-Action Models to New Tasks at Test Time