GC-TTT - Test-time Offline Reinforcement Learning on Goal-related Experience

$\ast$ Equal contribution.
1ETH Zürich    2Max Planck Institute - Intelligent Systems    3University of Tübingen

We introduce test-time training in the context of offline goal-conditioned reinforcement learning. The same data used for pre-training is filtered and leveraged to improve the policy locally during evaluation. This results in significant performance gains in standard benchmarks (left) when combined with common offline RL backbones, GC-BC, GC-IQL, and SAW.

Abstract

Foundation models compress a large amount of information in a single, large neural network, which can then be queried for individual tasks. There are strong parallels between this widespread framework and offline goal-conditioned reinforcement learning algorithms: a universal value function is trained on a large number of goals, and the policy is evaluated on a single goal in each test episode. Extensive research in foundation models has shown that performance can be substantially improved through test-time training, specializing the model to the current goal. We find similarly that test-time offline reinforcement learning on experience related to the test goal can lead to substantially better policies at minimal compute costs. We propose a novel self-supervised data selection criterion, which selects transitions from an offline dataset according to their relevance to the current state and quality with respect to the evaluation goal. We demonstrate across a wide range of high-dimensional loco-navigation and manipulation tasks that fine-tuning a policy on the selected data for a few gradient steps leads to significant performance gains over standard offline pre-training. Our goal-conditioned test-time training (GC-TTT) algorithm applies this routine in a receding-horizon fashion during evaluation, adapting the policy to the current trajectory as it is being rolled out. Finally, we study compute allocation at inference, demonstrating that, at comparable costs, GC-TTT induces performance gains that are not achievable by scaling model size.


Method: Goal-conditioned Test-time Training (GC-TTT)

We propose to dynamically fine-tune the policy during evaluation, using data from the pre-training dataset $\mathcal{D}$. Our method, Goal-conditioned Test-time Training (GC-TTT), carefully selects a subset of data $\mathcal{D}(s, g^*)$ that is "close" to the agent's current state $s$ and "optimal" for reaching its current goal $g^*$. The policy is then fine-tuned on this small, specialized dataset for a few gradient steps.

This process has two key components:

  1. What to train on? (Data Selection): We first filter the dataset to find sub-trajectories that are relevant to the agent's current state, i.e., trajectories that start nearby (using a distance function $d(s, s_1) < \epsilon$). Then, we filter these relevant trajectories to find the ones that are most optimal for the current test goal $g^*$. We measure optimality using an H-step return estimate, $\hat{V}$, which combines the rewards along the sub-trajectory with the critic's value estimate of its final state. We select the trajectories in the top $q$-th percentile of this value.
  2. When to train? (Receding Horizon Training): We apply this fine-tuning process in a receding-horizon fashion. Every $K$ steps, we reset the policy weights back to the original pre-trained ones. We then perform data selection based on the new current state and fine-tune the policy again. This allows the agent to dynamically adapt its policy as it moves through the environment and correct for any deviations.
GC-TTT specializes the agent to the next steps for achieving its target goal by selecting relevant and optimal data from the pre-training dataset.

Results

We evaluated GC-TTT on a range of loco-navigation (pointmaze, antmaze, humanoidmaze) and manipulation (cubesingle) tasks from OGBench. As shown in Table 1, GC-TTT provides substantial improvements when applied on top of various offline RL backbones (GC-BC, GC-IQL, and SAW). On average, GC-TTT improved the success rate of GC-BC from 0.23 to 0.58 (+152.2%) and GC-IQL from 0.39 to 0.63 (+61.5%).

Table 1: Success rates of GC-TTT (and its critic-free variant) on top of GC-BC, GC-IQL, and SAW backbones. GC-TTT consistently and significantly improves performance across tasks and algorithms.

Ablation Studies

We conducted several ablations to understand why GC-TTT works. We found:

  • Data selection is crucial (Fig. 6, left): Both relevance and optimality filters are necessary. Training on random data, or data that is only relevant but not optimal (or vice-versa), fails to produce significant gains.
  • TTT frequency matters (Fig. 6, middle): Performance scales with the frequency of test-time updates. More complex environments like Antmaze benefit from more frequent updates (e.g., every 100 steps).
  • GC-TTT scales better than model size (Fig. 6, right): We compared allocating more compute at test-time by (a) increasing TTT frequency or (b) scaling the policy network size. GC-TTT (blue line) consistently outperforms simple model scaling (grey line) at matched inference FLOPs.
Figure 6: Ablation studies on data selection (left), TTT frequency (middle), and compute scaling (right).

BibTeX

@misc{bagatella2025testtime,
      title={Test-time Offline Reinforcement Learning on Goal-related Experience}, 
      author={Marco Bagatella* and Mert Albaba* and Jonas Hübotter and Georg Martius and Andreas Krause},
      year={2025},
      eprint={2507.18809},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.18809}, 
    }