GPLA - Grounded Preference-based Language-Action Alignment

Abstract

Achieving robot transparency is a critical step toward effective human-robot collaboration. To be transparent, a robot's natural language communication must be consistent with its actions and explicitly grounded in the task and environment.

Existing hierarchical Vision-Language-Action (VLA) models can generate language (e.g., through chain-of-thought) and low-level actions. However, current work does not consider explicit alignment between these modalities during training.

To address this crucial gap, we propose a novel training framework that explicitly grounds hierarchical VLA sub-task descriptions with respect to the visual observation and action space. Our framework uses a contrastive model to assess the alignment between generated language and corresponding action trajectories. This contrastive model enables direct ranking of different language-trajectory pairs based on their alignment, allowing us to refine the grounding of our hierarchical VLA through offline preference learning.

We apply our framework to the LanguageTable dataset, a benchmark dataset of human language-annotated trajectories, and provide critical insights into multimodal grounding representations, all while establishing a strong baseline that achieves performance comparable to fully supervised fine-tuning and minimizing the need for costly data annotations.

Method Overview

GPLA extends a standard VLA into a hierarchical VLA by integrating a high-level VLM (Gemma 3) that decomposes high-level instructions into executable low-level sub-tasks, paired with a low-level VLA (SmolVLA) that generates action trajectories.

A separately trained Action-Conditioned Grounding Model extends SigLIP 2 by conditioning visual features on encoded action trajectories via FiLM layers. This model scores the alignment of each language-action pair using a symmetric InfoNCE loss, enabling preference pairs to be automatically constructed without additional human annotation.

The high-level VLM is then updated using SimPO preference optimisation on these pairs, iteratively steering the model toward more grounded and semantically accurate sub-task descriptions.

Figure 1. We extend a regular VLA into a hierarchical VLA, then iteratively align the intermediate language and action outputs using a learned grounding model and preference-based optimisation.

Contributions

01 We propose GPLA, a novel preference-learning framework that explicitly grounds the intermediate language outputs of hierarchical VLAs with visual observations and actions, potentially eliminating the need for expensive annotation of sub-task labels.
02 GPLA achieves performance comparable to fully supervised fine-tuning on the LanguageTable manipulation benchmark, while uniquely supporting low-data regimes.
03 Visual analysis of the embedding space shows that our action-conditioned grounding model improves cross-modal alignment, mapping action-vision and text inputs into overlapping embedding regions unlike standard CLIP or SigLIP 2.

Results

Table 2 · Qualitative Examples on the LanguageTable Dataset


High-level Instructions	make a "parallelogram" shape out of all the blocks	put all the blocks in the bottom left corner	put all the blocks in the center left	put all the blocks in a horizontal line on the bottom of the board
Low-level Instructions (GT)	move the green star diagonal to the hexagon	move the blue blocks towards the bottom left	keep the yellow heart at the bottom right side of the green star	move the blue blocks towards the bottom left
Supervised	move your arm towards the left below the yellow heart	push the yellow hexagon into your hand	set down the heart	slide the blue cube slightly towards the left and the right of the blue triangle
GPLA (Action-Conditioned)	place hexagon above square	move your arm towards yellow star	move your arm towards front of the board	place your arm towards towards left side
Supervised + GPLA (Action-Conditioned)	push the red circle diagonally to the triangle	move yellow hexagon into red star	drag red circle to the yellow hexagon	push blue cube diagonally above green circle

BibTeX

@InProceedings{Wulff_2026_CVPR, author = {Wulff, Theodor and Tavella, Federico and Maharjan, Rahul Singh and Adikari, Manith and Cangelosi, Angelo}, title = {Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings}, month = {June}, year = {2026}, pages = {9269-9281} }

Acknowledgements

This work was funded by the EU and UKRI under Horizon Europe, MSCA grant agreement No 101072488 (TRAIL). The authors thank the Computational Shared Facility at the University of Manchester for providing the compute resources used to train all models.