Assistax: A Multi-Agent Hardware-Accelerated Reinforcement Learning Benchmark for Assistive Robotics

Hinckeldey, Leonard; Fosong, Elliot; Miller, Elle; Rubavicius, Rimvydas; McInroe, Trevor; Zhang, Fan; Wollstadt, Patricia; Albrecht, Stefano V.; Ramamoorthy, Subramanian

Assistax

A Multi-Agent Hardware-Accelerated Reinforcement Learning Benchmark for Assistive Robotics

Leonard Hinckeldey^1,*, Elliot Fosong^1,*, Elle Miller¹, Rimvydas Rubavicius¹, Trevor McInroe¹, Fan Zhang², Patricia Wollstadt², Stefano V. Albrecht³, Subramanian Ramamoorthy¹

¹University of Edinburgh · ²Honda Research Institute EU · ³DeepFlow
Reinforcement Learning Conference 2026
^*Equal contribution

Paper Code Models Docs

Scratch — the robot reaches a target itch location on the humanoid.

Tooth Brushing — coordinated contact while the humanoid holds still and opens the mouth.

Feeding — the robot delivers food to the humanoid's mouth.

Bed Bath — the robot wipes along the humanoid's arm while it lies in bed.

Arm Assist — the robot helps the humanoid lift its arm back onto the bed.

Abstract

Real-world assistive tasks — home assistance, caretaking, daily-living support — are inherently multi-agent, yet most existing reinforcement learning (RL) and robot learning simulators consider single-agent problems. This gap is compounded by two further limitations: common RL environments are too simple to capture the complexity of real robotics domains, while most robotics simulations have throughput too low for RL training. As a result, environments that combine robotics fidelity, training efficiency, and multi-agent support remain rare. Assistax closes this gap with a high-throughput, GPU-accelerated suite of assistive robotics tasks built on JAX and MuJoCo MJX, and pairs each robot with an active humanoid partner that can be co-trained via multi-agent RL (MARL).

On top of the MARL benchmark we formulate assistive tasks as an Ad-Hoc Teamwork (AHT) problem: the robot must generalise to unseen humans with varying disabilities and preferences. We provide tuned MARL baselines, an AHT pipeline with a diverse population of pre-trained humanoid partners (released on Hugging Face), and an evaluation protocol that reveals a clear coordination gap when current RL algorithms meet unseen partner agents.

Key Results

412×

Faster open-loop simulation

Versus CPU-based assistive RL environments, on a single GPU. A typical 40M-timestep training run finishes in ~20 minutes instead of ~8.3 hours — a 25× wall-clock reduction.

Steps-per-second scaling across vectorised environments

The Coordination Gap

RL agents fail on unseen partners

Train a robot with 5 humanoid partners, test it against 625 unseen partners with novel preference combinations, and performance drops.

PPO AHT coordination gap between training and held-out partners

Explore a Trained Policy

IPPO co-policy on the Bed Bath task. Drag to orbit, scroll to zoom, scrub the timeline to inspect any moment of the rollout.

Slow to load? Open in a new tab.

Tuned MARL Baselines

Legend for MARL baseline learning curves — Mean test returns across 16 seeds and 64 evaluation episodes ± 95% stratified-bootstrap CIs.

Learning curves for the Scratch task — Mean test returns across 16 seeds and 64 evaluation episodes ± 95% stratified-bootstrap CIs.

We provide well tuned multi-agent RL baselines: IPPO, MAPPO, and MASAC, in feed-forward and recurrent variants — co-trained on every task. All baselines were tuned across 168 hyperparameter combinations per algorithm–task pair (16 seeds, 64 evaluation episodes), the feed-forward IPPO and MAPPO variants consistently come out on top.

Assistax at a Glance

Assistax training loop: MuJoCo MJX simulation, JAX-vectorised rollouts, MARL update, preference-conditioned rewards

Assistax runs simulation, environment logic, and the RL loop entirely on GPU. Tasks are built in MuJoCo MJX and vectorised across thousands of parallel environments via JAX vmap. We also provide further vectorization allowing practitioners to scale their experiments across different seeds or across multiple partner agents for ad-hoc teamwork. Each task pairs a Franka arm with an actively-controlled humanoid; rewards combine task success with preference rewards encoding how a partner wants the task performed (e.g. contact force, speed, frequency of contact).

Ad-Hoc Teamwork Pipeline

The same loop is then used to study a harder question: can a robot trained with a handful of partners cooperate with humans it has never seen? We pre-train a diverse population of 630 humanoid partners per task with MARL, varying them along two axes; 610 preference combinations and 7 disability combinations (which joints are usable). The population is then split into a small training set of 5 partners and a held-out test set of 625 partners with novel preference combinations.

A robot policy is trained against only the training partners and evaluated against the held-out test partners. All reactive, pre-trained humanoid policies are released on Hugging Face, so AHT researchers can drop them into their own pipelines without paying the cost of pre-training a partner population from scratch.

BibTeX

@article{hinckeldey2026assistax,
  title   = {Assistax: A Multi-Agent Hardware-Accelerated Reinforcement Learning Benchmark for Assistive Robotics},
  author  = {Hinckeldey, Leonard and Fosong, Elliot and Miller, Elle and Rubavicius, Rimvydas
             and McInroe, Trevor and Zhang, Fan and Wollstadt, Patricia
             and Albrecht, Stefano V. and Ramamoorthy, Subramanian},
  journal = {Reinforcement Learning Conference},
  year    = {2026}
}