Diffusion Reward: Learning Rewards via Conditional Video Diffusion

Tao Huang^,1,2 Guangqi Jiang^,1,3 Yanjie Ze¹ Huazhe Xu^4,1,5

¹Shanghai Qi Zhi Institute ²The Chinese University of Hong Kong

³Sichuan University ⁴Tsinghua University, IIIS ⁵Shanghai AI Lab

💐 ECCV 2024 🎉

*Equal contribution

arXiv Paper Code Data Models

Abstract

Learning rewards from expert videos offers an affordable and effective solution to specify the intended behaviors for reinforcement learning tasks. In this work, we propose Diffusion Reward, a novel framework that learns rewards from expert videos via conditional video diffusion models for solving complex visual RL problems. Our key insight is that lower generative diversity is observed when conditioned on expert trajectories. Diffusion Reward is accordingly formalized by the negative of conditional entropy that encourages productive exploration of expert-like behaviors. We show the efficacy of our method over 10 robotic manipulation tasks from MetaWorld and Adroit with visual input and sparse reward. Moreover, Diffusion Reward could even solve unseen tasks successfully and effectively, largely surpassing baseline methods.

Method Overview

We present a framework for reward learning in RL using conditional video diffusion models. Our key insight is that lower generative diversity is observed when conditioned on expert trajectories. We perform reverse processes conditioned on historical frames to estimate conditional entropy as rewards to encourage RL exploration of expert-like behaviors. The success rate of 10 visual robotic manipulation tasks from two environments demonstrates the effectiveness of diffusion reward.

Diffusion Reward Visualization (Real)

We evaluate Diffusion Reward on a real Franka arm attached with an Allegro hand, to pick up a bowl on the table. Videos are recorded by a RealSense D435i camera.

Diffusion Reward Visualization (Sim)

We show the learned reward curve of expert and random trajectories for each task with Diffusion Reward.

MetaWorld Assembly

MetaWorld Coffee Push

Adroit Hammer

Adroit Pen

MetaWorld Dial Turn

MetaWorld Door Close

MetaWorld Lever Pull

MetaWorld Peg Unplug Side

MetaWorld Reach

Adroit Door

Main Results

We report the learning curves for our method and baselines on 7 gripper manipulation tasks from MetaWorld and 3 dexterous manipulation tasks from Adroit with image observations. Our method achieves prominent performance on all tasks, and significantly outperforms baselines on complex door and hammer tasks.

Zero-shot Generalization on Unseen Tasks

Diffusion Reward could generalize to unseen tasks directly and produce reasonable rewards, largely surpassing other baselines.

Citation

If you find this project helpful, please cite us:

@article{Huang2023DiffusionReward, title={Diffusion Reward: Learning Rewards via Conditional Video Diffusion}, author={Tao Huang and Guangqi Jiang and Yanjie Ze and Huazhe Xu}, journal={European Conference on Computer Vision (ECCV)}, year={2024}, }