Diffusion Reward: Learning Rewards via Conditional Video Diffusion

Tao Huang*,1,2  Guangqi Jiang*,1,3  Yanjie Ze1  Huazhe Xu4,1,5

1Shanghai Qi Zhi Institute  2The Chinese University of Hong Kong 

3Sichuan University  4Tsinghua University, IIIS  5Shanghai AI Lab 

*Equal contribution

Abstract

Learning rewards from expert videos offers an affordable and effective solution to specify the intended behaviors for reinforcement learning tasks. In this work, we propose Diffusion Reward, a novel framework that learns rewards from expert videos via conditional video diffusion models for solving complex visual RL problems. Our key insight is that lower generative diversity is observed when conditioned on expert trajectories. Diffusion Reward is accordingly formalized by the negative of conditional entropy that encourages productive exploration of expert-like behaviors. We show the efficacy of our method over 10 robotic manipulation tasks from MetaWorld and Adroit with visual input and sparse reward. Moreover, Diffusion Reward could even solve unseen tasks successfully and effectively, largely surpassing baseline methods.

Method Overview

We present a framework for reward learning in RL using conditional video diffusion models. Our key insight is that lower generative diversity is observed when conditioned on expert trajectories. We perform reverse processes conditioned on historical frames to estimate conditional entropy as rewards to encourage RL exploration of expert-like behaviors. The success rate of 10 visual robotic manipulation tasks from two environments demonstrates the effectiveness of diffusion reward.

Diffusion Reward Visualization (Real)

We evaluate Diffusion Reward on a real Franka arm attached with an Allegro hand, to pick up a bowl on the table. Videos are recorded by a RealSense D435i camera.


Diffusion Reward Visualization (Sim)

We show the learned reward curve of expert and random trajectories for each task with Diffusion Reward.


MetaWorld Assembly
MetaWorld Coffee Push
Adroit Hammer
Adroit Pen
MetaWorld Dial Turn
MetaWorld Door Close
MetaWorld Lever Pull
MetaWorld Peg Unplug Side
MetaWorld Reach
Adroit Door

Main Results

We report the learning curves for our method and baselines on 7 gripper manipulation tasks from MetaWorld and 3 dexterous manipulation tasks from Adroit with image observations. Our method achieves prominent performance on all tasks, and significantly outperforms baselines on complex door and hammer tasks.

Zero-shot Generalization on Unseen Tasks

Diffusion Reward could generalize to unseen tasks directly and produce reasonable rewards, largely surpassing other baselines.

Citation

If you find this project helpful, please cite us:

@article{Huang2023DiffusionReward, title={Diffusion Reward: Learning Rewards via Conditional Video Diffusion}, author={Tao Huang and Guangqi Jiang and Yanjie Ze and Huazhe Xu}, journal={arxiv}, year={2023}, }