#Large-Scale Actionless Video Pre-Training via Discrete Diffusion for Efficient Policy Learning

This is the official code for the paper "Large-Scale Actionless Video Pre-Training via Discrete Diffusion for Efficient Policy Learning",
We develop a video-based policy learning framework via discrete diffusion, facilitating efficient policy learning by incorporating foresight from predicted videos.

## Pre-training
We first train a VQ-VAE to learn a unified discrete latent codebook:

`torchrun --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --nproc_per_node=4 --nnodes=$WORLD_SIZE --node_rank=$RANK scripts/train_vqvae.py --gpus=4 --max_epoch=10 --resolution 96 --sequence_length 8 --batch_size 32`

We then pre-train VPDD on Meta-World:

`python scripts/pretrain_meta.py --seed 1 --model models.VideoDiffuserModel --diffusion models.GaussianVideoDiffusion --loss_type video --device cuda:0 --batch_size 10 --loader datasets.MetaDataset --act_classes 48`

or on RLBench which requires multi-view videos prediction:

`python scripts/pretrain_video_diff.py --seed 1 --model models.VideoDiffuserModel --diffusion models.MultiviewVideoDiffusion --loss_type video --device cuda:0 --batch_size 3 --loader datasets.MultiViewDataset --act_classes 360 --n_diffusion_steps 100`
## Fine-Tuning
After pre-training, we fine-tune VPDD with a limited set of robot data:

`python scripts/pretrain_meta.py --seed 1 --model models.VideoDiffuserModel --diffusion models.GaussianVideoDiffusion --loss_type video --device cuda:0 --batch_size 1 --loader datasets.MetaFinetuneDataset --pretrain False`

`python scripts/pretrain_video_diff.py --seed 1 --model models.VideoDiffuserModel --diffusion models.MultiviewVideoDiffusion --loss_type video --device cuda:0 --batch_size 10 --loader datasets.MultiviewFinetuneDataset --pretrain False --act_classes 360`
## Datset

Our datasets are derived from Rgo4d videos and collection from Meta-World and RLBench environments. The processed dataset will be open sourced as soon as possible.

## Acknowledgment 
Our code for VPDD is partly based on the following awesome projects:
- MTDiff from https://github.com/tinnerhrhe/MTDiff/
- UniD3 from https://github.com/mhh0318/UniD3