# A2PO: Towards Effective Offline Reinforcement Learning from an Advantage-aware Perspective

This directory is the implementation for our proposed Advantage-Aware Policy Optimization (A2PO).

## Requirement

- [PyTorch 2.0.0](https://github.com/pytorch/pytorch)
- [Python 3.8.10](https://www.python.org/downloads/release/python-3810/)
- [OpenAI gym 0.21.0](https://github.com/openai/gym)
- [mujoco-py 2.0.2.0](https://github.com/openai/mujoco-py)

## Datasets

[D4RL datasets](https://github.com/rail-berkeley/d4rl) were used for evaluation in this paper. We take the gym tasks, maze, kitchen, and adroit tasks for evaluation.

#### Gym tasks

For gym tasks, three tasks were used in this paper:

- _halfcheetah-v2_
- _hopper-v2_
- _walker2d-v2_

and each of the task has datasets containing different polices:

- *medium*
- *medium-replay*
- *medium-expert*
- *random-medium*
- *random-expert*
- *random-medium-expert*

The first 3 datasets are given in [D4RL datasets](https://github.com/rail-berkeley/d4rl) , while the last 3 datasets are manually constructed. It should be noted that since the maximum file size of supplementary material is 100MB and the manually constructed datasets are large, we do not provide the  manually constructed mixed-quality dataset in the directory.
However, they can be easily reproduced by directly combining the D4RL *random, medium, expert* datasets.

####  Maze, Kitchen and  Adroit tasks

For these tasks introduced in [D4RL datasets](https://github.com/rail-berkeley/d4rl), we directly utilize the D4RL datasets generated by  the **expert** behavior policy.

## Usage

The paper results can be reproduced by :

```
python main.py --env=<env_name> --seed=<seed>
```

If want to see influence of different components, the command can be extent as bellow:

```
python main.py --env=<env_name> --use_discrete=<bool> --epsilon=<epsilon> --adv_step=<adv_step> --seed=<seed>
```



