# Overview

This repo runs the experiments of nonlinear systems such as classic control and mujoco environments for our paper.

It mainly does the following:

1. Generate a stable policy for continuous time systems (we use dt=0.001 as a proxy) using [CDAU] algorithm (http://proceedings.mlr.press/v97/tallec19a.html).
2. Generate a dataset of reward observations of the system at dt=0.001 for 300k/600k episodes (minimum number of episodes that works for all choices of h,B and trials)
3. Compute the MSE for various data budget and $h$ from the dataset of step 2, and plot.

Step 1 and 2 take most of the computational time.


## Usage

Typical the scripts are run from the `code` subdirectory and take the form
```
python code/main.py --logdir existing_dir_where_you_want_to_store_your_logs [options]
```
The most critical options are
- `--algo`: which algo to use. Currently, the relevant algorithm is **cdau** (for continuous-action deep advantage updating).
- `--dt`: inverse of the framerate, or discretization timestep. Expect learning epochs & steps to scale as 1/dt.
- `--noise_type`: wether you want to use temporally coherent noise or independent noise.
- `--gamma`: the discount factor **IN PHYSICAL TIME**. The actual discount factor will be **gamma^dt**.
- `--env_id`: environment to train upon.

Other parameters are algorithm dependent, run `python code/main.py --help` to get more details, check the code for precise details.


### Run examples

Bipedal walker with CDAU:
```bash
mkdir log/logdir

python code/main.py --algo cdau --steps_btw_train 10 --noise_type coherent --batch_size 256 --hidden_size 256 --nb_layers 1 --gamma 0.8 --nb_steps 100 --sigma 1.5 --theta 7.5 --nb_train_env 256 --nb_eval_env 64 --memory_size 1000000 --learn_per_step 50 --eval_gap 0.05   --weight_decay 0.0   --tau 0.0 --optimizer rmsprop --env_id bipedal_walker --time_limit 10 --dt 0.02 --normalize_state  --lr 0.1 --policy_lr 0.02   --nb_true_epochs 20 --redirect_stdout --logdir log/logdir
```

The default args are in code/parse.py

Note that this is just an example. 
As we run most training and inference on a cluster, we could not include those code for anonymity reasons.
