# Fully Transparent Self-Alignment for Code Generation

## Data generation pipeline

> Run `pip install -e .` first to install the package locally. Check [seed_gathering](seed_gathering/) for details on how we collected the seeds.

By default, we use in-memory vLLM engine for data generation, but we also provide an option to use vLLM's [OpenAI compatible server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) for data generation.

Set `CUDA_VISIBLE_DEVICES=...` to specify the GPU devices to use for the vLLM engine.

To maximize data generation efficiency, we recommend invoking the script multiple times with different `seed_code_start_index` and `max_new_data` values, each with an vLLM engine running on a separate GPU set. For example, for a 100k seed dataset on a 2-GPU machine, you can have 2 processes each generating 50k samples by setting `CUDA_VISIBLE_DEVICES=0 --seed_code_start_index 0 --max_new_data 50000` and `CUDA_VISIBLE_DEVICES=1 --seed_code_start_index 50000 --max_new_data 50000`.

<details>

<summary>Click to see how to run with vLLM's OpenAI compatible API</summary>

To do so, make sure the vLLM server is running, and the associated `openai` environment variables are set.

For example, you can start an vLLM server with `docker`:

```shell
docker run --gpus '"device=0"' \
    -v $HF_HOME:/root/.cache/huggingface \                            
    -p 10000:8000 \
    --ipc=host \
    vllm/vllm-openai:v0.3.3 \
    --model Qwen/CodeQwen1.5-7B \
    --tensor-parallel-size 1 --dtype bfloat16
```

And then set the environment variables as follows:

```shell
export OPENAI_API_KEY="EMPTY"
export OPENAI_BASE_URL="http://localhost:10000/v1/"
```

You will also need to set `--use_vllm_server True` in the following commands.

</details>

<details>

<summary>Snippet to concepts generation</summary>

```shell
MODEL=Qwen/CodeQwen1.5-7B
MAX_NEW_DATA=1000000
python src/star_align/self_ossinstruct.py \
    --use_vllm_server False \
    --instruct_mode "S->C" \
    --seed_data_files /path/to/seeds.jsonl \
    --max_new_data $MAX_NEW_DATA \
    --tag concept_gen \
    --temperature 0.7 \
    --seed_code_start_index 0 \
    --model $MODEL \
    --num_fewshots 8 \
    --num_batched_requests 2000 \
    --num_sample_per_request 1
```

</details>

<details>

<summary>Concepts to instruction generation</summary>

```shell
MODEL=Qwen/CodeQwen1.5-7B
MAX_NEW_DATA=1000000
python src/star_align/self_ossinstruct.py \
    --instruct_mode "C->I" \
    --seed_data_files /path/to/concepts.jsonl \
    --max_new_data $MAX_NEW_DATA \
    --tag instruction_gen \
    --temperature 0.7 \
    --seed_code_start_index 0 \
    --model $MODEL \
    --num_fewshots 8 \
    --num_sample_per_request 1 \
    --num_batched_request 2000
```

</details>

<details>

<summary>Instruction to response (with self-validation code) generation</summary>

```shell
MODEL=Qwen/CodeQwen1.5-7B
MAX_NEW_DATA=1000000
python src/star_align/self_ossinstruct.py \
    --instruct_mode "I->R" \
    --seed_data_files path/to/instructions.jsonl  \
    --max_new_data $MAX_NEW_DATA \
    --tag response_gen \
    --seed_code_start_index 0 \
    --model $MODEL \
    --num_fewshots 1 \
    --num_batched_request 500 \
    --num_sample_per_request 10 \
    --temperature 0.7
```

</details>

<details>

<summary>Execution filter</summary>

> **Warning:** Though we implemented reliability guards, it is highly recommended to run execution in a sandbox environment we provided.

To use the Docker container for executing code, you will first need to `git submodule update --init --recursive` to clone the server, then run:

```shell
pushd ./src/star_align/code_exec_server
./pull_and_run.sh
popd
python src/star_align/execution_filter.py \
    --response_paths /path/to/response.jsonl \
    --result_path /path/to/filtered.jsonl \
    --max_batched_tasks 10000 \
    --container_server http://127.0.0.1:8000
```

Execution filter will produce a flattened list of JSONL entries with a `pass` field indicating whether the execution passed or not. **It also incrementally dumps the results and can load a cached partial data file.** You can recover an execution with:

```shell
python src/star_align/execution_filter.py \
    --response_paths /path/to/response.jsonl* \
    --cache_paths /path/to/filtered.jsonl* \
    --result_path /path/to/filtered-1.jsonl \
    --max_batched_tasks 10000 \
    --container_server http://127.0.0.1:8000
```

Note that sometimes execution can lead to significant slowdowns due to excessive resource consumption. To alleviate this, you can limit the docker's cpu usage (e.g., `docker run --cpuset-cpus="0-31"`). You can also do:

```shell
# For example, you can set the command to be `sudo pkill -f '/tmp/codeexec'`
export CLEANUP_COMMAND="the command to execute after each batch"
python src/star_align/execution_filter.py...
```

Also, the container connection may be lost during execution. In this case, you can just leverage the caching mechanism described above to re-run the script.

</details>

<details>

<summary>Data sanitization and selection</summary>

```shell
# Uncomment to do decontamination
# export MBPP_PATH="/path/to/mbpp.jsonl"
# export DS1000_PATH="/path/to/ds1000_data"
# export DECONTAMINATION=1
./sanitize.sh /path/to/exec-filtered.jsonl /path/to/sanitized.jsonl
```

</details>

## Training Details

> Run `pip install -e .` first to install the package locally. And install [Flash Attention](https://github.com/Dao-AILab/flash-attention) to speed up the training.

### Hyperparameters

- **Optimizer:** Adafactor
- **Learning rate:** 1e-5
- **Epoch:** 2
- **Batch size:** 64
- **Warmup ratio:** 0.05
- **Scheduler:** Linear
- **Sequence length:** 1280
- **Dropout**: Not applied

### Script

<details>

<summary>Click to see the training script</summary>

```shell
MODEL_KEY=Qwen/CodeQwen1.5-7B
LR=1e-5
EPOCH=2
SEQ_LEN=1280
WARMUP_RATIO=0.05
OUTPUT_DIR=/path/to/output_model
DATASET_FILE=/path/to/dataset.jsonl
# Make sure BS * ACC * num_gpu = 64
BS=2
ACC=8
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch -m star_align.train \
    --model_key $MODEL_KEY \
    --model_name_or_path $MODEL_KEY \
    --use_flash_attention True \
    --datafile_paths $DATASET_FILE \
    --output_dir $OUTPUT_DIR \
    --bf16 True \
    --num_train_epochs $EPOCH \
    --max_training_seq_length $SEQ_LEN \
    --pad_to_max_length False \
    --per_device_train_batch_size $BS \
    --gradient_accumulation_steps $ACC \
    --group_by_length False \
    --ddp_find_unused_parameters False \
    --logging_steps 1 \
    --log_level info \
    --optim adafactor \
    --max_grad_norm -1 \
    --warmup_ratio $WARMUP_RATIO \
    --learning_rate $LR \
    --lr_scheduler_type linear
```

</details>
