### Neurips Submission Code - Watermarking Makes Language Models Radioactive

This repository contains the code for our NeurIPS submission. Due to the large size of the fine-tuned models, we cannot provide them in this submission. However, we offer some examples of usage and commit to open-sourcing the data, models, and code in the future. We will also include more details for the use of filters, which are not used in the above experiments.

The code is adapted from the repository: https://github.com/facebookresearch/three_bricks which is licensed under the Creative Commons Attribution-NonCommercial 4.0 International Public License.


### Requirements

```cmd
conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txt
```
!!! IMPORTANT !!!

Important Note: Before running any commands, please set the path to your llama2-chat-hf-model in the `wm/paths.py file`

### Data and models

A subset of 10k watermarked instruction/answer pairs with (Kirchenbauer et al.), watermark window size 2, delta=3 and gamma = 0.25 used in section 4 can be found in `data/used_split_maryland.jsonl`. A similar dataset with a format compatible with the "reading mode" is available in `data/maryland_ngram4.jsonl`, but this time for a watermark window size 4.

As we the models finetuned on watermarked data are too heavy for the submission, we store radioactive outputs of a model trained on 5% of watermarked data in `output_closed_supervised_0p05/results.jsonl`. 


## Usage - Closed model setting

(You should rename the current result_chunked_0.jsonl in order to reproduce it)

</details>

For example, the following command analyses the radioactive outputs in `output_closed_supervised_0p05/results.jsonl` by concatenating the results and applying the deduplication proposed in section 4. It corresponds the closed-model/supervised setting  with 5% of watermarked data in Figure 5.

Note that "Llama-2-7b-chat-hf" is not used here; the scripts notices the presence of outputs in output_closed_supervised_0p05/, so it does not generate any answers and just score the ones that are already present.

```cmd
python main_watermark.py \
    --model_name "Llama-2-7b-chat-hf" \
    --prompt_type "none" \
    --prompt_path "data/used_split_maryland.jsonl" \
    --method none --method_detect maryland \
    --ngram 2 --scoring_method v2 \
    --nsamples 10000 --batch_size 16 \
    --output_dir output_closed_supervised_0p05/ \
    --eval_chunk 1 \
    --eval_chunk_methods "closed_model_params.jsonl" \
    --prop 0.05 
```


#### Output

The previous script generates `results_chunked_0.jsonl`, which contains the following (important) fields:

| Field | Description |
| --- | --- |
| `mean_r` | Proportion of green list tokens scored |
| `num_token` | Number of analyzed tokens in the text |
| `num_scored` | Number of scored tokens in the text |
| `pvalue` | p-value of the detection test |

The results should be close to 1e-30.
The mean_r field is computed at the end and reported at every line.


Running the following command will this time generate outputs from Llama-2-7b-chat-hf from the watermarked prompts, and compute the radioactivity detection test. This time, as the model was not trained on watermarked data, the resulting p-value should be random. Without deduplication (no_dedup = 1), it will appear falsely radioactive because the prompts are watermarked.

```cmd
python main_watermark.py \
    --model_name "Llama-2-7b-chat-hf" \
    --prompt_type "none" \
    --prompt_path "data/used_split_maryland.jsonl" \
    --method none --method_detect maryland \
    --ngram 2 --scoring_method v2 \
    --nsamples 100 --batch_size 16 \
    --output_dir output_closed/ \
    --eval_chunk 1 \
    --eval_chunk_methods "closed_model_params.jsonl" \
    --prop 0.05 
```

(You should rename the current result_chunked_0.jsonl in order to reproduce it)


## Usage - Open model setting

The following command will run the reading mode with deduplication on "data/maryland_ngram4.jsonl"
(You should first rename the current results_grams.jsonl in order to reproduce it)

```cmd
python main_reed_wm.py \
    --model_name "Llama-2-7b-chat-hf" \
    --dataset_path2 "data/maryland_ngram4.jsonl" \
    --method_detect maryland \
    --degree 0 \
    --prop 1 \
    --nsamples 1000 \
    --batch_size 16 \
    --output_dir output_open/ \
    --ngram 4
```

As "Llama-2-7b-chat-hf" is not radioactive, it should lead to a random p-value. Note that the text was watermarked with the same secret key and watermark window size then the ones used for detection.