# ConStat: Performance-Based Contamination Detection in Large Language Models

This repository contains the code necessary to reproduce the results from the paper *ConStat: Performance-Based Contamination Detection in Large Language Models*. We first look into installation, how ConStat can be used for your own problem, and how our results can be reproduced

## Installation
You can install the code in this repo by installing [Conda](https://docs.conda.io/projects/miniconda/en/latest/) and running the following commands:

```bash
conda create -n constat python=3.10
conda activate constat
python -m pip install -e .
```

If you want to reproduce results, also run the following code: 
```bash
python -m pip install -r requirements.txt
cd lm-evaluation-harness
python -m pip install -e .
cd ..
```

## Using ConStat

When using this library, there will be only one class you will have to interact with, the `ConStat` class. You can call this class with the performances of your reference models. The code below gives a simple example where we generate the performances randomly using NumPy:

```bash
from constat import ConStat
import numpy as np

constat = ConStat(
    n_bootstrap=10000, # Number of bootstraps to use. We suggest a high number, even though it might take a while
    p_value_delta=0, # the \delta value from our paper for which to report the p-value in a test
    random_performance=(0,0) # ConStat will add a random model at this point. If you have a multiple-choice benchmark, this should become something like (0.25,0.25)
)

accuracies_model = np.random.randint(0, 2, 1000) # For each sample in the actual benchmnark, the accuracy of your model on that sample. This can also be any other metric you want to use.
accuracies_model_reference = np.random.randint(0, 2, 1000) # For each sample  in the reference benchmnark, the accuracy of your model on that sample without contamination.

accuracies = np.random.randint(0, 2, (10, 1000)) # the accuracies of your reference models on the benchmark
accuracies_reference = np.random.randint(0, 2, (10, 1000)) # the accuracies of your reference models on the reference benchmark

result = constat.test(accuracies_model, accuracies_model_reference, accuracies, accuracies_reference) # perform the test
result
# {'p_value': 0.4528, # p-value for the test
#  'estimated_contamination': 0.49376129894065124, # estimated performance of the model on the benchmark if it weren't contaminated
#  'estimated_contamination_025': 0.4616516952656119, # 2.5% lower bound of the estimated performance
#  'estimated_contamination_975': 0.5247609019013911, # 97.5% upper bound of the estimated performance
#  'estimated_contamination_std': 0.015812938662854926, # std of the estimated performance
#  'delta': 0.0023555010593487333, # Estimated \delta
#  'delta_std': 0.0223297155214044,# Estimated std of \delta
#  'min_delta_095': -0.034313983164635666} # Estimated 95%th lower bound of delta
```


## Reproduction
We include the raw results for our paper in the [tables/](tables/) folder. You can run the notebooks in the [notebooks/](notebooks/) folder to extract the results as they were presented in the paper. To produce the raw results yourself, please follow the following steps. We assume you have an OPENAI API key and a Together API Key and are running on a machine with a GPU that has at least 80GB of VRAM. Add your api keys to the environment variables by running the following commands in a terminal:

```bash
export OPENAI_API_KEY=[YOUR OPENAI KEY]
export TOGETHER_API_KEY=[YOUR Together KEY]
```

To run some models (e.g. Llama 2), you will need to be logged into your Huggingface account and store your read and write token:
```bash
huggingface-cli login # follow the instructions after this
```
If you rerun the finetuning process, the code will automatically upload the finetuned models to private repos in your account. They will not be stored locally. In all the following commands, please replace `USERNAME` with your huggingface username 

Then, ensure you are in the main folder of this repository and run the following commands:

```bash
python scripts/preprocessing.py # Downloads the 4 benchmarks we use
python scripts/generate_rephrase.py # Uses OpenAI to generate rephrases of the various benchmarks
python scripts/generate_synthetic.py # Uses OpenAI to generate synthetic samples of the various benchmarks
bash scripts/finetune.sh # Finetunes the models on the benchmark data
```

After this, you need to run the evaluation for all models:
```bash
cd lm-evaluation-harness # goes to the LM Eval Harness
bash main_small.sh # run smaller models that can be run locally
bash main_controlled.sh USERNAME # run the trained models
bash main_large.sh # run the larger models through the Together API
```

Now, we just need to run the detection and their baselines:

```bash
cd .. ## goes back to the main folder
bash scripts/detection.sh USERNAME # runs ConStat
bash scripts/evaluate_baselines.sh USERNAME # runs most baselines
cd code-contamination-detection USERNAME
bash bash.sh # run the Shi baseline
```

You can also run the simulation experiments presented in the appendix:

```bash
cd .. ## goes back to the main folder
bash scripts/simulation.sh
```

Now, you can run the notebooks from the [notebooks/](notebooks/) folder to extract the results yourself.