# Large Language Models Are Implicitly Topic Models: Explaining and Finding Good Demonstrations for In-Context Learning


## Preparation

The code is tested with python 3.8.

A large portion of the code is based on the [MetaICL codebase](https://github.com/facebookresearch/MetaICL).

Install the data dependencies and download the data.
```bash
conda create --name metaicl-data python=3.8
conda activate metaicl-data
pip install datasets==1.4.0 wget
cd preprocess
python run.py
```

After preprocesisng is done, come back to the main directory.
```bash
cd ../
conda deactivate
```

Now, install the model dependencies to run the model. Please note that the Transformer version is not compatible to the datasets library used to download the data, so make sure to use a different environment.
```
conda create --name metaicl python=3.8
conda activate metaicl
pip install torch==1.9.0
pip install git+https://github.com/huggingface/transformers.git@c37573806ab3526dd805c49cbe2489ad4d68a9d7
pip install sentence-transformers
```

(Optional) Install OpenAI Python Library for running GPT-3
```
pip install openai
```

Before running the scripts, you need to set a few environment variables in `.bashrc` under your home directory:
```
# MetaICL variables
export ICL_DATA_DIR="./data/"
export ICL_OUT_DIR="./"
```
Or you can set them to other directories you like.

## Latent concept learning

To obtain the "concept" tokens, please first run the script `tensorize.sh`, then run the script `train.sh`. We simutaniously train a set of datasets together and obtain separate concept tokens for each datasets. This set of datasets is called a `$TASK`. Currently, we define 3 tasks: glue, diverse and tune. You can see what datasets are contained in each task by inspecting the corresponding `.json` file in `./config`. The concept token embeddings of these 3 tasks will be stored in `./checkpoints`.

## Demonstration selection

To obtain performance on test set by using different way to choose the in-context examples, you can use `prior.sh`. `--prior easiest` means choosing the examples using our proposed method.  `--prior most_similar` means choosing the examples that are most similar to the test input.  By commenting out the `--prior` line, examples will be chosen uniformly. More details of each argument can be found in `test.py`.