# A Principle for the Equivalence of Language Encoders

Code for the 2024 NeurIPS submission "A Principle for the Equivalence of Language Encoders".

## Installation
We run on python 3.10.12. Install dependencies through `requirements.txt`, e.g., with virtualenv and pip.

```bash
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

# How to run

## Training Scripts
There exist 9 training scripts used in the paper.
- `get-embeddings.py`: generates all (25 x 12) training embeddings for sst2 and mrpc, saves to "data/representations-large"
- `embedding-symmetry.py` and `embedding-symmetry-simply-lr-parallel.py`: alternative implementations both computing the distances to evaluate asymmetry in experiment 1.
- `compute-intrinsic-parallel.py` and `compute-extrinsic-parallel.py`: Computes intrinsic and extrinsic similarity for experiments 2, 3.
- `train-classifier.py`: computes a task specific linear probe $\phi$
- `compute-extrinsic-equivalence-parallel.py`: Compute extrinsic Haussdorff-Hoare approximation.
- `compute-cca-parallel.py`: Computes Orthogonal Procrustes, CCA, PWCCA, linear CKA similarity measures between final layer embeddings.
- `embedding-symmetry-lora-simple-lr-parallel.py`: computes low-rank estimates and computes intrinsic distances (experiment in appendix)

## Evaluation Notebooks
Notebooks used to generate plots for the paper are in `evaluation_notebooks/`
- `evaluate_symmetry.ipynb`: generates Fig 1.
- `eval-intrinsic-extrinsic.ipynb`: computes correlations from Table 2, as well as Fig 2.
- `eval-svd-rank.ipynb`: computes ranks to precision epsilon ("The influence of encoder rank deficiency")
- `eval-lora.ipynb`: generates Fig 3. in the appendix


## License

[MIT](https://choosealicense.com/licenses/mit/)
