<div align="center">

<h1 style="text-align: center;">🌊 SEA: Spectral Editing of Activations for Large Language Model Alignment</h1>


\
![SEA pipeline](figures/sea-pipeline.png)
<div align="left">

## Abstract
Large language models (LLMs) often exhibit undesirable behaviours, e.g., generating untruthful or biased content. Editing their internal representations has been shown to be effective in mitigating such behaviours. We propose a novel _inference-only_ editing method, namely spectral editing of activations (SEA), to project the input representations into directions with _maximal_ covariance with the positive demonstrations (e.g., truthful) while _minimising_ covariance with the negative demonstrations (e.g., hallucinated). We also extend our method to non-linear editing using feature functions. Extensive experiments on benchmarks concerning truthfulness and bias with six popular open-sourced LLMs of different sizes and model families demonstrate the superiority in inference efficiency and effectiveness of our method compared to several strong baselines.

## Setup Environment

You can find all required packages and their versions in `sea.yml` file. Remember to check and change the prefix `/PATH-TO-CONDA/` to your conda path. And you can setup the environment by,

```
conda env create --name sea --file=sea.yml
```

Please also put your own huggingface token into `src/models/huggingface.py` for accessing required models such as LLaMA-2-Chat-7B.

## How to Run
### Run SEA from the Scratch

There are three steps to run SEA,

1. Firstly, we need to trace the activations for positive/negative/base demonstrations. Once these activations are computed and saved, we do not need to run this phase again,

```
sh prepare-activations.sh
```

2. Once we have all activations, we use spectral decomposition to compute positive and negative editing projections, $\overline{\mathbf{U}^+} \cdot {\overline{\mathbf{U}^+}^{\top}}$ and $\overline{\mathbf{U}^-} \cdot {\overline{\mathbf{U}^-}^{\top}}$.

3. In inference time, we simply apply this editing projections as an additional layer into forward computation when we evaluate LLMs on benchmarks. 

**To run 2nd and 3rd step, you can simply run the following scripts for TruthfulQA and BBQ.**

```
sh run-truthfulqa.sh
sh run-bbq.sh
```