# PENGI

Pengi is an Audio Language Model (ALM) that takes in audio and text and produces natural language. It achieves SoTA in multiple close-ended and open-ended audio tasks. 

## Setup

You are required to install the dependencies: `pip install -r requirements.txt` using Python 3 to get started.

If you have [conda](https://www.anaconda.com) installed, you can run the following: 

```shell
cd PENGI_API && \
conda create -n pengi python=3.8 && \
conda activate pengi && \
pip install -r requirements.txt
```

## Downloads
- Download Pengi weights: [here](https://drive.google.com/file/d/10H8V4nJodcWpIrAECOaS9pmU8sX9dLFM/view?usp=sharing)
- Download 2000 sample audio files for testing: [here](https://drive.google.com/file/d/16NGpSdr3dUfEmYeS_1BqwqifEs0vMEI7/view?usp=sharing)

## Usage
The wrapper provides an easy way to get Pengi output given and audio and text input. To use the wrapper three things are required:
- model_weights: model weights in .pth
- audio_file_path: file path of audio to be used for inference. Use audio files of less than 7 seconds. For larger than 7 seconds, chunk the file into multiple files
- input text: the recommended prompt for general use is "generate metadata". Refer the Pengi manuscript for performance of difference prompts

```python
from wrapper import PENGIWrapper

pengi = PENGIWrapper(model_path=<model_weights>)
audio_file_path = <audio_file_path>

generated_captions, scores = pengi.predict(audio_file_path, \
                                                 "generate metadata", 
                                                 "", 
                                                 30, 
                                                 3, 
                                                 1.0, 
                                                 ' <|endoftext|>')
```
