## Introduction

CoVoMix is a text-to-speech model that is able to generate human-like monologue and dialogue. 

This model contains 3 parts: Text-to-semantic model (including CoSingle and CoMix), Acoustic Model (including VoSingle and VoMix) and HiFiGAN Vocoder.



## Specification of dependencies

```
conda create -n covomix python=3.8
source activate covomix
pip install voicebox_pytorch==0.0.34 jiwer speechbrain textgrid matplotlib soundfile librosa
cd fairseq 
pip install --editable ./
cd ../
pip install -r requirements.txt
```


## Data preparation

Make sure you have already downloaded the Fisher English Dataset. If not, you need to first download it from https://catalog.ldc.upenn.edu/LDC2004T19. 

#### 1. Monologue Dataset Preparation
```
python data_preparation/process_fisher_data.py \
   --audio_root=<audio (.wav) directory>
   --transcript_root=<LDC Fisher dataset directory> \
   --dest_root=<destination directory> \
   --data_sets=LDC2004S13-Part1,LDC2005S13-Part2 \
   --remove_noises \
   --min_slice_duration 10 \
```

#### 2. Dialogue Dataset Preparation

**Dialogue for training Text2semantic model:**   
```
python data_preparation/process_fisher_data_conversation_overlap_text2semantic.py \
   --audio_root=<audio (.wav) directory>
   --transcript_root=<LDC Fisher dataset directory> \
   --dest_root=<destination directory> \
   --data_sets=LDC2004S13-Part1,LDC2005S13-Part2 \
   --remove_noises \
```

**Dialogue for training Acoustic Model:**
```
python data_preparation/process_fisher_data_conversation.py \
   --audio_root=<audio (.wav) directory>
   --transcript_root=<LDC Fisher dataset directory> \
   --dest_root=<destination directory> \
   --data_sets=LDC2004S13-Part1,LDC2005S13-Part2 \
   --remove_noises \
```


#### 3. Feature Extraction

**Extract Mel-spectrogram**
```
python data_preparation/prepare_8k_mel_20ms.py \
    --processed_path <audio (.wav) directory>
    --target_path <destination directory> 
```

**Saving text from json file**
```
bash data_preparation/save_txt.sh $json_file $target_dir
```

**Extract Hubert Semantic Token Sequence**
```
cd fairseq-hubert \
python get_fisher_semantic_tokens.py \
    --process_dir <audio (.wav) directory>
    --target_dir <destination directory>
```


#### 4. Inference data preparation

Before doing inference, you need to 

1. Prepare text directory. The text directory contains .txt files with text that you want to synthesis. If you want to generate dialogue, please use '[spkchange]' to separate each speaker's utterance. 

2. Prepare prompt directory. The prompt directory for monologue generation contains .wav that has the same name as in text directory. The acoustic prompt for dialogue generation needs two  wavefiles, and the prompt directory for dialogue contains .wav files that named "textfilename_1.wav" and "textfilename_2.wav". 

3. Extract hubert semantic tokens for prompts. 
```
python get_fisher_semantic_tokens.py \
    --process_dir CoVoMix/exp/monologue/prompt_dir  
    --target_dir CoVoMix/exp/monologue/prompt_dir 
```
## Training code


#### 1. Train Text2semantic Model
**Training CoSingle Model:** bash running_command/T2S_CoSingle.sh

**Training CoMix Model:** bash running_command/T2S_CoMix.sh


#### 2. Train Acoustic Model

**Training VoSingle Model:** bash running_command/Acous_VoSingle.sh

**Training VoMix Model:** bash running_command/Acous_VoMix.sh

#### 3. Train HiFiGAN Model

```
cd hifi-gan
python train.py \
    --input_wavs_dir= \
    --input_val_wavs_dir <audio (.wav) directory> \
    --config config_covomix.json \
    --checkpoint_path <destination directory> \
    --stdout_interval 10
```

#### 4. Train speaker verification model


## Inference code

#### 1. Monologue Generation

There are 3 modes that you can choose: covosingle, covosinx and covomix. Run this code for inference with CoVoSingle model. If you want to use other mode, please change --mode and the corresponding checkpoints. 

```
python monologue_generation.py \
    --t2s_ckpt CoSingle.ckpt \
    --acous_ckpt VoSingle.ckpt \
    --hifigan_ckpt vocoder.ckpt \
    --text_dir path_to_text_dir \
    --prompt_dir path_to_prompt_dir \
    --saved_dir path_to_covosingle_saved_dir \
    --mode covosingle \
```


#### 2. Dialogue Generation 

There are 3 modes that you can choose: covosingle, covosinx and covomix. Run this code for inference with CoVoSingle model. If you want to use other mode, please change --mode and the corresponding checkpoints. 

```
python dialogue_generation.py \
    --t2s_ckpt CoMix.ckpt \
    --acous_ckpt VoMix.ckpt \
    --hifigan_ckpt vocoder.ckpt \
    --text_dir path_to_text_dir \
    --prompt_dir path_to_prompt_dir \
    --saved_dir path_to_covomix \
    --mode covomix
```

