# Implementation Details for Diffusion-KTO

This codebase trains and evaluates Stable Diffusion v1-5 (CreativeML Open RAIL-M license) using the Diffusion-KTO objective.

The assets used in this work (SD v1-5, datasets, preference models) are publicly available and are used according to their respective licenses. This code is released privately for the purposes of our submission, and will eventually be made public under the Apache 2.0 License (LICENSE). 

## Training Details

### Training Data

This data uses a processed version of Pick-a-Pic v2 (MIT license) dataset, where the paired preferences in the Pick-a-Pic dataset are converted into per-image binary feedback.

### Training Procedure

#### Preprocessing

We convert paired preferences (winning image, losing image, prompt) from Pick-a-Pic v2 into per-sample binary preference. If an image has always been labelled as losing, then we consider it an undesirable image. Otherwise (it has won atleast one comparision), we consider it as desirable. 

#### Training Hyperparameters

We train with the following hyperparameters:

    - Learning Rate: 1e-7
    - Batch size: 8
    - Steps: 10,000

## Evaluation

We evaluate using HPS v2, ImageReward, CLIP, LAION Aesthetics, and PickScore models. We generate images from Pick-a-Pic v2 test set, HPS v2, and PartiPrompts and report win rate.

### Testing Data, Factors & Metrics

#### Testing Data

We test using prompts from Pick-a-Pic v2 test set, HPS v2, PartiPrompts. 

#### Metrics

We report win-rate by comparing generations from Diffusion-KTO and another method using one of the preference models listed above.

## Technical Specifications

### Model Architecture and Objective

We use the SD v1-5 architecture (U-Net, VAE, CLIP text encoder) and only fine-tune the U-Net with our objective. 

### Compute

We train with 4 NVIDIA A6000 GPUs for less than 1 day.