Image to image local Stable Diffusion pipelines with ControlNet in Swift

Wednesday, January 17, 2024

Converting ControlNet models to Core ML

Let’s start by converting DionTimmer’s ControlNet model along with the Runway’s Stable Diffusion v1.5 model to CoreML using Apple’s script I showed in my previous article:

Terminal

python -m python_coreml_stable_diffusion.torch2coreml \
    --bundle-resources-for-swift-cli \
    --convert-unet \
    --convert-text-encoder \
    --convert-vae-decoder \
    --convert-vae-encoder \
    --model-version runwayml/stable-diffusion-v1-5 \
    --unet-support-controlnet \
    --convert-controlnet DionTimmer/controlnet_qrcode-control_v1p_sd15 \
    -o generated

The command above looks very similar to the one I used in my previous article, but this time I’m passing a couple of extra important parameters:

--unet-support-controlnet: This tells the script to generate a Unet model that supports ControlNet.
--convert-controlnet DionTimmer/controlnet_qrcode-control_v1p_sd15: This tells the script to convert a ControlNet model from HuggingFace to CoreML.

The script might take some time to run but, once it finishes, it will generate a Resources folder which you can bring directly into your project.

If you’re unsure about what these files are, how to import them or want to learn about best practices when dealing with large CoreML models, make sure you check out my previous article on the topic.

Loading the models in Swift

Let’s now load these models into a StableDiffusionPipeline using Apple’s ml-stable-diffusion Swift Package:

StableDiffusionPipeline.swift

import Foundation
import StableDiffusion
import CoreML

func generate(prompt: String, startingImageURL: URL, numberOfImages: Int) async throws -> [CGImage] {
    guard let resourcesURL = Bundle.module.url(forResource: "Resources", withExtension: nil)?.path() else {
        return []
    }
    let url = URL(fileURLWithPath: resourcesURL)
    let configuration = MLModelConfiguration()
    configuration.computeUnits = .all
    let pipeline = try StableDiffusionPipeline(resourcesAt: url,
                                               // 1
                                               controlNet: ["DiontimmerControlnetQrcodeControlV1PSd15"],
                                               configuration: configuration,
                                               disableSafety: false,
                                               reduceMemory: false)
    try pipeline.loadResources()

    // 2
    let startingNSImage = NSImage(contentsOf: startingImageURL)?.resized(to: .init(width: 512, height: 512))
    guard let startingImage = startingNSImage?.cgImage(forProposedRect: nil, context: nil, hints: nil) else {
        return []
    }

    var pipelineConfig = StableDiffusionPipeline.Configuration(prompt: prompt)
    // 3
    pipelineConfig.negativePrompt = "ugly, disfigured, low quality, blurry, nsfw"
    // 4
    pipelineConfig.controlNetInputs = [startingImage]
    pipelineConfig.startingImage = startingImage
    pipelineConfig.useDenoisedIntermediates = true
    // 5
    pipelineConfig.strength = 0.9
    pipelineConfig.seed = UInt32.random(in: (0..<UInt32.max))
    // 6
    pipelineConfig.guidanceScale = 7.5
    pipelineConfig.stepCount = 50
    pipelineConfig.originalSize = 512
    pipelineConfig.targetSize = 512
    pipelineConfig.imageCount = numberOfImages

    return try pipeline.generateImages(configuration: pipelineConfig, progressHandler: { _ in })
        .compactMap { $0 }
}

As there is a lot that I covered in my previous article and the code is fairly similar, I will only cover the new parts here:

Pass the name of the ControlNet model to use in the pipeline. The string passed here should match the name of the model in the Resources folder.
Load the start image from disk, resize it to the scale that the model expects and convert it to a CGImage. If you pass an image that is too big or too small, the pipeline will throw an error.
Pass a negative prompt to the pipeline. In this case, I used the one this model on the Replicate API uses as it seems to work very well.
Set the starting image in the pipeline configuration and also pass it as a ControlNet input.
Set the strength of the pipeline to a number below 1. This is important as Apple’s Stable Diffusion SDK will only set the pipeline mode to image-to-image if both the strength is below 1 and there is a starting image.
Set the guidanceScale to the value that works for your use case. In this case, I used the same value that this model on the Replicate API uses and it seems to do the job well.

I want to give a massive shout-out to Dion Timmer’s effort in putting together both the model and some amazing documentation as well as Nathan Raw for coming up with the Replicate API endpoint and open-sourcing the code, they were both very helpful to find correct values to work with.

Results

Now that the pipeline is set up, let’s run a couple of prompts and check out the Results

Prompt: Darth Vader from Star Wars

May the force be with you!

Prompt: A dark-eyed Junco

The prompt is a bit longer than that and I got it from this Reddit post. 🐦😅