Reduce Your Serverless Automatic1111 Start Time

Reduce Your Serverless Automatic1111 Start Time
Photo by Kevin Ku / Unsplash

I've found that many users are using the Automatic1111 stable diffusion repo not only as a GUI interface, but as an API layer. If you're trying to scale a service on top of A1111, shaving off a few seconds from your start time can be really important. If you need to make your automatic1111 install start faster, this is the article for you!

We will be referencing the files found in this repository for this blog post: https://github.com/runpod/containers/tree/main/serverless-automatic

There are two major performance optimizations that we will cover in this blog post:

1) Make sure that needed huggingface files are cached

2) Pre-calculate the model hash

Both of these optimizations are taken care of in the Dockerfile line that runs the cache.py script:

FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 AS runtime

SHELL ["/bin/bash", "-o", "pipefail", "-c"]
ENV DEBIAN_FRONTEND noninteractive\
    SHELL=/bin/bash

ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu
ENV PATH="/workspace/venv/bin:$PATH"
WORKDIR /workspace

ADD cache.py .
ADD install.py .
COPY --from=models SDv1-5.ckpt /sd-models/SDv1-5.ckpt

RUN apt update --yes && \
apt upgrade --yes && \
apt install --yes --no-install-recommends \
wget \
curl \
psmisc \
vim \
git \
libgl1 \
libgoogle-perftools4 \
libtcmalloc-minimal4 \
software-properties-common \
ca-certificates && \
update-ca-certificates && \
add-apt-repository ppa:deadsnakes/ppa && \
apt install python3.10-dev python3.10-venv -y --no-install-recommends && \
ln -s /usr/bin/python3.10 /usr/bin/python && \
rm /usr/bin/python3 && \
ln -s /usr/bin/python3.10 /usr/bin/python3 && \
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \python get-pip.py && \
pip install -U --no-cache-dir pip && \
python -m venv /workspace/venv && \
export PATH="/workspace/venv/bin:$PATH" && \
pip install -U --no-cache-dir jupyterlab jupyterlab_widgets ipywidgets jupyter-archive && \
git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git && \
cd stable-diffusion-webui && \
git checkout tags/v1.3.1 && \
mv /workspace/install.py /workspace/stable-diffusion-webui/ && \
python -m install --skip-torch-cuda-test && \
pip install -U --no-cache-dir torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 && \
cd /workspace/stable-diffusion-webui/ && \
mv /workspace/cache.py /workspace/stable-diffusion-webui/ && \
python cache.py --use-cpu=all --ckpt /sd-models/SDv1-5.ckpt && \
pip cache purge && \
apt clean

COPY --from=scripts start.sh /start.sh
RUN chmod a+x /start.sh

SHELL ["/bin/bash", "--login", "-c"]
CMD [ "/start.sh" ]

The cache.py script simply imports and runs a few functions from webui and modules out of automatic1111:

from webui import initialize
import modules.interrogate
initialize()
interrogator = modules.interrogate.InterrogateModels("interrogate")
interrogator.load()
interrogator.categories()

If you run this against an installation of Automatic via command line, you will find that it will do two major things:

1) It will download some files and store them in the huggingface cache (/root/.cache/huggingface)

If you don't do this prior to launching your serverless template, it will have to download these files on every cold start! yikes!

2) It will calculate the model hash and store it in /workspace/stable-diffusion-webui/cache.json. Automatic does this by default on launch. You can also disable this by using the --no-hashing command line argument.

Here's the comparison before and after:

Before

Calculating sha256 for /model.safetensors: 9aba26abdfcd46073e0a1d42027a3a3bcc969f562d58a03637bf0a0ded6586c9
Loading weights [9aba26abdf] from /model.safetensors
Creating model from config: /workspace/stable-diffusion-webui/configs/v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Downloading (…)olve/main/vocab.json:   0%|          | 0.00/961k [00:00<?, ?B/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 961k/961k [00:00<00:00, 6.01MB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 961k/961k [00:00<00:00, 5.99MB/s]
Downloading (…)olve/main/merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 525k/525k [00:00<00:00, 39.4MB/s] 
Downloading (…)cial_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 389/389 [00:00<00:00, 2.63MB/s]
Downloading (…)okenizer_config.json:   0%|          | 0.00/905 [00:00<?, ?B/s]
Downloading (…)okenizer_config.json: 100%|██████████| 905/905 [00:00<00:00, 8.45MB/s]
Downloading (…)lve/main/config.json:   0%|          | 0.00/4.52k [00:00<?, ?B/s]
Downloading (…)lve/main/config.json: 100%|██████████| 4.52k/4.52k [00:00<00:00, 23.7MB/s]
Applying xformers cross attention optimization.
Textual inversion embeddings loaded(0): 
Model loaded in 4.2s (calculate hash: 1.2s, load weights from disk: 0.6s, create model: 0.3s, apply weights to model: 1.1s, apply half(): 0.7s, move model to device: 0.3s).
Startup time: 7.9s (import torch: 0.9s, import gradio: 1.1s, import ldm: 0.6s, other imports: 0.7s, load scripts: 0.3s, load SD checkpoint: 4.2s).

After

Loading weights [none] from /model.safetensors
Creating model from config: /workspace/stable-diffusion-webui/configs/v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Textual inversion embeddings loaded(0): 
Model loaded in 3.0s (load weights from disk: 0.6s, create model: 0.4s, apply weights to model: 1.1s, apply half(): 0.7s, move model to device: 0.3s).
Startup time: 6.6s (import torch: 0.9s, import gradio: 1.1s, import ldm: 0.6s, other imports: 0.7s, load scripts: 0.3s, load SD checkpoint: 3.0s).

We have found that the startup time for automatic1111 is very cpu-bound, which means that a faster CPU will yield a faster startup time. We've found this to be a linear relationship to single-core CPU performance.

If you look closely, you will see that there is still a relatively long time spent importing both the pytorch and gradio modules. The next blog post will cover possibly optimizing these import times. Stay tuned!