Overcoming token measurement limitations, customized mannequin loading, LoRa help, textual inversion help, and extra
Stable Diffusion WebUI from AUTOMATIC1111 has confirmed to be a robust software for producing high-quality photographs utilizing the Diffusion mannequin. Nonetheless, whereas the WebUI is simple to make use of, knowledge scientists, machine studying engineers, and researchers usually require extra management over the picture technology course of. That is the place the diffusers package deal from huggingface is available in, offering a strategy to run the Diffusion mannequin in Python and permitting customers to customise their fashions and prompts to generate photographs to their particular wants.
Regardless of its potential, the Diffusers package deal has a number of limitations that forestall it from producing photographs nearly as good as these produced by the Steady Diffusion WebUI. Essentially the most important of those limitations embody:
- The shortcoming to make use of customized fashions within the
.safetensor
file format; - The 77 immediate token limitation;
- An absence of LoRA help;
- And the absence of picture scale-up performance (also referred to as HighRes in Steady Diffusion WebUI);
- Low efficiency and excessive VRAM utilization by default.
This text goals to deal with these limitations and allow the Diffusers package deal to generate high-quality photographs akin to these produced by the Steady Diffusion WebUI. With the enhancement options offered, knowledge scientists, machine studying engineers, and researchers can get pleasure from larger management and suppleness of their picture technology processes whereas additionally reaching distinctive outcomes. Within the following sections, we are going to discover the assorted methods and strategies that can be utilized to beat these limitations and unlock the total potential of the Diffusers package deal.
Notice that please comply with this hyperlink to put in all required CUDA and Python packages if it’s your first time operating Steady Diffusion.
1. Load Up Native Mannequin information in .safetensor Format
Customers can simply spin up diffusers to generate a picture like this:
from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
pipeline.to("cuda")
picture = pipeline("A cute cat taking part in piano").photographs[0]
picture.save("image_of_cat_playing_piano.png")
You might not fulfill with both the output picture or the efficiency. Let’s take care of the issues one after the other. First, let’s load up a customized mannequin in .safetensor
format situated wherever in your machine. you can’t simply load the mannequin file like this:
pipeline = DiffusionPipeline.from_pretrained("/mannequin/custom_model.safetensors")
Listed below are the detailed steps to covert .safetensor
file to diffusers format:
Step 1. Pull all diffusers code from GitHub
git clone https://github.com/huggingface/diffusers.git
Step 2. Beneath the scripts
folder find the file: convert_original_stable_diffusion_to_diffusers.py
In your terminal, run this command to transform .safetensor
file to Diffusers format. Bear in mind to vary the — checkpoint_path
worth to signify your case.
python convert_original_stable_diffusion_to_diffusers.py --from_safetensors --checkpoint_path="D:stable-diffusion-webuimodelsStable-diffusiondeliberate_v2.safetensors" --dump_path='D:sd_modelsdeliberate_v2' --device='cuda:0'
Step 3. Now you possibly can load up the pipeline utilizing the newly transformed mannequin file, right here is the whole code:
from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
)
pipeline.to("cuda")
picture = pipeline("A cute cat taking part in piano").photographs[0]
picture.save("image_of_cat_playing_piano.png")
It is best to be capable to convert and use any fashions you obtain from huggingface or civitai.com.
2. Enhance the Efficiency of Diffusers
Producing high-quality photographs is usually a time-consuming course of even for the most recent 3xxx and 4xxx Nvidia RTX GPUs. By default, Diffuers package deal comes with non-optimized settings. Two options might be utilized to tremendously enhance efficiency.
Right here is the interplay pace earlier than making use of the next answer, solely about 2.x iterations per second in RTX 3070 TI 8G RAM to generate a 512×512 picture
- Use Half Precision Weights
The primary answer is to make use of half precision weights. Half precision weights use 16-bit floating-point numbers as a substitute of the normal 32-bit numbers. This reduces the reminiscence required for storing weights and accelerates computation, which may considerably enhance the efficiency of the Diffusers package deal.
In accordance with this video, lowering float precision from FP32 to FP16 will even allow the Tensor Cores.
I had one other article to check out how briskly GPU Tensor cores can enhance the computation pace.
Right here is the right way to allow FP16 in diffusers, Simply including two traces of code will enhance the efficiency by 500%, with nearly no picture high quality impacts.
from diffusers import DiffusionPipeline
import torch # <----- Line 1 added
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
,torch_dtype = torch.float16 # <----- Line 2 Added
)
pipeline.to("cuda")
picture = pipeline("A cute cat taking part in piano").photographs[0]
picture.save("image_of_cat_playing_piano.png")
Now the iteration pace boosts to 10.x iteration per second. A 5x instances quicker.
Xformers is an open-source library that gives a set of high-performance transformers for numerous pure language processing (NLP) duties. It’s constructed on prime of PyTorch and goals to offer environment friendly and scalable transformer fashions that may be simply built-in into current NLP pipelines. (These days, are there any fashions that don’t use Transformer? :P)
Set up Xformers by pip set up xformers
, then we will simply change diffusers to make use of xformers by one line code.
...
pipeline.to("cuda")
pipeline.enable_xformers_memory_efficient_attention() <--- one line added
...
This one-line code boosts efficiency by one other 20%.
3. Take away the 77 immediate tokens limitation
Within the present model of Diffusers, there’s a limitation of 77 immediate tokens that can be utilized within the technology of photographs.
Thankfully, there’s a answer to this downside. By utilizing the “lpw_stable_diffusion
” pipeline offered by the neighborhood, you possibly can unlock the 77 immediate token limitation and generate high-quality photographs with longer prompts.
To make use of the “lpw_stable_diffusion
” pipeline, you should utilize the next code:
pipeline = DiffusionPipeline.from_pretrained(
model_path,
custom_pipeline="lpw_stable_diffusion", #<--- code added
torch_dtype=torch.float16
)
On this code, we’re initializing a brand new DiffusionPipeline object utilizing the “from_pretrained
” technique. We’re specifying the trail to the pre-trained mannequin and setting the “custom_pipeline
” argument to “lpw_stable_diffusion
”. This tells Diffusers to make use of the “lpw_stable_diffusion
” pipeline, which unlocks the 77 immediate token limitation.
Now, let’s use an extended immediate string to try it out. Right here is the whole code:
from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
,custom_pipeline = "lpw_stable_diffusion" #<--- code added
,torch_dtype = torch.float16
)
pipeline.to("cuda")
pipeline.enable_xformers_memory_efficient_attention()
immediate = """
Babel tower falling down, strolling on the starlight, dreamy extremely huge shot
, atmospheric, hyper reasonable, epic composition, cinematic, octane render
, artstation panorama vista pictures by Carr Clifton & Galen Rowell, 16K decision
, Panorama veduta picture by Dustin Lefevre & tdraw, detailed panorama portray by Ivan Shishkin
, DeviantArt, Flickr, rendered in Enscape, Miyazaki, Nausicaa Ghibli, Breath of The Wild
, 4k detailed put up processing, artstation, rendering by octane, unreal engine
"""
picture = pipeline(immediate).photographs[0]
picture.save("goodbye_babel_tower.png")
And you’re going to get a picture like this:
For those who nonetheless see a warning message like: Token indices sequence size is longer than the required most sequence size for this mannequin ( *** > 77 ) . Operating this sequence by the mannequin will lead to indexing errors.
It’s regular, you possibly can simply ignore it.
4. Use Customized LoRA with Diffusers
Regardless of the claims of LoRA support in Diffusers, customers nonetheless face limitations on the subject of loading native LoRA information within the .safetensor
file format. This is usually a important impediment for customers to make use of the LoRA from the neighborhood.
To beat this limitation, I’ve created a operate that enables customers to load LoRA information with weighted numbers in actual time. This operate can be utilized to load LoRA information and their corresponding weights to a Diffusers mannequin, enabling the technology of high-quality photographs with LoRA knowledge.
Right here is the operate physique:
from safetensors.torch import load_file
def __load_lora(
pipeline
,lora_path
,lora_weight=0.5
):
state_dict = load_file(lora_path)
LORA_PREFIX_UNET = 'lora_unet'
LORA_PREFIX_TEXT_ENCODER = 'lora_te'alpha = lora_weight
visited = []
# immediately replace weight in diffusers mannequin
for key in state_dict:
# as we now have set the alpha beforehand, so simply skip
if '.alpha' in key or key in visited:
proceed
if 'textual content' in key:
layer_infos = key.break up('.')[0].break up(LORA_PREFIX_TEXT_ENCODER+'_')[-1].break up('_')
curr_layer = pipeline.text_encoder
else:
layer_infos = key.break up('.')[0].break up(LORA_PREFIX_UNET+'_')[-1].break up('_')
curr_layer = pipeline.unet
# discover the goal layer
temp_name = layer_infos.pop(0)
whereas len(layer_infos) > -1:
attempt:
curr_layer = curr_layer.__getattr__(temp_name)
if len(layer_infos) > 0:
temp_name = layer_infos.pop(0)
elif len(layer_infos) == 0:
break
besides Exception:
if len(temp_name) > 0:
temp_name += '_'+layer_infos.pop(0)
else:
temp_name = layer_infos.pop(0)
# org_forward(x) + lora_up(lora_down(x)) * multiplier
pair_keys = []
if 'lora_down' in key:
pair_keys.append(key.substitute('lora_down', 'lora_up'))
pair_keys.append(key)
else:
pair_keys.append(key)
pair_keys.append(key.substitute('lora_up', 'lora_down'))
# replace weight
if len(state_dict[pair_keys[0]].form) == 4:
weight_up = state_dict[pair_keys[0]].squeeze(3).squeeze(2).to(torch.float32)
weight_down = state_dict[pair_keys[1]].squeeze(3).squeeze(2).to(torch.float32)
curr_layer.weight.knowledge += alpha * torch.mm(weight_up, weight_down).unsqueeze(2).unsqueeze(3)
else:
weight_up = state_dict[pair_keys[0]].to(torch.float32)
weight_down = state_dict[pair_keys[1]].to(torch.float32)
curr_layer.weight.knowledge += alpha * torch.mm(weight_up, weight_down)
# replace visited checklist
for merchandise in pair_keys:
visited.append(merchandise)
return pipeline
The logic is extracted from the convert_lora_safetensor_to_diffusers.py of the diffusers git repo.
Take one of many well-known LoRA:MoXin for instance. you should utilize the __load_lora
operate like this:
from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
,custom_pipeline = "lpw_stable_diffusion"
,torch_dtype = torch.float16
)
lora = (r"D:sd_modelsLoraMoxin_10.safetensors",0.8)
pipeline = __load_lora(pipeline=pipeline,lora_path=lora[0],lora_weight=lora[1])
pipeline.to("cuda")
pipeline.enable_xformers_memory_efficient_attention()immediate = """
shukezouma,unfavourable area,shuimobysim
a department of flower, conventional chinese language ink portray
"""
picture = pipeline(immediate).photographs[0]
picture.save("a department of flower.png")
The immediate will generate a picture like this:
You may name a number of instances of __load_lora()
to load a number of LoRAs for one technology.
With this operate, now you can load LoRA information with weighted numbers in actual time and use them to generate high-quality photographs with Diffusers. The LoRA loading is fairly quick, normally taking only one–2 seconds, method higher than changing and utilizing(which can generate one other mannequin file in GB measurement).
5. Use Customized Texture Inversions with Diffusers
Utilizing customized Texture Inversions with Diffusers package deal is usually a highly effective strategy to generate high-quality photographs. Nonetheless, the official documentation of Diffusers means that customers want to coach their very own Textual Inversions which may take as much as an hour on a V100 GPU. This might not be sensible for a lot of customers who need to generate photographs rapidly.
So I investigated it and located an answer that may allow diffusers to make use of a textual inversion identical to in Steady Diffusion WebUI. Under is the operate I created to load a customized Textual Inversion.
def load_textual_inversion(
learned_embeds_path
, text_encoder
, tokenizer
, token = None
, weight = 0.5
):
'''
Use this operate to load textual inversion mannequin in mannequin initilization stage
or picture technology stage.
'''
loaded_learned_embeds = torch.load(learned_embeds_path, map_location="cpu")
string_to_token = loaded_learned_embeds['string_to_token']
string_to_param = loaded_learned_embeds['string_to_param']# separate token and the embeds
trained_token = checklist(string_to_token.keys())[0]
embeds = string_to_param[trained_token]
embeds = embeds[0] * weight
# forged to dtype of text_encoder
dtype = text_encoder.get_input_embeddings().weight.dtype
embeds.to(dtype)
# add the token in tokenizer
token = token if token shouldn't be None else trained_token
num_added_tokens = tokenizer.add_tokens(token)
if num_added_tokens == 0:
#print(f"The tokenizer already comprises the token {token}.The brand new token will substitute the earlier one")
increase ValueError(f"The tokenizer already comprises the token {token}. Please go a distinct `token` that isn't already within the tokenizer.")
# resize the token embeddings
text_encoder.resize_token_embeddings(len(tokenizer))
# get the id for the token and assign the embeds
token_id = tokenizer.convert_tokens_to_ids(token)
text_encoder.get_input_embeddings().weight.knowledge[token_id] = embeds
return (tokenizer,text_encoder)
Within the load_textual_inversion()
operate, it’s essential present the next arguments:
learned_embeds_path
: Path to the pre-trained textual inversion mannequin file in .pt or .bin format.text_encoder
: Textual content encoder object obtained from the Diffusion Pipeline.tokenizer
: Tokenizer object obtained from the Diffusion Pipeline.token
: Non-obligatory argument specifying the immediate token. By default, it’s set to None. it’s the key phrase that can set off the textual inversion in your immediateweight
: Non-obligatory argument specifying the load of the textual inversion. By default, I set it to 0.5. you possibly can change to different worth as wanted.
Now you can use the operate with a diffusers pipeline like this:
from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
,custom_pipeline = "lpw_stable_diffusion"
,torch_dtype = torch.float16
,safety_checker = None
)textual_inversion_path = r"D:sd_modelsembeddingsstyle-empire.pt"
tokenizer = pipeline.tokenizer
text_encoder = pipeline.text_encoder
load_textual_inversion(
learned_embeds_path = textual_inversion_path
, tokenizer = tokenizer
, text_encoder = text_encoder
, token = 'styleempire'
)
pipeline.to("cuda")
pipeline.enable_xformers_memory_efficient_attention()
immediate = """
styleempire,award successful stunning road, storm,((darkish storm clouds))
, fluffy clouds within the sky, shaded flat illustration, digital artwork
, trending on artstation, extremely detailed, positive element, intricate
, ((lens flare)), (backlighting), (bloom)
"""
neg_prompt = """
cartoon, 3d, ((disfigured)), ((dangerous artwork)), ((deformed)), ((poorly drawn))
, ((additional limbs)), ((shut up)), ((b&w)), bizarre colours, blurry
, hat, cap, glasses, sun shades, lightning, face
"""
generator = torch.Generator("cuda").manual_seed(1)
picture = pipeline(
immediate
,negative_prompt =neg_prompt
,generator = generator
).photographs[0]
picture.save("tv_test.png")
Right here is the results of making use of an Empire Style Textual Inversion.
The left’s fashionable road turns to an outdated London fashion.
6. Upscale Pictures
Diffusers package deal is nice for producing high-quality photographs, however picture upscaling shouldn’t be its main operate. Nonetheless, the Steady-Diffusion-WebUI gives a function known as HighRes, which permits customers to upscale their generated photographs to 2x or 4x. It might be nice if Diffusers customers might get pleasure from the identical function. After some analysis and testing, I discovered that the SwinRI mannequin is a superb choice for picture upscaling, and it might simply upscale photographs to 2x or 4x after they’re generated.
To make use of the SwinRI mannequin for picture upscaling, we will use the code from the GitHub repository of JingyunLiang/SwinIR. For those who simply need codes, downloading fashions/network_swinir.py
, utils/util_calculate_psnr_ssim.py
and main_test_swinir.py
is sufficient. Following the readme guideline, you possibly can upscale photographs like magic.
Here’s a pattern of how effectively SwinRI can scale up a picture.
Many different open-source options can be utilized to enhance picture high quality. Right here checklist three different fashions that I attempted that return great outcomes.
RealSR can scale up a picture 4 instances nearly nearly as good as SwinRI, and its execution efficiency is the quickest, as a substitute of invoking PyTorch and CUDA. The writer compiles the code and CUDA utilization to binary immediately. My observations reveal that the RealSR can upscale a mage in about simply 2–4 seconds.
CodeFormer is sweet at restoring blurred or damaged faces, it might additionally take away noise and improve background particulars. This answer and algorithm is extensively utilized in different functions, together with Steady-Diffusion-WebUI
One other highly effective open-source answer that archives superb outcomes of face restoration, and it’s quick too. GFPGAN can be built-in into Steady-Diffusion-WebUI.
7. Optimize Diffusers CUDA Reminiscence Utilization
When utilizing Diffusers to generate photographs, it’s vital to think about the CUDA reminiscence utilization, particularly while you need to load different fashions to additional course of the generated photographs. For those who attempt to load one other mannequin like SwinIR to upscale photographs, you would possibly encounter a RuntimeError: CUDA out of reminiscence
because of the Diffuser mannequin nonetheless occupying the CUDA reminiscence.
To mitigate this situation, there are a number of options to optimize CUDA reminiscence utilization. The next two options I discovered work the very best:
- Sliced Consideration for Extra Reminiscence Financial savings
Sliced consideration is a method that reduces the reminiscence utilization of self-attention mechanisms in transformers. By partitioning the eye matrix into smaller blocks, the reminiscence necessities are lowered. This method can be utilized with the Diffusers package deal to scale back the reminiscence footprint of the Diffuser mannequin.
To make use of it in Diffusers, merely one line code:
pipeline.enable_attention_slicing()
Often, you received’t have two fashions operating on the similar time, the thought is to dump the mannequin knowledge to the CPU reminiscence briefly and unencumber CUA reminiscence area for different fashions, and solely load as much as VRAM while you begin utilizing the mannequin.
To make use of dynamically offload knowledge to CPU reminiscence in Diffusers, use this line code:
pipeline.enable_model_cpu_offload()
After making use of this, at any time when Diffusers end the picture technology activity, the mannequin knowledge will likely be offloaded to CPU reminiscence robotically till the subsequent time calling.
Abstract
The article discusses the right way to enhance the efficiency and capabilities of the Diffusers package deal, The article covers a number of options to widespread points confronted by Diffusers customers, together with loading native .safetensor
fashions, boosting efficiency, eradicating the 77 immediate tokens limitation, utilizing customized LoRA and Textual Inversion, upscaling photographs, and optimizing CUDA reminiscence utilization.
By making use of these options, Diffusers customers can generate high-quality photographs with higher efficiency and extra management over the method. The article additionally consists of code snippets and detailed explanations for every answer.
For those who can efficiently apply these options and code in your case, there might be an extra profit, which I profit loads, is that you could be implement your individual options by studying the Diffusers supply code and perceive higher how Steady Diffusion works. To me, studying, discovering, and implementing these options is a enjoyable journey. Hope these options also can assist you and need you get pleasure from with Steady Diffusion and diffusers package deal.
Right here present the immediate that generates the heading picture:
Babel tower falling down, strolling on the starlight, dreamy extremely huge shot
, atmospheric, hyper reasonable, epic composition, cinematic, octane render
, artstation panorama vista pictures by Carr Clifton & Galen Rowell, 16K decision
, Panorama veduta picture by Dustin Lefevre & tdraw, detailed panorama portray by Ivan Shishkin
, DeviantArt, Flickr, rendered in Enscape, Miyazaki, Nausicaa Ghibli, Breath of The Wild
, 4k detailed put up processing, artstation, rendering by octane, unreal engine
Dimension: 600 * 800
Seed: 3977059881
Scheduler (or Sampling technique): DPMSolverMultistepScheduler
Sampling steps: 25
CFG Scale (or Steering Scale): 7.5
SwinRI mannequin: 003_realSR_BSRGAN_DFO_s64w8_SwinIR-M_x4_GAN.pth
License and Code Reuse
The options offered on this article have been achieved by intensive supply studying, later evening testing, and logical design. It is very important observe that on the time of writing (April 2023), loading LoRA and Textual Inversion options and code included on this article are the one working variations throughout the web.
For those who discover the code introduced on this article helpful and need to reuse it in your challenge, paper, or article, please reference again to this Medium article. The code introduced right here is licensed beneath the MIT license, which allows you to use, copy, modify, merge, publish, distribute, sublicense, and/or promote copies of the software program, topic to the circumstances of the license.
Please observe that the options introduced on this article might not be the optimum or most effective strategy to obtain the specified outcomes, and are topic to vary as new developments and enhancements are made. It’s at all times really helpful to totally check and validate any code earlier than implementing it in a manufacturing surroundings.