Region-Adaptive Sampling For Diffusion Transformers
Region-Adaptive Sampling For Diffusion Transformers
Ziming Liu1 *, Yifan Yang2 , Chengruidong Zhang2 , Yiqi Zhang1 , Lili Qiu2 , Yang You1†, Yuqing Yang2†
1
National University of Singapore, 2 Microsoft Research
{liuziming, youy}@comp.nus.edu.sg {yifanyang, yuqyang}@microsoft.com
https://aka.ms/ras-dit
arXiv:2502.10389v1 [cs.CV] 14 Feb 2025
Abstract
Diffusion models (DMs) have become the leading choice
for generative tasks across diverse domains. However, their
reliance on multiple sequential forward passes significantly
limits real-time performance. Previous acceleration meth-
ods have primarily focused on reducing the number of sam-
pling steps or reusing intermediate results, failing to lever-
age variations across spatial regions within the image due
to the constraints of convolutional U-Net structures. By har- Figure 1. The main subject and the regions with more details are
nessing the flexibility of Diffusion Transformers (DiTs) in brushed for more steps than other regions in RAS. Each block rep-
handling variable number of tokens, we introduce RAS, a resents a patchified latent token.
novel, training-free sampling strategy that dynamically as-
signs different sampling ratios to regions within an image
based on the focus of the DiT model. Our key observa- ducing high-quality data across various domains. Appli-
tion is that during each sampling step, the model concen- cations of DMs include image synthesis [7, 35], image
trates on semantically meaningful regions, and these ar- super-resolution [13, 24, 51], image-to-image translation
eas of focus exhibit strong continuity across consecutive [23, 38, 47], image editing [22, 52], inpainting [29], video
steps. Leveraging this insight, RAS updates only the re- synthesis [3, 9], text-to-3D generation [33], and even plan-
gions currently in focus, while other regions are updated ning tasks [20]. However, generating samples with DMs
using cached noise from the previous step. The model’s fo- involves solving a generative Stochastic or Ordinary Differ-
cus is determined based on the output from the preceding ential Equation (SDE/ODE) [15, 34] in reverse time, which
step, capitalizing on the temporal consistency we observed. requires multiple sequential forward passes through a large
We evaluate RAS on Stable Diffusion 3 and Lumina-Next- neural network. This sequential processing limits their real-
T2I, achieving speedups up to 2.36x and 2.51x, respectively, time applicability.
with minimal degradation in generation quality. Addition-
ally, a user study reveals that RAS delivers comparable Considerable work has been dedicated to accelerating
qualities under human evaluation while achieving a 1.6x the sampling process in DMs by reducing the number of
speedup. Our approach makes a significant step towards sampling steps. Approaches include training-based meth-
more efficient diffusion transformers, enhancing their po- ods such as progressive distillation [39], consistency models
tential for real-time applications. Our code is available at [43], and rectified flow [1, 26, 27], and training-free meth-
https://github.com/microsoft/RAS. ods such as DPM-solver [28], AYS [37], DeepCache [49],
and Delta-DiT [5]. These methods, however, uniformly pro-
cess all regions of an image during sampling, irrespective of
1. Introduction the specific needs of different regions. Each sampling step
in these methods treats every area of the image equally, pre-
Diffusion models (DMs) [8, 18, 41, 42] have proven to dicting the noise for each region at the current time step be-
be highly effective probabilistic generative models, pro- fore proceeding to the next. Intuitively, however, the com-
* This work is done while Ziming Liu’s internship at Microsoft Re-
plexity of different regions within an image varies: intri-
search. cate foreground elements may require more sampling steps
† Corresponding authors. for clarity, while repetitive backgrounds could benefit from
(c) Lumina-Next-T2I FID RAS VS Rectified Flow
(a) Lumina-Next-T2I
Figure 2. (a)(b) Accelerating Lumina-Next-T2I and Stable Diffusion 3, with 30 and 28 steps separately. (c)(d) Multiple configurations of
RAS outperform rectified flow in both image qualities and text-following. RAS-X stands for RAS with X sampling steps in total. (e) RAS
achieves comparable human-evaluation results with the default model configuration while achieving around 1.6x speedup.
more aggressive compression of sampling steps without sig- semantically meaningful areas within the image. This pat-
nificant loss of quality. This suggests a potential for a more tern is akin to the process of an artist completing a painting
flexible sampling approach that can dynamically adjust the in 1000 steps on a blank canvas, where each step involves
sampling ratio across different regions, enabling faster, yet a different brush stroke that selectively refines certain ar-
high-quality diffusion process. eas. This observation suggests that in each adjacent step,
This concept is a natural progression in the evolution of the ”brushes” used by the diffusion model are similar, and
DMs. From DDPM [18] to Stable Diffusion XL [32], dif- hence, the areas they refine remain consistent. Thus, areas
fusion models have predominantly relied on U-Nets, whose that the model momentarily ignores could potentially be ex-
convolutional structures [36] necessitate uniform treatment cluded from the computationally intensive DiT processing,
of all image regions due to fixed square inputs. However, allowing the model to focus more on regions of immediate
with the advent of DiTs [31] and the increasing exploration interest.
of fully transformer-based architectures [45], the research To validate this hypothesis, we conducted an experiment
focus has shifted towards architectures that can accommo- where we identified the ranking of tokens at each diffusion
date flexible token inputs. DiTs allow any number of tokens step using the metric of output noise we introduced, repre-
to be processed flexibly, opening up new possibilities. This senting regions of primary focus for the model. By calcu-
shift has inspired us to design a new sampling approach ca- lating the similarity of token rankings using NDCG (Figure
pable of assigning different sampling steps to different re- 4), we found a high degree of continuity in the areas of fo-
gions within an image. cus between adjacent steps. This continuity motivated us to
To further explore the feasibility of this idea, we visual- design a sampling approach that assigns different sampling
ized several diffusion process outputs at varying sampling ratios to different regions based on their attention continu-
steps (Figure 3). We observed two key phenomena: (1) ity.
the regions of focus in adjacent steps exhibit considerable As is shown in Figure 5, our method leverages the out-
continuity during the later stages of diffusion, and (2) in put noise from the previous step to identify the model’s pri-
each sampling step, the model tends to focus on specific mary focus for the current step (fast-update regions), al-
Figure 3. Visualization of predicted noise of each step. DiT model focuses on certain regions during each step and the change in focus is
continuous across steps.
lowing only these regions to proceed through DiT for de- flexibility enables our approach to allocate DiT’s processing
noising. Conversely, for regions of less interest (slow- power to the model’s current areas of interest, significantly
update regions), we reuse the previous step’s noise output improving generation quality within the same inference
directly. This approach enables regional variability in sam- budget. As shown in Figure 2 (c)(d), our method achieves
pling steps: areas of interest are updated with higher ratio, substantial reductions in inference cost with minimal FID
while others retain the previous noise output, thus reducing increase, while outperforming the uniform sample baseline
computation. After each local update, the updated regions’ in terms of FVD within equivalent inference times. Figure
predicted noise values change, forming the noise map for 2 (a)(b) also demonstrates that with models like Lumina-
sampling in the current step, which also serves as the se- Next-T2I [2] and Stable Diffusion 3 [11], our method’s
lection criterion for the next fast-update regions. This op- fast-region noise updating yields over twice the acceleration
eration restricts the tokens processed by DiT to those in the with negligible image quality loss. A user study comparing
model’s current area of focus, enhancing image refinement our method to uniform sampling across various generated
efficiency. cases further shows that our method maintains comparable
generation quality at 1.6x the acceleration rate.
Figure 4. NDCG [21, 48] for each pair of adjacent sampling steps
is high throughout the diffusion process, marking the similarities
in the ranking of focused tokens ranging from 0 to 1.
Figure 5. Overview of RAS design. Only current fast-update re-
For each input Xt , we select a fast-update rate to deter- gions of each step are passed to the model.
mine the regions needing updates in each step, while re-
gions in the slow-update regime retain the previous noise
output, which, combined with the updated fast-region noise, 2. Related Work
forms Xt−1 for the next step. To maintain global consis-
tency, we keep features from slow-update regions as refer-
2.1. Diffusion Models: From U-Net to Transformer
ence keys and values for subsequent steps. Although the Diffusion models [8, 18, 41, 42] have demonstrated signif-
fast-region selection is dynamic and recalculated after each icant capabilities across a range of generative tasks, often
update to prioritize significant areas, we periodically reset outperforming previous methods such as generative adver-
the inference for all regions to mitigate cumulative errors. sarial networks (GANs) [14] in many downstream applica-
In summary, we propose RAS, the first diffusion sam- tions. Historically, denoising diffusion probabilistic mod-
pling strategy that allows for regional variability in sam- els (DDPMs) [18] and more recent models like Stable Dif-
pling ratios. Compared to spatially uniform samplers, this fusion XL [32] have predominantly utilized convolutional
U-Nets [36] as the backbone. Due to the structure of con- ploit the characteristics of both the diffusion process and the
volutional networks, it is necessary to maintain the original structure of Diffusion Transformers (DiTs), we introduce
spatial resolution of the input samples to support operations RAS, a novel approach designed to optimize computation
like pooling, which inherently limits the ability to exploit by focusing on the distinctive properties of different image
spatial redundancy during the diffusion process, particularly regions. RAS is also orthogonal to the methods we men-
when attempting to prune the latent samples used as model tioned above, such as DiTFastAttn [50] and ∆-DiT [5].
inputs.
Fortunately, the dilemma has been broken by the emer- 3. Methodology
gence of Diffusion Transformer (DiT) [31], which has also
been used as the backbone of the SOTA diffusion models
like Stable Diffusion 3 [10], Lumina T2X [2], Pixart-Sigma t The current timestep
[4]. The biggest characteristic of DiT is that it completely N The noise output of the DiT model
eliminates the need for Convolutaional U-Net. DiT uses a Ne The cached noise output from the previous timestep
pure Transformer [45] architecture and adds conditional in- N̂ The estimated full-length noise calculated with N and N
e
formation such as prompts with adaptive layer norm. In this S The unpatchified image sample
way, the positional information is no longer provided by the x The patchified input of the DiT model
convolutional operations, and the latent tokens are now po- M Mask generated to drop certain tokens in the input
sitionally independent after position embedding. This en- D The number of times the tokens in a patch being dropped
ables us to utilize the redundancy we discovered in Sec-
tion 1 and select the tokens that are likely to be focused in Table 1. Meanings of the symbols that are used in this paper
the current sampling step to compute while caching the pre-
dicted of other tokens from the previous step.
To address the problem of high inference cost in diffu- In this section, we present the RAS design and techniques to
sion models, various acceleration techniques have been pro- exploit inter-timestep token correlations and the regional to-
posed from different perspectives. A commonly used ap- ken attention mechanism introduced in Section 1. (1) Based
proach is to reduce the number of sampling steps. Some on the regional characteristics we observed in the DiT in-
of these techniques require additional training, such as pro- ference process, we propose an end-to-end pipeline that dy-
gressive distillation [39], consistency models [43], and rec- namically eliminates the computation through DiT of cer-
tified flow [1, 26, 27]. Among these methods, rectified flow tain tokens at each timestep. (2) To leverage the continuity
has been widely used in models like Stable Diffusion 3 [10]. across consecutive timesteps, we propose a straightforward
It learns the ODE to follow straight paths between the stan- method to identify the fast-update regions that require re-
dard normal distribution and the distribution of the training finement in upcoming timesteps, while ensuring that slow-
dataset. These straight paths significantly reduce the dis- update regions are not neglected due to insufficient diffu-
tance between the two distributions, which in turn lowers sion steps. This approach effectively balances token focus
the number of sampling steps needed. without leading to starvation. (3) Building on our observa-
There are also training-free methods to either reduce the tions of continuous distribution patterns, we introduce sev-
number of sampling steps or decrease the computational eral scheduling optimization techniques to further enhance
burden within each step. DPM-solver [28], for instance, the quality of generated content.
introduces a formulation that enhances the solution process
def step(self, dit_output_fast, sample, t):
of diffusion ODEs. DeepCache [49], specifically designed dit_output = self.merge(dit_output_fast, self.cached_output_slow)
for U-Net-based models, leverages the U-Net architecture to
# original sampling step
cache and retrieve features across adjacent stages, allowing diff = (self.sigma[t - 1] - self.sigma[t]) * dit_output
prev_sample = sample – diff
for the skipping of certain downsampling and upsampling
operations during the diffusion process. However, these # identify and cache the slow-update region
slow_indices, fast_indices, self.cached_output_slow = \
approaches treat all image regions uniformly, ignoring the self.region_identify(dit_output)
varying complexity across different parts of the image. This
# returning only the fast-update region to the model
uniform treatment can lead to significant computational in- return prev_sample[fast_indices]
efficiency, as not all regions require the same level of pro-
cessing. Figure 6. Sample Step with RAS in Python. Only two extra func-
tions are needed to switch from the original scheduler to RAS.
As introduced in Section 1, the complexity of differ-
ent regions within an image can vary substantially. To ex-
3.2. Region-Adaptive Sampling age qualities and notable differences between regions, as is
shown in Figure 8. Also, considering the similarities be-
Region-Aware DiT Inference with RAS. Building on the
tween latent samples across adjacent timesteps, we hypoth-
insight that only certain regions are important at each
esize that tokens deemed important in the current timestep
timestep, we introduce the RAS pipeline for DiT inference.
are likely to remain important in the next, while the less-
In U-Net-based models such as SDXL [32], tokens must
focused tokens can be dropped with minimal impact. Be-
remain in fixed positions to preserve positional informa-
fore we reach the final formulation of the metric, we need
tion. However, given the structure of DiT, we can now
to introduce another technique to prevent starvation.
mask and reorder elements within latent samples, as posi-
Starvation Prevention. During the diffusion process, the
tional information is already embedded using techniques
main subject regions typically require more refinement
like RoPE [44]. This flexibility allows us to selectively
compared to the background. However, consistently drop-
determine which regions are processed by the model. To
ping computations for background tokens can lead to ex-
achieve this, some additional operations are required start-
cessive blurring or noise in the final generated image. To
ing from the final step, as described in Figure 6. At the end
address this, we track how often a token is dropped and
of each timestep, the current sample is updated by com-
incorporate this count as a scaling factor in our metric for
bining the fresh model output for the active tokens and the
selecting tokens to cache or drop, ensuring less important
cached noise for the inactive tokens. Specifically, the noise
tokens are still adequately processed.
for the entire sequence is restored by integrating both the
Additionally, since DiT patchifies the latent tokens before
model output and the cached noise from the previous step.
feeding them into the model, we compute our metric at the
This mechanism enables active, important tokens to move in
patch level by averaging the scores of the tokens within each
the new direction determined at the current timestep, while
patch. Combining all the factors mentioned above, our met-
the inactive tokens retain the trajectory from the previous
ric can be written as:
timestep. The next step involves updating the unpatchified
sample with the scaled noise. We then compute the metric Rt = meanpatch (std(N̂t )) · exp(k ∗ Dpatch ) (1)
R, which is used to identify the fast-update regions based
on the noise, update the drop count D to track the frequency where N̂t is the current estimated noise, Dpatch is the count
with which each token has been excluded, and generate the of how many times the tokens in a patch have been dropped,
mask M accordingly. and k is a scale factor to control the difference of sample
With the mask M , the noise for the slow-update regions ratios between fast-update regions and slow-update regions.
is cached, while the sample for the current fast-update re-
gions is patchified and passed through the DiT model. Since
modules like Layernorm and MLP do not involve cross-
Fused Key Query Value Fused
token operations, the computation remains unaffected even Projection Projection Projection
when the sequence is incomplete. For the attention [45]
module, the computation can still proceed with the query,
key, and value tensors being pruned. Additionally, we intro-
Scatter Scatter
duce a caching mechanism to further enhance performance,
(Skipped)
which will be detailed later. In summary, RAS dynamically Self
Attention
detects regions of focus and reduces the overall computa-
tional load of DiT by at least the same proportion as the
user-defined sampling ratio.
Output
Region Identification. The DiT model processes the cur- Projection
rent timestep embedding, latent sample, and prompt em-
bedding to predict the noise that guides the current sam-
Figure 7. A RAS self-attention module using Attention Recov-
ple closer to the original image at each timestep. To quan- ery to enhance generation quality. Xat,l , Qt,l t,l t,l t,l
a , Ka , Va and Oa
tify the refinement of tokens at each timestep, we use the represent the input hidden states, query, key, value and attention
model’s output as a metric. Through observation, we found output of active tokens on layer l during step t, respectively. K t,l
that the standard deviation of the noise strongly marks the and V t,l denote the key and value caches. The scatter operation to
regions in the images, with the main subject (fast-update partially upload the key and value caches are fused into the previ-
regions) showing an obvious lower standard deviation than ous projection using a PIT GeMM kernel. The keys and values of
the background (slow-update region). This could be caused the not-focused area (Kit,l and Vit,l ) are estimated with the cache
by the difference in the amount of information between the from the last sampling step (K t−1,l and V t−1,l ).
regions after mixing with the Gaussian noises. Utilizing
the deviation as a metric achieves reasonable results of im- Key and Value Caching. As we know, the attention mech-
Prompt: A photorealistic image of a tree in the desert.
Prompt: A red heart in the clouds over water, in the style of zbrush, light pink and sky-blue, I can't believe how beautiful this is, hyperbolic expression, nyc explosion coverage, unreal engine 5, robert bissell.
Lumina-Next-T2I RAS (50% Sampling) Regional Sample Ratio Stable Diffusion 3 RAS (50% Sampling) Regional Sample Ratio
anism works by using the query for each token to compute address these observations, we propose a dynamic sampling
its attention score with each other tokens by querying the ratios that maintains a 100% ratio for the initial timesteps
keys and values of the whole sequence, thus giving the rela- (e.g., the first 4 out of 28 steps) to mitigate any adverse
tions between each two tokens. The attention of the active effects on the outline of the generated image. Thereafter,
tokens in RAS can be given as: the sampling ratio is progressively reduced during the stable
phase. This approach ensures a balance between computa-
Qa K T tional efficiency and the image quality, enabling effective
Oa = sof tmax( √ a )Va (2)
d sampling ratios while minimizing adverse impacts on the
where a stands for currently active tokens. However, the generated output.
metric R we introduce to identify the current fast and slow Accumulated Error Resetting. RAS focuses on the
regions does not take their contribution to the attention model’s regions of interest, which tend to be similar across
score into consideration. Thereby, losing these tokens dur- adjacent sampling steps. However, regions that are not pri-
ing attention can cause a huge change in the final output. oritized for multiple steps may accumulate stale denoising
Our solution here is also caching. During each step, the full directions, resulting in significant error between the original
keys and values are cached until they are partially updated latent sample and the one generated with RAS. To mitigate
with the current active tokens. As is described in Figure 7, this issue, we introduce dense steps into the RAS diffusion
this solution is also based on the similarity between each process to periodically reset accumulated errors. For in-
two sampling steps, and now we can estimate the original stance, in a 30-step diffusion process where RAS is applied
attention output by: starting from step 4, we designate steps 12 and 20 as dense
steps. During these dense steps, the entire image is pro-
e ]T
Qa [Ka , K cessed by the model, allowing it to correct any drift that
Oa = sof tmax( √ i )[Va , Vei ] (3) may have developed in unfocused areas. This approach en-
d
sures that the accumulated errors are reset, maintaining the
where i stands for the inactive tokens.
denoising process in alignment with the correct direction.
3.3. Scheduling Optimization
3.4. Implementation
Dynamic Sampling Ratio. As illustrated in Figure 4, the
correlation between the initial timesteps is lower compared Kernel Fusing. As previously mentioned, we introduced
to the latter stages, where the diffusion process exhibits key and value caching in the self-attention mechanism. In
greater stability. This trend is also evident in Figure 3. Con- each attention block of the selective sampling steps, these
sequently, the strategy previously introduced is not suitable caches are partially updated by active tokens and then used
for the early stages of the diffusion process, as it could neg- as key and value inputs for the attention functions. This par-
atively impact the foundational structure of the generated tial updating operation is equivalent to a scatter operation
image. Furthermore, we have observed that the similarity with active token indices.
gradually increases during the stable phase of diffusion. To In our scenario, the source data of the scatter operation
Method Sample Steps Sampling Ratio Throughput (iter/s)↑ FID ↓ sFID ↓ CLIP score ↑
Stable Diffusion 3
RFlow 5 100% 1.43 39.70 22.34 29.84
RAS 7 25.0% 1.45 31.99 21.70 30.64
RAS 7 12.5% 1.48 32.86 22.10 30.55
RAS 6 25.0% 1.52 33.24 21.51 30.38
RAS 6 12.5% 1.57 33.81 21.62 30.33
RFlow 4 100% 1.79 61.92 27.42 28.45
RAS 5 25.0% 1.94 51.92 25.67 29.06
RAS 5 12.5% 1.99 53.24 26.04 28.94
Lumina-Next-T2I
RFlow 7 100% 0.49 48.19 38.60 28.65
RAS 10 25.0% 0.59 45.67 32.36 29.82
RAS 10 12.5% 0.65 47.34 32.69 29.75
RFlow 5 100.% 0.69 96.53 59.26 26.03
RAS 7 25.0% 0.70 53.93 39.80 28.85
RAS 7 12.5% 0.74 54.62 40.23 28.83
RAS 6 25.0% 0.75 67.16 46.46 27.85
RAS 6 12.5% 0.78 67.88 45.88 27.83
Table 2. Pareto Improvements of rectified flow with RAS on COCO Val2014 1024×1024.
comprises active keys and values outputted by the previous the original implementation under similar throughput con-
general matrix multiplication (GeMM) kernel in the linear ditions.
projection module. The extra GPU memory read/store on Code Implementation. We implement RAS using PyTorch
active keys and values can be avoided by fusing the scat- [30], leveraging the diffusers library [46] and its Flow-
ter operation into the GeMM kernel, rather than launch- MatchEulerDiscreteScheduler. The evaluation metrics are
ing a separate scatter kernel. Fortunately, PIT [53] demon- computed using public repositories available on GitHub
strates that all permutation invariant transformations, in- [19, 40, 54]. Experiments are conducted on four servers,
cluding one-dimensional scattering, can be performed in each equipped with eight NVIDIA A100 40GB GPUs,
the I/O stage of GPU-efficient computation kernels (e.g. while speed tests are performed on an NVIDIA A100 80GB
GeMM kernels) with minimal overhead. Using this method, GPU.
we fused the scatter operation into the epilogue of the pre-
vious GeMM kernel. 4.2. Generation Benchmarks
We conducted a comparative evaluation of RAS and the
4. Experiments rectified flow, which uniformly reduces the number of
timesteps for every token during inference. To assess the
4.1. Experiment Setup
performance of RAS, we performed experiments using vari-
Models, Datasets, Metrics and Baselines. We evalu- ous configurations of inference timesteps. The findings can
ate RAS on Stable Diffusion 3 [10] and Lumina-Next-T2I be interpreted in two principal ways.
[2] for text-to-image generation tasks, using 10,000 ran- Pushing the Efficiency Frontier. From the first aspect,
domly selected caption-image pairs from the MS-COCO RAS offers a chance to further reduce the inference cost for
2017 dataset [25]. To assess the quality of generated images each number of timesteps rectified flow offers. As illus-
and their compatibility with prompts, we use the Fréchet In- trated in Figure 2 (c)(d), we generated 10,000 images using
ception Distance (FID) [17], the Sliding Fréchet Inception dense inference across different timesteps, ranging from 3
Distance (sFID) [17], and the CLIP score [16] as evaluation to 30. Subsequently, we applied RAS at varying average
metrics. For baseline comparison, we evaluate RAS against sampling ratios (75%, 50%, 25%, and 12.5%) over selective
widely-used Rectified-Flow-based Flow-Matching methods sampling timesteps, with the total number of timesteps set
[1, 6, 10, 12, 26, 27], which uniformly reduce the number of at 5, 6, 7, 10, 15, and 30. The results indicate that RAS can
timesteps in the generation process for the whole image. We significantly reduce inference time while exerting only a mi-
implement RAS with varying numbers of total timesteps to nor effect on key evaluation metrics. For instance, employ-
assess its performance, and compare these configurations to ing RAS with 25% sampling over 30 timesteps improved
throughput by a factor of 2.25, with only a 22.12% increase conducted a human evaluation. We randomly selected 14
in FID, a 26.22% increase in sFID, and a 0.065% decrease prompts from the official research papers and blogs of Sta-
in CLIP score. Furthermore, the efficiency improvements ble Diffusion 3 and Lumina, generating two images for
achieved with RAS are attained at a lower cost compared to each prompt: one using dense inference and the other using
merely reducing the number of timesteps. Specifically, the RAS, both with the same random seed and default number
rate of quality degradation observed when decreasing the of timesteps. RAS was configured with 50% average sam-
sampling ratio of RAS is considerably lower than that ob- pling ratio during the selective sampling period. We invited
served when reducing the number of timesteps in dense in- 100 participants, comprising students and faculty members
ference, particularly when the number of timesteps is fewer from 18 different universities and companies, to compare
than 10. This demonstrates that RAS constitutes a promising the generated images. Each participant was asked to deter-
approach to enhancing efficiency while maintaining output mine whether one image was clearly better, slightly better,
quality and ensuring compatibility with prompts. or of similar quality compared to the other. The order of
Pareto Improvements of Uniform Sampling. Through the images was randomized, and participants were unaware
observation, we found that RAS can offer a Pareto improve- of which image was generated with RAS. As shown in Fig-
ment for rectified flow in many cases. We sorted some of ure 2 (e), 633 out of 1400 votes (45.21%) indicated that the
the experimental results of Stable Diffusion 3 and Lumina- two images were of similar quality. Additionally, 28.29%
Next-T2I by throughput, and listed different configurations of votes favored the dense image over the RAS result, while
of RAS alongside the closest baseline in terms of through- 26.50% preferred RAS over the dense result. These results
put in Table 2 to provide a comprehensive comparison. The demonstrate that RAS achieves a significant improvement
results clearly demonstrate that, for each instance of dense in throughput (1.625× for Stable Diffusion 3 and 1.561×
inference with rectified-flow-based flow matching in the ta- for Lumina-Next-T2I) without noticeably affecting human
ble, there is almost consistently an option within RAS that preference.
offers higher throughput while delivering superior perfor-
mance in terms of FID, sFID, and CLIP score. This high- 4.4. Ablation Study
lights that, for achieving a given throughput level during Token Drop Scheduling. As shown in Table 3 (a), we eval-
DiT inference, RAS not only provides multiple configura- uate the scheduling configurations introduced in Section 3,
tions with both enhanced throughput and improved image including sampling ratio scheduling, selection of cached
quality, but also offers a broader parameter space for opti- tokens, and the insertion of dense steps during the selec-
mizing trade-offs between throughput, image quality, and tive sampling period to reset accumulated errors, using 10
compatibility with prompts. timesteps with an average sampling ratio of 12.5% on Sta-
ble Diffusion 3. The results indicate that each of these tech-
(a) Drop Scheduling niques contributes to the overall quality of RAS.
Method FID ↓ sFID ↓ CLIP score ↑ Key and Value Caching. As shown in Table 3 (b), caching
Default 35.81 18.41 30.13 keys and values from the previous step is crucial, especially
Static Sampling Freq. 37.92 19.11 29.98 when generating high-quality images with more timesteps.
Random Dropping 43.19 22.23 29.65 While dropping the keys and values of non-activated tokens
W/O Error Reset 46.10 24.85 30.41 during attention can improve throughput, it significantly af-
fects the attention scores of activated tokens. A token’s low
(b) Key and Value Caching ranking in the model output does not necessarily mean it
Method Timesteps FID ↓ sFID ↓ CLIP score ↑ has no contribution to the attention scores of other tokens.
Default 28 24.30 26.26 31.34
W/O 28 31.36 20.19 31.29 5. Conclusions and Limitations
Default 10 35.81 18.41 30.13
W/O 10 32.33 20.21 30.27 In this paper, we observed that different regions within an
image require varying levels of refinement during the dif-
Table 3. Ablation Study on Stable Diffusion 3. All techniques fusion process, and that adjacent sampling steps exhibit
including dynamic sampling ratio, region identifying, error reset, significant continuity in the distribution of focused areas.
and key & value recovery are necessary for high quality genera- Based on these observations, we proposed RAS, a novel
tion. diffusion sampling strategy that dynamically adjusts sam-
pling rates according to regional attention, thereby allo-
cating computational resources more efficiently to areas of
4.3. Human Evaluation
greater importance while reusing noise predictions for less
To evaluate whether RAS can enhance throughput while critical regions. Our approach effectively reduces computa-
maintaining generation quality in real-world scenarios, we tional costs while preserving high image quality. Extensive
experiments and user studies demonstrate that RAS achieves nik Marek, and Robin Rombach. Scaling rectified flow trans-
substantial speed-ups with minimal degradation in quality, formers for high-resolution image synthesis, 2024. 3
outperforming uniform sampling baselines and paving the [12] Johannes S Fischer, Ming Gui, Pingchuan Ma, Nick Stracke,
way for more efficient and adaptive diffusion models. Stefan A Baumann, and Björn Ommer. Boosting latent diffu-
sion with flow matching. arXiv preprint arXiv:2312.07360,
2023. 7
References [13] Sicheng Gao, Xuhui Liu, Bohan Zeng, Sheng Xu, Yan-
[1] Michael S Albergo and Eric Vanden-Eijnden. Building nor- jing Li, Xiaoyan Luo, Jianzhuang Liu, Xiantong Zhen, and
malizing flows with stochastic interpolants. arXiv preprint Baochang Zhang. Implicit diffusion models for continu-
arXiv:2209.15571, 2022. 1, 4, 7 ous super-resolution. In Proceedings of the IEEE/CVF con-
[2] Peng Gao andå Le Zhuo, Dongyang Liu, Ruoyi Du, Xu ference on computer vision and pattern recognition, pages
Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie 10021–10030, 2023. 1
Huang, Shijie Geng, Renrui Zhang, Junlin Xi, Wenqi Shao, [14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Zhengkai Jiang, Tianshuo Yang, Weicai Ye, He Tong, Jing- Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
wen He, Yu Qiao, and Hongsheng Li. Lumina-t2x: Trans- Yoshua Bengio. Generative adversarial nets. Advances in
forming text into any modality, resolution, and duration via neural information processing systems, 27, 2014. 3
flow-based large diffusion transformers, 2024. 3, 4, 7 [15] Philip Hartman. Ordinary differential equations. SIAM,
[3] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- 2002. 1
horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. [16] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras,
Align your latents: High-resolution video synthesis with la- and Yejin Choi. Clipscore: A reference-free evaluation met-
tent diffusion models. In Proceedings of the IEEE/CVF Con- ric for image captioning. arXiv preprint arXiv:2104.08718,
ference on Computer Vision and Pattern Recognition, pages 2021. 7
22563–22575, 2023. 1 [17] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
[4] Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, two time-scale update rule converge to a local nash equilib-
and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffu- rium. Advances in neural information processing systems,
sion transformer for 4k text-to-image generation, 2024. 4 30, 2017. 7
[18] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
[5] Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao,
sion probabilistic models. In Proceedings of the 34th Inter-
Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao,
national Conference on Neural Information Processing Sys-
and Tao Chen. ∆-dit: A training-free acceleration
tems, Red Hook, NY, USA, 2020. Curran Associates Inc. 1,
method tailored for diffusion transformers. arXiv preprint
2, 3
arXiv:2406.01125, 2024. 1, 4
[19] Tao Hu. pytorch-fid-with-sfid. https://github.com/
[6] Quan Dao, Hao Phung, Binh Nguyen, and Anh Tran. Flow
dongzhuoyao/pytorch-fid-with-sfid, 2022. 7
matching in latent space, 2023. 7
[20] Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey
[7] Prafulla Dhariwal and Alexander Nichol. Diffusion models Levine. Planning with diffusion for flexible behavior synthe-
beat gans on image synthesis. Advances in neural informa- sis. arXiv preprint arXiv:2205.09991, 2022. 1
tion processing systems, 34:8780–8794, 2021. 1 [21] Kalervo Järvelin and Jaana Kekäläinen. Ir evaluation meth-
[8] Prafulla Dhariwal and Alex Nichol. Diffusion models beat ods for retrieving highly relevant documents. In Proceedings
gans on image synthesis. In Proceedings of the 35th Inter- of the 23rd Annual International ACM SIGIR Conference on
national Conference on Neural Information Processing Sys- Research and Development in Information Retrieval, page
tems, Red Hook, NY, USA, 2024. Curran Associates Inc. 1, 41–48, New York, NY, USA, 2000. Association for Comput-
3 ing Machinery. 3
[9] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, [22] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen
Jonathan Granskog, and Anastasis Germanidis. Structure Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic:
and content-guided video synthesis with diffusion models. Text-based real image editing with diffusion models. In Pro-
In Proceedings of the IEEE/CVF International Conference ceedings of the IEEE/CVF Conference on Computer Vision
on Computer Vision, pages 7346–7356, 2023. 1 and Pattern Recognition, pages 6007–6017, 2023. 1
[10] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim [23] Bo Li, Kaitao Xue, Bin Liu, and Yu-Kun Lai. Bbdm: Image-
Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik to-image translation with brownian bridge diffusion models.
Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim In Proceedings of the IEEE/CVF conference on computer vi-
Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan- sion and pattern Recognition, pages 1952–1961, 2023. 1
nik Marek, and Robin Rombach. Scaling rectified flow trans- [24] Haoying Li, Yifan Yang, Meng Chang, Shiqi Chen, Huajun
formers for high-resolution image synthesis, 2024. 4, 7 Feng, Zhihai Xu, Qi Li, and Yueting Chen. Srdiff: Single
[11] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim image super-resolution with diffusion probabilistic models.
Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Neurocomputing, 479:47–59, 2022. 1
Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim [25] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan- Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Zitnick. Microsoft coco: Common objects in context. In [38] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee,
Computer Vision–ECCV 2014: 13th European Conference, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad
Zurich, Switzerland, September 6-12, 2014, Proceedings, Norouzi. Palette: Image-to-image diffusion models. In
Part V 13, pages 740–755. Springer, 2014. 7 ACM SIGGRAPH 2022 conference proceedings, pages 1–10,
[26] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- 2022. 1
ian Nickel, and Matt Le. Flow matching for generative mod- [39] Tim Salimans and Jonathan Ho. Progressive distillation for
eling. arXiv preprint arXiv:2210.02747, 2022. 1, 4, 7 fast sampling of diffusion models. In International Confer-
[27] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow ence on Learning Representations. 1, 4
straight and fast: Learning to generate and transfer data with [40] Maximilian Seitzer. pytorch-fid: FID Score for PyTorch.
rectified flow. arXiv preprint arXiv:2209.03003, 2022. 1, 4, https://github.com/mseitzer/pytorch-fid,
7 2020. Version 0.3.0. 7
[28] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan [41] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Mah-
Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion eswaranathan, and Surya Ganguli. Deep unsupervised
probabilistic model sampling in around 10 steps. Advances learning using nonequilibrium thermodynamics. In
in Neural Information Processing Systems, 35:5775–5787, Proceedings of the 32nd International Conference on
2022. 1, 4 International Conference on Machine Learning - Volume
[29] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher 37, page 2256–2265. JMLR.org, 2015. 1, 3
Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting [42] Yang Song and Stefano Ermon. Generative modeling by esti-
using denoising diffusion probabilistic models. In Proceed- mating gradients of the data distribution. Curran Associates
ings of the IEEE/CVF conference on computer vision and Inc., Red Hook, NY, USA, 2019. 1, 3
pattern recognition, pages 11461–11471, 2022. 1 [43] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya
Sutskever. Consistency models. arXiv preprint
[30] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
arXiv:2303.01469, 2023. 1, 4
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
[44] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen
Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison,
Bo, and Yunfeng Liu. Roformer: Enhanced transformer with
Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison,
rotary position embedding. Neurocomputing, 568:127063,
Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu
2024. 5
Fang, Junjie Bai, and Soumith Chintala. PyTorch: an imper-
ative style, high-performance deep learning library. Curran [45] A Vaswani. Attention is all you need. Advances in Neural
Associates Inc., Red Hook, NY, USA, 2019. 7 Information Processing Systems, 2017. 2, 4, 5
[46] Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro
[31] William Peebles and Saining Xie. Scalable diffusion models
Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj,
with transformers. In 2023 IEEE/CVF International Confer-
Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven
ence on Computer Vision (ICCV), pages 4172–4182, 2023.
Liu, and Thomas Wolf. Diffusers: State-of-the-art diffu-
2, 4
sion models. https://github.com/huggingface/
[32] Dustin Podell, Zion English, Kyle Lacey, Andreas
diffusers, 2022. 7
Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and
[47] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong
Robin Rombach. Sdxl: Improving latent diffusion models
Chen, Qifeng Chen, and Fang Wen. Pretraining is all
for high-resolution image synthesis, 2023. 2, 3, 5
you need for image-to-image translation. arXiv preprint
[33] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- arXiv:2205.12952, 2022. 1
hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv [48] Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Tie-Yan Liu,
preprint arXiv:2209.14988, 2022. 1 and Wei Chen. A theoretical analysis of ndcg type ranking
[34] Philip E Protter and Philip E Protter. Stochastic differential measures, 2013. 3
equations. Springer, 2005. 1 [49] Mengwei Xu, Mengze Zhu, Yunxin Liu, Felix Xiaozhu Lin,
[35] Robin Rombach, Andreas Blattmann, Dominik Lorenz, and Xuanzhe Liu. Deepcache: Principled cache for mo-
Patrick Esser, and Björn Ommer. High-resolution image bile deep vision. In Proceedings of the 24th annual inter-
synthesis with latent diffusion models. In Proceedings of national conference on mobile computing and networking,
the IEEE/CVF conference on computer vision and pattern pages 129–144, 2018. 1, 4
recognition, pages 10684–10695, 2022. 1 [50] Zhihang Yuan, Hanling Zhang, Pu Lu, Xuefei Ning, Lin-
[36] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- feng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, and
net: Convolutional networks for biomedical image segmen- Yu Wang. Ditfastattn: Attention compression for diffusion
tation. In Medical image computing and computer-assisted transformer models, 2024. 4
intervention–MICCAI 2015: 18th international conference, [51] Zongsheng Yue, Jianyi Wang, and Chen Change Loy.
Munich, Germany, October 5-9, 2015, proceedings, part III Resshift: Efficient diffusion model for image super-
18, pages 234–241. Springer, 2015. 2, 4 resolution by residual shifting. Advances in Neural Infor-
[37] Amirmojtaba Sabour, Sanja Fidler, and Karsten Kreis. Align mation Processing Systems, 36, 2024. 1
your steps: Optimizing sampling schedules in diffusion mod- [52] Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris N
els. In Forty-first International Conference on Machine Metaxas, and Jian Ren. Sine: Single image editing with text-
Learning, 2023. 1 to-image diffusion models. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 6027–6037, 2023. 1
[53] Ningxin Zheng, Huiqiang Jiang, Quanlu Zhang, Zhenhua
Han, Lingxiao Ma, Yuqing Yang, Fan Yang, Chengruidong
Zhang, Lili Qiu, Mao Yang, et al. Pit: Optimization of dy-
namic sparse deep learning models via permutation invariant
transformation. In Proceedings of the 29th Symposium on
Operating Systems Principles, pages 331–347, 2023. 7
[54] SUN Zhengwentai. clip-score: CLIP Score for Py-
Torch. https : / / github . com / taited / clip -
score, 2023. Version 0.1.1. 7
Region-Adaptive Sampling for Diffusion Transformers
Supplementary Material
6. More Visualization of RAS the model to predict more consistent denoising directions.
We acknowledge that further study is needed to fully under-
This section presents RAS accelerating Lumina-Next-T2I stand this phenomenon.
and Stable Diffusion 3 with a 50% sampling ratio. As illus- The primary contribution of this work is to highlight that
trated in Figure 10, the main object receives more sampling employing different sampling steps for different regions can
steps compared to the background, demonstrating the sig- significantly enhance the efficiency of diffusion model sam-
nificance of our region-adaptive sampling strategy. This ap- pling. The method for selecting these regions is not limited
proach ensures that the primary subject in the generated im- to the aforementioned approach based on the noise standard
age consistently undergoes more sampling, while relatively deviation across dimensions. For example, we also exper-
smooth regions receive fewer sampling steps. For instance, imented with using the l − 2 norm of the noise output by
in the example shown in Figure 10 with the prompt ”hare in the network as a criterion for selection. By targeting re-
snow,” the weeds in the snow are sampled more frequently, gions with larger noise norms, which indicate areas the net-
while the smooth snow receives fewer sampling steps. work deems requiring more refinement, we observed a pref-
In Figure 11, we visualize the standard deviation of the erence for more complex regions in the frequency domain
noise across dimensions, as well as the decoded images de- as in Figure 9. This approach also achieves high-quality
rived from the noise. This stems from our observation that imaging results, as shown in Table 4. It can be seen that
the noise’s standard deviation is consistently smaller in the the methods using the l2 norm and standard deviation (std)
main subject areas. A preliminary hypothesis is that this oc- yield relatively similar results, and both significantly out-
curs because the main subject contains more information. perform random selection, particularly when the cache ratio
When mixed with a certain proportion of noise at each dif- is higher.
fusion step, the foreground tends to retain more determin-
istic information compared to the background. This allows
High
Low
Figure 9. RAS using norm as the metric, accelerating Lumina-Next-T2I with 50% sample ratio and 30 total steps. The noise, masks and
samples are from the 20th step.
Figure 10. RAS VS default sampling and the active sampling step for each latent token.
Method Sample Steps Sampling Ratio Throughput (iter/s)↑ FID ↓ sFID ↓ CLIP score ↑
RFlow 7 100.0% 1.01 27.23 17.76 30.87
RAS-Std 7 25.0% 1.45 31.99 21.7 30.64
RAS-Norm 7 25.0% 1.45 31.65 21.24 30.59
Random 7 25.0% 1.45 33.26 22.10 30.67
Table 4. Experiments on using L2 Norm as the metric for RAS on Stable Diffusion 3. The sample ratio of the first 4 steps is 100% to
guarantee generation qualities.
High
Low
Figure 11. The 20th sampling step (out of 30) of Lumina-Next-T2I using RAS.
7. Full Experiment Results of RAS
In this section, we present the full experiment results of RAS
against rectified flow, with the same settings as is described
in the experiment section. Both Table 5 and 6 are ordered
by the throughputs.
Method Sample Steps Sampling Ratio Throughput (iter/s)↑ FID ↓ sFID ↓ CLIP score ↑
RFlow 30 100.0% 0.11 22.46 16.59 30.47
RAS 30 75.0% 0.14 23.31 17.73 30.49
RFlow 23 100.0% 0.15 23.10 17.91 30.42
RAS 30 50.0% 0.18 24.10 18.83 30.51
RFlow 15 100.0% 0.23 24.88 21.02 30.25
RAS 30 25.0% 0.26 27.44 20.95 30.45
RAS 15 75.0% 0.27 26.82 23.33 30.26
RAS 30 12.5% 0.31 33.64 23.44 30.36
RAS 15 50.0% 0.33 28.48 25.17 30.29
RFlow 10 100.0% 0.34 31.35 27.84 29.74
RAS 10 75.0% 0.40 34.19 30.57 29.79
RAS 15 25.0% 0.43 33.28 27.41 30.24
RAS 15 12.5% 0.48 39.75 28.88 30.14
RAS 10 50.0% 0.48 36.18 32.36 29.86
RFlow 7 100.0% 0.49 48.19 38.60 28.65
RAS 7 75.0% 0.54 50.45 40.19 28.78
RAS 10 25.0% 0.59 42.96 33.51 29.91
RAS 7 50.0% 0.61 51.78 40.51 28.82
RAS 6 75.0% 0.62 66.12 46.58 27.80
RAS 10 12.5% 0.65 47.34 32.70 29.75
RAS 6 50.0% 0.67 66.54 46.71 27.83
RAS 7 25.0% 0.70 53.93 39.80 28.85
RAS 7 12.5% 0.74 54.62 40.23 28.83
RAS 6 25.0% 0.74 67.16 46.46 27.85
RAS 5 75.0% 0.75 99.01 56.26 26.02
RAS 6 12.5% 0.78 67.88 45.89 27.83
RFlow 5 100.0% 0.69 96.53 59.26 26.03
RAS 5 50.0% 0.83 99.81 56.57 26.01
RAS 5 25.0% 0.95 101.50 56.40 25.93
RAS 5 12.5% 1.00 102.90 55.25 25.84
RFlow 3 100.0% 1.15 256.90 94.80 19.67
Table 5. Full experiment results of RAS and rectified flow on Lumina-Next-T2I and COCO Val2014 1024×1024.
Method Sample Steps Sampling Ratio Throughput (iter/s)↑ FID ↓ sFID ↓ CLIP score ↑
RFlow 28 100% 0.26 25.8 15.32 31.4
RAS 28 75.0% 0.33 24.43 15.94 31.39
RAS 28 50.0% 0.42 24.86 16.88 31.36
RFlow 14 100% 0.51 24.49 14.78 31.34
RAS 28 25.0% 0.55 25.16 17.11 31.29
RFlow 12 100% 0.59 24.36 14.89 31.3
RAS 14 75.0% 0.62 23.61 15.92 31.35
RAS 28 12.5% 0.63 25.72 17.3 31.22
RFlow 10 100% 0.71 24.17 15.39 31.22
RAS 14 50.0% 0.74 24.6 17.24 31.32
RAS 14 25.0% 0.91 25.88 17.97 31.24
RAS 10 75.0% 0.91 24.39 16.29 31.12
RAS 14 12.5% 0.98 26.48 18.14 31.18
RAS 10 50.0% 1.0 27.1 17.5 30.93
RFlow 7 100% 1.01 27.23 17.76 30.87
RAS 7 75.0% 1.16 27.57 18.76 30.81
RAS 10 25.0% 1.2 30.97 18.36 30.67
RAS 10 12.5% 1.3 35.81 18.41 30.13
RAS 7 50.0% 1.3 30.04 20.34 30.73
RAS 6 75.0% 1.3 31.23 19.98 30.48
RAS 6 50.0% 1.41 32.21 20.86 30.43
RFlow 5 100% 1.43 39.7 22.34 29.84
RAS 7 25.0% 1.45 31.99 21.7 30.64
RAS 7 12.5% 1.48 32.86 22.1 30.55
RAS 6 25.0% 1.52 33.24 21.51 30.36
RAS 6 12.5% 1.57 33.81 21.62 30.33
RAS 5 75.0% 1.59 44.02 23.14 29.53
RAS 5 50.0% 1.75 48.65 24.51 29.29
RFlow 4 100% 1.79 61.92 27.42 28.45
RAS 5 25.0% 1.94 51.92 25.67 29.06
RAS 5 12.5% 1.99 53.24 26.04 28.94
RFlow 3 100% 2.38 121.61 36.92 25.32
Table 6. Full experiment results of RAS and rectified flow on Stable Diffusion 3 and COCO Val2014 1024×1024.