Anonymous Research Project

Abstract

Diffusion Transformers (DiT) have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. To solve this problem, feature caching has been proposed to accelerate diffusion models by caching the features in the previous timesteps and then reusing them in the following timesteps. However, at timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, leading to a pronounced increase in errors introduced by feature caching, significantly harming the generation quality. To solve this problem, we propose TaylorSeer, which firstly shows that features of diffusion models at future timesteps can be predicted based on their values at previous timesteps. Based on the fact that features change slowly and continuously across timesteps, TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. Extensive experiments demonstrate its significant effectiveness in both image and video synthesis, especially in high acceleration ratios. For instance, it achieves an almost lossless acceleration of 4.99× on FLUX and 5.00× on HunyuanVideo without additional training. On DiT, it achieves 3.41 lower FID compared with previous SOTA at 4.53× acceleration. Our code is provided in the supplementary materials and will be made publicly available on GitHub.

FramePack Video Comparison

Scene: Loading...

WAN2.1

TeaCache

TaylorSeer

HiDream Model Performance Comparison

Comparison of TaylorSeer with other acceleration methods

Method	TFLOPs	Speedup	ImageReward	PSNR	SSIM	LPIPS
HiDream-Full	7780.00	1.0$\times$	1.128518935	-	-	-
TeaCache($l1=1$)	2047.37	3.8$\times$	0.9849	28.139	0.6036	0.565
TaylorSeer($N=4,O=2$)	1945.00	4.0$\times$	1.083275183	28.248	0.6084	0.532

Radar Chart Comparison(Scaled)

Visual Quality Comparison

Comparison of generation results across different methods

HiDream-Full

1.0$\times$ speed

TeaCache($l_1=1$)

3.8$\times$ speed

TaylorSeer($N=4,O=2$)

4.0$\times$ speed

Click image to view details

HiDream-Full

1.0$\times$ speed

TeaCache($l_1=1$)

3.8$\times$ speed

TaylorSeer($N=4,O=2$)

4.0$\times$ speed

Click image to view details

HiDream-Full

1.0$\times$ speed

TeaCache($l_1=1$)

3.8$\times$ speed

TaylorSeer($N=4,O=2$)

4.0$\times$ speed

Click image to view details

HiDream-Full

1.0$\times$ speed

TeaCache($l_1=1$)

3.8$\times$ speed

TaylorSeer($N=4,O=2$)

4.0$\times$ speed

Click image to view details

HiDream-Full

1.0$\times$ speed

TeaCache($l_1=1$)

3.8$\times$ speed

TaylorSeer($N=4,O=2$)

4.0$\times$ speed

Click image to view details

HiDream-Full

1.0$\times$ speed

TeaCache($l_1=1$)

3.8$\times$ speed

TaylorSeer($N=4,O=2$)

4.0$\times$ speed

Click image to view details

Super-Resolution Experimental Results

TaylorSeer application to super-resolution tasks

Experimental Configuration

Base Model: Inf-DiT with standard configuration
Dataset: DIV8K, 100 test images
Resolution: 4× upscaling (512→2048)
Sampler: ConcatDDIMSampler
Sampling Steps: 40 steps

Analysis

Our experiments with TaylorSeer on super-resolution tasks demonstrate impressive results. We achieved a computational acceleration of 3× in terms of FLOPs while maintaining virtually lossless quality in PSNR metrics compared to the full Inf-DiT model. Additionally, we observe that lower-order methods perform better when the interval is large, while higher-order methods achieve better results when the interval is small.

Visual Comparison

4$\times$ super-resolution results from 128×128 to 512×512 pixels

From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers

Abstract

HunyuanVideo Gallery(480p

FramePack Video Comparison

WAN2.1

TeaCache

TaylorSeer

HiDream Model Performance Comparison

Radar Chart Comparison(Scaled)

Visual Quality Comparison

Super-Resolution Experimental Results

Experimental Configuration

Analysis

Visual Comparison