AudioCache

Accelerate Audio Generation With Training-Free Layer Caching

Qingyang Shi, Zhicheng Du, Jiasheng Lu, Yingshan Liang, Xinyu Zhang, Yiran Wang, Jing Peng, Kehong Yuan

Abstract. Diffusion models have become the primary choice in audio generation. However, their slow generation speed necessitates acceleration techniques. While current audio generation methods primarily target U-Net-based models, the Diffusion Transformer (DiT) is emerging as the trend in audio generation. As DiT costs a large amount of computational resources, we propose AudioCache: a training-free caching strategy that, for the first time to our best knowledge, accelerates DiT-based audio generation models by reusing the attention and feedforward layers of DiT during sampling. We define a reasonable statistic to characterize the degree of internal structure variation, leading to the proposal of a self-adaptive caching strategy. We achieve a 2.35x acceleration with both objective and subjective metrics remaining practically consistent. Furthermore, our method is extendable to different models and input modalities. Based on appropriate indicators and corresponding rules, this method provides a plug-and-play solution for training-free diffusion models built on attention architectures.

AudioCache Overview

Caching strategy for DiT audio generation model.

Text-to-Audio Generation

Text Prompts	Stable Audio Open	AudioCache(2.35x)
Racing vehicle accelerating from a distance growing louder than driving by at a high rate
An alarm that is going off and beeping
A person snoring
A baby is crying loudly
Typing on a computer keyboard
Flushing toilets

Text-to-Music Generation

Text Prompts	Stable Audio Open	AudioCache(2.35x)
Ethereal piano notes, Ambient style, Slow tempo
Punchy drum beats, Hip Hop style, Steady groove
Electric guitar solo in rock style, Powerful drum beats, Energetic and passionate
Strumming acoustic guitar, Folk style, Easy-going rhythm
African hand drumming

Video-to-Audio Generation

Make-An-Audio 3	AudioCache(2.50x)