Qingyang Shi, Zhicheng Du, Jiasheng Lu, Yingshan Liang, Xinyu Zhang, Yiran Wang, Jing Peng, Kehong Yuan
Abstract. Diffusion models have become the primary choice in audio generation. However, their slow generation speed necessitates acceleration techniques. While current audio generation methods primarily target U-Net-based models, the Diffusion Transformer (DiT) is emerging as the trend in audio generation. As DiT costs a large amount of computational resources, we propose AudioCache: a training-free caching strategy that, for the first time to our best knowledge, accelerates DiT-based audio generation models by reusing the attention and feedforward layers of DiT during sampling. We define a reasonable statistic to characterize the degree of internal structure variation, leading to the proposal of a self-adaptive caching strategy. We achieve a 2.35x acceleration with both objective and subjective metrics remaining practically consistent. Furthermore, our method is extendable to different models and input modalities. Based on appropriate indicators and corresponding rules, this method provides a plug-and-play solution for training-free diffusion models built on attention architectures.
Caching strategy for DiT audio generation model.
Text Prompts | Stable Audio Open | AudioCache(2.35x) |
---|---|---|
Racing vehicle accelerating from a distance growing louder than driving by at a high rate | ||
An alarm that is going off and beeping | ||
A person snoring | ||
A baby is crying loudly | ||
Typing on a computer keyboard | ||
Flushing toilets |
Text Prompts | Stable Audio Open | AudioCache(2.35x) |
---|---|---|
Ethereal piano notes, Ambient style, Slow tempo | ||
Punchy drum beats, Hip Hop style, Steady groove | ||
Electric guitar solo in rock style, Powerful drum beats, Energetic and passionate | ||
Strumming acoustic guitar, Folk style, Easy-going rhythm | ||
African hand drumming |