Abstract
Videos are created to express emotion, exchange information, and share experiences. Video synthesis has intrigued researchers for a long time. Despite the rapid progress driven by advances in visual synthesis, most existing studies focus on improving the frames’ quality and the transitions between them, while little progress has been made in generating longer videos. In this paper, we present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames. Our evaluation shows that our model trained on 16-frame video clips from standard benchmarks such as UCF-101, Sky Time-lapse, and Taichi-HD datasets can generate diverse, coherent, and high-quality long videos. We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio. Videos and code can be found at https://songweige.github.io/projects/tats.
S. Ge—Work done primarily during an internship at Meta AI.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Acharya, D., Huang, Z., Paudel, D.P., Van Gool, L.: Towards high resolution video generation with progressive growing of sliced Wasserstein GANs. arXiv preprint arXiv:1810.02419 (2018)
Alsallakh, B., Kokhlikyan, N., Miglani, V., Yuan, J., Reblitz-Richardson, O.: Mind the pad - CNNs can develop blind spots. In: ICLR (2021)
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: ICLR (2018)
Brooks, T., et al.: Generating long videos of dynamic scenes. arXiv preprint arXiv:2206.03429 (2022)
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)
Castrejon, L., Ballas, N., Courville, A.: Hierarchical video generation for complex data. arXiv preprint arXiv:2106.02719 (2021)
Chatterjee, M., Cherian, A.: Sound2Sight: generating visual dynamics from sound and context. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 701–719. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_42
Clark, A., Donahue, J., Simonyan, K.: Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571 (2019)
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR (2021)
Fan, A., Lewis, M., Dauphin, Y.: Hierarchical neural story generation (2018)
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: ICASSP (2017)
Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS (2014)
Hayes, T., et al.: MUGEN: a playground for video-audio-text multimodal understanding and generation. arXiv preprint arXiv:2204.08058 (2022)
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. arXiv preprint arXiv:2204.03458 (2022)
Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: CogVideo: large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022)
Islam, M.A., Jia, S., Bruce, N.D.: How much position information do convolutional neural networks encode? In: ICLR (2019)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Kahembwe, E., Ramamoorthy, S.: Lower dimensional kernels for video discriminators. Neural Netw. 132, 506–520 (2020)
Kalchbrenner, N., et al.: Video pixel networks. In: ICML (2017)
Karras, T., et al.: Alias-free generative adversarial networks. In: NeurIPS (2021)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: CVPR (2020)
Kayhan, O.S., van Gemert, J.C.: On translation invariance in CNNs: convolutional layers can exploit absolute spatial location. In: CVPR (2020)
Le Moing, G., Ponce, J., Schmid, C.: CCVS: context-aware controllable video synthesis. In: NeurIPS (2021)
Luc, P., et al.: Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035 (2020)
Munoz, A., Zolfaghari, M., Argus, M., Brox, T.: Temporal shift GAN for large scale video generation. In: WACV (2021)
Nash, C., et al.: Transframer: arbitrary frame prediction with generative models. arXiv preprint arXiv:2203.09494 (2022)
van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. In: NeurIPS (2017)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)
Rakhimov, R., Volkhonskiy, D., Artemov, A., Zorin, D., Burnaev, E.: Latent video transformer. arXiv preprint arXiv:2006.10704 (2020)
Ramesh, A., et al.: Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092 (2021)
Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014)
Saito, M., Matsumoto, E., Saito, S.: Temporal generative adversarial nets with singular value clipping. In: ICCV (2017)
Saito, M., Saito, S., Koyama, M., Kobayashi, S.: Train sparsely, generate densely: memory-efficient unsupervised training of high-resolution temporal GAN. Int. J. Comput. Vis. 128(10), 2586–2606 (2020). https://doi.org/10.1007/s11263-020-01333-y
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: NeurIPS (2019)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Sitzmann, V., Martel, J., Bergman, A., Lindell, D., Wetzstein, G.: Implicit neural representations with periodic activation functions. In: NeurIPS (2020)
Skorokhodov, I., Tulyakov, S., Elhoseiny, M.: StyleGAN-V: a continuous video generator with the price, image quality and perks of StyleGAN2. arXiv preprint arXiv:2112.14683 (2021)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: ICML (2015)
Tancik, M., et al.: Fourier features let networks learn high frequency functions in low dimensional domains. In: NeurIPS (2020)
Tian, Y., et al.: A good image generator is what you need for high-resolution video synthesis. In: ICLR (2021)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. In: CVPR (2018)
Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: a new metric & challenges. In: ICLR (2019)
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: NeurIPS (2016)
Wang, T.C., et al.: Video-to-video synthesis. In: NeurIPS (2018)
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: CVPR (2018)
Weissenborn, D., Täckström, O., Uszkoreit, J.: Scaling autoregressive video models. In: ICLR (2020)
Wu, C., et al.: GODIVA: generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806 (2021)
Wu, C., et al.: NÜWA: visual synthesis pre-training for neural visual world creation. arXiv preprint arXiv:2111.12417 (2021)
Xiong, W., Luo, W., Ma, L., Liu, W., Luo, J.: Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: CVPR (2018)
Xu, R., Wang, X., Chen, K., Zhou, B., Loy, C.C.: Positional encoding as spatial inductive bias in GANs. In: CVPR (2021)
Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: VideoGPT: cideo generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157 (2021)
Yu, S., et al.: Generating videos with dynamics-aware implicit generative adversarial networks. In: ICLR (2021)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Acknowledgements
We thank Oran Gafni, Sasha Sheng, and Isabelle Hu for helpful discussion and feedback; Patrick Esser and Robin Rombach for sharing additional insights for training VQGAN models; Anoop Cherian and Moitreya Chatterjee for sharing the pre-processing code for the AudioSet dataset.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ge, S. et al. (2022). Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13677. Springer, Cham. https://doi.org/10.1007/978-3-031-19790-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-19790-1_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19789-5
Online ISBN: 978-3-031-19790-1
eBook Packages: Computer ScienceComputer Science (R0)