Abstract
Entropy modeling is a key component for high-performance image compression algorithms. Recent developments in autoregressive context modeling helped learning-based methods to surpass their classical counterparts. However, the performance of those models can be further improved due to the underexploited spatio-channel dependencies in latent space, and the suboptimal implementation of context adaptivity. Inspired by the adaptive characteristics of the transformers, we propose a transformer-based context model, named Contextformer, which generalizes the de facto standard attention mechanism to spatio-channel attention. We replace the context model of a modern compression framework with the Contextformer and test it on the widely used Kodak, CLIC2020, and Tecnick image datasets. Our experimental results show that the proposed model provides up to 11% rate savings compared to the standard Versatile Video Coding (VVC) Test Model (VTM) 16.2, and outperforms various learning-based models in terms of PSNR and MS-SSIM.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Versatile Video Coding. Standard, Rec. ITU-T H.266 and ISO/IEC 23090–3 (2020)
Asuni, N., Giachetti, A.: Testimages: a large-scale archive for testing visual devices and basic image processing algorithms. In: STAG, pp. 63–70 (2014)
Ballé, J., Chou, P.A., Minnen, D., Singh, S., Johnston, N., Agustsson, E., Hwang, S.J., Toderici, G.: Nonlinear transform coding. IEEE Journal of Selected Topics in Signal Processing 15(2), 339–353 (2020)
Ballé, J., Laparra, V., Simoncelli, E.P.: Density modeling of images using a generalized normalization transformation. In: 4th International Conference on Learning Representations, ICLR 2016 (2016)
Ballé, J., Laparra, V., Simoncelli, E.P.: End-to-end optimized image compression. In: 5th International Conference on Learning Representations, ICLR 2017 (2017)
Ballé, J., Minnen, D., Singh, S., Hwang, S.J., Johnston, N.: Variational image compression with a scale hyperprior. In: International Conference on Learning Representations (2018)
Bégaint, J., Racapé, F., Feltman, S., Pushparaja, A.: Compressai: a pyTorch library and evaluation platform for end-to-end compression research. arXiv preprint arXiv:2011.03029 (2020)
Bellard, F.: BPG image format (2015). Accessed 01 Jun 2022. https://bellard.org/bpg
Bjontegaard, G.: Calculation of average PSNR differences between RD-curves. VCEG-M33 (2001)
Carion, Nicolas, Massa, Francisco, Synnaeve, Gabriel, Usunier, Nicolas, Kirillov, Alexander, Zagoruyko, Sergey: End-to-end object detection with transformers. In: Vedaldi, Andrea, Bischof, Horst, Brox, Thomas, Frahm, Jan-Michael. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chen, T., Liu, H., Ma, Z., Shen, Q., Cao, X., Wang, Y.: End-to-end learnt image compression via non-local attention optimization and improved context modeling. IEEE Transactions on Image Processing 30, 3179–3191 (2021). DOI: https://doi.org/10.1109/TIP.2021.3058615
Cheng, Z., Sun, H., Takeuchi, M., Katto, J.: Learned image compression with discretized gaussian mixture likelihoods and attention modules. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7939–7948 (2020)
Cui, Z., Wang, J., Bai, B., Guo, T., Feng, Y.: G-VAE: A continuously variable rate deep image compression framework. arXiv preprint arXiv:2003.02012 (2020)
Cui, Z., Wang, J., Gao, S., Guo, T., Feng, Y., Bai, B.: Asymmetric gained deep image compression with continuous rate adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10532–10541 (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol.1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12873–12883 (2021)
Franzen, R.: Kodak lossless true color image suite (1999)
Goyal, V.K.: Theoretical foundations of transform coding. IEEE Signal Processing Magazine 18(5), 9–21 (2001)
Guo, Z., Zhang, Z., Feng, R., Chen, Z.: Causal contextual prediction for learned image compression. IEEE Transactions on Circuits and Systems for Video Technology (2021)
He, D., Zheng, Y., Sun, B., Wang, Y., Qin, H.: Checkerboard context model for efficient learned image compression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14771–14780 (2021)
Jiang, Y., Chang, S., Wang, Z.: TransGAN: two transformers can make one strong gan. arXiv preprint arXiv:2102.07074 (2021)
Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: fast autoregressive transformers with linear attention. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, 13–18 Jul 2020, vol. 119, pp. 5156–5165. PMLR (2020). https://proceedings.mlr.press/v119/katharopoulos20a.html
Kim, D.W., Chung, J.R., Jung, S.W.: GRDN: grouped residual dense network for real image denoising and gan-based real-world noise modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2019)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Koyuncu, A.B., Cui, K., Boev, A., Steinbach, E.: Parallelized context modeling for faster image coding. In: 2021 International Conference on Visual Communications and Image Processing (VCIP), pp. 1–5. IEEE (2021)
Lee, J., Cho, S., Beack, S.K.: Context-adaptive entropy model for end-to-end optimized image compression. In: 6th International Conference on Learning Representations, ICLR 2018 (2018)
Lee-Thorp, J., Ainslie, J., Eckstein, I., Ontanon, S.: FNet: mixing tokens with Fourier transforms. arXiv preprint arXiv:2105.03824 (2021)
Li, D., et al.: Involution: inverting the inherence of convolution for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12321–12330 (2021)
Li, M., Ma, K., You, J., Zhang, D., Zuo, W.: Efficient and effective context-based convolutional entropy modeling for image compression. IEEE Transactions on Image Processing 29, 5900–5911 (2020). DOI: 10.1109/TIP.2020.2985225
Liu, H., Chen, T., Shen, Q., Ma, Z.: Practical stacked non-local attention modules for image compression. In: CVPR Workshops (2019)
Mentzer, F., Agustsson, E., Tschannen, M., Timofte, R., Van Gool, L.: Conditional probability models for deep image compression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4394–4402 (2018)
Minnen, D., Ballé, J., Toderici, G.: Joint autoregressive and hierarchical priors for learned image compression. In: NeurIPS (2018)
Minnen, D., Singh, S.: Channel-wise autoregressive entropy models for learned image compression. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 3339–3343. IEEE (2020)
Naseer, M., Ranasinghe, K., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Intriguing properties of vision transformers. arXiv preprint arXiv:2105.10497 (2021)
Niu, Z., Zhong, G., Yu, H.: A review on the attention mechanism of deep learning. Neurocomputing 452, 48–62 (2021). https://doi.org/10.1016/j.neucom.2021.03.091. https://www.sciencedirect.com/science/article/pii/S092523122100477X
Parmar, N., et al.: Image transformer. In: International Conference on Machine Learning, pp. 4055–4064. PMLR (2018)
Qian, Y., Sun, X., Lin, M., Tan, Z., Jin, R.: Entroformer: a transformer-based entropy model for learned image compression. In: International Conference on Learning Representations (2021)
Qian, Y., et al.: Learning accurate entropy model with global reference for image compression. In: International Conference on Learning Representations (2020)
Rissanen, J., Langdon, G.G.: Arithmetic coding. IBM Journal of research and development 23(2), 149–162 (1979)
Roy, A., Saffar, M., Vaswani, A., Grangier, D.: Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics 9, 53–68 (2021)
Sullivan, G., Ohm, J., Han, W.J., Wiegand, T.: Overview of the high efficiency video coding standard vol. 22, pp. 1648–1667 (2012)
Team, J.V.E.: Versatile video coding (vvc) reference software: Vvc test model (vtm) (2022). Accessed 01 Jun 2022. https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware/_VTM
Toderici, G., et al.: Workshop and challenge on learned image compression (clic2020)
Vaswani, A., et al.: Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008 (2017)
Wallace, G.K.: The jpeg still picture compression standard. IEEE Trans. Cons. Electr. 38(1), xviii-xxxiv (1992)
Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural similarity for image quality assessment. In: The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2, pp. 1398–1402. IEEE (2003)
Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural similarity for image quality assessment. In: The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2, pp. 1398–1402. IEEE (2003)
Xue, T., Chen, B., Wu, J., Wei, D., Freeman, W.T.: Video enhancement with task-oriented flow. International Journal of Computer Vision (IJCV) 127(8), 1106–1125 (2019)
Zhang, Y., Li, K., Li, K., Zhong, B., Fu, Y.: Residual non-local attention networks for image restoration. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=HkeGhoA5FX
Zhao, J., Li, B., Li, J., Xiong, R., Lu, Y.: A universal encoder rate distortion optimization framework for learned compression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1880–1884 (2021)
Zhou, J., Wen, S., Nakagawa, A., Kazui, K., Tan, Z.: Multi-scale and context-adaptive entropy model for image compression. arXiv preprint arXiv:1910.07844 (2019)
Ziv, J.: On universal quantization. IEEE Transactions on Information Theory 31(3), 344–347 (1985)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Koyuncu, A.B., Gao, H., Boev, A., Gaikov, G., Alshina, E., Steinbach, E. (2022). Contextformer: A Transformer with Spatio-Channel Attention for Context Modeling in Learned Image Compression. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13679. Springer, Cham. https://doi.org/10.1007/978-3-031-19800-7_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-19800-7_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19799-4
Online ISBN: 978-3-031-19800-7
eBook Packages: Computer ScienceComputer Science (R0)