Skip to main content
Log in

GVLE: a highly optimized GPU-based implementation of variable-length encoding

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Nowadays, the massive use of multimedia data gives to data compression a fundamental role in reducing the storage requirements and communication bandwidth. Variable-length encoding (VLE) is a relevant data compression method that reduces input data size by assigning shorter codewords to mostly used symbols, and longer codewords to rarely utilized symbols. As it is a common strategy in many compression algorithms, such as the popular Huffman coding, speeding VLE up is essential to accelerate them. For this reason, during the last decade and a half, efficient VLE implementations have been presented in the area of General Purpose Graphics Processing Units (GPGPU). The main performance issues of the state-of-the-art GPU-based implementations of VLE are the following. First, the way in which the codeword look-up table is stored in shared memory is not optimized to reduce the bank conflicts. Second, input/output data are read/written through inefficient strided global memory accesses. Third, the way in which the thread-codes are built is not optimized to reduce the number of executed instructions. Our goal in this work is to significantly speed up the state-of-the-art implementations of VLE by solving their performance issues. To this end, we propose GVLE, a highly optimized implementation of VLE on GPU, which uses the following optimization strategies. First, the caching of the codeword look-up table is done in a way that minimizes the bank conflicts. Second, input data are read by using vectorized loads to exploit fully the available global memory bandwidth. Third, each thread encoding is performed efficiently in the register space with high instruction-level parallelism and lower number of executed instructions. Fourth, a novel inter-block scan method, which outperforms those of state-of-the-art solutions, is used to calculate the bit-positions of the thread-blocks encodings in the output bit-stream. Our proposed mechanism is based on a regular segmented scan performed efficiently on sequences of bit-lengths of 32 consecutive thread-blocks encodings by using global atomic additions. Fifth, output data are written efficiently by executing coalesced global memory stores. An exhaustive experimental evaluation shows that our solution is on average 2.6\(\times\) faster than the best state-of-the-art implementation. Additionally, it shows that the scan algorithm is on average 1.62\(\times\) faster if it utilizes our inter-block scan method instead of that of the best state-of-the-art VLE solution. Hence, our inter-block scan method offers promising possibilities to accelerate algorithms that require it, such as the scan itself or the stream compaction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

Not applicable.

Notes

  1. The source code is available at https://github.com/z12fuala/GVLE.

References

  1. Jayasankar U, Thirumal V, Ponnurangam D (2021) A survey on data compression techniques: from the perspective of data quality, coding schemes, data type and applications. J King Saud Univ-Comput Inf Sci 33(2):119–140

    Google Scholar 

  2. Wise J “How many videos are uploaded to youtube a day in 2022?”, June 2022. https://earthweb.com/how-many-videos-are-uploaded-to-youtube-a-day/

  3. Banerji A, Ghosh AM (2010) Multimedia technologies. Tata McGraw Hill, New Delhi

    Google Scholar 

  4. Pu IM (2005) Fundamental data compression. Butterworth-Heinemann

  5. Huffman DA (1952) A method for the construction of minimum-redundancy codes. Proc IRE 40(9):1098–1101

    Article  MATH  Google Scholar 

  6. Moffat A (2019) Huffman coding. ACM Comput Surv (CSUR) 52(4):1–35

    Article  Google Scholar 

  7. Balevic A (2009) Parallel variable-length encoding on GPGPUs. In European Conference on Parallel Processing, Springer, Berlin, Heidelberg. pp 26–35

  8. Fuentes-Alventosa A, Gómez-Luna J, González-Linares JM, Guil N (2014) CUVLE: variable-length encoding on CUDA. In Design and Architectures for Signal and Image Processing (DASIP), 2014 Conference on IEEE. pp 1–6

  9. Rahmani H, Topal C, Akinlar C (2014) A parallel Huffman coder on the CUDA architecture. In 2014 IEEE Visual Communications and Image Processing Conference, IEEE. pp 311–314

  10. Yamamoto N, Nakano K, Ito Y, Takafuji D, Kasagi A, Tabaru T (2020) Huffman coding with gap arrays for GPU acceleration. In 49th International Conference on Parallel Processing-ICPP. pp 1–11

  11. Tian J, Di S, Zhao K, Rivera C, Fulp MH, Underwood R, Cappello F (2020) Cusz: an efficient gpu-based error-bounded lossy compression framework for scientific data. arXiv preprint arXiv:2007.09625

  12. Tian J, Rivera C, Di S, Chen J, Liang X, Tao D, Cappello F (2021) Revisiting huffman coding: toward extreme performance on modern gpu architectures. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE. pp 881–891

  13. Zhu F, Yan H (2022) An efficient parallel entropy coding method for JPEG compression based on GPU. J Supercomput 78(2):2681–2708

    Article  Google Scholar 

  14. Fuentes-Alventosa A, Gómez-Luna J, González-Linares JM, Guil N, Medina-Carnicer R (2022) CAVLCU: an efficient GPU-based implementation of CAVLC. J Supercomput 78(6):7556–7590

    Article  Google Scholar 

  15. NVIDIA: GPU-Accelerated Applications (2020) https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/gpu-applications-catalog.pdf

  16. NVIDIA: CUDA Zone (2022) https://developer.nvidia.com/category/zone/cuda-zone

  17. Khronos group: OpenCL (2022) https://www.khronos.org/opencl/

  18. Harris M, Sengupta S, Owens JD (2007) Parallel prefix sum (scan) with CUDA. GPU Gems 3(39):851–876

    Google Scholar 

  19. Martín PJ, Ayuso LF, Torres R, Gavilanes A (2012) Algorithmic strategies for optimizing the parallel reduction primitive in CUDA. In 2012 International Conference on High Performance Computing & Simulation (HPCS), IEEE. pp 511–519

  20. Sengupta S, Harris M, Garland M (2008) Efficient parallel scan algorithms for GPUs. NVIDIA, Santa Clara, CA, Tech. Rep. NVR-2008-003, 1(1), 1–17

  21. Yan S, Long G, Zhang Y (2013) StreamScan: fast scan algorithms for GPUs without global barrier synchronization. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming pp 229–238

  22. Luitjens J “CUDA Pro Tip: increase Performance with Vectorized Memory Access”, Dec. 2013. https://devblogs.nvidia.com/cuda-pro-tip-increase-performance-with-vectorized-memory-access/

  23. NVIDIA: CUDA C Programming Guide (2022) https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

  24. NVIDIA: CUDA C Best Practices Guide (2022) https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html

  25. Manz O (2021) Well Packed-Not a Bit Too Much. Springer Fachmedien Wiesbaden

  26. Gyasi-Agyei A (2019) Telecommunications engineering: principles and practice. World Scientific, Singapore

    Book  Google Scholar 

  27. Unger H, Kyamaky K, Kacprzyk J. (Eds.). (2011). Autonomous Systems: Developments and Trends (Vol. 391). Springer

  28. Lal S, Lucas J, Juurlink B (2017) E\(^{2}\)MC: entropy encoding based memory compression for GPUs. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE. pp 1119–1128

  29. Choukse E, Sullivan MB, O’Connor M, Erez M, Pool J, Nellans D, Keckler SW (2020) Buddy compression: enabling larger memory for deep learning and HPC workloads on GPUs. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), IEEE. pp 926–939

  30. Larmore LL, Hirschberg DS (1990) A fast algorithm for optimal length-limited Huffman codes. J ACM (JACM) 37(3):464–473

    Article  MathSciNet  MATH  Google Scholar 

  31. Katajainen J, Moffat A, Turpin A (1995) A fast and space-economical algorithm for length-limited coding. In International Symposium on Algorithms and Computation. Springer, Berlin, Heidelberg. pp 12–21

  32. NVIDIA CUDA Compiler Driver NVCC (2022) https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html

  33. Luna JGG, Chang LW, Sung IJ, Hwu WM, Guil N (2015) In-place data sliding algorithms for many-core architectures. In: 2015 44th International Conference on Parallel Processing, IEEE. pp 210–219

  34. Di S, Cappello F (2016) Fast error-bounded lossy HPC data compression with SZ. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE. pp 730–739

  35. Gómez-Luna J, González-Linares JM, Benavides JI, Guil N (2013) An optimized approach to histogram computation on GPU. Mach Vis Appl 24(5):899–908

    Article  Google Scholar 

  36. Barnett ML (2003) U.S. Patent No. 6,657,569. Washington, DC: U.S. Patent and Trademark Office

  37. Su H, Zhang C, Chai J, Wen M, Wu N, Ren J (2011) A high-efficient software parallel CAVCL encoder based on GPU, 2011 34th International Conference on Telecommunications and Signal Processing (TSP), Budapest, pp 534–540

  38. Su H, Wen M, Wu N, Ren J, Zhang C (2014) Efficient parallel video processing techniques on GPU: from framework to implementation. Sci World J, 2014

Download references

Acknowledgements

Not applicable.

Funding

None.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception. Design, data collection and analysis were performed by AF-A. The first draft of the manuscript was written by AF-A and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Antonio Fuentes-Alventosa.

Ethics declarations

Conflict of interest

No, I declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Ethical approval and consent to participate

The corresponding author has read the Springer journal policies on author responsibilities and submits this manuscript in accordance with those policies.

Consent for publication

I have read and understood the publishing policy, and submit this manuscript in accordance with this policy.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fuentes-Alventosa, A., Gómez-Luna, J. & Medina-Carnicer, R. GVLE: a highly optimized GPU-based implementation of variable-length encoding. J Supercomput 79, 8447–8474 (2023). https://doi.org/10.1007/s11227-022-04994-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04994-3

Keywords

Navigation

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy