Abstract
Nowadays, the massive use of multimedia data gives to data compression a fundamental role in reducing the storage requirements and communication bandwidth. Variable-length encoding (VLE) is a relevant data compression method that reduces input data size by assigning shorter codewords to mostly used symbols, and longer codewords to rarely utilized symbols. As it is a common strategy in many compression algorithms, such as the popular Huffman coding, speeding VLE up is essential to accelerate them. For this reason, during the last decade and a half, efficient VLE implementations have been presented in the area of General Purpose Graphics Processing Units (GPGPU). The main performance issues of the state-of-the-art GPU-based implementations of VLE are the following. First, the way in which the codeword look-up table is stored in shared memory is not optimized to reduce the bank conflicts. Second, input/output data are read/written through inefficient strided global memory accesses. Third, the way in which the thread-codes are built is not optimized to reduce the number of executed instructions. Our goal in this work is to significantly speed up the state-of-the-art implementations of VLE by solving their performance issues. To this end, we propose GVLE, a highly optimized implementation of VLE on GPU, which uses the following optimization strategies. First, the caching of the codeword look-up table is done in a way that minimizes the bank conflicts. Second, input data are read by using vectorized loads to exploit fully the available global memory bandwidth. Third, each thread encoding is performed efficiently in the register space with high instruction-level parallelism and lower number of executed instructions. Fourth, a novel inter-block scan method, which outperforms those of state-of-the-art solutions, is used to calculate the bit-positions of the thread-blocks encodings in the output bit-stream. Our proposed mechanism is based on a regular segmented scan performed efficiently on sequences of bit-lengths of 32 consecutive thread-blocks encodings by using global atomic additions. Fifth, output data are written efficiently by executing coalesced global memory stores. An exhaustive experimental evaluation shows that our solution is on average 2.6\(\times\) faster than the best state-of-the-art implementation. Additionally, it shows that the scan algorithm is on average 1.62\(\times\) faster if it utilizes our inter-block scan method instead of that of the best state-of-the-art VLE solution. Hence, our inter-block scan method offers promising possibilities to accelerate algorithms that require it, such as the scan itself or the stream compaction.








Similar content being viewed by others
Data availability
Not applicable.
Notes
The source code is available at https://github.com/z12fuala/GVLE.
References
Jayasankar U, Thirumal V, Ponnurangam D (2021) A survey on data compression techniques: from the perspective of data quality, coding schemes, data type and applications. J King Saud Univ-Comput Inf Sci 33(2):119–140
Wise J “How many videos are uploaded to youtube a day in 2022?”, June 2022. https://earthweb.com/how-many-videos-are-uploaded-to-youtube-a-day/
Banerji A, Ghosh AM (2010) Multimedia technologies. Tata McGraw Hill, New Delhi
Pu IM (2005) Fundamental data compression. Butterworth-Heinemann
Huffman DA (1952) A method for the construction of minimum-redundancy codes. Proc IRE 40(9):1098–1101
Moffat A (2019) Huffman coding. ACM Comput Surv (CSUR) 52(4):1–35
Balevic A (2009) Parallel variable-length encoding on GPGPUs. In European Conference on Parallel Processing, Springer, Berlin, Heidelberg. pp 26–35
Fuentes-Alventosa A, Gómez-Luna J, González-Linares JM, Guil N (2014) CUVLE: variable-length encoding on CUDA. In Design and Architectures for Signal and Image Processing (DASIP), 2014 Conference on IEEE. pp 1–6
Rahmani H, Topal C, Akinlar C (2014) A parallel Huffman coder on the CUDA architecture. In 2014 IEEE Visual Communications and Image Processing Conference, IEEE. pp 311–314
Yamamoto N, Nakano K, Ito Y, Takafuji D, Kasagi A, Tabaru T (2020) Huffman coding with gap arrays for GPU acceleration. In 49th International Conference on Parallel Processing-ICPP. pp 1–11
Tian J, Di S, Zhao K, Rivera C, Fulp MH, Underwood R, Cappello F (2020) Cusz: an efficient gpu-based error-bounded lossy compression framework for scientific data. arXiv preprint arXiv:2007.09625
Tian J, Rivera C, Di S, Chen J, Liang X, Tao D, Cappello F (2021) Revisiting huffman coding: toward extreme performance on modern gpu architectures. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE. pp 881–891
Zhu F, Yan H (2022) An efficient parallel entropy coding method for JPEG compression based on GPU. J Supercomput 78(2):2681–2708
Fuentes-Alventosa A, Gómez-Luna J, González-Linares JM, Guil N, Medina-Carnicer R (2022) CAVLCU: an efficient GPU-based implementation of CAVLC. J Supercomput 78(6):7556–7590
NVIDIA: GPU-Accelerated Applications (2020) https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/gpu-applications-catalog.pdf
NVIDIA: CUDA Zone (2022) https://developer.nvidia.com/category/zone/cuda-zone
Khronos group: OpenCL (2022) https://www.khronos.org/opencl/
Harris M, Sengupta S, Owens JD (2007) Parallel prefix sum (scan) with CUDA. GPU Gems 3(39):851–876
Martín PJ, Ayuso LF, Torres R, Gavilanes A (2012) Algorithmic strategies for optimizing the parallel reduction primitive in CUDA. In 2012 International Conference on High Performance Computing & Simulation (HPCS), IEEE. pp 511–519
Sengupta S, Harris M, Garland M (2008) Efficient parallel scan algorithms for GPUs. NVIDIA, Santa Clara, CA, Tech. Rep. NVR-2008-003, 1(1), 1–17
Yan S, Long G, Zhang Y (2013) StreamScan: fast scan algorithms for GPUs without global barrier synchronization. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming pp 229–238
Luitjens J “CUDA Pro Tip: increase Performance with Vectorized Memory Access”, Dec. 2013. https://devblogs.nvidia.com/cuda-pro-tip-increase-performance-with-vectorized-memory-access/
NVIDIA: CUDA C Programming Guide (2022) https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
NVIDIA: CUDA C Best Practices Guide (2022) https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html
Manz O (2021) Well Packed-Not a Bit Too Much. Springer Fachmedien Wiesbaden
Gyasi-Agyei A (2019) Telecommunications engineering: principles and practice. World Scientific, Singapore
Unger H, Kyamaky K, Kacprzyk J. (Eds.). (2011). Autonomous Systems: Developments and Trends (Vol. 391). Springer
Lal S, Lucas J, Juurlink B (2017) E\(^{2}\)MC: entropy encoding based memory compression for GPUs. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE. pp 1119–1128
Choukse E, Sullivan MB, O’Connor M, Erez M, Pool J, Nellans D, Keckler SW (2020) Buddy compression: enabling larger memory for deep learning and HPC workloads on GPUs. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), IEEE. pp 926–939
Larmore LL, Hirschberg DS (1990) A fast algorithm for optimal length-limited Huffman codes. J ACM (JACM) 37(3):464–473
Katajainen J, Moffat A, Turpin A (1995) A fast and space-economical algorithm for length-limited coding. In International Symposium on Algorithms and Computation. Springer, Berlin, Heidelberg. pp 12–21
NVIDIA CUDA Compiler Driver NVCC (2022) https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html
Luna JGG, Chang LW, Sung IJ, Hwu WM, Guil N (2015) In-place data sliding algorithms for many-core architectures. In: 2015 44th International Conference on Parallel Processing, IEEE. pp 210–219
Di S, Cappello F (2016) Fast error-bounded lossy HPC data compression with SZ. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE. pp 730–739
Gómez-Luna J, González-Linares JM, Benavides JI, Guil N (2013) An optimized approach to histogram computation on GPU. Mach Vis Appl 24(5):899–908
Barnett ML (2003) U.S. Patent No. 6,657,569. Washington, DC: U.S. Patent and Trademark Office
Su H, Zhang C, Chai J, Wen M, Wu N, Ren J (2011) A high-efficient software parallel CAVCL encoder based on GPU, 2011 34th International Conference on Telecommunications and Signal Processing (TSP), Budapest, pp 534–540
Su H, Wen M, Wu N, Ren J, Zhang C (2014) Efficient parallel video processing techniques on GPU: from framework to implementation. Sci World J, 2014
Acknowledgements
Not applicable.
Funding
None.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception. Design, data collection and analysis were performed by AF-A. The first draft of the manuscript was written by AF-A and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
No, I declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.
Ethical approval and consent to participate
The corresponding author has read the Springer journal policies on author responsibilities and submits this manuscript in accordance with those policies.
Consent for publication
I have read and understood the publishing policy, and submit this manuscript in accordance with this policy.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Fuentes-Alventosa, A., Gómez-Luna, J. & Medina-Carnicer, R. GVLE: a highly optimized GPU-based implementation of variable-length encoding. J Supercomput 79, 8447–8474 (2023). https://doi.org/10.1007/s11227-022-04994-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04994-3