Abstract
Machine translation quality has seen tremendous improvement since the development of neural machine translation. However, translation models are memory intensive, with expensive hardware facilities and slow training speed. To reduce memory requirements and speed up translation, we propose the Transformer Discrete Fourier method with Skipping Sub-Layer (TF-SSL), which incorporates the discrete Fourier transform and a Skipping Sub-Layer algorithm, after relative positional embedding for Chinese and English source sentences. The input sequence is based on a Transformer model in the relative positional embedding layer, and the text is transformed into word vectors with information encoding via the embedding matrix, so that the word vectors can effectively capture interdependences between the texts. We distribute the transform coefficient matrix after the 2D Fourier transform near the center of the Encoder layer with a short matrix of transform coefficients, which accelerates translation on a GPU. The accuracy and speed are improved by skipping the sub-layer method, and the sub-layer is randomly omitted to introduce disturbance to the training, thus imposing greater constraint effects on the sub-layers. We conduct the ablation study and comparative analyses. Results show that our approach achieves improvement in both BLEU scores and GFLOPS values compared to the baseline Transformer model and other deep learning models.








Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data that support the findings of this study are openly available. WMT18 News Commentary v13 dataset at https://www.aclweb.org/anthology/volumes/W18-64/, OpenSubtitles2016 dataset at https://aclanthology.org/L16-1147/, WMT2017 dataset at https://aclanthology.org/volumes/W17-47/.
References
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł and Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Bapna A, Chen MX, Firat O, Cao Y, and Wu Y (2018) Training deeper neural machine translation models with transparent attention. In: Empirical Methods in Natural Language Processing, pages 3028–3033
Wu L, Wang Y, Xia Y, Tian F, Gao F, Qin T, Lai J and Liu T-Y (2019) Depth growing for neural machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 5558–5563
Wu L, Wang Y, Xia Y, Tian F, Gao F, Qin T, Lai J and Liu T-Y (2019) Improving deep transformer with depth-scaled initialization and merged attention. In: Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 898–909, Hong Kong, China
Xu H, Liu Q, van Genabith J, Xiong D and Zhang J (2020) Lipschitz-constrained regularization of self-attention mechanism for machine translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1348–1359
Huang XS, Perez F, Ba J and Volkovs M (2020) A reinforcement learning approach. In Advances in Neural Information Processing Systems, Improving knowledge distillation with teacher assistant
Clark K, Khandelwal U, Levy O and Manning CD (2019) What does BERT look at? An analysis of bert’s attention. In: Linzen T, Chrupala G, Belinkov Y and Hupkes D (eds) Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@ACL 2019, Florence, Italy, August 1, 2019, pages 276–286
Tenney EPI, Das D (2019) Bert rediscovers the classical NLP pipeline. Assoc Comput Linguist 2019:4593–4601
Wu F, Fan A, Baevski A, Dauphin YN and Auli M (2019) Pay less attention with lightweight and dynamic convolutions. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019
So D, Le Q and Liang C (2019) The evolved transformer. In: International Conference on Machine Learning, pages 5877–5886. PMLR
Meng F, Zhang J (2019) Dtmt: a novel deep transition architecture for neural machine translation. Proc AAAI Conf Artif Intell 33:224–231
Chen K, Wang R, Utiyama M and Sumita E (2019) Neural machine translation with reordering embeddings. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1787–1799
Liao B, Khadivi S and Hewavitharana S (2021) Back-translation for large-scale multilingual machine translation. arXiv preprint arXiv:2109.08712
Abdulmumin I, Galadanci BS, Ahmad IS and Abdullahi RI (2021) Data selection as an alternative to quality estimation in self-learning for low resource neural machine translation. In: International Conference on Computational Science and Its Applications, pages 311–326. Springer
Shi Y, Wang Y, Wu C, Yeh C-F, Chan J, Zhang F, Le D and Seltzer M (2021) Emformer: efficient memory transformer based acoustic model for low latency streaming speech recognition. In: ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6783–6787. IEEE
Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control Signals Syst 2(04):303–314
Barron AR (1993) Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans Inf Theory 39(03):930–945
Minami K-I, Nakajima H and Toyoshima T (1999) Real-time discrimination of ventricular tachyarrhythmia with Fourier-transform neural network. IEEE Trans Biomed Eng 179–185
Gothwal H, Kedawat S, Kumar R (2011) Cardiac arrhythmias detection in an ECG beat signal using fast Fourier transform and artificial neural network. J Biomed Sci Eng 4(04):289
Bíla J, Mironovova M (2015) Fast Fourier transform for feature extraction and neural network for classification of electrocardiogram signals. In: Future Generation Communication Technology (FGCT 2015), Luton, United Kingdom, pages 1–6
Zhang KWZ, Wang Y (2013) Fault diagnosis and prognosis using wavelet packet decomposition, Fourier transform and artificial neural network. J Intell Manuf 24(06):1213–1227
Choromanski K, Likhosherstov V, Dohan D, Song X, Davis J, Sarlós T, Belanger D, Colwell LJ and Weller A (2020) Masked language modeling for proteins via linearly scalable long-context transformers. Comput Res Repository
Goodman ND, Tamkin A, Jurafsky D (2020) Language through a prism: a spectral approach for multiscale language representations. Adv Neural Inf Process Syst 33:5492–5504
Cohan A, Beltagy I, Peters ME (2020) Longformer: the long-document transformer. Adv Neural Inf Process Syst
Zaheer M, Guruganesh G, Dubey KA, Ainslie J, Alberti C, Ontañón S, Pham P, Ravula A, Wang Q, Yang L and Ahmed A (2020) Big bird: transformers for longer sequences. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6516–6532
Tay Y, Dehghani M, Abnar S, Shen Y, Bahri D, Pham P, Rao J, Yang L, Ruder S and Metzler D (2020) Long range arena: a benchmark for efficient transformers. Comput Res Repository
Goodman ND, Tamkin A, Jurafsky D (2020) Fast transformers with clustered attention. Adv Neural Inf Process Syst 33:21665–21674
Choromanski K, Likhosherstov V, Dohan D, Song X, Gane A, Sarlós T, Hawkins P, Davis J, Mohiuddin A, Kaiser L, Belanger D, Colwell LJ and Weller A. Rethinking attention with performers. Comput Res Repository (2020)
Shaw P, Uszkoreit J and Vaswani A (2018) Self-attention with relative position representations. arXiv preprint arXiv:1803.02155
Lee-Thorp J, Ainslie J, Eckstein I and Ontanon S (2021) Fnet: Mixing tokens with Fourier transforms. arXiv preprint arXiv:2105.03824
Schmidhuber J, Greff K, Srivastava RK (2016) Highway and residual networks learn unrolled iterative estimation. Comput Res Repository
Fan A, Grave E and Joulin A (2020) Reducing transformer depth on demand with structured dropout. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, pages 26–30
Li B, Wang Z, Liu H, Quan D, Xiao T, Zhang C, Zhu J (2021) Learning light-weight translation models from deep transformer. Proc AAAI Conf Artif Intell 35:13217–13225
Papineni K, Roukos S, Ward T and Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318
Shankar V (1991) A gigaflop performance algorithm for solving Maxwell’s equations of electromagnetics. In: Computational Fluid Dynamics Conference, page 1578
Wei X, Yu H, Hu Y, Zhang Y, Weng R and Luo W (2020) Multiscale collaborative deep models for neural machine translation. In: Annual Meeting of the Association for Computational Linguistics, pages 414–426
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, Y., Chen, S., Liu, Z. et al. Translation model based on discrete Fourier transform and Skipping Sub-Layer methods. Int. J. Mach. Learn. & Cyber. 15, 4435–4444 (2024). https://doi.org/10.1007/s13042-024-02156-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-024-02156-w