Batch Gradient Training Method with Smoothing Group $$L_0$$ Regularization for Feedfoward Neural Networks

Zhang, Ying; Wei, Jianing; Xu, Dongpo; Zhang, Huisheng

doi:10.1007/s11063-022-10956-w

Batch Gradient Training Method with Smoothing Group $L_0$ Regularization for Feedfoward Neural Networks

Published: 14 July 2022

Volume 55, pages 1663–1679, (2023)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Ying Zhang¹,
Jianing Wei¹,
Dongpo Xu² &
…
Huisheng Zhang ORCID: orcid.org/0000-0001-7394-7295¹

295 Accesses
1 Altmetric
Explore all metrics

Abstract

$L_0$ regularization is an ideal pruning method for neural networks as it can generate the sparsest results of all $L_p$ regularization method. However, the solving of $L_0$ regularization is an NP-hard problem, and the existing training algorithm with $L_0$ regularization can only prune the networks weights, but not neurons. To this end, in this paper we propose a batch gradient training method with smoothing Group $L_0$ regularization ($\hbox {BGSGL}_0$). $\hbox {BGSGL}_0$ not only overcomes the NP-hard nature of the $L_0$ regularizer, but also prunes the network from the neuron level. The working mechanism for $\hbox {BGSGL}_0$ to prune hidden neurons is analysed, and the convergence is theoretically established under mild conditions. Simulation results are provided to validate the theoretical finding and the the superiority of the proposed algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Convergence of batch gradient algorithm with smoothing composition of group $l_{0}$ and $l_{1/2}$ regularization for feedforward neural networks

Article 25 June 2022

Convergence of batch gradient learning with smoothing regularization and adaptive momentum for neural networks

Article Open access 08 March 2016

Mitigating Vanishing Gradient in SGD Optimization in Neural Networks

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Zhang H, Wang J, Sun Z, Zurada JM, Pal NR (2020) Feature selection for neural networks using Group Lasso regularization. IEEE Trans Knowl Data Eng 32(4):659–673
Article Google Scholar
Liu T, Xiao J, Huang Z, Kong E, Liang Y (2019) BP neural network feature selection based on Group Lasso regularization. Proc. Chin. Autom. Congr. 2786-2790
Alemu HZ, Zhao J, Li F, Wu W (2019) Group $L_{1/2}$ regularization for pruning hidden layer nodes of feedforward neural networks. IEEE Access 7:9540–9557
Article Google Scholar
Augasta MG, Kathirvalavakumar T (2011) A novel pruning algorithm for optimizing feedforward neural network of classifification problems. Neural Process Lett 34:241–258
Article Google Scholar
Zeng XQ, Yeung DS (2006) Hidden neuron pruning of multilayer perceptrons using a quantified sensitivity measure. Neurocomputing 69:825–837
Article Google Scholar
Wang J, Chang Q, Chang Q, Liu Y, Pal NR (2019) Weight noise injection-based MLPs with Group Lasso penalty: asymptotic convergence and application to node pruning. IEEE T Cybern 49:4346–4364
Article Google Scholar
Wang J, Xu C, Yang X, Zurada JM (2018) A novel pruning algorithm for smoothing feedforward neural networks based on Group Lasso method. IEEE Trans Neural Netw Learn Syst 29(5):2012–2024
Article MathSciNet Google Scholar
Dheeru D, Taniskidou EK (2017) UCI machine learning repository. Comput. Sci. Univ. California, Irvine, CA, USA, Tech. Rep, School Inf
Moody JO, Antsaklis PJ (1996) The dependence identification neural network construction algorithm. IEEE Trans Neural Netw 7:3–15
Article Google Scholar
Augasta MG, Kathirvalavakumar T (2013) Pruning algorithms of neural networks a comparative study. Central Eur J Comput Sci 3:105–115
Google Scholar
Reed R (1993) Pruning algorithms: a survey. IEEE Trans Neural Netw 4:740–747
Article Google Scholar
Wang XY, Wang J, Zhang K, Lin F, Chang Q (2021) Convergence and objective functions of noise-injected multilayer perceptrons with hidden multipliers. Neurocomputing 452:796–812
Article Google Scholar
Xu ZB, Chang XY, Xu FM, Zhang H (2012) $L_{1/2}$ regularization: a thresholding representation theory and a fast solver. IEEE Trans Neural Netw Learn Syst 23(7):1013–27
Article Google Scholar
Miao C, Yu H (2016) Alternating iteration for $L_p(0 < p)$ regularized CT reconstruction. IEEE Access 4:4355–4363
Article Google Scholar
Treadgold NK, Gedeon TD (1998) Simulated annealing and weight decay in adaptive learning: the SARPROP algorithm. IEEE Trans Neural Netw 9(4):662–8
Article Google Scholar
Wu W, Fan Q, Zurada JM, Wang J, Yang D, Liu Y (2014) Batch gradient method with smoothing $L_{1/2}$ regularization for training of feedforward neural networks. Neural Netw 50:72–78
Article MATH Google Scholar
Zhang HS, Zhang Y, Zhu S, Xu DP (2020) Deterministic convergence of complex mini-batch gradient learning algorithm for fully complex-valued neural networks. Neurocomputing 407:185–193
Article Google Scholar
Wu W, Wang J, Cheng MS, Li ZX (2011) Convergence analysis of online gradient method for BP neural networks. Neural Netw 24(1):91–8
Article MATH Google Scholar
Zhang HS, Tang YL (2017) Online gradient method with smoothing $L_0$ regularization for feedforward neural networks. Neurocomputing 224:1–8
Article Google Scholar
Wang J, Wu W, Zurada JM (2011) Deterministic convergence of conjugate gradient mehtod for feedforward neural networks. Neurocomputing 74:2368–2376
Article Google Scholar
Zhang HS, Wu W, Yao MC (2012) Boundedness and convergence of batch back-propagation algorithm with penalty for feedforward neural networks. Neurocomputing 89:141–146
Article Google Scholar
Yang DK, Liu Y (2018) $L_{1/2}$ regularization learning for smoothing interval neural networks: Algorithms and convergence analysis. Neurocomputing 272:122–129
Article Google Scholar
Kurkova V, Sanguineti M (2001) Bounds on rates of variable-basis and neural-network approximation. IEEE Trans Inf Theory 47(6):2659–2665
Article MathSciNet MATH Google Scholar
Gnecco G, Sanguineti M (2011) On a variational norm tailored to variable-basis approximation schemes. IEEE Trans Inf Theory 57(1):549–558
Article MathSciNet MATH Google Scholar
Xu ZB, Zhang H, Wang Y, Chang XY, Liang Y (2010) $L_{1/2}$ regularization. Sci China-Inf Sci 6:1159–1169
Article MATH Google Scholar
Kurt H, Maxwell S, Halbert W (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2:359–366
Article MATH Google Scholar
Li F, Zurada JM, Wu W (2018) Smooth Group $L_{1/2}$ regularization for input layer of feedforward neural networks. Neurocomputing 314:109–119
Article Google Scholar
Li F, Zurada JM, Liu Y, Wu W (2017) Input layer regularization of multilayer feedforward neural networks. IEEE Access 5:10979–10985
Article Google Scholar
Wang J, Zhang H, Wang J, Pu YF, Pal NR (2021) Feature selection using a neural network with Group Lasso regularization and controlled redundancy. IEEE Trans Neural Netw Learn Syst 32(3):1110–1123
Article Google Scholar
Scardapane S, Comminiello D, Hussain A, Uncini A (2017) Group sparse regularization for deep neural networks. Neurocomputing 241:81–89
Article Google Scholar
Xie XT, Zhang HQ, Wang Chang Q, Wang J, Pal NR (2020) Learning optimized structure of neural networks by hidden node pruning with $L_1$ regularization. IEEE T Cybern 50:1333–1346
Article Google Scholar
Formanek A, Hadhzi D (2019) Compressing convolutional neural networks by $L_0$ regularization. Proceeding International Conference on Control, Artificial Intelligence, Robotics & Optimization pp 155-162
Scardapane S, Comminiello D, Hussain A, Uncini A (2017) Group sparse regularization for deep neural networks. Neurocomputing 241:81–89
Article Google Scholar
Zhang HS, Tang YL, Liu XD (2015) Batch gradient training method with smoothing regularization for $L_0$ feedforward neural networks. Neural Comput & Applic 26(2):383–390
Article Google Scholar
Xie Q, Li C, Diao B, An Z, Xu Y (2019) $L_0$ regularization based fine-grained neural network pruning method. Proc. Int. Conf. Electron. Comput. Artif. Intell. p 11:1-4
Wang J, Cai Q, Zurada JM, Chang Q, Zurada JM (2017) Convergence analyses on sparse feedforward neural networks via Group Lasso regularization. Inf Sci 381:250–269
Fan Q, Peng J, Li H, Lin S (2021) Convergence of a gradient-based learning algorithm with penalty for ridge polynomial neural networks. IEEE Access 9:28742–28752
Article Google Scholar
Kang Q, Fan Q, Zurada JM (2021) Deterministic convergence analysis via smoothing Group Lasso regularization and adaptive momentum for Sigma-Pi-Sigma neural network. Inf Sci 553:66–82
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Nos. 61671099, 62176051).

Author information

Authors and Affiliations

School of Science, Dalian Maritime University, Dalian, 116026, China
Ying Zhang, Jianing Wei & Huisheng Zhang
School of Mathematis and Statistics, Northeast Normal University, Changchun, 130024, China
Dongpo Xu

Authors

Ying Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianing Wei
View author publications
You can also search for this author in PubMed Google Scholar
Dongpo Xu
View author publications
You can also search for this author in PubMed Google Scholar
Huisheng Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Dongpo Xu or Huisheng Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

In this appendix, we present the proof of Theorem 1. For brevity, we list the following constants which will be used in the sequel.

$$\begin{aligned} C_2&= \mathop {\max }_{0\le j\le J}\parallel {{\textbf {x}}}^j\parallel _2,\nonumber \\ C_3&= \max \left\{ \sup _{t\in R}f(t), \sup _{t\in R}f'(t), \sup _{t\in R}f''(t),\mathop {\sup }_{\begin{array}{c} t\in R , 0\le j\le J \\ 0\le q\le Q \end{array}} e'_{jq}(t), \mathop {\sup }_{\begin{array}{c} t\in R , 0\le j\le J \\ 0\le q\le Q \end{array}} e''_{jq}(t)\right\} , \nonumber \\ C_4&= \max \left\{ \mathop {\sup }_{\begin{array}{c} m\in N \\ 1\le l\le L \end{array}}\parallel {{\textbf {u}}}^m_{c_l} \parallel _2,\mathop {\sup }_{\begin{array}{c} m\in N \\ 1\le l\le L \end{array}}\parallel {{\textbf {v}}}^m_{l} \parallel _2\right\} ,\nonumber \\ C_5&= \max \left\{ \sqrt{L}C_3,C_2C_3\right\} ,\nonumber \\ C_6&= 4C^2_4\sup _{t\in R}h''_{\sigma }(t),\nonumber \\ C_7&= C_6+\sup _{t\in R}h'_{\sigma }(t).\nonumber \\ C_1&=\max \left\{ \frac{JQC^2_5}{2}(C_3+ C_4+2C_3C^2_4),\frac{JC_3}{2}(1+2C^2_5)\right\} \end{aligned}$$

(20)

For the sake of simplicity, we also define the following notations

$$\begin{aligned} {{\textbf {F}}}^{m,j}={{\textbf {F}}}(V^m{{\textbf {x}}}^j),f^{m,j} = f({{\textbf {v}}}^m_l{{\textbf {x}}}^j). \end{aligned}$$

(21)

The following two lemmas are crucial to our convergence analysis.

Lemma 1

Suppose Assumptions (A1) and (A2) are valid, then we have

$$\begin{aligned} \parallel {{\textbf {F}}}(V^{m+1}{{\textbf {x}}}^j) \parallel \le C_5, \parallel {{\textbf {F}}}^{m+1,j}-{{\textbf {F}}}^{m,j} \parallel ^2 \le C_5^2\sum _{l=1}^L\parallel {{\textbf {v}}}^{m+1}_l-{{\textbf {v}}}^m_l \parallel ^2, \end{aligned}$$

(22)

where the constant $C_5$ is defined in (20), $m=1,2,\dots ,M$, $l=1,2,\dots ,L$, and $j=1,2,\dots ,J$.

Proof

Using (2) we have

$$\begin{aligned} \parallel {{\textbf {F}}}(V^{m+1}{{\textbf {x}}}^j) \parallel= & {} \sqrt{f^2({{\textbf {v}}}^{m+1}_1{{\textbf {x}}}^j)+ f^2({{\textbf {v}}}^{m+1}_2{{\textbf {x}}}^j)+ \dots +f^2({{\textbf {v}}}^{m+1}_L{{\textbf {x}}}^j)} \nonumber \\\le & {} \sqrt{C^2_3+ C^2_3+ \dots + C^2_3}\le \sqrt{L}C_3 \le C_5. \end{aligned}$$

(23)

Using Lagrangian mean value theorem, Assumption (A1), and Eq.(23), we have:

$$\begin{aligned} \parallel {{\textbf {F}}}^{m+1,j}-{{\textbf {F}}}^{m,j}\parallel ^2= & {} \Bigg \Vert \begin{array}{rl} f({{\textbf {v}}}^{m+1}_1{{\textbf {x}}}^j)-f({{\textbf {v}}}^m_1{{\textbf {x}}}^j) \\ \vdots \qquad \qquad \qquad \\ f({{\textbf {v}}}^{m+1}_L{{\textbf {x}}}^j)-f({{\textbf {v}}}^m_L{{\textbf {x}}}^j) \end{array} \Bigg \Vert ^2 =\Bigg \Vert \begin{array}{rl} f'(t^{m,j}_1)({{\textbf {v}}}^{m+1}_1-{{\textbf {v}}}^{m}_1){{\textbf {x}}}^j \\ \vdots \qquad \qquad \qquad \\ f'(t^{m,j}_L)({{\textbf {v}}}^{m+1}_L-{{\textbf {v}}}^{m}_L){{\textbf {x}}}^j \end{array}\Bigg \Vert ^2 \nonumber \\= & {} \sum \limits _{l=1}^L [f'(t^{m,j}_l)({{\textbf {v}}}^{m+1}_l-{{\textbf {v}}}^{m}_l){{\textbf {x}}}^j]^2 \nonumber \\\le & {} C_2^2C_3^2\sum \limits _{l=1}^L\parallel {{\textbf {v}}}^{m+1}_l-{{\textbf {v}}}^{m}_l \parallel ^2 \le C_5^2\sum \limits _{l=1}^L\parallel {{\textbf {v}}}^{m+1}_l-{{\textbf {v}}}^{m}_l \parallel ^2, \end{aligned}$$

(24)

where $t^{m,j}_l$ is between ${{\textbf {v}}}^{m+1}_l{{\textbf {x}}}^j$ and ${{\textbf {v}}}^{m}_l{{\textbf {x}}}^j$ . $\square $

Lemma 2

(See Lemma 3 in [21]) Let $F:\varPhi \subset R^k \rightarrow R^q(k,q\ge 1)$ be continuous for a bounded closed region $\varPhi $, and $\varOmega = \{{{\textbf {z}}}\in \varPhi : F({{\textbf {z}}})=0\}$. The projection of $\varOmega $ on each coordinate axis does not contain any interior point. Let the sequence $\{{{\textbf {z}}}^m\}$ satisfy

$$\begin{aligned} \lim _{m \rightarrow \infty }F({{\textbf {z}}}^m)=0,\lim _{m \rightarrow \infty }\parallel {{\textbf {z}}}^{m+1}-{{\textbf {z}}}^m \parallel =0. \end{aligned}$$

(25)

Then, there exists a unique ${{\textbf {z}}}^*\in \varOmega $ such that $\lim \limits _{m \rightarrow \infty }{{\textbf {z}}}^m={{\textbf {z}}}^*$.

${\textbf {Proof to (1) of theorem}}$ 1. Applying Taylor formula to the cost function defined in (11), we have

$$\begin{aligned}&E({{\textbf {w}}}^{m+1})- E({{\textbf {w}}}^m)\nonumber \\&\quad =\sum _{j=1}^J\sum _{q=1}^Q [e_{jq}({{\textbf {u}}}^{m+1}_{r_q}{{\textbf {F}}}^{m+1,j})- e_{jq}({{\textbf {u}}}^{m}_{r_q}{{\textbf {F}}}^{m,j})]\nonumber \\&\quad + \lambda \sum _{l=1}^L [h_{\sigma }(\parallel {{\textbf {w}}}^{m+1}_l\parallel ^2)- h_{\sigma }(\parallel {{\textbf {w}}}^{m}_l\parallel ^2)] \nonumber \\&=\sum _{j=1}^J\sum _{q=1}^Q e'_{jq}({{\textbf {u}}}^{m}_{r_q}{{\textbf {F}}}^{m,j}) ({{\textbf {u}}}^{m+1}_{r_q}{{\textbf {F}}}^{m+1,j}- {{\textbf {u}}}^{m}_{r_q}{{\textbf {F}}}^{m,j}) \nonumber \\&\quad + \sum _{j=1}^J\sum _{q=1}^Q \frac{e''_{jq}(t_0^{j,q})}{2}({{\textbf {u}}}^{m+1}_{r_q}{{\textbf {F}}}^{m+1,j}- {{\textbf {u}}}^{m}_{r_q}{{\textbf {F}}}^{m,j})^2 \nonumber \\&\quad + \lambda \sum _{l=1}^L[h_{\sigma }(\parallel {{\textbf {w}}}^{m+1}_l\parallel ^2)- h_{\sigma }(\parallel {{\textbf {w}}}^{m}_l\parallel ^2)], \end{aligned}$$

(26)

where $t_0^{j,q}$ is between ${{\textbf {u}}}^{m+1}_{r_q}{{\textbf {F}}}^{m+1,j}$ and ${{\textbf {u}}}^{m}_{r_q}{{\textbf {F}}}^{m,j}$.

By (21) and Taylor formula, we have

$$\begin{aligned}&{{\textbf {u}}}^{m+1}_{r_q}{{\textbf {F}}}^{m+1,j}- {{\textbf {u}}}^{m}_{r_q} {{\textbf {F}}}^{m,j} \nonumber \\&=({{\textbf {u}}}^{m+1}_{r_q}- {{\textbf {u}}}^{m}_{r_q}){{\textbf {F}}}^{m,j} + {{\textbf {u}}}^{m}_{r_q}({{\textbf {F}}}^{m+1,j}- {{\textbf {F}}}^{m,j}) +({{\textbf {u}}}^{m+1}_{r_q}- {{\textbf {u}}}^{m}_{r_q})({{\textbf {F}}}^{m+1,j} -{{\textbf {F}}}^{m,j}) \nonumber \\&=\sum _{l=1}^L (u^{m+1}_{ql} - u^m_{ql}) f({{\textbf {v}}}^m_l{{\textbf {x}}}^j) + \sum _{l=1}^L\sum _{k=1}^K u^m_{ql} f'({{\textbf {v}}}^m_l{{\textbf {x}}}^j)(v^{m+1}_{lk} -v^{m}_{lk})x^j_k \nonumber \\&+ \sum _{l=1}^L u^m_{ql}\frac{f''(t^{m,j}_l)}{2} ({{\textbf {v}}}^{m+1}_l{{\textbf {x}}}^j-{{\textbf {v}}}^{m}_l{{\textbf {x}}}^j)^2 +({{\textbf {u}}}^{m+1}_{r_q}- {{\textbf {u}}}^{m}_{r_q}) ({{\textbf {F}}}^{m+1,j}-{{\textbf {F}}}^{m,j}). \end{aligned}$$

(27)

Substituting (27) into (26), we have

$$\begin{aligned}&E({{\textbf {w}}}^{m+1})- E({{\textbf {w}}}^m) \nonumber \\&\quad = \sum _{j=1}^J\sum _{q=1}^Q\sum _{l=1}^L e'_{jq}({{\textbf {u}}}^{m}_{r_q}{{\textbf {F}}}^{m,j})(u^{m+1}_{ql}- u^m_{ql})f({{\textbf {v}}}^m_l{{\textbf {x}}}^j) \nonumber \\&\qquad + \sum _{j=1}^J\sum _{q=1}^Q\sum _{l=1}^L\sum _{k=1}^K e'_{jq}({{\textbf {u}}}^{m}_{r_q}{{\textbf {F}}}^{m,j})u^m_{ql} f'({{\textbf {v}}}^m_l{{\textbf {x}}}^j)(v^{m+1}_{lk} -v^{m}_{lk})x^j_k \nonumber \\&\quad +\sum _{j=1}^J\sum _{q=1}^Q\sum _{l=1}^L e'_{jq}({{\textbf {u}}}^{m}_{r_q}{{\textbf {F}}}^{m,j})u^m_{ql}\frac{f''(t^{m,j}_l)}{2} ({{\textbf {v}}}^{m+1}_l{{\textbf {x}}}^j-{{\textbf {v}}}^{m}_l{{\textbf {x}}}^j)^2 \nonumber \\&\quad + \sum _{j=1}^J\sum _{q=1}^Q e'_{jq}({{\textbf {u}}}^{m}_{r_q}{{\textbf {F}}}^{m,j}) ({{\textbf {u}}}^{m+1}_{r_q}-{{\textbf {u}}}^{m}_{r_q}) ({{\textbf {F}}}^{m+1,j}- {{\textbf {F}}}^{m,j}) \qquad \nonumber \\&\quad +\sum _{j=1}^J\sum _{q=1}^Q\frac{e''_{jq}(t^{j,q}_0)}{2} ({{\textbf {u}}}^{m+1}_{r_q}{{\textbf {F}}}^{m+1,j}- {{\textbf {u}}}^{m+1}_{r_q}{{\textbf {F}}}^{m,j})^2 \qquad \qquad \nonumber \\&\quad +\lambda \sum _{l=1}^L[h_{\sigma }(\parallel {{\textbf {w}}}^{m+1}_l\parallel ^2)- h_{\sigma }(\parallel {{\textbf {w}}}^{m}_l\parallel ^2)] \nonumber \\&=\sum _{j=1}^J\sum _{q=1}^Q\sum _{l=1}^L e'_{jq}({{\textbf {u}}}^{m}_{r_q}{{\textbf {F}}}^{m,j})(u^{m+1}_{ql}- u^m_{ql})f({{\textbf {v}}}^m_l{{\textbf {x}}}^j) \nonumber \\&\quad + \sum _{j=1}^J\sum _{q=1}^Q\sum _{l=1}^L\sum _{k=1}^K e'_{jq}({{\textbf {u}}}^{m}_{r_q}{{\textbf {F}}}^{m,j})u^m_{ql} f'({{\textbf {v}}}^m_l{{\textbf {x}}}^j)(v^{m+1}_{lk} -v^{m}_{lk})x^j_k \nonumber \\&\quad + \lambda \sum _{l=1}^L[h_{\sigma }(\parallel {{\textbf {w}}}^{m+1}_l\parallel ^2)- h_{\sigma }(\parallel {{\textbf {w}}}^{m}_l\parallel ^2)] +\delta _1, \end{aligned}$$

(28)

where

$$\begin{aligned} \delta _1= & {} \sum _{j=1}^J\sum _{q=1}^Q\sum _{l=1}^L e'_{jq}({{\textbf {u}}}^{m}_{r_q}{{\textbf {F}}}^{m,j})u^m_{ql}\frac{f''(t^{m,j}_l)}{2} ({{\textbf {v}}}^{m+1}_l{{\textbf {x}}}^j-{{\textbf {v}}}^{m}_l{{\textbf {x}}}^j)^2 \nonumber \\&+ \sum _{j=1}^J\sum _{q=1}^Q e'_{jq}({{\textbf {u}}}^{m}_{r_q}{{\textbf {F}}}^{m,j}) ({{\textbf {u}}}^{m+1}_{r_q}-{{\textbf {u}}}^{m}_{r_q}) ({{\textbf {F}}}^{m+1,j}- {{\textbf {F}}}^{m,j}) \nonumber \\&+ \sum _{j=1}^J\sum _{q=1}^Q\frac{e''_{jq}(t^{j,q}_0)}{2} ({{\textbf {u}}}^{m+1}_{r_q}{{\textbf {F}}}^{m+1,j}- {{\textbf {u}}}^{m+1}_{r_q}{{\textbf {F}}}^{m,j})^2. \end{aligned}$$

(29)

Combining (13)-(15), (27)-(29), we have

$$\begin{aligned}&E({{\textbf {w}}}^{m+1})-E({{\textbf {w}}}^m) \nonumber \\&\quad =-\frac{1}{\eta }\sum _{q=1}^Q\sum _{l=1}^L (u^{m+1}_{ql}-u^{m}_{ql})^2- 2\lambda \sum _{q=1}^Q\sum _{l=1}^L h'_{\sigma }(\parallel {{\textbf {w}}}^{m}_l\parallel ^2)u^m_{ql}(u^{m+1}_{ql}- u^m_{ql}) \nonumber \\&\qquad -\frac{1}{\eta }\sum _{k=1}^K\sum _{l=1}^L (v^{m+1}_{lk}-v^{m}_{lk})^2- 2\lambda \sum _{l=1}^L\sum _{k=1}^K h'_{\sigma }(\parallel {{\textbf {w}}}^{m}_l\parallel ^2) v^m_{lk}(v^{m+1}_{lk}- v^m_{lk}) \nonumber \\&\qquad + \lambda \sum _{l=1}^L[h_{\sigma }(\parallel {{\textbf {w}}}^{m+1}_l\parallel ^2)- h_{\sigma }(\parallel {{\textbf {w}}}^{m}_l\parallel ^2)] +\delta _1. \end{aligned}$$

(30)

Applying Taylor formula, we have

$$\begin{aligned}&h_{\sigma }(\parallel {{\textbf {w}}}^{m+1}_l\parallel ^2)- h_{\sigma }(\parallel {{\textbf {w}}}^{m}_l\parallel ^2) =h'_{\sigma }(\parallel {{\textbf {w}}}^{m}_l\parallel ^2)(\parallel {{\textbf {w}}}^{m+1}_l\parallel ^2-\parallel {{\textbf {w}}}^{m}_l\parallel ^2 ) \nonumber \\&\qquad + \frac{1}{2}h''_{\sigma }(\xi ^{m}_l)(\parallel {{\textbf {w}}}^{m+1}_l\parallel ^2-\parallel {{\textbf {w}}}^{m}_l\parallel ^2 )^2 \qquad \quad \; \nonumber \\&\quad =h'_{\sigma }(\parallel {{\textbf {w}}}^{m}_l\parallel ^2)({{\textbf {w}}}^{m+1}_l+ {{\textbf {w}}}^{m}_l)^T({{\textbf {w}}}^{m+1}_l- {{\textbf {w}}}^{m}_l) \nonumber \\&\qquad +\frac{1}{2}h''_{\sigma }(\xi ^{m}_l)(({{\textbf {w}}}^{m+1}_l+ {{\textbf {w}}}^{m}_l)^T({{\textbf {w}}}^{m+1}_l- {{\textbf {w}}}^{m}_l))^2\nonumber \\&\quad \le \sum _{q=1}^Qh'_{\sigma }(\parallel {{\textbf {w}}}^{m}_l\parallel ^2)(u^{m+1}_{ql}+u^m_{ql})(u^{m+1}_{ql}- u^m_{ql}) \nonumber \\&\qquad + \sum _{k=1}^K h'_{\sigma }(\parallel {{\textbf {w}}}^{m}_l\parallel ^2)(v^{m+1}_{lk}+v^{m}_{lk}) (v^{m+1}_{lk}-v^m_{lk}) \nonumber \\&\qquad +\frac{1}{2}h''_{\sigma }(\xi ^{m}_l)\parallel {{\textbf {w}}}^{m+1}_l+ {{\textbf {w}}}^{m}_l\parallel ^2\parallel {{\textbf {w}}}^{m+1}_l- {{\textbf {w}}}^{m}_l\parallel ^2 \nonumber \\&\quad \le \sum _{q=1}^Qh'_{\sigma }(\parallel {{\textbf {w}}}^{m}_l\parallel ^2)(u^{m+1}_{ql}+u^m_{ql})(u^{m+1}_{ql}- u^m_{ql}) \nonumber \\&\qquad + \sum _{k=1}^K h'_{\sigma }(\parallel {{\textbf {w}}}^{m}_l\parallel ^2)(v^{m+1}_{lk}+v^{m}_{lk}) (v^{m+1}_{lk}-v^m_{lk}) \nonumber \\&\qquad +C_6\parallel {{\textbf {w}}}^{m+1}_l-{{\textbf {w}}}^{m}_l\parallel ^2, \end{aligned}$$

(31)

where $\xi ^{m}_l$ is between $\parallel {{\textbf {w}}}^{m+1}_l\parallel ^2$ and $\parallel {{\textbf {w}}}^{m}_l\parallel ^2$.

In order to give a further estimation of the Eq. (29), we apply the triangular inequality and obtain

$$\begin{aligned} ({{\textbf {u}}}^{m+1}_{r_q} - {{\textbf {u}}}^{m}_{r_q})({{\textbf {F}}}^{m+1,j} - {{\textbf {F}}}^{m,j})&\le \frac{1}{2}\parallel {{\textbf {u}}}^{m+1}_{r_q} - {{\textbf {u}}}^{m}_{r_q}\parallel ^2+\frac{1}{2}\parallel {{\textbf {F}}}^{m+1,j}- {{\textbf {F}}}^{m,j}\parallel ^2 \nonumber \\&\le \frac{1}{2}\parallel {{\textbf {u}}}^{m+1}_{r_q} - {{\textbf {u}}}^{m}_{r_q}\parallel ^2 + \frac{C^2_5}{2}\sum _{l=1}^L \parallel {{\textbf {v}}}^{m+1}_l-{{\textbf {v}}}^{m}_l \parallel ^2 \end{aligned}$$

(32)

and

$$\begin{aligned} ({{\textbf {u}}}^{m+1}_{r_q}{{\textbf {F}}}^{m+1,j}-&{{\textbf {u}}}^{m}_{r_q}{{\textbf {F}}}^{m,j})^2 =[({{\textbf {u}}}^{m+1}_{r_q} - {{\textbf {u}}}^{m}_{r_q}){{\textbf {F}}}^{m+1,j} + {{\textbf {u}}}^{m}_{r_q}({{\textbf {F}}}^{m+1,j}-{{\textbf {F}}}^{m,j})]^2 \nonumber \\ \le&2\parallel {{\textbf {u}}}^{m+1}_{r_q} - {{\textbf {u}}}^{m}_{r_q}\parallel ^2 \parallel {{\textbf {F}}}^{m+1,j}\parallel ^2 + 2\parallel {{\textbf {u}}}^{m}_{r_q}\parallel ^2\parallel {{\textbf {F}}}^{m+1,j} - {{\textbf {F}}}^{m,j}\parallel ^2 \nonumber \\ \le&2C^2_5\parallel {{\textbf {u}}}^{m+1}_{r_q} - {{\textbf {u}}}^{m}_{r_q}\parallel ^2 + 2C^2_4C^2_5\sum _{l=1}^L \parallel {{\textbf {v}}}^{m+1}_l-{{\textbf {v}}}^{m}_l \parallel ^2 \end{aligned}$$

(33)

Combining the above two equations, we have

$$\begin{aligned}&|\delta _1|\le \frac{1}{2}\sum _{j=1}^J\sum _{q=1}^Q\sum _{l=1}^L |e'_{jq}({{\textbf {u}}}^{m}_{r_q}{{\textbf {F}}}^{m,j})| |u^m_{ql}| \sup |f''(t^{m,j}_l)| \parallel {{\textbf {v}}}^{m+1}_l-{{\textbf {v}}}_l^{m} \parallel ^2 \parallel {{\textbf {x}}}^j \parallel ^2 \nonumber \\&\quad +\sum _{j=1}^J\sum _{q=1}^Q |e'_{jq}({{\textbf {u}}}^{m}_{r_q}{{\textbf {F}}}^{m,j})| |({{\textbf {u}}}^{m+1}_{r_q}-{{\textbf {u}}}^{m}_{r_q}) ({{\textbf {F}}}^{m+1,j}-{{\textbf {F}}}^{m,j})| \qquad \qquad \nonumber \\&\quad +\frac{1}{2}\sum _{j=1}^J\sum _{q=1}^Q |e''_{jq}(t^{j,q}_0)||{{\textbf {u}}}^{m+1}_{r_q}{{\textbf {F}}}^{m+1,j} -{{\textbf {u}}}^{m}_{r_q}{{\textbf {F}}}^{m,j}|^2 \nonumber \\&\quad \le \frac{JQ}{2}C_4C^2_5\sum _{l=1}^L\parallel {{\textbf {v}}}^{m+1}_l -{{\textbf {v}}}^{m}_l \parallel ^2 \nonumber \\&\quad + \frac{J}{2}C_3\sum _{l=1}^L\parallel {{\textbf {u}}}^{m+1}_{c_l} - {{\textbf {u}}}^{m}_{c_l}\parallel ^2+\frac{JQ}{2}C_3C^2_5\sum _{l=1}^L \parallel {{\textbf {v}}}^{m+1}_{l} - {{\textbf {v}}}^{m}_{l} \parallel ^2 \nonumber \\&\quad +C_3C^2_5J\sum _{l=1}^L\parallel {{\textbf {u}}}^{m+1}_{c_l} - {{\textbf {u}}}^{m}_{c_l}\parallel ^2+C_3C^2_4C^2_5JQ\sum _{l=1}^L \parallel {{\textbf {v}}}^{m+1}_{l} - {{\textbf {v}}}^{m}_{l}\parallel ^2 \nonumber \\&\quad \le C_1 \parallel {{\textbf {w}}}^{m+1}-{{\textbf {w}}}^{m}\parallel ^2, \end{aligned}$$

(34)

where $C_1=\max \{\frac{JQC^2_5}{2}(C_3+ C_4+2C_3C^2_4),\frac{JC_3}{2}(1+2C^2_5)\}$.

Substituting (31)-(34) into (30), we have

$$\begin{aligned} E({{\textbf {w}}}^{m+1})-E({{\textbf {w}}}^{m})\le & {} -\frac{1}{\eta }\parallel {{\textbf {w}}}^{m+1}-{{\textbf {w}}}^m\parallel ^2 +\lambda \sup _{t\in R}h'_{\sigma }(t) \parallel {{\textbf {w}}}^{m+1}-{{\textbf {w}}}^m \parallel ^2 \nonumber \\+ & {} C_6\lambda \parallel {{\textbf {w}}}^{m+1}-{{\textbf {w}}}^m\parallel ^2 + C_1 \parallel {{\textbf {w}}}^{m+1}-{{\textbf {w}}}^{m} \parallel ^2 \nonumber \\\le & {} -\left( \frac{1}{\eta }- C_7\lambda - C_1\right) \parallel {{\textbf {w}}}^{m+1}-{{\textbf {w}}}^{m} \parallel ^2\le 0, \end{aligned}$$

(35)

where $C_7=C_6+\sup _{t\in R}h'_{\sigma }(t)$.

Thus, if the learning rate $\eta $ satisfies $0<\eta < \frac{1}{C_7\lambda +C_1}$, then we have $E({{\textbf {w}}}^{m+1})\le E({{\textbf {w}}}^{m})$. This ends the proof for the monotonicity of the error function.

${\textbf {Proof to (2) of theorem}}$ 1. As $E({{\textbf {w}}}^m)$ is monotonically decreasing and $E({{\textbf {w}}}^m)\ge 0$, there exists a constant $E^*\ge 0$ such that

$$\begin{aligned} \lim _{m\rightarrow \infty }E({{\textbf {w}}}^m) = E^*. \end{aligned}$$

${\textbf {Proof to (3) of theorem}}$ 1. Let $\beta = \frac{1}{\eta }- C_7\lambda - C_1>0$. By (35), we have

$$\begin{aligned} E({{\textbf {w}}}^{m+1})\le & {} E({{\textbf {w}}}^m)-\beta \parallel {{\textbf {w}}}^{m+1} - {{\textbf {w}}}^{m} \parallel ^2 \nonumber \\\le & {} E({{\textbf {w}}}^{m-1})-\beta \parallel {{\textbf {w}}}^{m} - {{\textbf {w}}}^{m-1} \parallel ^2 - \beta \parallel {{\textbf {w}}}^{m+1} - {{\textbf {w}}}^{m} \parallel ^2 \nonumber \\\le & {} \dots \le E({{\textbf {w}}}^0)-\beta \sum _{i=0}^m\parallel {{\textbf {w}}}^{i+1} - {{\textbf {w}}}^{i} \parallel ^2. \end{aligned}$$

(36)

Since $E({{\textbf {w}}}^{m}) \ge 0$ holds for any $m\ge 1$, we have

$$\begin{aligned} \beta \sum _{i=0}^m\parallel {{\textbf {w}}}^{i+1} - {{\textbf {w}}}^{i} \parallel ^2 \le E({{\textbf {w}}}^0). \end{aligned}$$

Considering $E({{\textbf {w}}}^{m}) \ge 0$, let $m\rightarrow \infty $, then we have

$$\begin{aligned} \beta \sum _{i=0}^\infty \parallel {{\textbf {w}}}^{i+1} - {{\textbf {w}}}^{i} \parallel ^2 = \beta \eta ^2\sum _{m=0}^\infty \parallel E_{{{\textbf {w}}}}({{\textbf {w}}}^m) \parallel ^2 \le E({{\textbf {w}}}^0) < \infty . \end{aligned}$$

(37)

Then we have

$$\begin{aligned} \lim _{m\rightarrow \infty }\parallel E_{{{\textbf {w}}}}({{\textbf {w}}}^m) \parallel =0. \end{aligned}$$

(38)

${\textbf {Proof to (4) of theorem}}$ 1. Obviously $\parallel E_{{{\textbf {w}}}}({{\textbf {w}}}) \parallel $ is a continuous function under Assumptions $(A1)-(A4)$. Using (12), we have

$$\begin{aligned} \lim _{m\rightarrow \infty }\parallel {{\textbf {w}}}^{m+1}-{{\textbf {w}}}^m \parallel =\eta \lim _{m\rightarrow \infty }\parallel E_{{{\textbf {w}}}}({{\textbf {w}}}^m) \parallel =0. \end{aligned}$$

(39)

By virtue of Lemma 2, there exists a constant ${{\textbf {w}}}^*$ such that $\lim \limits _{m\rightarrow \infty }{{\textbf {w}}}^m={{\textbf {w}}}^*$. This completes the proof to (4) of Theorem 1.

This completes the proof of Theorem 1.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Y., Wei, J., Xu, D. et al. Batch Gradient Training Method with Smoothing Group $L_0$ Regularization for Feedfoward Neural Networks. Neural Process Lett 55, 1663–1679 (2023). https://doi.org/10.1007/s11063-022-10956-w

Download citation

Accepted: 01 July 2022
Published: 14 July 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s11063-022-10956-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Batch Gradient Training Method with Smoothing Group \(L_0\) Regularization for Feedfoward Neural Networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Convergence of batch gradient algorithm with smoothing composition of group \(l_{0}\) and \(l_{1/2}\) regularization for feedforward neural networks

Convergence of batch gradient learning with smoothing regularization and adaptive momentum for neural networks

Mitigating Vanishing Gradient in SGD Optimization in Neural Networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Appendix

Lemma 1

Proof

Lemma 2

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Batch Gradient Training Method with Smoothing Group \(L_0\) Regularization for Feedfoward Neural Networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Convergence of batch gradient algorithm with smoothing composition of group \(l_{0}\) and \(l_{1/2}\) regularization for feedforward neural networks

Convergence of batch gradient learning with smoothing regularization and adaptive momentum for neural networks

Mitigating Vanishing Gradient in SGD Optimization in Neural Networks

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Appendix

Appendix

Lemma 1

Proof

Lemma 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.