Abstract
This paper introduces a new dictionary learning strategy based on atoms obtained by translating the composition of \(K\) convolutions with \(S\)-sparse kernels of known support. The dictionary update step associated with this strategy is a non-convex optimization problem. We propose a practical formulation of this problem and introduce a Gauss–Seidel type algorithm referred to as alternative least square algorithm for its resolution. The search space of the proposed algorithm is of dimension \(KS\), which is typically smaller than the size of the target atom and much smaller than the size of the image. Moreover, the complexity of this algorithm is linear with respect to the image size, allowing larger atoms to be learned (as opposed to small patches). The conducted experiments show that we are able to accurately approximate atoms such as wavelets, curvelets, sinc functions or cosines for large values of K. The proposed experiments also indicate that the algorithm generally converges to a global minimum for large values of \(K\) and \(S\).




























Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
All the signals in \(\mathbb R^{\mathcal P}\) are extended by periodization to be defined at any point in \(\mathbb Z^d\).
\(\mathbb {R}^{\mathcal {P}}\) and \(\mathbb R^S\) are endowed with the usual scalar product denoted \(\langle . , . \rangle \) and the usual Euclidean norm denoted \(\Vert \cdot \Vert _{2}\). We use the same notation whatever the vector space. We expect that the notation will not be ambiguous, once in context.
Usually, DL is applied to small images such as patches extracted from large images.
In the practical situations we are interested in, \(\#{\mathcal P}\gg S\) and \(S^3\) can be neglected when compared to \((K+S)S\#{\mathcal P}\).
For simplicity, in the formula below, we do not mention the mapping of \(\mathbb R^S\) into \(\mathbb R^{\mathcal P}\) necessary to build \(h^k\).
In this case the comparison is relevant, because \(\alpha \) is a Dirac delta function.
A sum of cosines of same frequency and different phases will yield a cosine of unchanged frequency.
We further assume that \(\Vert f^k \Vert _{2}\ne 0\), for all \(k\in \{1,\ldots ,K\}\), since the inequality is otherwise trivial.
References
Aharon, M., Elad, M., & Bruckstein, A. M. (2006). The K-SVD, an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11), 4311–4322.
Aldroubi, A., Unser, M., & Eden, M. (1992). Cardinal spline filters: Stability and convergence to the ideal sinc interpolator. Signal Processing, 28(2), 127–138.
Attouch, H., Bolte, J., Redont, P., & Soubeyran, A. (2010). Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka–Łojasiewicz inequality. Mathematics of Operations Research, 35(2), 438–457. doi:10.1287/moor.1100.0449.
Attouch, H., Bolte, J., & Svaiter, B. (2013). Convergence of descent methods for semi-algebraic and tame problems: Proximal algorithms, forward-backward splitting, and regularized Gauss–Seidel methods. Mathematical Programming, 137(1–2), 91–129.
Bengio, Y., & LeCun, Y. (2007). Scaling learning algorithms towards AI. Large-Scale Kernel Machines, 34, 1–41.
Bertsekas, D. (2003). Convex analysis and optimization. Belmont: Athena Scientific.
Bolte, J., Sabach, S., & Teboulle, M. (2013). Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, series A, 146, 1–16. doi:10.1007/s10107-013-0701-9.
Cai, J. F., Ji, H., Shen, Z., & Ye, G. B. (2014). Data-driven tight frame construction and image denoising. Applied and Computational Harmonic Analysis, 37(1), 89–105.
Champagnat, F., Goussard, Y., & Idier, J. (1996). Unsupervised deconvolution of sparse spike trains using stochastic approximation. IEEE Transaction on Signal Processing, 44(12), 2988–2998.
Chen, S. S., Donoho, D. L., & Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1), 33–61.
Chouzenoux, E., Pesquet, J., & Repetti, A. (2013). A block coordinate variable metric forward-backward algorithm. Tech. Rep. 00945918, HAL.
Cohen, A., & Séré, E. (1996). Time-frequency localization by non-stationary wavelet packet. In M. T. Smith & A. Akansu (Eds.), Subband and Wavelet Transforms: Theory and Design. Boston: Kluwer Academic Publisher.
Daubechies, I., Defrise, M., & De Mol, C. (2004). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11), 1413–1457.
De Lathauwer, L., De Moor, B., & Vandewalle, J. (2000). On the best rank-1 and rank-(r1, r2,., rn) approximation of higher-order tensors. SIAM Journal on Matrix Analysis and Applications, 21(4), 1324–1342.
Delsarte, P., Macq, B., & Slock, D. (1992). Signal adapted multiresolution transform for image coding. IEEE Transaction on Signal Processing, 42(11), 2955–2966.
Dobigeon, N., & Tourneret, J. Y. (2010). Bayesian orthogonal component analysis for sparse representation. IEEE Transaction on Signal Processing, 58(5), 2675–2685.
Duarte-Carvajalino, J. M., & Sapiro, G. (2009). Learning to sense sparse signals: Simultaneous sensing matrix and sparsifying dictionary optimization. IEEE Transaction on Image Processing, 18(7), 1395–1408.
Elad, M. (2010). Sparse and redundant representations from theory to applications in signal and image processing. Berlin: Springer.
Engan, K., Aase, S. O., & Hakon Husoy, J. (1999). Method of optimal directions for frame design. Proceedings of the IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP), Washington, DC (pp. 2443–2446).
Fadili, J., Starck, J. L., Elad, M., & Donoho, D. (2010). MCALab: Reproducible research in signal and image decomposition and inpainting. IEEE Computing in Science and Engineering, 12(1), 44–62.
Grippo, G. L., & Sciandrone, M. (2000). On the convergence of the block nonlinear Gauss–Seidel method under convex constraints. Operations Research Letters, 26(3), 127–136.
Jenatton, R., Mairal, J., Obozinski, G., & Bach, F. (2010). Proximal methods for sparse hierarchical dictionary learning. ICML.
Jenatton, R., Mairal, J., Obozinski, G., & Bach, F. (2011). Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research, 12, 2297–2334.
Kail, G., Tourneret, J. Y., Dobigeon, N., & Hlawatsch, F. (2012). Blind deconvolution of sparse pulse sequences under a minimum distance constraint: A partially collapsed Gibbs sampler method. IEEE Transaction on Signal Processing, 60(6), 2727–2743.
Lesage, S., Gribonval, R., Bimbot, F., & Benaroya, L. (2005). Learning Unions of Orthonormal Bases with Thresholded Singular Value Decomposition. Proceedings of the IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP), Philadelphia, PA (Vol. V, pp. 293–296).
Lewicki, M. S., & Sejnowski, T. J. (2000). Learning overcomplete representations. Neural Computation, 12(2), 337–365.
Lu, W. S., & Antoniou, A. (2000). Design of digital filters and filter banks by optimization: A state of the art review. Proceedings of EUSIPCO 2000, Tampere, Finland (Vol 1, pp. 351–354).
Luo, Z. Q., & Tseng, P. (1992). On the convergence of the coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72(1), 7–35.
Macq, B., & Mertes, J. (1993). Optimization of linear multiresolution transforms for scene adaptive coding. IEEE Transaction on Signal Processing, 41(12), 3568–3572.
Mailhé, B., Lesage, S., Gribonval, R., Bimbot, F., & Vandergheynst, P. (2008). Shift-invariant dictionary learning for sparse representations: extending K-SVD. Proceedings of the European Signal Processing Conference (EUSIPCO), Lausanne, Switzerland.
Mairal, J., Sapiro, G., & Elad, M. (2008). Learning multiscale sparse representations for image and video restoration. SIAM Multiscale Modeling and Simulation, 7(1), 214–241.
Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11, 10–60.
Mairal, J., Bach, F., & Ponce, J. (2012). Task-driven dictionary learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 791–804.
Malgouyres, F., & Zeng, T. (2009). A predual proximal point algorithm solving a non negative basis pursuit denoising model. International Journal of Computer Vision, 83(3), 294–311.
Muller, M. E. (1959). A note on a method for generating points uniformly on n-dimensional spheres. Association for Computing Machinery, 2(4), 19–20.
Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37(23), 3311–3325.
Ophir, B., Lustig, M., & Elad, M. (2011). Multi-scale dictionary learning using wavelets. IEEE Journal of Selected Topics in Signal Processing, 5(5), 1014–1024.
Painter, T., & Spanias, A. (2000). Perceptual coding of digital audio. Proceedings of IEEE, 88(4), 451–515.
Peyré, G., Fadili, J., & Starck, J. L. (2010). Learning the morphological diversity. SIAM Journal on Imaging Sciences, 3(3), 646–669.
Princen, J. P., & Bradley, A. B. (1986). Analysis/synthesis filter bank design based on time domain aliasing cancellation. IEEE Transaction on Acoustic, Speech, and Signal Processing, 34(5), 1153–1161.
Quinsac, C., Dobigeon, N., Basarab, A., Tourneret, J. Y., & Kouamé, D. (2011). Bayesian compressed sensing in ultrasound imaging. Proceedings of the Third International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP11), San Juan, Puerto Rico.
Razaviyayn, M., Hong, M., & Luo, Z. Q. (2013). A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23(2), 1126–1153.
Rigamonti, R., Sironi, A., Lepetit, V., & Fua, P. (2013). Learning separable filters. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, Oregon.
Rubinstein, R., Bruckstein, A. M., & Elad, M. (2010a). Dictionaries for sparse representation. Proceedings of the IEEE, 98(6), 1045–1057.
Rubinstein, R., Zibulevsky, M., & Elad, M. (2010b). Double sparsity: Learning sparse dictionaries for sparse signal approximation. IEEE Transaction on Signal Processing, 58(3), 1553–1564.
Sallee, P., & Olshausen, B. A. (2002). Advances in neural information processing systems (pp. 1327–1334)., Learning sparse multiscale image representations Cambridge: MIT Press.
Starck, J. L., Fadili, J., & Murtagh, F. (2007). The undecimated wavelet decomposition and its reconstruction. IEEE Transaction on Image Processing, 16(2), 297–309.
Thiagarajan, J., Ramamurthy, K., & Spanias, A. (2011). Multilevel dictionary learning for sparse representation of images. Proceedings of the IEEE Digital Signal Processing Workshop and IEEE Signal Process. Education Workshop (DSP/SPE), Sedona, Arizona (pp. 271–276).
Tsiligkaridis, T., Hero, A., & Zhou, S. (2013). On convergence properties of Kronecker graphical lasso algorithms. IEEE Transaction on Signal Processing, 61(7), 1743–1755.
Uhl, A. (1996). Image compression using non-stationary and inhomogeneous multiresolution analyses. Image and Vision Computing, 14(5), 365–371.
Whittaker, E. T. (1915). On the functions which are represented by the expansions of the interpolation theory. Proceedings of the Royal Society of Edinburgh, 35, 181–194.
Wiesel, A. (2012). Geodesic convexity and covariance estimation. IEEE Transaction on Signal Processing, 60(12), 6182–6189.
Acknowledgments
The authors would like to thank Jose Bioucas-Dias, Jalal Fadili, Rémi Gribonval and Julien Mairal for their fruitful remarks on this work. Olivier Chabiron is supported by ANR-11-LABX-0040-CIMI within the program ANR-11-IDEX-0002-02.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Julien Mairal , Francis Bach , Michael Elad.
This work was performed during the Thematic Trimester on image processing of the CIMI Excellence Laboratory which was held in Toulouse, France, during the period May–June–July 2013.
Appendix
Appendix
1.1 Proof of Proposition 1
First notice that \({\mathcal D}\) is a compact set. Moreover, when (7) holds, the objective function of \((P_1)\) is coercive in \(\lambda \). Thus, for any threshold \(\mu \), it is possible to build a compact set such that the objective function evaluated at any \((\lambda ,\mathbf h)\) outside this compact set is larger than \(\mu \). As a consequence, we can extract a converging subsequence from any minimizing sequence. Since the objective function of \((P_1)\) is continuous in a closed domain, any limit point of this subsequence is a minimizer of \((P_1)\).
1.2 Proof of Proposition 2
The proof of 1 hinges on formulating the expression of a stationary point of \((P_1)\), then showing that the Lagrange multipliers associated with the norm-to-one constraint for the \((h^k)_{1 \le k \le K}\) are all equal to \(0\). First, considering the partial differential of the objective function of \((P_1)\) with respect to \(\lambda \) and a Lagrange multiplier \(\gamma _\lambda \ge 0\) for the constraint \(\lambda \ge 0\), we obtain
and
Then, considering Lagrange multipliers \(\gamma _k\in \mathbb R\) associated with each constraint \(\Vert h^k \Vert _{2}=1\), we have for all \(k \in \{1,\dots ,K\}\)
where \(H^k\) is defined by (5). Taking the scalar product of (21) with \(h^k\) and using both \(\Vert h^k\Vert _2=1\) and (19), we obtain
Hence, (21) takes the form, for all \(k \in \{1,\dots ,K\}\)
When \(\lambda >0\), this immediately implies that the kernels \(\mathbf g\) defined by (8) satisfy
i.e., the kernels \(\mathbf g \in (\mathbb {R}^{\mathcal {P}})^K\) form a stationary point of \((P_0)\).
The proof of the item 2 is straightforward since for any \((f^k)_{1 \le k \le K} \in (\mathbb {R}^{\mathcal {P}})^K\) satisfying the constraints of \((P_0)\) Footnote 8, we have
As a consequence, the kernels \((g^k)_{1\le k\le K}\) defined by (8) form a solution of \((P_0)\).
1.3 Proof of Proposition 3
The first item of proposition 3 can be obtained directly since 1) the sequence of kernels generated by the algorithm belongs to \({\mathcal D}\) and \({\mathcal D}\) is compact, 2) the objective function of \((P_1)\) is coercive with respect to \(\lambda \) when (13) holds, and 3) the objective function is continuous and decreases during the iterative process.
To prove the second item of proposition 3, we consider a limit point \((\lambda ^*,\mathbf h^*) \in \mathbb R\times {\mathcal D}\). We denote by \(F\) the objective function of \((P_1)\) and denote by \((\lambda ^o,\mathbf h^o)_{o\in \mathbb N}\) a subsequence of \((\lambda ^n,\mathbf h^n)_{n\in \mathbb N}\) which converges to \((\lambda ^*,\mathbf h^*)\). The following statements are trivially true, since \(F\) is continuous and \(\left( F(\lambda ^n,\mathbf h^n)\right) _{n\in \mathbb N}\) decreases:
However, if for any \(k\) inside \(\{1,\ldots , K\}\), we have \(C_k^Tu\ne 0\) and the matrix \(C_k\) generated using \(T_k(\mathbf h^*)\) is full column rank, then there exist an open neighborhood of \(T_k(\mathbf h^*)\) such that these conditions remain true for the matrices \(C_k\) generated from kernels \(\mathbf h\) in this neighborhood. As a consequence, the \(k\)th iteration of the for loop is a continuous mapping on this neighborhood. Finally, we deduce that there is a neighborhood of \(\mathbf h^*\) in which \(T\) is continuous.
Since \(T\) is continuous in the vicinity of \(\mathbf h^*\) and \((\mathbf h^o)_{o\in \mathbb N}\) converges to \((\mathbf h^*)\), the sequence \((T(\mathbf h^o))_{o\in \mathbb N}\) converges to \(T(\mathbf h^*)\) and (23) guarantees that
As a consequence, denoting \(\mathbf h^*=(h^{*,k})_{1\le k\le K}\), for every \(k\in \{1,\ldots ,K\}\), \(F(\lambda ^*, h^{*,k})\) is equal to the minimal value of \((P_k)\). Since \(C_k\) is full column rank, we know that this minimizer is unique (see the end of Sect. 3.2) and therefore \((\lambda ^*, h^{*,k})\) is the unique minimizer of \((P_k)\). We can then deduce that \((\lambda ^*,\mathbf h^*)=T(\mathbf h^*)\).
Finally, we also know that \((\lambda ^*,\mathbf h^*)\) is a stationary point of \((P_k)\). Combining all the equations stating that, for any \(k\), \((\lambda ^*, \mathbf h^{*,k})\) is a stationary point of \((P_k)\), we can find that \((\lambda ^*,\mathbf h^*)\) is a stationary point of \((P_1)\).
Rights and permissions
About this article
Cite this article
Chabiron, O., Malgouyres, F., Tourneret, JY. et al. Toward Fast Transform Learning. Int J Comput Vis 114, 195–216 (2015). https://doi.org/10.1007/s11263-014-0771-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-014-0771-z