MoM Compression GPU
MoM Compression GPU
net/publication/365350111
CITATIONS READS
0 253
4 authors:
All content following this page was uploaded by Hector Lopez-Menchon on 13 November 2022.
Abstract
In this work, we propose a GPU parallel implementation of the random-
ized CUR (or Pseudo Skeleton) Approximation to compress the H-matrices
of linear systems that arise in the discretization of integral equations model-
ing electromagnetic scattering problems. This compression method is highly
parallelizable, in contrast with other similar methods such as the Adaptive
Cross Approximation. It involves dense linear algebra computations that
can be efficiently implemented on a GPU device. Besides, a stochastic con-
vergence criterion is introduced to minimize the communication between the
host and the device. Testing the code with standard cases shows the efficiency
and accuracy of the method.
Keywords: Electric Field Integral Equation; graphics processing unit;
low-rank approximation; Method of Moments;H-matrices; randomized
methods.
1. Introduction
The Method of Moments (MoM) is one of the most widely used techniques
to solve the integral equation formulation of the scattering problem in elec-
tromagnetics. Its main drawback is the cost of the construction, storage and
solution of a dense linear system. This has led to the development of several
fast algorithms such as the Multilevel Fast Multipole Algorithm (MLFMA)
∗
Corresponding author.
E-mail address: hector.lopez.menchon@upc.edu
2
per each ACA iteration) and the small size of the operation performed by
each of them would still entail a large overhead.
In this work, we propose to use a randomized CUR compression method
to extract the maximum performance from GPU devices. The CUR method,
also known as pseudo skeleton approximation, was introduced by Goreinov
in 1997 [10, 11] and was already applied to compress the impedance matrix
representing the interaction between two distant subdomains in an electro-
magnetic scattering problem in one of the early papers about the method
[10]. Later, it has been used in a multilevel fashion to compress the MoM
matrix [12]. As the ACA, this method is purely algebraic and approximates a
rank-deficient matrix by a subset of its rows and columns. However, whereas
the ACA iteratively adds new rows and columns to the approximation ac-
cording to a certain optimality criterion, the randomized CUR stochastically
picks a set of rows and columns at once. Although the compression rate of
the ACA is better, the randomized CUR enjoys a better degree of parallelism
that can be efficiently exploited in GPU devices.
2. Theory
2.1. CUR approximations
Let us consider a matrix A ∈ Cm×n . The pseudoskeleton approximation
aims to find a good representation of A based on r of its rows and columns,
with r small compared to m and n. Let I ≡ {i1 , . . . , ir } be a subset of the
row indices of A, and J ≡ {j1 , . . . , jr } a subset of its column indices. Then,
C ≡ A(:, J) ∈ Cm×r and R ≡ A(I, :) ∈ Cr×n denote submatrices containing
subsets of columns and rows of A, respectively. Let G ≡ A(I, J) a r × r
submatrix of A determined by the indices I, J. We will refer to G as the
intersection matrix. The CUR (or skeleton) approximation  is given by
 ≡ CU R, (1)
where U ≡ G† (the Moore-Penrose inverse of G). Note that in the particular
case when rank(G) = r, then G† = G−1 .
Next, we should address the existence of good CUR approximations of A
(in the sense of the norm) and how to select the proper index subsets I, J.
First, let us consider the case when A has exactly rank r. In this case, it is
possible to select I, J so that G is nonsingular and  = A [10].
However, in many practical cases A is low-rank only from the numerical
point of view, in the sense that there are only r singular values significantly
3
larger than numerical noise, and A can be approximated using low rank
matrices, Â ≈ A . We say that A is well approximated as a rank−r matrix
with accuracy ε > 0 if there exists a matrix F so that rank(A − F ) ≤ r and
kF k2 < ε. Then, according to [10], there exists an approximation of the form
CXR where C and R are given as in (1) and X ∈ Cr×r so that
√ √ √
kA − CXRk2 = O(ε r( m + r)). (2)
Note that here the matrix X that links the row and column subset is
not built as the Moore-Penrose inverse of the intersection matrix A(I, J).
This sort of generalized CUR approximation is known as pseudoskeleton
component [10]. In this work, we will exclusively focus on the CUR approach
(1), where U is computed as A(I, J)† . This turns out to be the most practical
and widely used strategy for computing CUR approximations.
There exist different methods for selecting the indices I, J. The problem
of selecting I, J is closely linked with the maximization of the volume of the
intersection matrix G. We define the volume of a matrix as the modulus of
its determinant. It is well know that maximizing the volume of A(I, J) yields
a quasi-optimal CUR approximation [13](Chapter 7.4). If I, J are such that
A(I, J) is the maximal volume submatrix, then
where σr+1 (A) denotes the (r + 1)-th largest singular value of A, and the
k · kmax norm returns the maximum absolute value of the elements of the
matrix. However, finding I, J that maximize the volume of A(I, J) is an
NP-hard problem [14] that requires processing the whole A matrix. Thus,
finding the intersection matrix with maximal volume is often not feasible in
practice.
Instead of finding the maximal volume submatrix, we can employ ap-
proximate algorithms that find suboptimal solutions for this maximization
problem but are much more efficient in computational terms:
The most widely known among them is the Adaptive Cross Approxima-
tion (ACA), which iteratively selects the rows and columns of A that are
expected to contribute most to maximizing the volume of the intersection
matrix by employing a pivoting strategy. Usually, the output of the ACA
algorithm are two matrices U ∈ Cm×r and V ∈ Cr×n so that A ≈ U V . How-
ever, this decomposition can be alternatively represented in the form (1). It
4
is also worth to note that ACA ensures that the intersection matrix is full-
rank. This algorithm starts by selecting an arbitrary row (or column) of the
matrix to be compressed. Then, it finds its element with maximum absolute
value, and adds the corresponding column (or row) to the compressed rep-
resentation. This is iteratively repeated until a certain error criterion is met
or the intersection matrix becomes singular.
It is also possible to select I, J in a random way [15]. Randomized meth-
ods for compressing low-rank matrices have received great attention in the
latest years. Although in some cases they require a larger number of op-
erations than deterministic methods, they naturally adapt to parallel com-
putational environments, both at the higher level (multicore environments)
and at the lower one (GPUs and SIMD machines). For instance, techniques
based on using random embeddings (projections onto randomly chosen sub-
spaces) have been widely used in applications involving huge amounts of data
[16, 17]. However, these techniques are only efficient when all the elements
of the matrix A are known beforehand, or at least can be obtained in a very
easy way. In our case, the cost of computing the elements of the matrix that
we intend to compress is high (sec. 2.2), so we opt for the CUR approxi-
mation that allows to compute only a small number of the elements of the
matrix.
The randomly chosen indices I, J can be obtained by uniform random
sampling or by employing a probability distribution that takes into account
some properties of the physical problem. In our case, we choose a uniform
sampling strategy. There exist theoretical results about the expected error
for this kind of selection of I, J based on uniform random sampling [18].
However, since enunciating these results requires a significant mathematical
apparatus and they would not provide substantial insight into our particular
case, we prefer to skip them and refer the interested reader to the original
work [18]. Anyway, we should highlight two features of this strategy: 1) The
randomly selected I, J obviously leads to worse CUR approximations (in the
sense of the norm) than methods that make a significant effort to maximize
the volume of the intersection matrix, and 2) The intersection matrix A(I, J)
will easily become ill-conditioned or singular, but this is not a fundamental
limitation since U is computed as its Moore-Penrose inverse.
Thus, choosing a compression algorithm for low-rank matrices entails
finding a compromise between the accuracy of the compression and the com-
putational effort required to obtain it. Among the CUR strategies, finding
the maximal-volume intersection is probably optimal in terms of compression
5
quality, but is prohibitively costly. Greedy Adaptive Algorithms as ACA of-
fer a very good compression rate at a much smaller computational cost. In
the case of randomized methods, the cost of obtaining I, J is very small, of-
ten negligible. However, their main advantage does not rely on the reduced
cost of the computation of the indices, but on the potential parallelism when
computing the C, U and R matrices. The Adaptive Cross Approximation,
for instance, is sequential in nature, since every row or column is computed
only after the previous one has been processed. On the other hand, the ran-
domized selection allows to compute all the rows and columns in C and R
independently.
∇0 G(r, r0 ) 0
Z
0
L(J) = −jωµG(r, r ) − ∇ · J(r0 )dS 0 (4)
S jωε
and
exp(−jk|r − r0 |)
G(r, r0 ) = . (5)
4π|r − r0 |
k = 2π/λ is the free-space wavenumber associated with wavelength λ and
ε, µ are the electric permittivity and the magnetic permeability, respectively.
After setting the boundary condition at the perfectly conducting object
surface,
n̂ × E(r) = n̂ × ( Einc (r) + Es (r) ) = 0
the EFIE is expressed as
6
In order to solve (6), we discretize it into a system of linear algebraic equa-
tions with the Method of Weighted Residuals, also known as the Method of
Moments (MoM) [20] [21]. This method consists on (i) expanding the un-
known Js as a linear combination of N basis functions fj (r) and (ii) project-
ing both members of the equation onto a set of N weighting functions wi (r).
Here, the choice is wi (r) = fj (r), which is commonly known as the Galerkin
method [20].
The result is a linear system
ZJ = −Einc . (7)
7
and fj (r) functions [28] [29]. Therefore, the rank of the sub-matrix given by
the set of (i, j) indices in the field and source domains is much smaller than
the order of the sub-matrix. The smaller the subdomains are or the further
apart they are, the more compressible is the corresponding sub-matrix. Sub-
matrices corresponding to neighbor (and self) domains are full-rank, so they
are recursively subdivided as detailed in [2]: non-neighbor domains at the
next lower level are compressed, and neighbor and self ones are subdivided
again until a certain size is reached. The compression is carried out with the
techniques described in Section 2.1.
3. Methods
We employ a GPU accelerated randomized CUR algorithm (R-CUR) to
efficiently compress the low rank subblocks in the impedance matrix Z in
(7). The low-rank approximation is based on the CUR method described on
Section 2.1 with index selection based on uniform random sampling. In this
section we describe the basic R-CUR algorithm and its GPU implementation.
8
the indices (I, J) and then accordingly computes C = A(:, J), R = A(I, :)
and U = G† = A(I, J)† .
An obvious limitation of this method is that it requires a priori knowledge
of the rank r of the CUR approximation. In order to overcome this limita-
tion, we implement an adaptive version of the method. We start by setting
an arbitrary (but typically small) guess for the rank r and compute the cor-
responding CUR approximation. Then, we iteratively double the size of the
rank r and repeat the same operation. In order to check for convergence,
we compute at each iteration step the product of the approximate matrix
 and a random Gaussian vector v, and compare the result with the one of
the previous iteration. When the relative error between the approximation
of Av computed for the current iteration and for the previous one falls below
a certain threshold we consider that the quality of the CUR approximation
is good enough and stop the iterative procedure (Algorithm 1).
One may argue that this adaptive procedure goes against the leading
principle of the method, which consists on exploiting hardware capabilities
through highly parallel operations. Albeit this is true, Algorithm 1 requires
a small number of iterations with a relatively large amount of parallel work
per iteration. To obtain a compressed approximation of rank r, the random-
ized CUR method requires log2 r iterations with O(2k (m + n)) operations on
iteration k, whereas classical adaptive methods as ACA would require r iter-
ations with O(m + n) operations per iteration. Thus, this adaptive strategy
consisting on iteratively doubling r preserves a large amount of parallelism
within each iteration.
Regarding the initial rank r0 , we heuristically choose r0 = max(1, min(m, n)/100).
In this way, for a large square matrix, the initial compression rank corre-
sponds to a 1% of the matrix order. Indeed, it is possible to estimate be-
forehand the rank of the sub-matrix to be compressed, since it corresponds
to the number of degrees of freedom of E(r) and it is theoretically described
in [29]. However, we have preferred to omit this possibility on the current
version of our implementation in order to find conclusions that are also valid
for more general cases, where no information about the rank is available.
Note that when r grows to a size min(m, n) we are not applying any
compression to the original matrix A. In that case, we will directly compute
the whole matrix A.
In order to make Algorithm 1 more efficient, we could reuse the rows and
columns computed at every iteration. However, we avoid this on our current
implementation since it woud imply keeping the kernel calls in Algorithm 1
9
Algorithm 1 Randomized CUR
Input: Mesh data, initial r0 , threshold error
Output: C, U and R (approximation of A ∈ Cm×n )
r ← r0
err ← Inf
Initialize random test vector v ∈ Cn
Initialize product vector p ∈ Cm
while err > and r < min(m, n) do
Randomly select I = (i1 , . . . , ir ) ⊂ (1, . . . , m)
Randomly select J = (j1 , . . . , jr ) ⊂ (1, . . . , n)
Compute C ← A(:, J), G ← A(I, J), R ← A(I, :)
Compute U ← G†
pnew ← CU Rv
err ← kpnew − pk/kpnew k
p ← pnew
r ← 2r
end while
in the same scope of the main loop, which entails some code limitations, or
to resort to complex pointer manipulations. An optimum implementation
would need half the number of Z matrix element computations, which does
not affect the asymptotic complexity of the method.
Regarding the cost of computing the CU Rv product at each iteration, we
should notice that it is small compared with the computation of the C, U
and R matrices, that require evaluating the integral (8) for each matrix en-
try. Also, computing the Moore-Penrose inverse of G is not computationally
expensive since G has size r × r and r << m, n.
10
blocks of Z are computed at GPU and remain at the device memory with no
need to transfer them to the CPU. Indeed, the system (7) can be efficiently
solved by an iterative solver that computes the products between the com-
pressed matrix and the iterate vectors in the GPU itself. We will only need
to transfer the final result of (7) to the CPU. The fact that Algorithm 1 only
requires log2 r iterations also helps to reduce the communication between the
host and the device.
Host Device
Figure 2: Schematic representation of the procedure for obtaining the CUR approximation
of a matrix or matrix block (Algorithm 1). Data transfer between the host and the device
is minimal.
11
first step, the host (CPU) computes the mesh data, which comprises the
location of the vertices, information about the edges connecting them as well
as other physical parameters. This information is transferred to the GPU.
This constitutes the largest data transfer between the host and the device,
assuming the C, U and R matrices are not sent to the CPU.
Then, the host launches the kernels that compute the C, G and R ma-
trices. This is the most costly operation of the whole procedure. Computing
these matrices implies evaluating the integral (8) for every entry of the sub-
matrix. We achieve this by an in-house developed CUDA kernel that assgins
a CUDA grid to the submatrix, so that each thread is responsible for evalu-
ating (8) to fill the corresponding element of the submatrix.
Although filling the matrices is a highly parallel operation, the threads
should access the data structures containing the mesh information in a con-
current way. Besides, the memory access pattern is rather unpredictable,
since computing contiguous elements of Z does not imply the access to con-
tiguous memory addresses on the data structures that keep the mesh data.
This aspect is left for future study and optimization, however, it does not
seem to be an efficiency bottleneck. The fact that the accesses to the mesh
data structure consist on read-only operations implies that no coherence
mechanisms are necessary, which improves the overall efficiency of the pro-
cess.
Also note that C, G and R have some elements in common (indeed G
represents the elements that are found both in C and R). However, since G is
typically small compared with C and R it is more efficient to recompute the
common elements rather than copying them into a different device memory
location.
The next step is computing U = G† , the Moore-Penrose inverse of the
intersection matrix. We also perform this operation in the GPU device.
Although G is relatively small and the computation of its pseudoinverse
would not require from the GPU capabilities, transferring G to the CPU
entails a large cost in terms of time, so it is overall cheaper to compute it in
the device.
In order to obtain the Moore-Penrose inverse of G, we first compute its
singular value decomposition (SVD) as G = WL ΣWR∗ , where WL and WR are
orthogonal matrices and Σ is a diagonal matrix containing the σi (G) singular
values of G. Then, G† = WR Σ† WL∗ . The Moore-Penrose inverse of a diagonal
matrix like Σ is a diagonal matrix where the (i, i) element is 1/σi if σi 6= 0
and 0 otherwise. In order to avoid numerical unstabilities, we replace the
12
diagonal elements of Σ by 0 if they fall below a certain threshold value, in
our case 10−10 .
The singular value decomposition is computed with the cusolverDnZgesvd
function from the cuSolver library. The inversion of the elements of the di-
agonal of Σ if they are above a certain threshold to obtain Σ† is performed
by an in-house written CUDA kernel. Then, we compute the products in
WR Σ† WL∗ with the cublasZgemm function from the cuBLAS library.
Once we have computed the CUR approximation we perform the product
pnew = CU Rv to assess the compression error. Note that all the elements
C, U , R, pnew , v are in the device memory so that no expensive transfer
operations are required. The multiplication is computed, again, with the
cublasZgemm function. Of course, this operation is performed in several
steps following the order indicated by the parenthesis in C(U (Rv)) so that
we execute the minimum number of operations. Then, we compare the pnew
vector with the approximation p from the previous iteration by computing
kp−pnew k/kpnew k. The device returns the value of the relative error (a single
scalar) to the host. If it is greater than the threshold value then we double
the size of r and start the procedure again.
All the vectors and matrices involved in the process are based on the
cuDoubleComplex type, which represents double precision complex numbers,
except from the vector containing the singular values σi (G) ∈ R, which is
based on double precision real numbers.
This procedure is intended to compute the CUR approximation of a single
subblock ot the impedance matrix Z. We could also exploit the block-level
parallelism by compressing several blocks at the same time through con-
current CUDA calls or by using multiple GPUs communicated by MPI, if
the problem is computationally large. However, since the objective of this
work is to test the GPU-accelerated implementation of the randomized CUR
method, these possible improvements are left for future work.
4. Numerical Results
In this section we provide some numerical results for the randomized CUR
technique. The GPU device used in these experiments is a NVIDIA Quadro
RTX 5000 with 3072 CUDA cores and 16 GB of RAM, and the CPU is a Intel
Xeon Silver 4214R CPU with 24 cores at 2.4 GHz and 384 GB of RAM. The
code is written in Julia and MATLAB except the routines that are executed
in the GPU, which are written in CUDA C. The routine that computes the
13
elements of Z in the CPU (in order to compare its performance with the
GPU) is written in pure C. The codes are freely available at [30].
Figure 3: Mesh discretization of the two perfectly conducting spheres of radius 1.0 m and
separated by a distance of 12.0 m. Since the spheres are at a relatively large distance, the
matrix block representing the interaction between them is rank-deficient.
14
insight about its advantages and limitations by comparing it with ACA,
which is the most common kernel-independent compression method.
We consider the compression of submatrix A representing the interaction
between the two spheres. We analyze different cases with a fixed mesh con-
figuration of 12288 basis functions per sphere and let the wavelength take
different values, so that we can study the behavior of the algorithm under two
mesh resolutions: the wavelength is varied so that the average edge length
becomes λ/10 and λ/20. A mesh resolution of λ/10 is considered suitable
for most applications, and λ/20 corresponds to an overdiscretized case. Ta-
bles 1 and 2 summarize the compression error for both algorithms. The
computation time is similar for both cases.
A postcompression of the obtained CUR and ACA approximations can
be computed by applying QR and SVD decompositions to the approximant
matrices [2]. Since these matrices are already small, this procedure implies a
small computational effort and its cost can be considered negligible. We will
apply this strategy to both the CUR and ACA approximations.
Error vs. size
CUR 10/λ
-2 ACA 10/λ
10
CUR 20/λ
ACA 20/λ
-4
10
err
-6
10
-8
10
1.0 1.5 2.0 2.5
10 10 10 10
r
Figure 4: Comparison of randomized CUR and ACA compression for two different mesh
resolutions. The horizontal axis represents the number of rows of C and columns of R
involved on the approximation, that is, the numerical rank r of A. The vertical axis
shows the relative error of the approximation. ACA needs a smaller r than CUR to reach
the same error. After postcompression, the compression ratio of both method becomes
comparable.
Figure 4 shows the relative error for the randomized CUR and ACA ap-
proximations depending on the numerical rank r of the approximation. We
observe that cases with a finer discretization (as λ/20) require a smaller num-
ber of elements to reach a certain compression error, since overdiscretization
15
CUR λ/10 ACA λ/10
r err r err
13 (13) 7.99E-2 (0.10) 8 (8) 0.13 (0.13)
62 (48) 3.24E-4 (3.37E-4) 42 (40) 1.06E-3 (1.06E-3)
123 (88) 2.56E-6 (2.58E-6) 84 (79) 8.39E-6 (8.24E-6)
615 (132) 8.47E-9 (8.60E-9) 149 (137) 2.66E-8 (2.67E-8)
Table 1: Comparison of CUR and ACA algorithms for the λ/10 discretization. The table
shows the compression rank and the relative error kÂ−Ak/kAk. The values in parentheses
refer to the case where postcompression has been applied.
Table 2: Comparison of CUR and ACA algorithms for the λ/20 discretization.
16
tion that can be efficiently postcompressed later.
1.2
10 λ=2.0
λ=1.0
0.9
10
time [s]
0.6
10
0.3
10
0.0
10
3 6 9 12 15
threads per block side
Figure 5: Execution time for the GPU accelerated compression of a matrix block rep-
resenting the interaction between the two spheres for wavelength values 2 and 1, that
correspond to discretizations with 3072 and 12288 edges, respectively. The horizontal axis
represents the lateral size T of a square thread block. In total, the thread block has T 2
threads. Note that the total amount of threads involved in the execution is constant and
does not depend on the block size.
Figure 5 shows the time required for compressing the matrix blocks for
the two different cases and for several thread block sizes, and Figure 6 shows
17
speedup GPU vs CPU
30 λ=2.0
λ=1.0
25
speedup
20
15
10
3 6 9 12 15
threads per block side
Figure 6: Speedup with respect to the GPU corresponding to the example in Figure 5.
the speedup with respect to the same operation performed on a CPU, which
takes 12.19 s for λ = 2 m and 287 s for λ = 1 m. For the sake of comparison,
obtaining an equivalent compressed representation with ACA on CPU takes
14.49 s for the λ = 2 m case and 314.3 s for the λ = 1 m case.
The best speedup is achieved with the largest sub-matrix, with λ = 1m
and a size of 12288 × 12288 elements. The performance improves when we
increase the number of threads per block, although it quickly saturates and
reaches a speedup of 30 with respect to the CPU. The smallest size case (λ =
2m and 3072 mesh elements per sphere) exhibits a smaller speedup, due to
the larger impact that the overhead of certain operations may have on smaller
computations. Electrically large boxes often appear in large computational
problems in which it is beneficial to use matrix compression. Larger blocks
are expected to show even better speedups.
The multiplication between the compressed block and a vector can be
also efficiently carried on the GPU with standard libraries as cuBLAS. It
consist on a multiplication of the matrices CU Rv. On our experiments, we
also obtain speedup values around 30.
18
Figure 7: Surface current of the NASA almond normalized with the incident magnetic field
amplitude, in logarithmic scale (dB). The incident field propagates towards the −x̂ direc-
tion and is linearly polarized with E-field parallel to the y-axis. The operating frequency
is 7.895 GHz (λ = 0.038).
19
-20
Exact MoM
-30 Rand. CUR
ACA
-40
-50
-60
RCS [dB]
-70
-80
-90
-100
-110
-120
0 20 40 60 80 100 120 140 160 180
deg
Figure 8: NASA almond bistatic radar cross section in the E-plane vs observation angle θ
for an 7.895 GHz y-polarized incident plane wave propagating towards −x-direction. The
GMRES iterative solution of the R-CUR compressed matrix (Rand. CUR) and the ACA
compressed matrix (ACA) is compared with the direct solution of the uncompressed linear
system matrix (Exact MoM). Direction θ = 0◦ corresponds to the back scattering case
and θ = 180◦ corresponds to forward scattering.
5. Conclusions
This work shows the advantages of the randomized CUR approximation
for compressing the H-matrices of linear systems that arise in the discretiza-
tion of integral equations modeling electromagnetic scattering problems. The
main advantage of this method is its inherent parallelism, that allows it to
be efficiently implemented in massively parallel computing environments. In
particular, the fine-grained parallelism of the algorithm makes it very suit-
able for graphics processing units (GPU). Besides, since the method is purely
algebraic it can also be applied to other problems of physics and engineering
that lead to linear systems with compressible H-matrices.
Numerical examples are presented to show the performance of the ran-
domized method. Unlike other compression methods as ACA, the parallel
nature of the randomized CUR enables to achieve great computational effi-
ciency (30x speedup factor after parallelization in GPU). Besides, it remains
competitive with ACA in serial CPU implementations. Also, we show that
applying postcompression to the R-CUR approximation leads to an excellent
compression ratio.
A final example shows excellent accuracy of the randomized CUR when
20
solving the whole linear system for a standard electromagnetic scattering
benchmark.
6. Acknowledgements
This work was partly funded by the Ministerio de Ciencia e Innovacion
(MICINN) under projects PID2019- 107885GB-C31 / AEI / 10.13039/501100011033,
PID2020-113832RB-C21 / AEI / 10.13039/501100011033 and PID-2020-118410RB-
C21 / AEI / 10.13039/501100011033, and Catalan Research Group 2017 SGR
219 and grant 2021 FI B2 00096.
21
* Items marked with an asterisk are only required for new versions of
programs previously published in the CPC Program Library.
7.
[1] J. Song, C. Lu, W. C. Chew, Multilevel fast multipole algorithm for
electromagnetic scattering by large complex objects, IEEE Transactions
on Antennas and Propagation 45 (1997) 1488–1493.
22
[9] W. C. Gibson, Efficient solution of electromagnetic scattering problems
using multilevel adaptive cross approximation and lu factorization, IEEE
Transactions on Antennas and Propagation 68 (5) (2020) 3815–3823.
doi:10.1109/TAP.2019.2963619.
23
[17] M. W. Mahoney, P. Drineas, CUR matrix decompositions for improved
data analysis, Proceedings of the National Academy of Sciences 106 (3)
(2009) 697–702. doi:10.1073/pnas.0803205106.
URL https://pnas.org/doi/full/10.1073/pnas.0803205106
24
[27] L. Grasedyck, W. Hackbusch, Construction and arithmetics of H-
matrices, Computing 70 (4) (2003) 295–334. doi:10.1007/s00607-003-
0019-1.
URL https://doi.org/10.1007/s00607-003-0019-1
[28] W. C. Chew, J.-M. Jin, C.-C. Lu, E. Michielssen, J. Song, Fast solu-
tion methods in electromagnetics, IEEE Transactions on Antennas and
Propagation 45 (3) (1997) 533–543. doi:10.1109/8.558669.
25