0% found this document useful (0 votes)

7 views26 pages

MoM Compression GPU

A GPU parallel randomized CUR compression method for the Method of Moments

Uploaded by

pahoxi2512

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views26 pages

MoM Compression GPU

A GPU parallel randomized CUR compression method for the Method of Moments

Uploaded by

pahoxi2512

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/365350111

A GPU parallel randomized CUR compression method for the Method of

Moments $

Preprint · November 2022

CITATIONS READS

0 253

4 authors:

Hector Lopez-Menchon Alex Heldring

Universitat Politècnica de Catalunya Universitat Politècnica de Catalunya
9 PUBLICATIONS 9 CITATIONS 101 PUBLICATIONS 1,053 CITATIONS

SEE PROFILE SEE PROFILE

Eduard Ubeda J.M. Rius

Universitat Politècnica de Catalunya Universitat Politècnica de Catalunya
107 PUBLICATIONS 1,129 CITATIONS 227 PUBLICATIONS 2,703 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Hector Lopez-Menchon on 13 November 2022.

The user has requested enhancement of the downloaded file.

A GPU parallel randomized CUR compression method
for the Method of MomentsI
Hector Lopez-Menchona,∗, Alexander Heldringa , Eduard Ubedaa , Juan M.
Riusa
a
Commsenslab, Department of Signal Theory and Communications (TSC), Universitat
Politècnica de Catalunya · Barcelona Tech (UPC), Campus Nord UPC, Edifici D-3,
Jordi Girona 1-3, 08034 Barcelona, Spain

Abstract
In this work, we propose a GPU parallel implementation of the random-
ized CUR (or Pseudo Skeleton) Approximation to compress the H-matrices
of linear systems that arise in the discretization of integral equations model-
ing electromagnetic scattering problems. This compression method is highly
parallelizable, in contrast with other similar methods such as the Adaptive
Cross Approximation. It involves dense linear algebra computations that
can be efficiently implemented on a GPU device. Besides, a stochastic con-
vergence criterion is introduced to minimize the communication between the
host and the device. Testing the code with standard cases shows the efficiency
and accuracy of the method.
Keywords: Electric Field Integral Equation; graphics processing unit;
low-rank approximation; Method of Moments;H-matrices; randomized
methods.

1. Introduction
The Method of Moments (MoM) is one of the most widely used techniques
to solve the integral equation formulation of the scattering problem in elec-
tromagnetics. Its main drawback is the cost of the construction, storage and
solution of a dense linear system. This has led to the development of several
fast algorithms such as the Multilevel Fast Multipole Algorithm (MLFMA)

∗
Corresponding author.
E-mail address: hector.lopez.menchon@upc.edu

Preprint submitted to Computer Physics Communications November 5, 2022

[1], the Matrix Decomposition Algorithm (MDA) [2] or the Adaptive Cross
Approximation algorithm (ACA) [3]. These methods are based on some ap-
proximate compressed representation of the linear system matrix, so that the
storage space is reduced and the matrix-vector multiplications necessary for
solving the system with an iterative method are accelerated.
Methods like the MLFMA are capable of achieving excellent computa-
tional efficiency [4, 5], but they depend on the Green’s function of the in-
tegral equation and need a complete re-formulation to be applied to other
Green’s functions [1] In contrast, purely algebraic methods operate solely
with a small part of the elements of the impedance matrix. This allows to
implement them on top of previously existing MoM codes with a moderate
programming effort. Indeed, they can be applied to compress rank-deficient
matrices in a broad set of problems beyond computational electromagnetics.
The ACA is one of the more popular purely algebraic methods due to its
great simplicity and ease of implementation. It was developed in 2000 by
Bebendorf [3] and although it was originally targeted at magnetostatic prob-
lems [6], it has been later adapted to antenna radiation and RCS computation
[7] . After a subdivision of the object into non-overlapping subdomains it is
recursively used to compress the rank-deficient matrix blocks that arise from
the interaction between distant domains. The core of the ACA algorithm is
a pivoting strategy that optimizes an ordinary LU decomposition by itera-
tively picking the rows and columns of the original block that are expected
to contribute most to reduce the compression error. It is worth to note that
the whole matrix to be compressed is not stored in memory, instead, the
ACA algorithm computes only the necessary elements on runtime. Thanks
to this strategy, the ACA enjoys a high efficiency and an excellent compres-
sion rate. However, this iterative algorithm is inherently serial, which makes
parallelization challenging.
In particular, if we intend to exploit the capabilities of GPUs for dense
linear algebra computations, the iterative ACA scheme would be a serious
obstacle preventing an efficient hardware usage. Even if the computation of
rows and columns is efficiently performed on the GPU, it would be necessary
to return the execution control to the CPU at each ACA iteration [8, 9],
which would lead to an excessive overhead and an infrautilization of the GPU
device due to the relatively small size of a single row or column. This can
be mitigated by a coarse grain parallel strategy where all the rank-deficient
blocks of the system matrix are compressed by ACA in a concurrent way by
several CUDA streams [8]. However, the large number of kernel calls (one

2
per each ACA iteration) and the small size of the operation performed by
each of them would still entail a large overhead.
In this work, we propose to use a randomized CUR compression method
to extract the maximum performance from GPU devices. The CUR method,
also known as pseudo skeleton approximation, was introduced by Goreinov
in 1997 [10, 11] and was already applied to compress the impedance matrix
representing the interaction between two distant subdomains in an electro-
magnetic scattering problem in one of the early papers about the method
[10]. Later, it has been used in a multilevel fashion to compress the MoM
matrix [12]. As the ACA, this method is purely algebraic and approximates a
rank-deficient matrix by a subset of its rows and columns. However, whereas
the ACA iteratively adds new rows and columns to the approximation ac-
cording to a certain optimality criterion, the randomized CUR stochastically
picks a set of rows and columns at once. Although the compression rate of
the ACA is better, the randomized CUR enjoys a better degree of parallelism
that can be efficiently exploited in GPU devices.

2. Theory
2.1. CUR approximations
Let us consider a matrix A ∈ Cm×n . The pseudoskeleton approximation
aims to find a good representation of A based on r of its rows and columns,
with r small compared to m and n. Let I ≡ {i1 , . . . , ir } be a subset of the
row indices of A, and J ≡ {j1 , . . . , jr } a subset of its column indices. Then,
C ≡ A(:, J) ∈ Cm×r and R ≡ A(I, :) ∈ Cr×n denote submatrices containing
subsets of columns and rows of A, respectively. Let G ≡ A(I, J) a r × r
submatrix of A determined by the indices I, J. We will refer to G as the
intersection matrix. The CUR (or skeleton) approximation Â is given by
Â ≡ CU R, (1)
where U ≡ G† (the Moore-Penrose inverse of G). Note that in the particular
case when rank(G) = r, then G† = G−1 .
Next, we should address the existence of good CUR approximations of A
(in the sense of the norm) and how to select the proper index subsets I, J.
First, let us consider the case when A has exactly rank r. In this case, it is
possible to select I, J so that G is nonsingular and Â = A [10].
However, in many practical cases A is low-rank only from the numerical
point of view, in the sense that there are only r singular values significantly

3
larger than numerical noise, and A can be approximated using low rank
matrices, Â ≈ A . We say that A is well approximated as a rank−r matrix
with accuracy ε > 0 if there exists a matrix F so that rank(A − F ) ≤ r and
kF k2 < ε. Then, according to [10], there exists an approximation of the form
CXR where C and R are given as in (1) and X ∈ Cr×r so that
√ √ √
kA − CXRk2 = O(ε r( m + r)). (2)

Note that here the matrix X that links the row and column subset is
not built as the Moore-Penrose inverse of the intersection matrix A(I, J).
This sort of generalized CUR approximation is known as pseudoskeleton
component [10]. In this work, we will exclusively focus on the CUR approach
(1), where U is computed as A(I, J)† . This turns out to be the most practical
and widely used strategy for computing CUR approximations.
There exist different methods for selecting the indices I, J. The problem
of selecting I, J is closely linked with the maximization of the volume of the
intersection matrix G. We define the volume of a matrix as the modulus of
its determinant. It is well know that maximizing the volume of A(I, J) yields
a quasi-optimal CUR approximation [13](Chapter 7.4). If I, J are such that
A(I, J) is the maximal volume submatrix, then

kA − Âkmax ≤ (r + 1)σr+1 (A), (3)

where σr+1 (A) denotes the (r + 1)-th largest singular value of A, and the
k · kmax norm returns the maximum absolute value of the elements of the
matrix. However, finding I, J that maximize the volume of A(I, J) is an
NP-hard problem [14] that requires processing the whole A matrix. Thus,
finding the intersection matrix with maximal volume is often not feasible in
practice.
Instead of finding the maximal volume submatrix, we can employ ap-
proximate algorithms that find suboptimal solutions for this maximization
problem but are much more efficient in computational terms:
The most widely known among them is the Adaptive Cross Approxima-
tion (ACA), which iteratively selects the rows and columns of A that are
expected to contribute most to maximizing the volume of the intersection
matrix by employing a pivoting strategy. Usually, the output of the ACA
algorithm are two matrices U ∈ Cm×r and V ∈ Cr×n so that A ≈ U V . How-
ever, this decomposition can be alternatively represented in the form (1). It

4
is also worth to note that ACA ensures that the intersection matrix is full-
rank. This algorithm starts by selecting an arbitrary row (or column) of the
matrix to be compressed. Then, it finds its element with maximum absolute
value, and adds the corresponding column (or row) to the compressed rep-
resentation. This is iteratively repeated until a certain error criterion is met
or the intersection matrix becomes singular.
It is also possible to select I, J in a random way [15]. Randomized meth-
ods for compressing low-rank matrices have received great attention in the
latest years. Although in some cases they require a larger number of op-
erations than deterministic methods, they naturally adapt to parallel com-
putational environments, both at the higher level (multicore environments)
and at the lower one (GPUs and SIMD machines). For instance, techniques
based on using random embeddings (projections onto randomly chosen sub-
spaces) have been widely used in applications involving huge amounts of data
[16, 17]. However, these techniques are only efficient when all the elements
of the matrix A are known beforehand, or at least can be obtained in a very
easy way. In our case, the cost of computing the elements of the matrix that
we intend to compress is high (sec. 2.2), so we opt for the CUR approxi-
mation that allows to compute only a small number of the elements of the
matrix.
The randomly chosen indices I, J can be obtained by uniform random
sampling or by employing a probability distribution that takes into account
some properties of the physical problem. In our case, we choose a uniform
sampling strategy. There exist theoretical results about the expected error
for this kind of selection of I, J based on uniform random sampling [18].
However, since enunciating these results requires a significant mathematical
apparatus and they would not provide substantial insight into our particular
case, we prefer to skip them and refer the interested reader to the original
work [18]. Anyway, we should highlight two features of this strategy: 1) The
randomly selected I, J obviously leads to worse CUR approximations (in the
sense of the norm) than methods that make a significant effort to maximize
the volume of the intersection matrix, and 2) The intersection matrix A(I, J)
will easily become ill-conditioned or singular, but this is not a fundamental
limitation since U is computed as its Moore-Penrose inverse.
Thus, choosing a compression algorithm for low-rank matrices entails
finding a compromise between the accuracy of the compression and the com-
putational effort required to obtain it. Among the CUR strategies, finding
the maximal-volume intersection is probably optimal in terms of compression

5
quality, but is prohibitively costly. Greedy Adaptive Algorithms as ACA of-
fer a very good compression rate at a much smaller computational cost. In
the case of randomized methods, the cost of obtaining I, J is very small, of-
ten negligible. However, their main advantage does not rely on the reduced
cost of the computation of the indices, but on the potential parallelism when
computing the C, U and R matrices. The Adaptive Cross Approximation,
for instance, is sequential in nature, since every row or column is computed
only after the previous one has been processed. On the other hand, the ran-
domized selection allows to compute all the rows and columns in C and R
independently.

2.2. The computational problem

We will address electromagnetic scattering problems formulated in terms
of the Electric Field Integral Equation (EFIE). Let us consider an electric
incident field Einc (r) that impinges on a perfectly conducting object. As a
result of the interaction of the incident field with the object, the total field
actually existing in the presence of the object becomes E(r) = Einc (r)+Es (r),
where Es (r) is called the scattered field. The scattering problem consists on
finding the scattered or the total field given the incident one.
Using the Equivalence Principle [19], for perfectly conducting objects we
can compute the scattered field as that due to a equivalent current equal to
the induced current Js existing on the object surface, Es = L(Js ), where L
is the integro-differential linear operator defined as

∇0 G(r, r0 ) 0
Z
0
L(J) = −jωµG(r, r ) − ∇ · J(r0 )dS 0 (4)
S jωε

and
exp(−jk|r − r0 |)
G(r, r0 ) = . (5)
4π|r − r0 |
k = 2π/λ is the free-space wavenumber associated with wavelength λ and
ε, µ are the electric permittivity and the magnetic permeability, respectively.
After setting the boundary condition at the perfectly conducting object
surface,
n̂ × E(r) = n̂ × ( Einc (r) + Es (r) ) = 0
the EFIE is expressed as

−n̂ × Einc = n̂ × L(Js ) (6)

6
In order to solve (6), we discretize it into a system of linear algebraic equa-
tions with the Method of Weighted Residuals, also known as the Method of
Moments (MoM) [20] [21]. This method consists on (i) expanding the un-
known Js as a linear combination of N basis functions fj (r) and (ii) project-
ing both members of the equation onto a set of N weighting functions wi (r).
Here, the choice is wi (r) = fj (r), which is commonly known as the Galerkin
method [20].
The result is a linear system

ZJ = −Einc . (7)

with matrix Z ∈ CN ×N , known as the impedance matrix, independent term

Einc ∈ CN given by the projection of n̂ × Einc in the weighting space and
unknowns vector J formed by the coefficients of the expansion of Js (r).
Rao, Wilton and Glisson basis functions (RWG) are widely used for the
analysis of 3-D objects with arbitrary shape, in which the object surface is
approximated by a triangular mesh [22] [21]. The domain of basis function
fi (r) is the pair of adjacent triangles sharing edge i, and function fi (r) is
defined to have piece-wise constant divergence on each triangle. As a result,
the Zij elements of the impedance matrix are given by [21]
Z Z !
Zij = fi (r) · fj (r0 ) − k 2 ∇ · fi (r)∇0 · fj (r0 ) G(r, r0 )drdr0 (8)
Si Sj

where Si and Sj denote the domains of fi and fj , respectively.

The singular integrals arising in the diagonal, i = j, are evaluated by
the singularity subtraction technique [23] with analytical calculation of the
integral of 1/|r−r0 | [24]. The remaining terms, and all the integrals for i 6= j,
are numerically computed using Gauss cubature in triangular domains [25].
Elements Zij of the impedance matrix Z are often called “mutual impedances”
and represent the electric field due to basis function fi (r) weighted with fj (r).
Although the whole matrix Z is full-rank and cannot be directly compressed
with CUR, it can be hierarchically partitioned using a H-matrix representa-
tion [26] [27] in which most matrix sub-blocks are numerically low-rank.
This is due to the fact that if we divide the object into many non-
overlapping subdomains, the field due to a set of fi (r) at a source domain
weighted by a set of fj (r) at a non-neighbor observation domain has a reduced
number of degrees of freedom, which is much smaller than the number of fi (r)

7
and fj (r) functions [28] [29]. Therefore, the rank of the sub-matrix given by
the set of (i, j) indices in the field and source domains is much smaller than
the order of the sub-matrix. The smaller the subdomains are or the further
apart they are, the more compressible is the corresponding sub-matrix. Sub-
matrices corresponding to neighbor (and self) domains are full-rank, so they
are recursively subdivided as detailed in [2]: non-neighbor domains at the
next lower level are compressed, and neighbor and self ones are subdivided
again until a certain size is reached. The compression is carried out with the
techniques described in Section 2.1.

Figure 1: Representation of the partition of the impedance matrix Z of a conductive

sphere with radius 2.86λ and 49152 RWG basis functions. The color indicates the rank of
the subblocks.

3. Methods
We employ a GPU accelerated randomized CUR algorithm (R-CUR) to
efficiently compress the low rank subblocks in the impedance matrix Z in
(7). The low-rank approximation is based on the CUR method described on
Section 2.1 with index selection based on uniform random sampling. In this
section we describe the basic R-CUR algorithm and its GPU implementation.

3.1. The randomized CUR algorithm

Let A ∈ Cm×n be the low-rank sub-matrix of Z that will be compressed as
A ≈ Â = CU R. As we described before, the randomized CUR method selects

8
the indices (I, J) and then accordingly computes C = A(:, J), R = A(I, :)
and U = G† = A(I, J)† .
An obvious limitation of this method is that it requires a priori knowledge
of the rank r of the CUR approximation. In order to overcome this limita-
tion, we implement an adaptive version of the method. We start by setting
an arbitrary (but typically small) guess for the rank r and compute the cor-
responding CUR approximation. Then, we iteratively double the size of the
rank r and repeat the same operation. In order to check for convergence,
we compute at each iteration step the product of the approximate matrix
Â and a random Gaussian vector v, and compare the result with the one of
the previous iteration. When the relative error between the approximation
of Av computed for the current iteration and for the previous one falls below
a certain threshold we consider that the quality of the CUR approximation
is good enough and stop the iterative procedure (Algorithm 1).
One may argue that this adaptive procedure goes against the leading
principle of the method, which consists on exploiting hardware capabilities
through highly parallel operations. Albeit this is true, Algorithm 1 requires
a small number of iterations with a relatively large amount of parallel work
per iteration. To obtain a compressed approximation of rank r, the random-
ized CUR method requires log2 r iterations with O(2k (m + n)) operations on
iteration k, whereas classical adaptive methods as ACA would require r iter-
ations with O(m + n) operations per iteration. Thus, this adaptive strategy
consisting on iteratively doubling r preserves a large amount of parallelism
within each iteration.
Regarding the initial rank r0 , we heuristically choose r0 = max(1, min(m, n)/100).
In this way, for a large square matrix, the initial compression rank corre-
sponds to a 1% of the matrix order. Indeed, it is possible to estimate be-
forehand the rank of the sub-matrix to be compressed, since it corresponds
to the number of degrees of freedom of E(r) and it is theoretically described
in [29]. However, we have preferred to omit this possibility on the current
version of our implementation in order to find conclusions that are also valid
for more general cases, where no information about the rank is available.
Note that when r grows to a size min(m, n) we are not applying any
compression to the original matrix A. In that case, we will directly compute
the whole matrix A.
In order to make Algorithm 1 more efficient, we could reuse the rows and
columns computed at every iteration. However, we avoid this on our current
implementation since it woud imply keeping the kernel calls in Algorithm 1

9
Algorithm 1 Randomized CUR
Input: Mesh data, initial r0 , threshold error
Output: C, U and R (approximation of A ∈ Cm×n )
r ← r0
err ← Inf
Initialize random test vector v ∈ Cn
Initialize product vector p ∈ Cm
while err > and r < min(m, n) do
Randomly select I = (i1 , . . . , ir ) ⊂ (1, . . . , m)
Randomly select J = (j1 , . . . , jr ) ⊂ (1, . . . , n)
Compute C ← A(:, J), G ← A(I, J), R ← A(I, :)
Compute U ← G†
pnew ← CU Rv
err ← kpnew − pk/kpnew k
p ← pnew
r ← 2r
end while

in the same scope of the main loop, which entails some code limitations, or
to resort to complex pointer manipulations. An optimum implementation
would need half the number of Z matrix element computations, which does
not affect the asymptotic complexity of the method.
Regarding the cost of computing the CU Rv product at each iteration, we
should notice that it is small compared with the computation of the C, U
and R matrices, that require evaluating the integral (8) for each matrix en-
try. Also, computing the Moore-Penrose inverse of G is not computationally
expensive since G has size r × r and r << m, n.

3.2. The GPU accelerated implementation

The implementation of Algorithm 1 entails several dense computations
that can be efficiently accelerated with a GPU. In particular, the GPU will
carry out the computation of the C, U and R matrices, the product CU Rv
and the estimation of the approximation error.
The bottleneck of many GPU accelerated applications is the communi-
cation between the CPU and the GPU. In our case, the large degree of
parallelism in the algorithm also entails a low degree of communication be-
tween them. The C, U and R matrices associated to each one of the low-rank

10
blocks of Z are computed at GPU and remain at the device memory with no
need to transfer them to the CPU. Indeed, the system (7) can be efficiently
solved by an iterative solver that computes the products between the com-
pressed matrix and the iterate vectors in the GPU itself. We will only need
to transfer the final result of (7) to the CPU. The fact that Algorithm 1 only
requires log2 r iterations also helps to reduce the communication between the
host and the device.
Host Device

Figure 2: Schematic representation of the procedure for obtaining the CUR approximation
of a matrix or matrix block (Algorithm 1). Data transfer between the host and the device
is minimal.

Figure 2 illustrates the GPU accelerated procedure for computing the

CUR approximation of a matrix. This corresponds to the implementation
of Algorithm 1. Note that, in our particular case, we use this algorithm
to compress a single low-rank subblock of the impedance matrix Z. In the

11
first step, the host (CPU) computes the mesh data, which comprises the
location of the vertices, information about the edges connecting them as well
as other physical parameters. This information is transferred to the GPU.
This constitutes the largest data transfer between the host and the device,
assuming the C, U and R matrices are not sent to the CPU.
Then, the host launches the kernels that compute the C, G and R ma-
trices. This is the most costly operation of the whole procedure. Computing
these matrices implies evaluating the integral (8) for every entry of the sub-
matrix. We achieve this by an in-house developed CUDA kernel that assgins
a CUDA grid to the submatrix, so that each thread is responsible for evalu-
ating (8) to fill the corresponding element of the submatrix.
Although filling the matrices is a highly parallel operation, the threads
should access the data structures containing the mesh information in a con-
current way. Besides, the memory access pattern is rather unpredictable,
since computing contiguous elements of Z does not imply the access to con-
tiguous memory addresses on the data structures that keep the mesh data.
This aspect is left for future study and optimization, however, it does not
seem to be an efficiency bottleneck. The fact that the accesses to the mesh
data structure consist on read-only operations implies that no coherence
mechanisms are necessary, which improves the overall efficiency of the pro-
cess.
Also note that C, G and R have some elements in common (indeed G
represents the elements that are found both in C and R). However, since G is
typically small compared with C and R it is more efficient to recompute the
common elements rather than copying them into a different device memory
location.
The next step is computing U = G† , the Moore-Penrose inverse of the
intersection matrix. We also perform this operation in the GPU device.
Although G is relatively small and the computation of its pseudoinverse
would not require from the GPU capabilities, transferring G to the CPU
entails a large cost in terms of time, so it is overall cheaper to compute it in
the device.
In order to obtain the Moore-Penrose inverse of G, we first compute its
singular value decomposition (SVD) as G = WL ΣWR∗ , where WL and WR are
orthogonal matrices and Σ is a diagonal matrix containing the σi (G) singular
values of G. Then, G† = WR Σ† WL∗ . The Moore-Penrose inverse of a diagonal
matrix like Σ is a diagonal matrix where the (i, i) element is 1/σi if σi 6= 0
and 0 otherwise. In order to avoid numerical unstabilities, we replace the

12
diagonal elements of Σ by 0 if they fall below a certain threshold value, in
our case 10−10 .
The singular value decomposition is computed with the cusolverDnZgesvd
function from the cuSolver library. The inversion of the elements of the di-
agonal of Σ if they are above a certain threshold to obtain Σ† is performed
by an in-house written CUDA kernel. Then, we compute the products in
WR Σ† WL∗ with the cublasZgemm function from the cuBLAS library.
Once we have computed the CUR approximation we perform the product
pnew = CU Rv to assess the compression error. Note that all the elements
C, U , R, pnew , v are in the device memory so that no expensive transfer
operations are required. The multiplication is computed, again, with the
cublasZgemm function. Of course, this operation is performed in several
steps following the order indicated by the parenthesis in C(U (Rv)) so that
we execute the minimum number of operations. Then, we compare the pnew
vector with the approximation p from the previous iteration by computing
kp−pnew k/kpnew k. The device returns the value of the relative error (a single
scalar) to the host. If it is greater than the threshold value then we double
the size of r and start the procedure again.
All the vectors and matrices involved in the process are based on the
cuDoubleComplex type, which represents double precision complex numbers,
except from the vector containing the singular values σi (G) ∈ R, which is
based on double precision real numbers.
This procedure is intended to compute the CUR approximation of a single
subblock ot the impedance matrix Z. We could also exploit the block-level
parallelism by compressing several blocks at the same time through con-
current CUDA calls or by using multiple GPUs communicated by MPI, if
the problem is computationally large. However, since the objective of this
work is to test the GPU-accelerated implementation of the randomized CUR
method, these possible improvements are left for future work.

4. Numerical Results
In this section we provide some numerical results for the randomized CUR
technique. The GPU device used in these experiments is a NVIDIA Quadro
RTX 5000 with 3072 CUDA cores and 16 GB of RAM, and the CPU is a Intel
Xeon Silver 4214R CPU with 24 cores at 2.4 GHz and 384 GB of RAM. The
code is written in Julia and MATLAB except the routines that are executed
in the GPU, which are written in CUDA C. The routine that computes the

13
elements of Z in the CPU (in order to compare its performance with the
GPU) is written in pure C. The codes are freely available at [30].

4.1. Conducting spheres

First we apply the randomized CUR method to compress the subma-
trix A representing the interaction between two perfectly conducting spheres
(Figure 3). Both spheres have radius 1m and the centers are separated by a
distance of 12m. In order to build A, we set the source domain enclosing one
of the spheres and the field domain enclosing the other one. Since the dis-
tance between both domains is large compared with their size, the numerical
rank of A is much smaller than its order and it can be greatly compressed.

Figure 3: Mesh discretization of the two perfectly conducting spheres of radius 1.0 m and
separated by a distance of 12.0 m. Since the spheres are at a relatively large distance, the
matrix block representing the interaction between them is rank-deficient.

We present here two sets of numerical experiments regarding the com-

pression of a single block A: First, we test the performance of R-CUR versus
the ACA algorithm without parallelization and, second, we assess the GPU
parallelization speedup than can be achieved for R-CUR.

4.1.1. R-CUR serial performance in CPU

In our first experiment we compare the randomized CUR method with
the ACA method, both of them programmed serially and executed in a CPU.
Although the objective of this work is to present a GPU implementation of
the randomized CUR method, we think that this is necessary to provide

14
insight about its advantages and limitations by comparing it with ACA,
which is the most common kernel-independent compression method.
We consider the compression of submatrix A representing the interaction
between the two spheres. We analyze different cases with a fixed mesh con-
figuration of 12288 basis functions per sphere and let the wavelength take
different values, so that we can study the behavior of the algorithm under two
mesh resolutions: the wavelength is varied so that the average edge length
becomes λ/10 and λ/20. A mesh resolution of λ/10 is considered suitable
for most applications, and λ/20 corresponds to an overdiscretized case. Ta-
bles 1 and 2 summarize the compression error for both algorithms. The
computation time is similar for both cases.
A postcompression of the obtained CUR and ACA approximations can
be computed by applying QR and SVD decompositions to the approximant
matrices [2]. Since these matrices are already small, this procedure implies a
small computational effort and its cost can be considered negligible. We will
apply this strategy to both the CUR and ACA approximations.
Error vs. size

CUR 10/λ
-2 ACA 10/λ
10
CUR 20/λ
ACA 20/λ

-4
10
err

-6
10

-8
10
1.0 1.5 2.0 2.5
10 10 10 10
r

Figure 4: Comparison of randomized CUR and ACA compression for two different mesh
resolutions. The horizontal axis represents the number of rows of C and columns of R
involved on the approximation, that is, the numerical rank r of A. The vertical axis
shows the relative error of the approximation. ACA needs a smaller r than CUR to reach
the same error. After postcompression, the compression ratio of both method becomes
comparable.

Figure 4 shows the relative error for the randomized CUR and ACA ap-
proximations depending on the numerical rank r of the approximation. We
observe that cases with a finer discretization (as λ/20) require a smaller num-
ber of elements to reach a certain compression error, since overdiscretization

15
CUR λ/10 ACA λ/10
r err r err
13 (13) 7.99E-2 (0.10) 8 (8) 0.13 (0.13)
62 (48) 3.24E-4 (3.37E-4) 42 (40) 1.06E-3 (1.06E-3)
123 (88) 2.56E-6 (2.58E-6) 84 (79) 8.39E-6 (8.24E-6)
615 (132) 8.47E-9 (8.60E-9) 149 (137) 2.66E-8 (2.67E-8)

Table 1: Comparison of CUR and ACA algorithms for the λ/10 discretization. The table
shows the compression rank and the relative error kÂ−Ak/kAk. The values in parentheses
refer to the case where postcompression has been applied.

CUR λ/20 ACA λ/20

r err r err
13 (12) 5.78E-2 (4.05E-2) 8 (8) 6.11E-2 (6.11E-2)
62 (48) 2.34E-5 (2.80E-5) 45 (42) 5.22E-5 (5.23E-5)
123 (68) 7.70E-7 (9.60E-7) 76 (64) 1.36E-6 (2.43E-6)
615 (87) 2.01E-8 (2.58E-8) 102 (94) 5.34E-8 (5.38E-8)

Table 2: Comparison of CUR and ACA algorithms for the λ/20 discretization.

implies a better representation of the electric field degrees of freedom. Also,

for both cases, the ACA method requires a smaller r to reach a certain error
compared with the randomized CUR method. As we saw in Section 2.1, this
is due to the fact that the ACA method applies a pivoting strategy to max-
imize the volume of the intersection matrix, leading to better compression
ratios. However, after applying postcompression, the compression ratios of
ACA and R-CUR become similar.
ACA shows a better compression rate than R-CUR. However, the com-
pression rate of the latter can be improved by applying postcompression (see
Figure 4 and Tables 1 and 2). Postcompression drastically reduces the size
of the R-CUR approximation, whereas the improvement of the compression
ratio of ACA is relatively small. This is due to the fact that the rows and
columns of ACA are already selected based on an optimality criterion. Both
methods eventually obtain compressed representations of similar quality: in
the case of ACA by picking rows and columns that are already near-optimal,
and in the case R-CUR by aggressively gathering a large amount of informa-

16
tion that can be efficiently postcompressed later.

4.1.2. R-CUR parallel performance in GPU

In the second experiment we consider two cases where we modify the
wavelength and the number of elements in the mesh in such a way that the
mesh resolution (average length of triangle edges) remains constant and equal
to λ/20. . In particular, we consider λ = 2.0, 1.0 associated to 3072 and 12288
basis functions per sphere, respectively. For each one of this cases, we run
the randomized CUR algorithm with the GPU accelerated implementation
described in Section 3.2 and with an equivalent CPU implementation.
The computation of the submatrices C, G and R is the most expensive
part of the code. As we stated in Section 3.2, the CUDA routine that com-
putes these submatrices employs a CUDA grid where each thread is assigned
to an element of the matrix. However, the grid can be arranged in differ-
ent ways depending on the size of the thread blocks. We also modify the
thread block size in order to study the impact of different configurations on
performance.
time CUDA compression

1.2
10 λ=2.0
λ=1.0

0.9
10
time [s]

0.6
10

0.3
10

0.0
10

3 6 9 12 15
threads per block side

Figure 5: Execution time for the GPU accelerated compression of a matrix block rep-
resenting the interaction between the two spheres for wavelength values 2 and 1, that
correspond to discretizations with 3072 and 12288 edges, respectively. The horizontal axis
represents the lateral size T of a square thread block. In total, the thread block has T 2
threads. Note that the total amount of threads involved in the execution is constant and
does not depend on the block size.

Figure 5 shows the time required for compressing the matrix blocks for
the two different cases and for several thread block sizes, and Figure 6 shows

17
speedup GPU vs CPU

30 λ=2.0
λ=1.0

speedup
20

3 6 9 12 15
threads per block side

Figure 6: Speedup with respect to the GPU corresponding to the example in Figure 5.

the speedup with respect to the same operation performed on a CPU, which
takes 12.19 s for λ = 2 m and 287 s for λ = 1 m. For the sake of comparison,
obtaining an equivalent compressed representation with ACA on CPU takes
14.49 s for the λ = 2 m case and 314.3 s for the λ = 1 m case.
The best speedup is achieved with the largest sub-matrix, with λ = 1m
and a size of 12288 × 12288 elements. The performance improves when we
increase the number of threads per block, although it quickly saturates and
reaches a speedup of 30 with respect to the CPU. The smallest size case (λ =
2m and 3072 mesh elements per sphere) exhibits a smaller speedup, due to
the larger impact that the overhead of certain operations may have on smaller
computations. Electrically large boxes often appear in large computational
problems in which it is beneficial to use matrix compression. Larger blocks
are expected to show even better speedups.
The multiplication between the compressed block and a vector can be
also efficiently carried on the GPU with standard libraries as cuBLAS. It
consist on a multiplication of the matrices CU Rv. On our experiments, we
also obtain speedup values around 30.

4.2. NASA Almond

In this section we compress the full impedance matrix associated with
the NASA almond [31]. The almond length is 0.2524 m, corresponding to
6.64 wavelengths at an operating frequency of 7.895 GHz (λ = 0.038 m). The
surface triangular mesh has 38706 edges. The incident field is a plane wave

18
Figure 7: Surface current of the NASA almond normalized with the incident magnetic field
amplitude, in logarithmic scale (dB). The incident field propagates towards the −x̂ direc-
tion and is linearly polarized with E-field parallel to the y-axis. The operating frequency
is 7.895 GHz (λ = 0.038).

propagating towards −x̂ direction with E-field linear polarization parallel to

y-axis. The Electric Field Integral Equation (EFIE) (6) is discretized with
RWG basis functions [22].
As explained in sec. 2.2, although the whole linear system matrix is not
low-rank, it is an H-matrix that can be decomposed into blocks and com-
press those of them that are expected to be low-rank (Figure 1). Then, the
linear system formed by the compressed matrix and the vector representing
the incident field is solved using an iterative solver (GMRES) [32] and a
termination threshold τ = 10−4 .
Figure 8 shows the Bistatic Radar Cross Section in the E-plane (xy),
where θ = 0◦ is the back-scattering direction and 180◦ is forward scattering.
The GMRES iterative solution of the matrix compressed with R-CUR and
ACA is compared with the direct solution of the uncompressed linear sys-
tem. The agreement between the results of both methods shows excellent
compression accuracy for the randomized CUR method. The compression
performed with ACA in CPU takes 120.78 s, and the same procedure takes
111.13 s with the R-CUR algorithm.

19
-20
Exact MoM
-30 Rand. CUR
ACA
-40

-50

-60
RCS [dB]
-70

-80

-90

-100

-110

-120
0 20 40 60 80 100 120 140 160 180
deg

Figure 8: NASA almond bistatic radar cross section in the E-plane vs observation angle θ
for an 7.895 GHz y-polarized incident plane wave propagating towards −x-direction. The
GMRES iterative solution of the R-CUR compressed matrix (Rand. CUR) and the ACA
compressed matrix (ACA) is compared with the direct solution of the uncompressed linear
system matrix (Exact MoM). Direction θ = 0◦ corresponds to the back scattering case
and θ = 180◦ corresponds to forward scattering.

5. Conclusions
This work shows the advantages of the randomized CUR approximation
for compressing the H-matrices of linear systems that arise in the discretiza-
tion of integral equations modeling electromagnetic scattering problems. The
main advantage of this method is its inherent parallelism, that allows it to
be efficiently implemented in massively parallel computing environments. In
particular, the fine-grained parallelism of the algorithm makes it very suit-
able for graphics processing units (GPU). Besides, since the method is purely
algebraic it can also be applied to other problems of physics and engineering
that lead to linear systems with compressible H-matrices.
Numerical examples are presented to show the performance of the ran-
domized method. Unlike other compression methods as ACA, the parallel
nature of the randomized CUR enables to achieve great computational effi-
ciency (30x speedup factor after parallelization in GPU). Besides, it remains
competitive with ACA in serial CPU implementations. Also, we show that
applying postcompression to the R-CUR approximation leads to an excellent
compression ratio.
A final example shows excellent accuracy of the randomized CUR when

20
solving the whole linear system for a standard electromagnetic scattering
benchmark.

6. Acknowledgements
This work was partly funded by the Ministerio de Ciencia e Innovacion
(MICINN) under projects PID2019- 107885GB-C31 / AEI / 10.13039/501100011033,
PID2020-113832RB-C21 / AEI / 10.13039/501100011033 and PID-2020-118410RB-
C21 / AEI / 10.13039/501100011033, and Catalan Research Group 2017 SGR
219 and grant 2021 FI B2 00096.

21
* Items marked with an asterisk are only required for new versions of
programs previously published in the CPC Program Library.

7.
[1] J. Song, C. Lu, W. C. Chew, Multilevel fast multipole algorithm for
electromagnetic scattering by large complex objects, IEEE Transactions
on Antennas and Propagation 45 (1997) 1488–1493.

[2] J. M. Rius, J. Parron, A. Heldring, J. M. Tamayo, E. Ubeda, Fast

iterative solution of integral equations with method of moments and
matrix decomposition algorithm singular value decomposition, IEEE
Transactions on Antennas and Propagation 56 (8) (2008) 2314–2324.
doi:10.1109/TAP.2008.926762.

[3] M. Bebendorf, Approximation of boundary element matrices, Nu-

merische Mathematik 86 (2000) 565–589.

[4] S. Velamparambil, W. C. Chew, Analysis and performance of a dis-

tributed memory multilevel fast multipole algorithm, IEEE Trans-
actions on Antennas and Propagation 53 (8) (2005) 2719–2727.
doi:10.1109/TAP.2005.851859.

[5] L. Gurel, O. Ergul, Fast and accurate solutions of extremely large

integral-equation problems discretised with tens of millions of unknowns,
Electronics Letters 43 (2007) 499 – 500. doi:10.1049/el:20070639.

[6] S. Kurz, O. Rain, S. Rjasanow, The adaptive cross-approximation tech-

nique for the 3d boundary-element method, IEEE Transactions on Mag-
netics 38 (2) (2002) 421–424. doi:10.1109/20.996112.

[7] K. Zhao, M. Vouvakis, J.-F. Lee, The adaptive cross approximation

algorithm for accelerated method of moments computations of emc
problems, IEEE Transactions on Electromagnetic Compatibility 47 (4)
(2005) 763–773. doi:10.1109/TEMC.2005.857898.

[8] K. Vater, T. Betcke, B. Dilba, Simple and efficient gpu parallelization

of existing h-matrix accelerated bem code, ArXiv abs/1711.01897.

22
[9] W. C. Gibson, Efficient solution of electromagnetic scattering problems
using multilevel adaptive cross approximation and lu factorization, IEEE
Transactions on Antennas and Propagation 68 (5) (2020) 3815–3823.
doi:10.1109/TAP.2019.2963619.

[10] S. Goreinov, E. Tyrtyshnikov, N. Zamarashkin, A theory of pseu-

doskeleton approximations, Linear Algebra and its Applications 261 (1)
(1997) 1–21. doi:https://doi.org/10.1016/S0024-3795(96)00301-1.
URL https://www.sciencedirect.com/science/article/pii/
S0024379596003011

[11] S. A. Goreinov, N. Zamarashkin, E. E. Tyrtyshnikov, Pseudo-skeleton

approximations by matrices of maximal volume, Mathematical Notes 62
(1997) 515–519.

[12] Y. Zhang, H. Lin, Localized pseudo-skeleton approximation method for

electromagnetic analysis on electrically large objects, Progress in Elec-
tromagnetics Research Letters 57 (2015) 103–109.

[13] R. A. Horn, C. R. Johnson, Matrix analysis, 2nd Edition, Cambridge

University Press, Cambridge ; New York, 2012.

[14] A. Cortinovis, D. Kressner, S. Massei, On maximum volume submatri-

ces and cross approximation for symmetric semidefinite and diagonally
dominant matrices, Linear Algebra and its Applications 593 (2020)
251–268. doi:10.1016/j.laa.2020.02.010.
URL https://linkinghub.elsevier.com/retrieve/pii/
S0024379520300768

[15] Y. Dong, P.-G. Martinsson, Simpler is better: A comparative study of

randomized algorithms for computing the CUR decomposition, Tech.
Rep. arXiv:2104.05877, arXiv, arXiv:2104.05877 [cs, math] type: article
(Jun. 2022).
URL http://arxiv.org/abs/2104.05877

[16] N. Halko, P. G. Martinsson, J. A. Tropp, Finding structure

with randomness: Probabilistic algorithms for constructing approxi-
mate matrix decompositions, SIAM Review 53 (2) (2011) 217–288.
arXiv:https://doi.org/10.1137/090771806, doi:10.1137/090771806.
URL https://doi.org/10.1137/090771806

23
[17] M. W. Mahoney, P. Drineas, CUR matrix decompositions for improved
data analysis, Proceedings of the National Academy of Sciences 106 (3)
(2009) 697–702. doi:10.1073/pnas.0803205106.
URL https://pnas.org/doi/full/10.1073/pnas.0803205106

[18] J. Chiu, L. Demanet, Sublinear randomized algorithms for skeleton de-

compositions, Tech. Rep. arXiv:1110.4193, arXiv, arXiv:1110.4193 [cs,
math] type: article (Apr. 2012).
URL http://arxiv.org/abs/1110.4193

[19] C. Balanis, Advanced Engineering Electromagnetics, 2nd Edition, Wi-

ley, New York, 2012.

[20] R. F. Harrington, Field Computation by Moment Methods, Wiley-IEEE

Press, 1993.

[21] W. Gibson, The Method of Moments in Electromagnetics, Vol. 1, 2008.

doi:10.1201/b17119.

[22] S. Rao, D. Wilton, A. Glisson, Electromagnetic scattering by surfaces

of arbitrary shape, IEEE Transactions on Antennas and Propagation
30 (3) (1982) 409–418. doi:10.1109/TAP.1982.1142818.
URL http://ieeexplore.ieee.org/document/1142818/

[23] M. Khayat, D. Wilton, Numerical evaluation of singular and near-

singular potential integrals, IEEE Transactions on Antennas and Prop-
agation 53 (10) (2005) 3180–3190. doi:10.1109/TAP.2005.856342.

[24] S. Rao, A. Glisson, D. Wilton, B. Vidula, A simple numerical solu-

tion procedure for statics problems involving arbitrary-shaped surfaces,
IEEE Transactions on Antennas and Propagation 27 (5) (1979) 604–608.
doi:10.1109/TAP.1979.1142171.

[25] A. H. Stroud, Approximate calculation of multiple integrals, Prentice-

Hall series in automatic computation, 1971.

[26] W. Hackbusch, A sparse matrix arithmetic based on H-matrices.

part i: Introduction to H-matrices, Computing 62 (2) (1999) 89–108.
doi:10.1007/s006070050015.
URL https://doi.org/10.1007/s006070050015

24
[27] L. Grasedyck, W. Hackbusch, Construction and arithmetics of H-
matrices, Computing 70 (4) (2003) 295–334. doi:10.1007/s00607-003-
0019-1.
URL https://doi.org/10.1007/s00607-003-0019-1

[28] W. C. Chew, J.-M. Jin, C.-C. Lu, E. Michielssen, J. Song, Fast solu-
tion methods in electromagnetics, IEEE Transactions on Antennas and
Propagation 45 (3) (1997) 533–543. doi:10.1109/8.558669.

[29] O. Bucci, G. Franceschetti, On the degrees of freedom of scattered fields,

IEEE Transactions on Antennas and Propagation 37 (7) (1989) 918–926.
doi:10.1109/8.29386.

[30] H. Lopez-Menchon, J. M. Rius, A. Heldring, E. Ubeda,

https://github.com/hector-lopez-menchon/gpu-randomized-
pseudoskeleton.

[31] A. Woo, H. Wang, M. Schuh, M. Sanders, Em programmer’s notebook-

benchmark radar targets for the validation of computational electro-
magnetics programs, IEEE Antennas and Propagation Magazine 35 (1)
(1993) 84–89. doi:10.1109/74.210840.

[32] Y. Saad, M. H. Schultz, Gmres: A generalized minimal resid-

ual algorithm for solving nonsymmetric linear systems, SIAM Jour-
nal on Scientific and Statistical Computing 7 (3) (1986) 856–869.
arXiv:https://doi.org/10.1137/0907058, doi:10.1137/0907058.
URL https://doi.org/10.1137/0907058

View publication stats

Assignment Template
0% (1)
Assignment Template
18 pages
Microsoft Excel For Chemical Engineers Notes (By Moataz and Mohammed)
92% (13)
Microsoft Excel For Chemical Engineers Notes (By Moataz and Mohammed)
85 pages
Phase Field Modeling of Fracture in Fiber Reinforced Composite Laminate
No ratings yet
Phase Field Modeling of Fracture in Fiber Reinforced Composite Laminate
35 pages
Dynamic Harmonic Evolution Using The Extended Harmonic Domain
No ratings yet
Dynamic Harmonic Evolution Using The Extended Harmonic Domain
9 pages
Teaching The Mean-Field Approximation: January 2011
No ratings yet
Teaching The Mean-Field Approximation: January 2011
11 pages
Ifem - An Innovative Finite Element Method Package in Matlab
No ratings yet
Ifem - An Innovative Finite Element Method Package in Matlab
36 pages
2020-Mixed - Mode Stattime Discretization
No ratings yet
2020-Mixed - Mode Stattime Discretization
32 pages
Cmame D 24 00396 R Hal
No ratings yet
Cmame D 24 00396 R Hal
34 pages
A Simplified Method For The Analysis of Complex ST
No ratings yet
A Simplified Method For The Analysis of Complex ST
12 pages
The Method of Fundamental Solutions For Solving Incompressible Navier-Stokes Problems
No ratings yet
The Method of Fundamental Solutions For Solving Incompressible Navier-Stokes Problems
15 pages
PPINN
No ratings yet
PPINN
18 pages
7 2nonlinearequations
No ratings yet
7 2nonlinearequations
23 pages
Robust Preconditioning For A Mixed Formulation of
No ratings yet
Robust Preconditioning For A Mixed Formulation of
19 pages
Preconditioned Descent Algorithms For P-Laplacian
No ratings yet
Preconditioned Descent Algorithms For P-Laplacian
21 pages
A Comprehensive Review On The Progress of Lead Zirconate-Based Antiferroelectric Materials
No ratings yet
A Comprehensive Review On The Progress of Lead Zirconate-Based Antiferroelectric Materials
8 pages
40 IJMRASolving Ordinary Differential Equationswith Boundary Conditions
No ratings yet
40 IJMRASolving Ordinary Differential Equationswith Boundary Conditions
8 pages
IJPSE020301WU
No ratings yet
IJPSE020301WU
19 pages
FFT Integration of Instantaneous 3D Pressure Gradi
No ratings yet
FFT Integration of Instantaneous 3D Pressure Gradi
22 pages
Paper Amr
No ratings yet
Paper Amr
41 pages
A Method For Suppressing Electrical Stimulation Artifacts
No ratings yet
A Method For Suppressing Electrical Stimulation Artifacts
16 pages
Opportunities For Machine Learning in Scientific Discovery: Preprint
No ratings yet
Opportunities For Machine Learning in Scientific Discovery: Preprint
23 pages
A Comprehensive Review of Sustainable Materials and Toolpath Optimization in 3D Concrete Printing
No ratings yet
A Comprehensive Review of Sustainable Materials and Toolpath Optimization in 3D Concrete Printing
15 pages
复杂系统中的本征微观状态及其演化
No ratings yet
复杂系统中的本征微观状态及其演化
39 pages
Analytical Methods of Optimization
From Everand
Analytical Methods of Optimization
D. F. Lawden
No ratings yet
A Computer Code For Forward Calculation and Invers
No ratings yet
A Computer Code For Forward Calculation and Invers
33 pages
Practical Applicationsofthe Stochastic Finite Element Method
No ratings yet
Practical Applicationsofthe Stochastic Finite Element Method
31 pages
HTHPv38n2p137 151knupp
No ratings yet
HTHPv38n2p137 151knupp
17 pages
Asymptotic Analysis of The Scattering Problem For
No ratings yet
Asymptotic Analysis of The Scattering Problem For
24 pages
Post Processing For The Vector Finite Element Meth
No ratings yet
Post Processing For The Vector Finite Element Meth
20 pages
Manuscript
No ratings yet
Manuscript
23 pages
RE 2022 538285 - Plag Report
No ratings yet
RE 2022 538285 - Plag Report
10 pages
Computational Electromagnetics, Computational Electrodynamics or Electromagnetic Modeling Is The Process of
No ratings yet
Computational Electromagnetics, Computational Electrodynamics or Electromagnetic Modeling Is The Process of
25 pages
Physics-Informed Neural Networks For High-Speed Ows: Computer Methods in Applied Mechanics and Engineering March 2020
No ratings yet
Physics-Informed Neural Networks For High-Speed Ows: Computer Methods in Applied Mechanics and Engineering March 2020
27 pages
A FE 2 Multi-Scale Implementation For Modeling Composite Materials On Distributed Architectures
No ratings yet
A FE 2 Multi-Scale Implementation For Modeling Composite Materials On Distributed Architectures
12 pages
Numerical and Experimental Studies On The Interactions Between The Radio-Frequency Glow Discharge Plasma Jet and The Shielding Gas at Atmosphere
No ratings yet
Numerical and Experimental Studies On The Interactions Between The Radio-Frequency Glow Discharge Plasma Jet and The Shielding Gas at Atmosphere
11 pages
Hierarchical Deep Learning Neural Network (Hidenn) : An Artificial Intelligence (Ai) Framework For Computational Science and Engineering
No ratings yet
Hierarchical Deep Learning Neural Network (Hidenn) : An Artificial Intelligence (Ai) Framework For Computational Science and Engineering
29 pages
Direct Phase Unwrapping Method Based On A Local Third-Order Polynomial Fit
No ratings yet
Direct Phase Unwrapping Method Based On A Local Third-Order Polynomial Fit
11 pages
Meta Analisis
No ratings yet
Meta Analisis
23 pages
1 s2.0 S0165212524002191 Main
No ratings yet
1 s2.0 S0165212524002191 Main
20 pages
Method of Moments Accelerations and Extensions in
No ratings yet
Method of Moments Accelerations and Extensions in
5 pages
Single Molecule Biophysics
No ratings yet
Single Molecule Biophysics
98 pages
Design Optimization of Ultrasonic Vibration Cutting Tool To Generate
No ratings yet
Design Optimization of Ultrasonic Vibration Cutting Tool To Generate
17 pages
2013 Comparativefem FDM Javier
No ratings yet
2013 Comparativefem FDM Javier
8 pages
MS5 UnceComp2025
No ratings yet
MS5 UnceComp2025
3 pages
Very High-Order Method On Immersed Curved Domains For Finite Difference Schemes With Regular Cartesian Grids
No ratings yet
Very High-Order Method On Immersed Curved Domains For Finite Difference Schemes With Regular Cartesian Grids
48 pages
11 Regularization Methodsto Assessthe Eddy Current Density
No ratings yet
11 Regularization Methodsto Assessthe Eddy Current Density
6 pages
Machine Learning Applications in Earthquake Engineering, A Literature Review
No ratings yet
Machine Learning Applications in Earthquake Engineering, A Literature Review
13 pages
Introduction to Electromagnetic Engineering
From Everand
Introduction to Electromagnetic Engineering
Roger E. Harrington
5/5 (1)
Finite Element Method
From Everand
Finite Element Method
Gouri Dhatt
1/5 (1)
Numerical Evaluation of Ablation Zone Under Different Tip Temperatures During Radiofrequency Ablation
No ratings yet
Numerical Evaluation of Ablation Zone Under Different Tip Temperatures During Radiofrequency Ablation
19 pages
Laplace Transform Homotopy Perturbation Method For The Approximation of Variational Problems PDF
No ratings yet
Laplace Transform Homotopy Perturbation Method For The Approximation of Variational Problems PDF
34 pages
First Paper1
No ratings yet
First Paper1
20 pages
PI XTFC OrbitTransfer
No ratings yet
PI XTFC OrbitTransfer
21 pages
Iber Applications Basic Guide. Two-Dimensional Modelling of Free Surface Shallow Water Ows
No ratings yet
Iber Applications Basic Guide. Two-Dimensional Modelling of Free Surface Shallow Water Ows
106 pages
Castillo 1996 High Order Mimetic Finite Difference Methods
No ratings yet
Castillo 1996 High Order Mimetic Finite Difference Methods
17 pages
Wavelet-Based Galerkin Method For The Numerical Solution of One Dimensional Partial Differential Equations
No ratings yet
Wavelet-Based Galerkin Method For The Numerical Solution of One Dimensional Partial Differential Equations
12 pages
Method of Moments for 2D Scattering Problems: Basic Concepts and Applications
From Everand
Method of Moments for 2D Scattering Problems: Basic Concepts and Applications
Christophe Bourlier
No ratings yet
Partial Differential Equation
No ratings yet
Partial Differential Equation
2 pages
mathcal (PT) $-Symmetric Tight-Binding Chain With Gain and Loss: A Completely Solvable Model
No ratings yet
mathcal (PT) $-Symmetric Tight-Binding Chain With Gain and Loss: A Completely Solvable Model
24 pages
Design, Development and Evaluation of A Lightweight Knowledge-Based System For Theoretically-Grounded Math Error Classification
No ratings yet
Design, Development and Evaluation of A Lightweight Knowledge-Based System For Theoretically-Grounded Math Error Classification
15 pages
6.application of Discrete Fourier Transform To Electronic Measurements
No ratings yet
6.application of Discrete Fourier Transform To Electronic Measurements
7 pages
Mta0199 A
No ratings yet
Mta0199 A
20 pages
Homework - Questions
No ratings yet
Homework - Questions
14 pages
Jacob Quantization and Training
No ratings yet
Jacob Quantization and Training
10 pages
B.tech 1st Year Common With All Branche88s
No ratings yet
B.tech 1st Year Common With All Branche88s
39 pages
Experiment 1
No ratings yet
Experiment 1
11 pages
Bikila
No ratings yet
Bikila
6 pages
Performance of Music Algorithm
No ratings yet
Performance of Music Algorithm
14 pages
Saint 03 Header
No ratings yet
Saint 03 Header
6 pages
020 Matrix Algibra
No ratings yet
020 Matrix Algibra
33 pages
Fludioxonil (211) : Analytical Methods
No ratings yet
Fludioxonil (211) : Analytical Methods
13 pages
ECU106 Mathematics For Engineers 3
No ratings yet
ECU106 Mathematics For Engineers 3
13 pages
Tutorial 3 EEN-301 2024-25
No ratings yet
Tutorial 3 EEN-301 2024-25
2 pages
Lecture-4 Reduction of Quadratic Form
100% (1)
Lecture-4 Reduction of Quadratic Form
33 pages
Multiclass Classification: 9.520 Class 06, 25 Feb 2008 Ryan Rifkin
No ratings yet
Multiclass Classification: 9.520 Class 06, 25 Feb 2008 Ryan Rifkin
59 pages
EM-Based Design of Large-Scale Dielectric-Resonator Filters and Multiplexers by Space Mapping
No ratings yet
EM-Based Design of Large-Scale Dielectric-Resonator Filters and Multiplexers by Space Mapping
7 pages
311 54 - OSS - 1 Set A Mathematics
No ratings yet
311 54 - OSS - 1 Set A Mathematics
11 pages
Preshius Project
No ratings yet
Preshius Project
39 pages
Multi Plane Balancing of A Rotating Machine Using Run-Down Data
No ratings yet
Multi Plane Balancing of A Rotating Machine Using Run-Down Data
6 pages
MCA Syllabus
No ratings yet
MCA Syllabus
4 pages
Einzel Lens
No ratings yet
Einzel Lens
18 pages
FE Other Mathematics Workshop Problems & Solutions
No ratings yet
FE Other Mathematics Workshop Problems & Solutions
58 pages
Mechanics With Matlab
No ratings yet
Mechanics With Matlab
17 pages
Maharashtra Board 12th Maths Solutions Chapter 2 Matrices Ex 2.3
No ratings yet
Maharashtra Board 12th Maths Solutions Chapter 2 Matrices Ex 2.3
10 pages
Balancing A Social Accounting Matrix PDF
No ratings yet
Balancing A Social Accounting Matrix PDF
15 pages
Mathematics Paper 2 Question Paper
No ratings yet
Mathematics Paper 2 Question Paper
18 pages
2 - Determinants
No ratings yet
2 - Determinants
50 pages
Numerical Modeling of Neutron Transport
No ratings yet
Numerical Modeling of Neutron Transport
120 pages
ECE B.Tech Syllabus 2021 22
No ratings yet
ECE B.Tech Syllabus 2021 22
75 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

MoM Compression GPU

Uploaded by

MoM Compression GPU

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

A GPU parallel randomized CUR compression method for the Method of

Preprint · November 2022

Hector Lopez-Menchon Alex Heldring

SEE PROFILE SEE PROFILE

Eduard Ubeda J.M. Rius

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Preprint submitted to Computer Physics Communications November 5, 2022

kA − Âkmax ≤ (r + 1)σr+1 (A), (3)

2.2. The computational problem

−n̂ × Einc = n̂ × L(Js ) (6)

with matrix Z ∈ CN ×N , known as the impedance matrix, independent term

where Si and Sj denote the domains of fi and fj , respectively.

Figure 1: Representation of the partition of the impedance matrix Z of a conductive

3.1. The randomized CUR algorithm

3.2. The GPU accelerated implementation

Figure 2 illustrates the GPU accelerated procedure for computing the

4.1. Conducting spheres

We present here two sets of numerical experiments regarding the com-

4.1.1. R-CUR serial performance in CPU

CUR λ/20 ACA λ/20

implies a better representation of the electric field degrees of freedom. Also,

4.1.2. R-CUR parallel performance in GPU

4.2. NASA Almond

propagating towards −x̂ direction with E-field linear polarization parallel to

[2] J. M. Rius, J. Parron, A. Heldring, J. M. Tamayo, E. Ubeda, Fast

[3] M. Bebendorf, Approximation of boundary element matrices, Nu-

[4] S. Velamparambil, W. C. Chew, Analysis and performance of a dis-

[5] L. Gurel, O. Ergul, Fast and accurate solutions of extremely large

[6] S. Kurz, O. Rain, S. Rjasanow, The adaptive cross-approximation tech-

[7] K. Zhao, M. Vouvakis, J.-F. Lee, The adaptive cross approximation

[8] K. Vater, T. Betcke, B. Dilba, Simple and efficient gpu parallelization

[10] S. Goreinov, E. Tyrtyshnikov, N. Zamarashkin, A theory of pseu-

[11] S. A. Goreinov, N. Zamarashkin, E. E. Tyrtyshnikov, Pseudo-skeleton

[12] Y. Zhang, H. Lin, Localized pseudo-skeleton approximation method for

[13] R. A. Horn, C. R. Johnson, Matrix analysis, 2nd Edition, Cambridge

[14] A. Cortinovis, D. Kressner, S. Massei, On maximum volume submatri-

[15] Y. Dong, P.-G. Martinsson, Simpler is better: A comparative study of

[16] N. Halko, P. G. Martinsson, J. A. Tropp, Finding structure

[18] J. Chiu, L. Demanet, Sublinear randomized algorithms for skeleton de-

[19] C. Balanis, Advanced Engineering Electromagnetics, 2nd Edition, Wi-

[20] R. F. Harrington, Field Computation by Moment Methods, Wiley-IEEE

[21] W. Gibson, The Method of Moments in Electromagnetics, Vol. 1, 2008.

[22] S. Rao, D. Wilton, A. Glisson, Electromagnetic scattering by surfaces

[23] M. Khayat, D. Wilton, Numerical evaluation of singular and near-

[24] S. Rao, A. Glisson, D. Wilton, B. Vidula, A simple numerical solu-

[25] A. H. Stroud, Approximate calculation of multiple integrals, Prentice-

[26] W. Hackbusch, A sparse matrix arithmetic based on H-matrices.

[29] O. Bucci, G. Franceschetti, On the degrees of freedom of scattered fields,

[30] H. Lopez-Menchon, J. M. Rius, A. Heldring, E. Ubeda,

[31] A. Woo, H. Wang, M. Schuh, M. Sanders, Em programmer’s notebook-

[32] Y. Saad, M. H. Schultz, Gmres: A generalized minimal resid-

View publication stats

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.