0% found this document useful (0 votes)
32 views5 pages

Fpga Arm Processor Based Supercomputiing

This document describes an FPGA and ARM processor based supercomputing system composed of five Zynq SoCs compute-nodes. It proposes using the Zynq System on Chip, which combines an ARM processor with FPGA fabric, to build a low-cost and low-power supercomputing system. An FIR filter application was used to test the performance of the system with and without FPGA accelerators. The results showed that the ARM supercomputer with FPGA accelerators was 8.56 times faster than a similar system without accelerators.

Uploaded by

aksavar2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views5 pages

Fpga Arm Processor Based Supercomputiing

This document describes an FPGA and ARM processor based supercomputing system composed of five Zynq SoCs compute-nodes. It proposes using the Zynq System on Chip, which combines an ARM processor with FPGA fabric, to build a low-cost and low-power supercomputing system. An FIR filter application was used to test the performance of the system with and without FPGA accelerators. The results showed that the ARM supercomputer with FPGA accelerators was 8.56 times faster than a similar system without accelerators.

Uploaded by

aksavar2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2018 International Conference on Computing, Mathematics and Engineering Technologies – iCoMET 2018

FPGA and ARM Processor based Supercomputing


Wasim Akram1, Tassadaq Hussain2, Eduard Ayguade3
1, 2
Riphah International University Islamabad
2
Unal Color of Education Research and Development Islamabad
3
Barcelona Supercomputing Center, Barcelona, Spain
1
Wasimakram811@hotmail.com, 2tassadaq@ucerd.com

Abstract—The low-cost and low-power heterogeneous ARM based-Server is favorable for applications that need
architecture platform such as Xilinx Zynq SoC provides an high throughput instead of computing power.
extensive combination of ARM multi-core processor with FPGA
accelerator for acceleration of high performance computing Zynq SoC device [5] offers a heterogeneous computing
applications. In this paper, we proposed an FPGA and ARM platforms built by Xilinx. It combines a multi-core ARM
processor based supercomputer system composed of five Zynq CPU with FPGA accelerator. The purpose of FPGA
SoCs compute-nodes. The design system uses message passing accelerator integration into SoC for low power acceleration.
interface libraries for communication between compute-nodes
The ARM CPU and FPGA accelerator are connected together
while AXI4-stream interfaces between ARM processor and
FPGA inside a compute-node. An FIR filter application is used by using a high performance and high bandwidth AXI/ACP
to test the performance of the system with and without FPGA set of interfaces, that’s allow interfacing with main memory
accelerators. The results show that the performance of ARM [6] [7].
based supercomputer with FPGA accelerators is 8.56 times
higher than similar system without FPGA accelerators. In this proposed work, we designed an FPGA and ARM
processor based supercomputer system. The system
Keywords—Hetrogeneous, Zynq Soc, Supercomputer composed of low-cost and low power Zybo boards [8] having
Zynq SoCs. In the design system, the FPGA handle the
I. INTRODUCTION compute-intensive portion of a high performance application
With the improvement in multi-core processor and increased the computation power of ARM CPU. The
technology, the demand for application performance also Finite Impulse Response (FIR) filter is used as a test
increased. Delivering high performance to an application application on the system to evaluate the computational
requires more processing speed from multicore-processors capability of ARM processor with and without FPGA
which increases the power consumption. As the power has accelerator.
become the main metric for modern high performance
computing, the researcher, and system architect are proposing This paper is organized as follows: Section II gives a detail
heterogeneous multi-core processing system that combines a description of the related works in the field of heterogeneous
multi-core processor with hardware accelerators or co- architecture computing of deploying FPGA as a hardware
processor. These accelerators improve the performance of a accelerator with the conventional multi-core processor and
compute-intensive application by executing a certain task. with the embedded processor. Section III discusses the
Over the past few decades, FPGA-based accelerators give the system architecture and design includes both hardware and
considerable improvement in performance and power software. In Section IV the results and discussion of research
efficiency make them attractive to high performance works are presented. The conclusion and future works are
computing world. The flexibility of achieving higher presented in Section V and VI followed by references.
performance per watt prove that FPGA is capable to compete
for both superscalar and GPU accelerators, especially for
II. RELATED WORK
high performance computing applications [1] [2] .
The heterogeneous architecture platforms provide a
The embedded processor's ARM based-servers [3] has foundation to FPGA to integrate with other conventional
gained popularity in academia and industry due to low-cost processors for the acceleration of high performance
and low power consumption compared to conventional applications [9] [10]. The following research work promises
processors. The computational capability of ARM embedded an opportunity for FPGA-based accelerators with others
processors is not like other x86 architectures processors in computing units. Cray XD1 supercomputer [11], The
server environments but according to a recent research [4], Berkeley Emulation Engine 2 BEE2 [12] and Maxwell
their market share will be expected to rise 25% in 2020. project [13] used FPGA as the only computing elements in
their supercomputing cluster for application acceleration. In

978-1-5386-1370-2/18/$31.00 ©2018 IEEE


2008, the Convey Computer Corporation [14] designed
heterogonous computing platform that combines one or more
x86 processors with FPGA-based application accelerator. The
Convey Hybrid-core HC-1 was the first product consist of
Intel Xeon host processor and Xilinx Vertex FPGAs as a
coprocessor. Tsoi. K. H [15] presented a heterogeneous
computer cluster known as Axle, consist of AMD Phenom
Quad-core CPU, NVidia GPU and a Xilinx Virtex-5 FPGA
attached on a PCI bus as an accelerator for the simulation
process of the N-body algorithm. George. A. et al. [16]
AXI
implemented a machine called Novo-G supercomputer made
from 24 compute-nodes with quad-core Xeon processor
mounted on two PCI x8 PROCstar-III accelerator boards.
Each board comprises four Startix-III E260 FPGAs from
Altera.
Moreover, the following research works proposing
FPGA-based accelerators with embedded processors. Lin Z. Fig. 1: FPGA and ARM processor based Supercomputer Architecture using
[17] demonstrated an FPGA-based Hadoop cluster made of 8 Zynq SoC
computing nodes of Xilinx Zynq SoC called ZCluster. The .
aim of ZCluster to build a Hadoop cluster to increase the 1) Master-node: The master node used in the building
computing capabilities of ARM processors with the used of system is an Intel Core-i7 Xeon CPU running at 3.2GHz
reconfigurable hardware accelerators. Moorthy P. [18] built frequency with 4GB DDR3 memory. The 1TB hard disk
up a cluster of 32-nodes of Xilinx Zynq SoC chips. The main drive is used as a media storage for master node. The master
objective of the 32-nodes cluster to assess the energy node is the main controlling component of the system to
efficiency of hybrid SoC for fast mapping of parallel graph establish communication and dividing of tasks between
algorithms like neural network simulation. Bai X. et al. [19] compute-nodes.
designed a cluster of 48-compute-nodes and each computes
2) Compute-nodes: Each compute-nodes of Zybo board
node composed of Xilinx Zynq SoC chips. The hybrid
has Xilinx Zynq-7000 SoC (system on chip) which
architectures provide a platform for ARM CPU merge with
incorporate a dual-core ARM embedded processor running
FPGA reconfigurable hardware. A non-subtraction
on 650MHz and FPGA fabric of Xilinx 7 logic series with
Montgomery and Chines Reminder Theorem algorithms are
implemented to test the performance of hybrid architectures 512MB DDR3 main memory, and 240KB of RAM with 4.4k
platforms. logic slices and 80 DSP slices. The ARM CPU and FPGA
The above describes research works shows that researcher hardware accelerator are connected together by using a high
and system architects have made a great contribution to used performance and high bandwidth AXI4-stream interfaces as
FPGA-based accelerators with conventional processors as shown in Figure 1: The AXI4-stream interfaces are divided
well as embedded processors for processing of high into following two groups:
performance computing applications. • AXI4-stream Master interfaces connect ARM CPU
on the salves of FPGA fabric for read/write
operations. The two 32-bits master interfaces are
III. SUPERCOMPUTER ARCHITECTURE AND DESIGN GP0 and GP1.
In this section, we describe the building of FPGA and • AXI4-stream Salve interfaces connect FPGA master
ARM based supercomputer system and its operating into CPU salve to read/ write into the main memory
mechanism. The physical layout of the system required of processing system. High performance (HP0) and
different hardware interfaces and related software accelerator coherency port (ACP) are the example of
configuration. The scalability of the system increased by salve interfaces.
deploy multiple switches and compute-nodes. Figure 1 shows The two types of compute-nodes are used in the architecture
a block diagram of FPGA and ARM processor based of supercomputer system.
supercomputer architecture of Xilinx Zynq SoCs. This a) Compute-node without FPGA: This type of
section is further divided into four sub-sections as follow: compute-node only used ARM processor for the
computations and processing of data without the used of
A. Processing System
FPGA accelerator. The FPGA accelerator is disabled to
The FPGA and ARM processor based supercomputer perform any computations.
composed of five compute-nodes of Zybo board and Intel
Xeon server connected through an 8-port 10/100Mbps b) Compute-node with FPGA: This type of compute-
Ethernet Switch. The physical design of the supercomputer is node used FPGA accelerator to increase the computational
shown in Figure 2. The detail of processing system is given in capability of ARM processor by processing of the data-
the following two sub-sections. intesive portion of an application. A customized FPGA
node using SSH server commands to log onto every compute
node using their hostnames and IP addresses.
4) Network File System: The Network File System NFS
utilize the TCP/UDP internet protocols to distribute compiled
applications, packages, and libraries or data across the
supercomputer. We installed NFS server version on the
master node and client version on all five compute-nodes.
The machine files are accessible and available to all nodes at
the same time.
C. Supercomputer Configuration
After installing all relevant software and packages, our
b system now operates like a real production supercomputer.
a The DHCP server configuration provides the IP addresses, in
the 192.168.10.0/24 subnets. The SSH server on the master is
now authorized to access every compute node on the network
by using their hostname and IP address. SSH public Keys are
distributed on every node to grant permission without
Fig. 2: Physical Design of the Supercomputer (a) Five Compute-nodes of password authentication. This configuration made an
Zybo boards (b) Master-node application program to communicate across the
supercomputer without having specified the username and
accelerator is designed by using high-level synthesis and password on every connection. After making all necessary
designing tools. configurations of the supercomputer a common machine file
directory is created on master node is a root user and
exported that directory to all five compute-nodes. The
B. System Software compute-nodes mount the same directory in their local
In this section, all the required software and packages are location and a single folder is shared between master and all
installed on master and all compute-nodes. This section is compute-nodes.
further subdivided into four sub-sections.
1) Operating Systems: Linux based operating system D. Application Programming Software
such as 64-bit Ubuntu LTS 14.0 installed on master node to This section covers the parallel programming models and
manage the resources sharing across the supercomputer designing tool for our design supercomputer system. The
system. The Xillinux operating systems [20] is used by all parallel programming models are used to overcome the
compute-nodes in our system designing. This operating complexity that is between hardware architecture and
system is a Linux distribution flavor for Ubuntu LTS 12.04. application software. Our system support MPI [22], MPICH
We used the number of booting stages to boot Zybo board. In [23] and emerging models like OpenCL [24]. This section is
stage 0 we load boot image file of Xillinux into SD Card further categorized into following two sub-sections.
which runs the primary CPU and initialize the first stage 1) Message Passing interface: Multiples nodes make our
bootloader (FSBL) to configure the Clk, DDR and I/Os of the design supercomputer a distributed memory system
processing system. In addition to that, we add our bitstream architecture. The parallel programming models are used to
file of hardware accelerator and invoked the second stage accomplish the desired parallelism across the system. So, for
loader to load the test application program into main memory. our design, we installed OpenMPI and MPICH on five all
We do this process for all compute-nodes of the system. compute-nodes and master nodes. It provides the fast node to
2) Dynamic Host Configurationprotocols: Every node messaging passing protocols and daemon-based process
compute-nodes, the master node, and Ethernet Switch have startup/control for supercomputing functioning. After
an IP address. A DHCP server is used to assign static IP installation, we execute an FIR filter C++ program parallel on
addresses to the network. DHCP server generates specific IP multiple nodes with mpirun command mpirun -np 2 --user
by using MAC address of every node in the network. Every hosts ./exe. In the command, np specify the number of cores
node in the network has their hostname and IP address. A per node while the user is a file containing host node name
DNS server is also enabled with dynamic (DDNS) to make and IP addresses. Figure 3 shows the inter-nodes and intra-
easy for the master node to access compute node by their nodes communication across the supercomputer.
hostname without using of host IP every time. 2) Vivado HLS Tool: We used Xilinx Vivado 2015.4 tool
3) Secure Shell: The SSH server [21] is a secure data to generate bitstream file of the customized hardware
transfer protocol to log onto remote system utilizing TCP accelerator for FIR filter application acceleration. The
internet protocol. The standard TCP 22 port has been hardware accelerator includes one AXI4-stream master
designated for SSH server to communicate. The SSH server interface bus and one AXI4-stream high-performance slave
is installed on master and all five compute-nodes. The master interface bus, a customize sample source IP and FIR
TABLE 2. Supercomputer Five Nodes Performance

Clock Cycles Speedup


Data Set With FPGA
ARM ARM + FPGA

1GB 180,915,420 21,111,494 8.56x

B. Supercomputer system Performance


This experimental test presents the performance of ARM
processors based supercomputer with and without FPGA
accelerators. The result is tabulated in Table 2. The result
shows that executing FIR on the supercomputer of five
compute-nodes of ARM processors with FPGA accelerators
gains performance of 8.56 times higher than the
supercomputer of five compute-nodes of ARM processors
without FPGA accelerators. The improvement in
performance as compared to Table 1 is smaller due to MPI
Fig. 3: Communication across the Supercomputer communication overheads across the system. The
communication overhead between nodes increased with the
compiler. The sample source generates the required samples increasing of compute-nodes.
for FIR compiler. The sample generator writes the data to
FIR compiler through FIFO. The data from FIR is written to
main memory through DMA engine by using high- V. CONCLUSION AND FUTURE WORK
performance salve AXI bus. The ARM CPU read the data This paper proposed the implementation of FPGA and
from FPGA through AXI master bus and reconfigure the ARM processor based supercomputing using Zynq SoC
DMA engine for next packet of data. devices. The system is able to take advantage of parallelism
IV. RESULT AND DISCUSSION by executing high performance computing applications. The
using of FIR filter application on system shows that the
In this section, we perform a series of test to measure the computational capability of ARM processor is increased by
performance of ARM processor with and without FPGA integrating FPGA accelerator to execute the compute-
accelerator for single compute node and five compute-nodes intensive portion of the application. The supercomputer
of the supercomputer. We used a low pass 32-tap FIR filter performance of five compute-nodes of ARM processors with
[25] as a test application. The FIR application uses FPGA accelerator is 8.56 times higher than the performance
1GigaByte of data set. The clock frequency for ARM of same numbers of nodes without FPGA accelerators. The
processor is 650MHz and FPGA is 200MHz. The section is FPGA and ARM based supercomputer system shows that
further subdivided into two sub-sections: the single compute with the advancement of processor technologies will decrease
node performance and supercomputer system performance. the gap between embedded processor and conventional
processor in future high performance supercomputing.
A. Performance of Single Compute Node
The first experimental test describes the performance of In future, for our supercomputer system, the high-level
ARM processor with and without FPGA accelerator for a synthesis tools will be used which support OpenCL
single compute node of the supercomputer. The result is computing language for generating of bitstream files for
tabulated in Table 1. The result shows that while executing customized hardware accelerators from the standard C code.
FIR application on single compute node of ARM processor The implementation of OpenCL on the supercomputer to
with FPGA accelerator gains speedup of 7.55 times higher fully analyze the parallelism of heterogeneous architecture
than ARM processor without FPGA accelerator. platform in order to achieve higher performance with low
power consumption.
Table 1. Single Node Performance

ACKNOWLEDGMENT
Clock Cycles
Speedup The research leading to these results has received funding
Data Set With FPGA
ARM ARM + FPGA from the Unal Color of Education Research and Development
(UCERD) Private Limited Islamabad.
1GB 461,334,321 61,223,331 7.55x
REFERENCES
[1] S. Amin, T. Hussain, and U. Zabit, “FPGA Based Processing of [13] R. Baxter et al., “Maxwell - A 64 FPGA supercomputer,” Proc. -
Speckle Affected Self-Mixing Interferometric Signals,” 2016 Int. 2007 NASA/ESA Conf. Adapt. Hardw. Syst. AHS-2007, no. August,
Conf. Front. Inf. Technol., pp. 292–296, 2016. pp. 287–294, 2007.
[2] T. Hussain, M. Pericas, N. Navarro, and E. Ayguade, [14] B. Klauer, “The Convey Hybrid-Core Architecture,” in High-
“Implementation of a reverse time migration kernel using the HCE Performance Computing Using FPGAs, vol. 375, 2010, pp. 431–
high level synthesis tool,” 2011 Int. Conf. Field-Programmable 451.
Technol. FPT 2011, pp. 2–9, 2011. [15] Kuen Hung Tsoi and Wayne Luk, “Axel: A Heterogeneous Cluster
[3] “MACOM Announces Sampling of X-Gene® 3 Server-on-a- with FPGAs and GPUs,” 18th Annu. ACM/SIGDA Int. Symp. F.
Chip® Solution | AppliedMicro.” [Online]. Available: Program. Gate Arrays, pp. 115–124, 2010.
https://www.apm.com/news/macom-announces-sampling-of-x- [16] A. D. George and G. Stitt, “Novo-G : A View at the HPC
gene-3-server-on-a-chip-solution/. [Accessed: 27-Nov-2017]. Crossroads for Scientific Computing .,” no. January, 2010.
[4] “Worldwide x86 and ARM Server-Class Microprocessor Forecast, [17] Z. Lin and P. Chow, “ZCluster: A Zynq-based Hadoop cluster,”
2016–2020.” FPT 2013 - Proc. 2013 Int. Conf. F. Program. Technol., pp. 450–
[5] “Zynq-7000 All Programmable SoC.” [Online]. Available: 453, 2013.
https://www.xilinx.com/products/silicon-devices/soc/zynq- [18] P. Moorthy and N. Kapre, “Zedwulf: Power-performance tradeoffs
7000.html. [Accessed: 21-Nov-2017]. of a 32-node Zynq SoC cluster,” Proc. - 2015 IEEE 23rd Annu.
[6] T. Hussain, M. Shafiq, M. Pericàs, N. Navarro, and E. Ayguadé, Int. Symp. Field-Programmable Cust. Comput. Mach. FCCM
“PPMC: A programmable pattern based memory controller,” Lect. 2015, no. 3, pp. 68–75, 2015.
Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. [19] X. Bai, L. Jiang, Q. Dai, J. Yang, and J. Tan, “Acceleration of RSA
Lect. Notes Bioinformatics), vol. 7199 LNCS, pp. 89–101, 2012. Processes based on Hybrid ARM-FPGA Cluster,” 2017.
[7] T. Hussain, O. Palomar, O. Unsal, A. Cristal, E. Ayguade, and M. [20] “Xillinux: A Linux distribution for Zedboard, ZyBo, MicroZed and
Valero, “Advanced Pattern based Memory Controller for FPGA SocKit | xillybus.com.” [Online]. Available:
based HPC applications,” Proc. 2014 Int. Conf. High Perform. http://xillybus.com/xillinux. [Accessed: 21-Nov-2017].
Comput. Simulation, HPCS 2014, pp. 287–294, 2014. [21] “SSH Server | SSH.COM.” [Online]. Available:
[8] “Zybo Zynq-7000 ARM/FPGA SoC Trainer Board (LIMITED https://www.ssh.com/ssh/server. [Accessed: 21-Nov-2017].
TIME)>> see Zybo Z7-10 for replacement - Digilent.” [22] “Open MPI: Open Source High Performance Computing.”
[Online]. Available: http://store.digilentinc.com/zybo-zynq-7000- [Online]. Available: https://www.open-mpi.org/. [Accessed: 21-
arm-fpga-soc-trainer-board/. [Accessed: 23-Nov-2017]. Nov-2017].
[9] T. Hussain, “Memory resources aware run-time automated [23] “MPICH | High-Performance Portable MPI.” [Online]. Available:
scheduling policy for multi-core systems,” Microprocess. https://www.mpich.org/. [Accessed: 21-Nov-2017].
Microsyst., vol. 57, pp. 1–24, 2018. [24] “OpenCL Overview - The Khronos Group Inc.” [Online].
[10] T. Hussain, “A novel hardware support for heterogeneous multi- Available: https://www.khronos.org/opencl/. [Accessed: 23-Nov-
core memory system,” J. Parallel Distrib. Comput., vol. 106, pp. 2017].
31–49, 2017. [25] “FIR Filter Design, Software and Examples.” [Online]. Available:
[11] C. Xd, “Cray XD1 Supercomputer.” http://www.iowahills.com/5FIRFiltersPage.html. [Accessed: 27-
[12] C. Chang, J. Wawrzynek, and R. W. Brodersen, “BEE2: A high- Nov-2017].
end reconfigurable computing system,” IEEE Des. Test Comput.,
vol. 22, no. 2, pp. 114–125, 2005.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy