Real-Time Canny Edge Detection Parallel Implementation For Fpgas
Real-Time Canny Edge Detection Parallel Implementation For Fpgas
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/221457790
CITATIONS READS
42 743
4 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Calliope-Louisa Sotiropoulou on 23 February 2015.
Abstract-Edge detection is one of the most fundamental Because of its algorithmic efficiency and applicability
algorithms in digital image processing. The Canny edge detector many Canny implementations have been proposed. In [2] an
is the most implemented edge detection algorithm because of its implementation of a self-adapt threshold Canny algorithm is
ability to detect edges even in images that are intensely proposed. This design is FPGA based and intended for a
contaminated by noise. However, this is a time consuming mobile robot system. The results presented are for an Altera
algorithm and therefore its implementations are difficult to reach Cyclone FPGA and the highest frequency achieved is 27MHz,
real time response speeds. Especially nowadays where the
which result in 2.5ms computation time for a 360x280
demand for high resolution image processing is constantly
grayscale image. In [3] an industrial implementation for
increasing, the need for fast and efficient edge detector
ceramic tiles defect detection is presented, which defines the
implementations is ever so present. A new parallel Canny edge
hysteresis thresholds with a histogram subtraction method. A
detector FPGA implementation is proposed in this paper to
Canny edge detection on NVIDIA CUDA is presented in [4],
answer this demand. This design takes advantage of 4-pixel
which takes advantage of the CUDA framework to implement
parallel computations to achieve high throughput without
increasing the on-chip memory demands. Synthesis and
the entire Canny algorithm on a GPU. It achieves a 1O.92ms
simulation results are presented to prove the design's efficiency computation time for a 1024x 1024 image. In [5] there is an
and high frames per second rate. implementation of an adaptive edge-detection filter on an
FPGA using a combination of hardware and software
Keywords-Canny edge detection, FPGA, parallel architecture, components proposed by Altera. In [6] a reconfigurable
real time architecture and implementation of edge-detection using
Handle-C is presented This is a pipelined design of a canny
I. INTRODUCTION like edge detection algorithm. It achieves a computation time
of 4.2ms for a 256x256 grayscale image.
Modem image processing applications demonstrate an
increasing demand for computational power and memory Ours is a novel implementation of a Canny edge detector
space. This stems from the fact that image and video that takes advantage of 4-pixel parallel computation. It is a
resolutions have multiplied in the past few years, especially pipelined architecture that uses on-chip BRAM memories to
after the introduction of high definition video and high cache data between the different stages. The exploitation of
resolution digital cameras. Therefore there is a need for image both hardware parallelism and pipelining creates a very
processing implementations that can perform demanding efficient design that has the same memory requirements as a
computations on substantial amounts of data, with high design without parallelism in pixel computation. This results in
throughput, and often need to meet real-time requirements. achieving increased throughputs for high resolution images and
a computation time of 3.09ms for a 1.2Mpixel image on a
Edge detection is the first step in many computer vision Spartan-6 FPGA. We present synthesis and simulation results
algorithms. It is used to identify sharp discontinuities in an for low-end and high-end Xilinx FPGAs and have achieved
image, such as changes in luminosity or in the intensity due to
higher speeds, better throughputs and efficiency than the
changes in scene structure. Edge detection has been researched
implementations presented above.
extensively. A lot of edge detector algorithms have been
proposed, such as Robert detector, Prewitt detector, Kirsch
II. CANNY EDGE DETECTION ALGORITHM
detector, Gauss-Laplace detector and Canny detector. Among
all the above algorithms, Canny algorithm [ 1] is the most The block diagram of the Canny algorithm is demonstrated
widely used due to its good performance and its ability to in Fig. 1. The Canny algorithm first smoothes the image to
extract optimally edges even in images that are contaminated eliminate noise by using a smoothing filter such as the
by gaussian noise. Canny algorithm has the ability to achieve a Gaussian Convolution. The Gaussian smoothing is performed
low error-rate by eliminating almost all non-edges and by using a mask (matrix) which is sled over the image,
improving the localization of all identified edges. manipulating a square of pixels at a time. The bigger the
This work has been supported by National Funding and the European
Regional Development Fund in the frame of EIDA 2007-2013 under grant
No MIKP02-49-project LoC.
= l� 1 = l � �J
reduced noise. We found that a 5x5 mask was sufficient for our
0 1 2 implementation. Therefore, for the calculation of one pixel the
G, 0 -2 G, 0
contents of 25 pixels are required. We chose to implement a
design that takes advantage of 4-pixel parallel calculation, for
0 -1 -1 -2 which, as demonstrated in Fig. 3, 40 pixels are required. So for
-
an 8-bit pixel and 32-bit word of pixels, the cache reads are
Figure 2. 2-D Sobel operators minimized by following the pattern demonstrated in Fig. 4.
IGI=�G}+G/
which in modem FPGAs is both abundant and fast. Therefore,
(1) each BRAM has a size of image width / 4 x 32 bits to
accommodate a line of data aligned in 4-pixel words. The
e = arctan ( Gy /Gx ) (2) calculation of the first pixels begins as soon as the first 2 lines
(2x image width) of the cache are filled, since for the first line
The following computational step is non maximum of borderline pixels the non-existing lines necessary for the
suppression, which is used to reduce the edge thickness to calculations are considered to be black. Therefore 2 lines of
improve localization. After non maximum suppression the data are already loaded in the cache and the third line is
image may still contain some spurious responses. These simultaneously loaded in the cache and directed in the
responses which are called 'streaking' can be eliminated by the calculating core. Simultaneously, we adjust the matrix
use of hysterisis thresholding. In this procedure, two threshold normalization factor appropriately to maintain uniformity.
values are set, high threshold Thh and low threshold Th], with Thus, the same size of on-chip memory is used as if the
which the remaining edge gradient values are compared. Any calculation for only one pixel was executed at a time. The
pixel with a value greater than Thh is presumed to be a definite results are stored in another cache that serves as an input for
edge and any pixel with a value greater than Thj is considered the following stage.
to be a possible edge. Hysterisis is the procedure where every
possible edge is eliminated, unless there is a path from this B. Sobel Gradient Calculation
pixel to a pixel with a gradient above Thh which includes only In the Sobel gradient calculation block the same pixel
pixels with values above Thj• parallelism principles are applied as in the Gaussian smoothing
block. For the calculation of one pixel gradient 9 pixels are
III. HARDWARE IMPLEMENTATION required, while for the calculation of four 18 are required (Fig.
5). The gradients for both directions are calculated in parallel
The final goal of our implementation is to use the edge
as well. For caching, we allocate 3 BRAMs for respective
detection stage as a precursor step to feature extraction. This
lines. Each BRAM has a size of image data / 4 x 32bits in 4-
design will be part of a demanding machine vision system,
pixel words, as in the previous stage. For the start of the block
therefore there is a substantial need for efficiency and power in
calculation the existence of only one full BRAM is required.
our implementation with as limited use of resources as
The direction of the gradient is calculated by using fixed point
possible. The input images are 8-bit grayscale. The nature of
arithmetic and by implementing the multiplication with shifts
our implementation requires a successful edge detection for
500
from the first pass of hysterisis is still 2 bits for each pixel and
II � I HI1IJJ
it is stored in an external onboard memory (suitable for the size
of the frame we use). For the second pass the image data is
read in reverse, from the low right comer to the top left. The
hysterisis is executed in exactly the same manner, and finally
all the remaining possible edges are suppressed. The [mal
output of the process is only 1bit for each pixel and with all the
Figure 3. 4-pixel parallel Gaussian smoothing calculation
definite edges detected.
pix. no 1 2 9 10 11 12
.....--WORD'_--i�W
�4"4"--_ ORO ,. .. WORD-------.
IV. EXPERIMENTAL RESULTS
Figure 5. 4-pixel parallel Sobel gradient calculation TABLE I. CANNY SYNTHESIS RESULTS
Synthesis Frequenc,
and addition/subtraction to increase speed. The results are Gauss Sobel NMS Db_Thre! Hysterisi! Total Total(%)
Results (MHz)
stored in a cache used as an input for the next stage.
Spartan 3E
2613 1054 649 37 84 4200 28% 120.4
Slices
C. Non Maximum Suppression Spartan 6
2418 1391 651 36 126 4560 2% 201.4
Slices
The Non Maximum Suppression also requires an 8-pixel
Virtex 5
neighborhood for the determination of each pixel's value. 2409 1389 648 40 124 4553 6% 292.8
Slices
Therefore the parallelism implemented for the 4-pixel
simultaneous calculation is the same as in Fig. 5. As in Sobel
Our Canny implementation will be used by a lab on chip
gradient calculation NMS starts as soon as one cache line is
system on a Spartan 6 FPGA. Therefore we use the results
filled with data.
produced by the Spartan 6 synthesis for our simulation. We
simulate the design by using 3 different image file sources
D. Double Thresholding and Hysterisis
which are 8bit grayscale files of varying sizes. In Fig. 6 the
Double thresholding is executed by a double comparator. input and the output files of the Canny implementations are
No caching is required before thresholding and the data go demonstrated. Fig. 6.c and 6.d is an example frame of a video
right through the next stage of hysteresis. The data produced by for a lab on chip experiment. The timing results are presented
the double thresholding stage has a size of 2bits per pixel, as
in Table II.
three different values need to be stored for each pixel, no edge,
definite edge and possible edge. TABLE II. CANNY TIME RESULTS
50 1
(d)
(f)
Figure 6. (a) lena input, (b) lena output, (c) HCLAchip input, (d) HCLAchip output, (e) disc-brake input, (f) disc-brake output
50 2