0% found this document useful (0 votes)
58 views5 pages

Real-Time Canny Edge Detection Parallel Implementation For Fpgas

Canny Edge detection

Uploaded by

Aditya Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views5 pages

Real-Time Canny Edge Detection Parallel Implementation For Fpgas

Canny Edge detection

Uploaded by

Aditya Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/221457790

Real-time canny edge detection parallel


implementation for FPGAs

Conference Paper · December 2010


DOI: 10.1109/ICECS.2010.5724558 · Source: DBLP

CITATIONS READS

42 743

4 authors:

Christos Gentsos Calliope-Louisa Sotiropoulou


Aristotle University of Thessaloniki Università di Pisa
69 PUBLICATIONS 437 CITATIONS 271 PUBLICATIONS 3,078 CITATIONS

SEE PROFILE SEE PROFILE

Spyridon Nikolaidis Nikolaos Vassiliadis


Aristotle University of Thessaloniki Aristotle University of Thessaloniki
205 PUBLICATIONS 1,289 CITATIONS 23 PUBLICATIONS 147 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Power modeling View project

VGP ASIP View project

All content following this page was uploaded by Calliope-Louisa Sotiropoulou on 23 February 2015.

The user has requested enhancement of the downloaded file.


Real- Time Canny Edge Detection Parallel
Implementation for FPGAs

Christos Gentsos, Calliope-Louisa Sotiropoulou and


Nikolaos Vassiliadis
Spiridon Nikolaidis Micr02gen Ltd.
Department of Physics New Technology Park NCSR Demokritos
Aristotle University of Thessaloniki Ag. Paraskevi, Athens, Greece
Thessaloniki, Greece Email: nivas@micro2gen.com
Email: {cgentsos;lsoti;snikolaid}@physics.auth.gr

Abstract-Edge detection is one of the most fundamental Because of its algorithmic efficiency and applicability
algorithms in digital image processing. The Canny edge detector many Canny implementations have been proposed. In [2] an
is the most implemented edge detection algorithm because of its implementation of a self-adapt threshold Canny algorithm is
ability to detect edges even in images that are intensely proposed. This design is FPGA based and intended for a
contaminated by noise. However, this is a time consuming mobile robot system. The results presented are for an Altera
algorithm and therefore its implementations are difficult to reach Cyclone FPGA and the highest frequency achieved is 27MHz,
real time response speeds. Especially nowadays where the
which result in 2.5ms computation time for a 360x280
demand for high resolution image processing is constantly
grayscale image. In [3] an industrial implementation for
increasing, the need for fast and efficient edge detector
ceramic tiles defect detection is presented, which defines the
implementations is ever so present. A new parallel Canny edge
hysteresis thresholds with a histogram subtraction method. A
detector FPGA implementation is proposed in this paper to
Canny edge detection on NVIDIA CUDA is presented in [4],
answer this demand. This design takes advantage of 4-pixel
which takes advantage of the CUDA framework to implement
parallel computations to achieve high throughput without
increasing the on-chip memory demands. Synthesis and
the entire Canny algorithm on a GPU. It achieves a 1O.92ms
simulation results are presented to prove the design's efficiency computation time for a 1024x 1024 image. In [5] there is an
and high frames per second rate. implementation of an adaptive edge-detection filter on an
FPGA using a combination of hardware and software
Keywords-Canny edge detection, FPGA, parallel architecture, components proposed by Altera. In [6] a reconfigurable
real time architecture and implementation of edge-detection using
Handle-C is presented This is a pipelined design of a canny­
I. INTRODUCTION like edge detection algorithm. It achieves a computation time
of 4.2ms for a 256x256 grayscale image.
Modem image processing applications demonstrate an
increasing demand for computational power and memory Ours is a novel implementation of a Canny edge detector
space. This stems from the fact that image and video that takes advantage of 4-pixel parallel computation. It is a
resolutions have multiplied in the past few years, especially pipelined architecture that uses on-chip BRAM memories to
after the introduction of high definition video and high cache data between the different stages. The exploitation of
resolution digital cameras. Therefore there is a need for image both hardware parallelism and pipelining creates a very
processing implementations that can perform demanding efficient design that has the same memory requirements as a
computations on substantial amounts of data, with high design without parallelism in pixel computation. This results in
throughput, and often need to meet real-time requirements. achieving increased throughputs for high resolution images and
a computation time of 3.09ms for a 1.2Mpixel image on a
Edge detection is the first step in many computer vision Spartan-6 FPGA. We present synthesis and simulation results
algorithms. It is used to identify sharp discontinuities in an for low-end and high-end Xilinx FPGAs and have achieved
image, such as changes in luminosity or in the intensity due to
higher speeds, better throughputs and efficiency than the
changes in scene structure. Edge detection has been researched
implementations presented above.
extensively. A lot of edge detector algorithms have been
proposed, such as Robert detector, Prewitt detector, Kirsch
II. CANNY EDGE DETECTION ALGORITHM
detector, Gauss-Laplace detector and Canny detector. Among
all the above algorithms, Canny algorithm [ 1] is the most The block diagram of the Canny algorithm is demonstrated
widely used due to its good performance and its ability to in Fig. 1. The Canny algorithm first smoothes the image to
extract optimally edges even in images that are contaminated eliminate noise by using a smoothing filter such as the
by gaussian noise. Canny algorithm has the ability to achieve a Gaussian Convolution. The Gaussian smoothing is performed
low error-rate by eliminating almost all non-edges and by using a mask (matrix) which is sled over the image,
improving the localization of all identified edges. manipulating a square of pixels at a time. The bigger the

This work has been supported by National Funding and the European
Regional Development Fund in the frame of EIDA 2007-2013 under grant
No MIKP02-49-project LoC.

978- 1-4244-8 157-6/ 10/$26.00 ©20 10 IEEE 499 ICECS 20 10


source images contaminated by noise, therefore the adoption of
Gaussian
Grodiem I)Quble
CU/clllali,m an implementation of the Canny Algorithm was the most
.
Smoothing NonM(l.f;mUm Th"'jholdillg
Sob../ Hyslerisis
(Jd Slippress/ull (11rlgh :!i
(3d
�THn'Ol lic}ll)
cum'Ollllion)
now S) suitable option. To achieve high clock rates and throughput, a
pipelined design was chosen but with an exploitation of
parallel pixel computation. Our block design follows the exact
Figure I. Canny Algorithm block diagram procedure of the Canny algorithm, therefore our
implementation has 5 blocks:
dimensions of the mask are, the lower sensitivity the detector
has to noise. A 5x5 mask is a common choice for the size of a • Gaussian Smoothing
Gaussian filter. • Sobel Gradient calculation
After smoothing the next step is the calculation of the
• Non Maximum Suppression
gradient of the image. The gradient calculation leads to the
detection of the possible edge strength and direction. This is • Double Thresholding
executed by another convolution with a gradient operator. The
most commonly used gradient operators are the Prewitt and the
• Hysterisis
Sobel operators. Both operators perform a 2-D spatial gradient
measurement on an image. The Sobel edge detector uses a pair A. Gaussian Smoothing
of 3x3 convolution masks, one estimating the gradient in the x­ The first computational block is the Gaussian smoothing.
direction (columns) and the other estimating the gradient in the As previously described, in Gaussian smoothing a mask is
y-direction (rows). The Sobel operators are presented in Fig. 2. convolved with the image pixels to produce an image with

= l� 1 = l � �J
reduced noise. We found that a 5x5 mask was sufficient for our
0 1 2 implementation. Therefore, for the calculation of one pixel the

G, 0 -2 G, 0
contents of 25 pixels are required. We chose to implement a
design that takes advantage of 4-pixel parallel calculation, for
0 -1 -1 -2 which, as demonstrated in Fig. 3, 40 pixels are required. So for
-
an 8-bit pixel and 32-bit word of pixels, the cache reads are
Figure 2. 2-D Sobel operators minimized by following the pattern demonstrated in Fig. 4.

For the pixel input a specialized cache is designed. Five


Gradient magnitude and orientation IS defined by the
cache memory elements are used, each one assigned to storing
following equations:
the data of one image line. We exploit the on-chip BRAMs,

IGI=�G}+G/
which in modem FPGAs is both abundant and fast. Therefore,
(1) each BRAM has a size of image width / 4 x 32 bits to
accommodate a line of data aligned in 4-pixel words. The
e = arctan ( Gy /Gx ) (2) calculation of the first pixels begins as soon as the first 2 lines
(2x image width) of the cache are filled, since for the first line
The following computational step is non maximum of borderline pixels the non-existing lines necessary for the
suppression, which is used to reduce the edge thickness to calculations are considered to be black. Therefore 2 lines of
improve localization. After non maximum suppression the data are already loaded in the cache and the third line is
image may still contain some spurious responses. These simultaneously loaded in the cache and directed in the
responses which are called 'streaking' can be eliminated by the calculating core. Simultaneously, we adjust the matrix
use of hysterisis thresholding. In this procedure, two threshold normalization factor appropriately to maintain uniformity.
values are set, high threshold Thh and low threshold Th], with Thus, the same size of on-chip memory is used as if the
which the remaining edge gradient values are compared. Any calculation for only one pixel was executed at a time. The
pixel with a value greater than Thh is presumed to be a definite results are stored in another cache that serves as an input for
edge and any pixel with a value greater than Thj is considered the following stage.
to be a possible edge. Hysterisis is the procedure where every
possible edge is eliminated, unless there is a path from this B. Sobel Gradient Calculation
pixel to a pixel with a gradient above Thh which includes only In the Sobel gradient calculation block the same pixel
pixels with values above Thj• parallelism principles are applied as in the Gaussian smoothing
block. For the calculation of one pixel gradient 9 pixels are
III. HARDWARE IMPLEMENTATION required, while for the calculation of four 18 are required (Fig.
5). The gradients for both directions are calculated in parallel
The final goal of our implementation is to use the edge
as well. For caching, we allocate 3 BRAMs for respective
detection stage as a precursor step to feature extraction. This
lines. Each BRAM has a size of image data / 4 x 32bits in 4-
design will be part of a demanding machine vision system,
pixel words, as in the previous stage. For the start of the block
therefore there is a substantial need for efficiency and power in
calculation the existence of only one full BRAM is required.
our implementation with as limited use of resources as
The direction of the gradient is calculated by using fixed point
possible. The input images are 8-bit grayscale. The nature of
arithmetic and by implementing the multiplication with shifts
our implementation requires a successful edge detection for

500
from the first pass of hysterisis is still 2 bits for each pixel and

II � I HI1IJJ
it is stored in an external onboard memory (suitable for the size
of the frame we use). For the second pass the image data is
read in reverse, from the low right comer to the top left. The
hysterisis is executed in exactly the same manner, and finally
all the remaining possible edges are suppressed. The [mal
output of the process is only 1bit for each pixel and with all the
Figure 3. 4-pixel parallel Gaussian smoothing calculation
definite edges detected.
pix. no 1 2 9 10 11 12
.....--WORD'_--i�W
�4"4"--_ ORO ,. .. WORD-------.
IV. EXPERIMENTAL RESULTS

The above Canny implementation was synthesized for three


different Xilinx FPGAs ranging from the older and more
economical Spartan-3E to a top-end Virtex 5 FPGA by using
the Xilinx ISE 12. 1 toolchain [7]. The results are presented in
Table I. As demonstrated, in the newer Spartan 6 (45nm
technology) and Virtex 5 (65nm technology) FPGAs the final
design implementation occupies only a small percentage of the
Figure 4. Cache line read pattern FPGA's total area, and at the same time it achieves a high
operating frequency of above 200MHz. Even in the small
Spartan-3E (90nm technology) FPGA our design
implementation achieves an operating frequency of 120MHz
and still leaves more than 70% of the FPGA free for use. The
total on-chip memory used is 24kbytes for all types of FPGA.

Figure 5. 4-pixel parallel Sobel gradient calculation TABLE I. CANNY SYNTHESIS RESULTS

Synthesis Frequenc,
and addition/subtraction to increase speed. The results are Gauss Sobel NMS Db_Thre! Hysterisi! Total Total(%)
Results (MHz)
stored in a cache used as an input for the next stage.
Spartan 3E
2613 1054 649 37 84 4200 28% 120.4
Slices
C. Non Maximum Suppression Spartan 6
2418 1391 651 36 126 4560 2% 201.4
Slices
The Non Maximum Suppression also requires an 8-pixel
Virtex 5
neighborhood for the determination of each pixel's value. 2409 1389 648 40 124 4553 6% 292.8
Slices
Therefore the parallelism implemented for the 4-pixel
simultaneous calculation is the same as in Fig. 5. As in Sobel
Our Canny implementation will be used by a lab on chip
gradient calculation NMS starts as soon as one cache line is
system on a Spartan 6 FPGA. Therefore we use the results
filled with data.
produced by the Spartan 6 synthesis for our simulation. We
simulate the design by using 3 different image file sources
D. Double Thresholding and Hysterisis
which are 8bit grayscale files of varying sizes. In Fig. 6 the
Double thresholding is executed by a double comparator. input and the output files of the Canny implementations are
No caching is required before thresholding and the data go demonstrated. Fig. 6.c and 6.d is an example frame of a video
right through the next stage of hysteresis. The data produced by for a lab on chip experiment. The timing results are presented
the double thresholding stage has a size of 2bits per pixel, as
in Table II.
three different values need to be stored for each pixel, no edge,
definite edge and possible edge. TABLE II. CANNY TIME RESULTS

The execution of Hysterisis stage also starts as soon as the


Image File Size Time (ms)
first pixel data is received. At the same time the data is stored
in a specialized cache for this stage with a size of 1x image lena 512x512 0.66
width x 2bits. Hysterisis is basically a procedure of comparing
HCLAChip 960x540 1.31
the value of the pixel at hand with the values of three
neighboring pixels above it and the one directly on the left. If Disc-brake 1280x960 3 .09
the pixel is a possible edge and one of the aforementioned
neighboring pixels is a definite edge, then the pixel becomes a As can be seen for an image of 1.2Mpixel we have achieved
definite edge. Otherwise it is left as is. The original algorithm a computation time of 3.09ms. Thus, for an image of 1Mpixel
requests a comparison between all 8 neighboring pixels, but a rate of 396 frames per second has been achieved, which is
testing of our implementation has demonstrated that this far beyond our set specifications for a real-time design. On the
comparison is adequate as long as a second pass is executed Virtex-5 the throughput reaches the number of 580 frames per
with the pixels read in the opposite direction. The comparison
second for 1Mpixel images. Even in the low-end Spartan-3E
is also executed for 4 pixels in parallel. The data produced

50 1
(d)

(f)

Figure 6. (a) lena input, (b) lena output, (c) HCLAchip input, (d) HCLAchip output, (e) disc-brake input, (f) disc-brake output

FPGA a throughput of 240 frames per second has been


REFERENCES
reached for the same image size.
[1] J.F. Canny, "A computation approach to edge detection," IEEE
V. CONCLUSIONS Transactions on Pattern Analysis and Machine Intelligence, vol. 8, no 6,
pp. 769-798, November 1986
In this paper a parallel design of a real-time Canny
[2] W. He and K. Yuan, "An improved Canny Edge Detector and its
implementation is presented. In this Canny edge detector a Realization on FPGA," Proc. of 7ili World Conference on Intelligent
parallel architecture of simultaneous 4-pixel calculation is Control and Automation, 2008
proposed, which increases the throughput of the design [3] H. Zeljko, V. Suzana and H. Verica, "Improved Canny Edge Detector in
without increasing the need for on-chip cache memories. This Ceramic Tiles Defect Detection," IEEE Industrial Electronics, IECON
2006 - 32nd Annual Conference, pp. 3328-3331, November 2006
design has been synthesized for low-end and high-end Xilinx
[4] Y. Luo and R. Duraiswami, "Canny Edge Detection on NVIDIA
FPGAs, achieving a rate of 240 frames per second for IMpixel CUDA," Proc. Of IEEE Computer Vision and Pattern Recognition
images on a Spartan-3E occupying a 28% of the area of the Workshops, 2008 , pp. 1-8
chip, to a rate of 580 frames per second on a Virtex-5 [5] H.S. Neoh and A. Hazanchuk, "Adaptive Edge Detection for Real-Time
occupying a 6% of the area of the chip. In Spartan 6 a Video Processing using FPGAs," Altera Corp.
computation time of 3.09ms was achieved for a 1.2Mpixel 8- [6] D. V. Rao and M. Venkatesan, "An Efficient Reconfigurable
Architecture and Implementation of Edge Detection Algorithm Using
bit grayscale image and a rate of 396 frames per second for
Handle-C," Proc of International Conference on Information
1Mpixel images, while only 2% of the total area of the chip is Technology: Coding and Computing, ITCC 2004, Vol. 2, pp. 843-847
occupied. [7] "Xilinx ISE 12.1 - Synthesis and Simulation Design Guide", Xilinx
Corp.

50 2

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy