0% found this document useful (0 votes)
10 views73 pages

Thesis Rashed Doha

The thesis titled 'Deep Learning Based Crop Row Detection' presents an investigation into detecting crop rows using deep learning techniques, fulfilling the requirements for a Master's degree in Mechanical Engineering at Purdue University. It outlines methodologies, datasets, and models, including U-Net architecture, and discusses both supervised and unsupervised learning approaches. The research was conducted under the guidance of faculty members and supported by Equipment Technologies, with acknowledgments to contributors and peers.

Uploaded by

f.alpcelik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views73 pages

Thesis Rashed Doha

The thesis titled 'Deep Learning Based Crop Row Detection' presents an investigation into detecting crop rows using deep learning techniques, fulfilling the requirements for a Master's degree in Mechanical Engineering at Purdue University. It outlines methodologies, datasets, and models, including U-Net architecture, and discusses both supervised and unsupervised learning approaches. The research was conducted under the guidance of faculty members and supported by Equipment Technologies, with acknowledgments to contributors and peers.

Uploaded by

f.alpcelik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

DEEP LEARNING BASED CROP ROW DETECTION

by
Rashed Mohammad Doha

A Thesis
Submitted to the Faculty of Purdue University
In Partial Fulfillment of the Requirements for the degree of

Master of Science in Mechanical Engineering

Department of Mechanical and Energy Engineering


Indianapolis, Indiana
May 2022
THE PURDUE UNIVERSITY GRADUATE SCHOOL
STATEMENT OF COMMITTEE APPROVAL

Dr. Sohel Anwar, Chair


Department of Mechanical and Energy Engineering

Dr. Mohammad Al Hasan


Department of Computer and Information Science

Dr. Lingxi Li
Department of Electrical and Computer Engineering

Approved by:
Dr. Likhun Zhu

2
ACKNOWLEDGMENTS

This research was performed at Indiana University-Purdue University Indianapolis. Fund-


ing for this research came from a research grant by Equipment Technologies, Mooresville,
Indiana. I’d like to thank everybody associated with the project for the successful completion
of my thesis.

3
PREFACE

Before you lies the thesis titled “Deep Learning Based Crop Row Detection” the basis
of which is an in-depth investigation in both supervised and unsupervised paradigms for
detecting crop rows from images. It has been written to fulfill the graduation requirements
of the Masters in Mechanical Engineering program at Indiana University-Purdue University
Indianapolis. I was engaged in writing and researching this thesis from January 2020 to
December 2021.
This project was undertaken at the request of Equipment Technologies located at Moor-
esville, Indiana. My research question was formulated together with my thesis supervisor
Dr. Sohel Anwar. It was also co-supervised by Dr. Mohammad Al Hasan. The research was
difficult, but conducting extensive investigation has allowed me to answer the question that
we identified. Fortunately both Dr. Anwar and Dr. Hasan were always available and willing
to answer my queries.
I would like to thank my supervisors for their excellent guidance and support during this
process. Furthermore, I would like to thank my colleague Dr. Nazmuzzaman Khan who
provided great insight into traditional computer vision based solutions for the detection of
crop rows from images.
To my other colleagues at Indiana University-Purdue University Indianapolis, I would
like to thank you for your wonderful cooperation as well. It was always helpful to test out
my ideas about my research with you. I also benefited from debating issues with my friends
and family. If I ever lost interest, you kept me motivated. My parents deserve a particular
note of thanks: your wise counsel and kind words have, as always, served me well.
I hope you enjoy your reading.
Rashed Doha
Indianapolis, March 31, 2022

4
TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

LIST OF SYMBOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 RELATED WORK & LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . 18


2.1 Traditional Computer Vision Methods . . . . . . . . . . . . . . . . . . . . . 19
2.1.1 Hough Transform based methods . . . . . . . . . . . . . . . . . . . . 19
2.1.1.1 Preprocessing Methods . . . . . . . . . . . . . . . . . . . . 19
2.1.1.2 Application of Hough Transform . . . . . . . . . . . . . . . 23
2.1.1.3 Randomized Hough Transform . . . . . . . . . . . . . . . . 23
2.1.2 Horizontal Strip Based methods . . . . . . . . . . . . . . . . . . . . . 24
2.1.3 Linear Regression based methods . . . . . . . . . . . . . . . . . . . . 26
2.1.4 Stereo Vision Based Approaches . . . . . . . . . . . . . . . . . . . . 27

3 METHODOLOGY, DATASET AND MODELING . . . . . . . . . . . . . . . . . 28


3.1 Crop Row Benchmark Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Crop Row Modeling and Assumptions . . . . . . . . . . . . . . . . . . . . . 31

4 BACKBONE: U-NET ARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . 35

5 DATA PIPELINE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1 Low Level processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Train and test set split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6 ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5
6.1 Tuning Network Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.1.1 Refining the output . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2 Choosing central rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.3 Pre-Calibration and density estimation . . . . . . . . . . . . . . . . . . . . . 44
6.4 Rows of interest and largest clusters . . . . . . . . . . . . . . . . . . . . . . 44
6.5 Density Estimation of row clusters . . . . . . . . . . . . . . . . . . . . . . . 46
6.6 Frame extrapolation and temporal corrections . . . . . . . . . . . . . . . . . 47

7 FULLY SUPERVISED MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49


7.1 Larger dataset and models tested . . . . . . . . . . . . . . . . . . . . . . . . 49
7.1.1 Deeper-UNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.1.2 MA-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.1.3 LinkNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.1.4 PSPNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.1.5 Feature Pyramid Network . . . . . . . . . . . . . . . . . . . . . . . . 53
7.2 End-to-end supervised learning with feature pyramid network . . . . . . . . 55
7.2.1 Dataset and train/test split . . . . . . . . . . . . . . . . . . . . . . . 56
7.2.2 Region of Interest Selection . . . . . . . . . . . . . . . . . . . . . . . 57
7.2.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.2.4 Loss Function and Training Scheme . . . . . . . . . . . . . . . . . . . 57
7.2.5 Testing and Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 58

8 EXPERIMENTS AND RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . 59


8.1 Semi-Supervised Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.1.1 Distribution of row predictions . . . . . . . . . . . . . . . . . . . . . 59
8.1.2 Ablation study with and without extrapolation . . . . . . . . . . . . 61
8.1.3 Comparison of U-Net with unsupervised methods . . . . . . . . . . . 62
8.2 Fully Supervised Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

9 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . 66


9.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6
9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7
LIST OF TABLES

4.1 Architectural Details of the U-net. Number of features between encoder blocks
and decoder blocks are consistent for allowing skip connections. . . . . . . . . . 38
8.1 Changes in variance of parameters in subsequent frames, averaged across central
four crop rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.2 Processing time of algorithm on Nvidia Jetson TX2 . . . . . . . . . . . . . . . . 61
8.3 Comparison of our model’s mIoU scores and inference time per single image on
the test set with unsupervised computer vision algorithms . . . . . . . . . . . . 63

8
LIST OF FIGURES

1.1 System diagram for cascading clustering based crop row detection. . . . . . . . . 15
2.1 Defintion of the ROI by Rovira-Más et al. [13] . . . . . . . . . . . . . . . . . . . 20
2.2 Binarized ROI using dynamic thresholding. Taken from [13] . . . . . . . . . . . 21
2.3 Greyscale image divided into horizontal strips. Taken from [8] . . . . . . . . . . 25
2.4 Result of binarization of greyscale transformation by [17] . . . . . . . . . . . . . 26
3.1 Sample crop row image from CRBD . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Sample ground truth annotation from CRBD . . . . . . . . . . . . . . . . . . . 31
3.3 Modeling approaches: Object detection vs Semantic Segmentation . . . . . . . . 32
3.4 ROI for Crop row image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1 U-net Architecture for segmenting crop rows. The decoder consists of a down-
sampling path made up of four blocks (green) and the encoder consists of an
upsampling path made up of four blocks (blue). Skip connections are added be-
tween corresponding encoder-decoder blocks to pass on features learned by the
encoder to the decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1 Some samples from the augmented dataset . . . . . . . . . . . . . . . . . . . . . 41
6.1 Sample input/output of the U-Net . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Density based clustering of predicted row parameters (feature scaled) . . . . . . 45
6.3 Kmeans clustering of row representations . . . . . . . . . . . . . . . . . . . . . . 46
7.1 Availability of larger dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2 Deeper U-Net Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.3 The total architecture of MA-Net . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.4 The Position-wise Attention Block (PAB). The input image is HxWx256 and
output is HxWx512. The attention feature map is obtained by Softmax function. 52
7.5 LinkNet Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.6 Overview of PSPNet architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 55

9
7.7 (a) Using an image pyramid to build a feature pyramid. Features are computed
on each of the image scales independently, which is slow. (b) Recent detection
systems have opted to use only single scale features for faster detection. (c) An
alternative is to reuse the pyramidal feature hierarchy computed by a ConvNet as
if it were a featurized image pyramid. (d) Our proposed Feature Pyramid Network
(FPN) is fast like (b) and (c), but more accurate. In this figure, feature maps
are indicate by blue outlines and thicker outlines denote semantically stronger
features.[42] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.1 Distribution of row predictions in parameter space . . . . . . . . . . . . . . . . 60
8.2 Example of frame extrapolation to fix errors in predicted rows. Here frames 1 and
2 are used to extrapolate the third frame since confidence of prediction for the
left and right row did not cross the threshold in the case of the original prediction. 62
8.3 Training and Validation dice losses. . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.4 Training and Validation mIoU Scores. . . . . . . . . . . . . . . . . . . . . . . . 65

10
LIST OF SYMBOLS

α learning rate
m slope of straight line
c intercept of straight line
µ mean
σ variance
Sum over elements
P

11
ABBREVIATIONS

ML Machine Learning
NN Neural Network
HT Hough Transform
RHT Randomized Hough Transform
CNN Convolutional Neural Network
CRBD Crop Row Benchmark Dataset
ROI Region of Interest
ReLU Rectified Linear Unit
BN Batch Normalisation
IoU Intersection over Union
mIoU Mean Intersection over Union
SGD Stochastic Gradient Descent
MA-Net Multi-scale Attention Network
PSPNet Pyramid Scene Parsing Network
FPN Feature Pyramid Network
SMP Segmentation Models Pytorch

12
ABSTRACT

Detecting crop rows from video frames in real time is a fundamental challenge in the
field of precision agriculture. Deep learning based semantic segmentation method, namely
U-net, although successful in many tasks related to precision agriculture, performs poorly
for solving this task. The reasons include paucity of large scale labeled datasets in this
domain, diversity in crops, and the diversity of appearance of the same crops at various
stages of their growth. In this work, we discuss the development of a practical real-life crop
row detection system in collaboration with an agricultural sprayer company. Our proposed
method takes the output of semantic segmentation using U-net, and then apply a clustering
based probabilistic temporal calibration which can adapt to different fields and crops without
the need for retraining the network. Experimental results validate that our method can be
used for both refining the results of the U-net to reduce errors and also for frame interpolation
of the input video stream. Upon the availability of more labeled data, we switched our
approach from a semi-supervised model to a fully supervised end-to-end crop row detection
model using a Feature Pyramid Network or FPN. Central to the FPN is a pyramid pooling
module that extracts features from the input image at multiple resolutions. This results in
the network’s ability to use both local and global features in classifying pixels to be crop
rows. After training the FPN on the labeled dataset, our method obtained a mean IoU or
Jaccard Index score of over 70% as reported on the test set. We trained our method on only a
subset of the corn dataset and tested its performance on multiple variations of weed pressure
and crop growth stages to verify that the performance does translate over the variations and
is consistent across the entire dataset.

13
1. INTRODUCTION

For site-specific treatments in agriculture, such as, application of herbicides, and fertilizers
in the crop-fields, much of the works in the various parts of the world are still done man-
ually via tedious and time consuming human labor. A consequence of this is greater cost,
inefficiency, and the inability of on-time treatment, resulting in poor yield, and delay in the
final harvest. For this reason, manual labor is increasingly being replaced by precision and
autonomous agriculture technologies both in the developing and the developed countries. A
vital component of precision agriculture is agricultural sprayers (also called crop sprayers),
which are agricultural vehicles that maneuver in between crop rows at various stages of the
crop’s growth to apply water, herbicides, and fertilizers. A much needed feature in such
agricultural vehicles is to detect the locations of crop rows in its view, and provide a visual
projection of the crop rows on a dashboard for aiding the drivers in following a safe and
optimal route along the crop fields [1]. Such detection of crop rows is also a first step for
autonomous navigation of agricultural vehicles in the crop fields. To achieve this successfully,
the detection and identification of crop rows from image/video data is the first hurdle—a
machine learning and computer vision task, which is the focus of this work.
Crop row identification in real-life scenario suffers from multiple challenges. For instance,
outdoor agricultural environments are susceptible to a diverse range of lighting conditions
which contribute to poor quality of the images taken. Furthermore, if the camera is on-board
a moving vehicle, the turbulence caused by the motion of the vehicle causes perturbations in
the perspective of the camera—another contributing factor to poor image quality. Besides
this, high weed density with spectral signature similar to the crop rows can confuse the
vision algorithms in their row detection task. Different crops have different row widths; in
addition, the growth stage of the crops in the rows adds another degree of freedom to the
shape and size of the crop rows; this causes crop rows to have a variable distance between
them resulting in inconsistent densities of crop row pixels over different crop row images.
Finally, imperfections and occlusions are very common in real world scenarios, as such,
ground patches in between crop rows, artifacts in the form of objects that occlude portions

14
of the crop rows and, in general, a diverse range of pixel densities that are a result of different
soil conditions—all make accurate crop row identification a very difficult task.

Figure 1.1. System diagram for cascading clustering based crop row detection.

In existing works, different computer vision based approaches have been adopted as
possible solutions in dealing with diverse range of crop row images. The solutions were
designed to cope with a specific set of challenges that we discussed in previous paragraph.
For example, the Hough Transform[2] is a feature extraction algorithm that maps the image
from the pixel space to the parameter space in order to determine geometric shape, such as,
lines. Fitting straight lines through crop rows could prove to be fruitful for cases where the
rows are perfectly straight and parallel but the solution fails in the case of curved crop rows.
Pixel accumulator based algorithms [3] help in determining centroids of regions of a certain
color thereby being robust to different curvatures a crop row may exhibit. But since the
color of rows varies and their spectral signatures can often be similar to those of inter-crop
row weed clusters, building a model robust and consistent enough for practical use becomes
a difficult task.
Some of the existing methods have demonstrated impressive results on smaller datasets
consisting of only one type of crop and with little to low variation [4]–[6]; however, they
lack the ability to adapt to new crop fields that differ significantly from the training set
images. Existing methods also work on image data only, ignoring temporal sequence of
frames coming from the video feed, which could provide additional information to aid in

15
improving continuous detection of rows. These are the key challenges for bringing crop row
detection based driver assistance tool on-board with the agricultural vehicles.
In this work, we present our ongoing effort to build a practical crop row detection sys-
tem to be deployed in agricultural sprayers. We discuss the challenges that we have faced
while using off-the-shelf methodologies for solving this task and also discuss our solutions to
overcoming these challenges. Our overall solution is a novel crop row detection method that
augments the predictions from a U-net based convolutional neural network model trained
on a handful of crop row images. The critical component of our system is a clustering-based
probabilistic temporal calibration, which can adapt to different fields without the need for
retraining the network. This need is vital for industrial deployment of such an equipment
in order to reduce dependence on data collection throughout the year. In terms of method-
ologies, our contribution is to augment a supervised machine learning based method with
clustering based online pattern detection mechanism, so that the overall system can be used
for adaptation of a detection method in an unsupervised manner.
We summarize our main contributions using this method as follows:

1. We design a robust system for fast detection of crop rows from video frames.

2. We design a mechanism for adapting the model to unseen crop fields through a brief
calibration phase.

3. We introduce a method to extrapolate the model’s predictions by taking into account


past predictions thereby taking advantage of the temporal aspect of the data.

The overall system diagram is illustrated in Figure 1.1. As can be seen, continuous video
data frame from an on-board camera is streamed into a prediction system, which is calibrated
initially for a given crop field. The detection system detects slope, intercept, and confidence
value of central four crop rows in the angle of view of the camera. The confidence value for
each row represents how confident the system is in its detection of that specific row. Using
these data, these rows are displayed on a dashboard screen for assisting the driver.
Following this method, and upon availability of a larger dataset, we opted for an end-
to-end method based on a feature pyramid network [7] as the feature extractor for the

16
CNN. One of the foremost challenges of our previous U-Net based detection system was its
inability to detect features at different scales thus resulting in good local predictions but
poor global predictions over the whole image. Feature pyramids are a basic component in
recognition systems for detecting objects at different scales. But because they are compute
and resource heavy, deep learning based segmentation approaches have avoided utilizing their
unique characteristics. We employ the architecture in [7], named a feature pyramid network
to tackle this issue. Feature pyramid networks exploit the inherent multi-scale, pyramidal
hierarchy of convolutional neural networks to construct feature pyramids with marginal extra
cost.

17
2. RELATED WORK & LITERATURE REVIEW

Recently, precision agriculture has grown at a remarkable rate, attracting a great number
of researchers and practitioners. Owing to the dependence of manual labor for site-specific
treatments in agriculture, such as spraying of herbicides and fertilizers, the natural conse-
quence of low yield and late harvest naturally arises. Precision agriculture aims to optimize
the entire agricultural pipeline from the early stages of sowing seeds in the soil to the harvest-
ing and transportation of crops. Therefore, it has become on of the most popular research
directions and plays a significant role in the industry. Control and perception are two aspects
of precision agriculture that must work together in the final solution. In terms of perception,
crop row detection is one of the core components that ultimately impact how effective the
control method is.
Different types of approaches have been proposed for detecting crop rows from images.
From the perspective of feature engineering to distinguish crop rows, crop row detection
methods can be subdivided into two categories: traditional computer vision based methods
and machine learning based models. For traditional computer vision based approaches,
the hough transform is a representative algorithm that have been employed heavily for
modeling the geometric shape of the crop rows as straight or curved lines. Besides the Hough
Transform, these methods can be classified into a number of categories such as horizontal
strips, linear regression, blob analysis and stereo based. Horizontal Strips were employed by
Søgaard and Olsen [8] to divide the greyscale transformed image of crop rows into 15 distinct
horizontal strips. The vertical sum of each strips were used to detect peaks along the center
line of the crop rows. This approach basically approximates the crop rows shapes as linear
among each horizontal strip thereby allowing for the summing approach to be effective.
In terms of machine learning based methods, most methods rely on deep neural networks
to learn a latent representation of the crop rows that is not necessarily a sole function of the
geometric shape/luminosity of the crop rows.

18
2.1 Traditional Computer Vision Methods

Several strategies have been proposed for crop row detection. As discussed previously,
we can divide these methods based on the core principle of each of them.

2.1.1 Hough Transform based methods

The Hough Transform was first introduced in 1962 for detecting lines, curves and circles
[9]. It is one of the most commonly used machine vision methods for identifying crop rows.
The basic principle of the hough transform is the accumulation of votes and detection of
peaks in the parameter space. The first use of the Hough transform algorithm to detect the
center line of crops was proposed by Marchant et al.[10].

2.1.1.1 Preprocessing Methods

Since hough transform based methods rely on the edge information in the image pixel
space, these edges have conventionally been detected by applying various preprocessing
phases. Jiang et al. [11] applied a grayscale transformation on the RGB images by em-
phasizing on the green value and restraining the red and blue values. The principle of the
grayscale transformation used is shown in Equation (2.1).

2G ≤ R + B

0






(2.1)

pixel(x, y) =
2G − R − B other



2G ≥ R + B + 255

255

Where G,R, and B are the green, red and blue values of pixel(x, y). In order to eliminate
noise and as a precursor to the edge detection process, the images are binarized. The popular
Otsu thresholding [12] method was used to binarize the image. The basic principle of the
Otsu method is looking for an optimal threshold value to divide grey-level histogram of an
image into two parts on the condition that between-cluster variance is maximal. By using
the green values of the image as the optimal threshold, Jiang et al. attempted to optimally
separate the crop rows from the background by using color as a defining feature of crop rows.

19
Figure 2.1. Defintion of the ROI by Rovira-Más et al. [13]

Although the hough transform has been applied previously for detecting crop rows, but
due to the complexity of the algorithm real-time peformance is difficult ot achieve. A hori-
zontal division method was used in [11] to accommodate this.
Rovira-Más et al. [13] defined a square region of interest (ROI) in the image to reduce
computational cost of processing the whole image. Only pixels inside the defined ROI were
to be processed. A sample ROI used is shown in Figure 2.1. Following the ROI definition,
a binarization process was employed. The binarization applied a threshold on every pixel
inside the ROI in such a way that those pixels with a grey value less than the threshold
are mapped to the black, whereas those with a grey level higher than the threshold are
converted to white. A dynamic threshold was chosen in their method owing to which, every

20
Figure 2.2. Binarized ROI using dynamic thresholding. Taken from [13]

image captured and processed were binarized using a different threhold value. The dynamic
threshold was calculated using the equation

Th = Gav + ∆(Gmax − Gav ) (2.2)

where
Th = threshold to be applied to the ROI
Gav = average grey level, sampled every five pixels
∆ = offset, to be found experimentally and normally ranging between 0.5 and -0.5
Gmax = maximum grey level of the pixels sampled inside the ROI
As an example, the ROI shown in Figure 2.1 was binarized using hte following parameters:
Gav = 102, ∆ = 0.2, Gmax = 138 and Th = 94. The result of the dynamic binarization process
is shown in Figure 2.2.

21
Similar to Jiang et al., [13] found that despite the ROI selection, the complexity of the
Hough transform algorithm made effecient application a challenge. Therefore, the midpoint
encoder was introduced to further reduce the amount of data to be processed. The midpoint
encoder converts a binarized crop row into a line of 1 pixel width designating the skeleton of
the shape of the crop row. To apply the hough transform, they deemed it essential to have
at least two complete crop rows inside the ROI. The algorithm was designed to deal with
rows only completely inside the ROI. Therefore, the midpoint encoder neglected incomplete
rows in order to avoid false lines while attempting to detect the crop rows.
Bakker et al. [4] applied an inverse perspective projection to the image to make the crop
rows parallel to one another. This undoes the perspective projection that the images go
through as a result of the camera looking downward at an angle from the vertical. Bakker at
al. achieved greyscale image transformations by maximizing the contrast between the green
plants and soil backgrounds. The pixel intensity value was calculated as-

I = 2g − r − b (2.3)

The calculation of r,g and b were done in three different ways to find the optimum in
terms of accuracy. This was firstly done without normalization by taking the RGB values of
the current pixel and assigning them to r,g and b respectively.

r = Rc , g = Gc , b = Bc (2.4)

Then r,g and b were also calculated via image normalization in accordance with Woebbecke
et al. [14].

Rc Gc Bc
r= ,g = ,b = (2.5)
Rc + Gc + Bc Rc + Gc + Bc Rc + Gc + Bc

22
The third method for obtaining r,g and b were to simply normalize the pixel values by the
maximum value of the color intensity in the image. Mathematically this can be represented
as-

Rc Gc Bc
r= ,g = ,b = (2.6)
Rm Gm Bm
Where Rm ,Gm and Bm are the maximum RGB values for the image.

2.1.1.2 Application of Hough Transform

The Hough Transform is a standard tool in image analysis that allows to detect straight
lines, circles and polynomial curves. The basic idea of Hough Transform is to map a set of
points in the image space to a set of lines in the parameter space. A straight line can be
assumed to be parameterized in the form-

ρ = x cos θ + y sin θ (2.7)

where ρ is the perpendicular distance from the origin and θ is the angle with the normal.
All lines through a pixel location (x, y) in the image plane can be represented by a curve in
the (ρ, θ) plane or parameter space. Every line through all locations (x, y) in the image with
a certain quantized range of values for θ is mapped into the (ρ, θ) space and the greyvalue
I(x, y) of the points (x, y) that map into the locations (ρm , θm ) are accumulated in the two
dimensional histogram:

A(ρm , θm ) = A(ρm , θm ) + I(x, y) (2.8)

2.1.1.3 Randomized Hough Transform

One of the biggest issues with the traditional Hough Transform algorithm is computa-
tional complexity. The Hough Transform algorithm requires huge amounts of computation
with excessive redundancies. The requirement of quantizing the parameter space also affects

23
the detection accuracy. To overcome these shortcomings of the Hough Transform, Xu and
Oja [15] introduced the randmomized hough transform (RHT) algorithm).

2.1.2 Horizontal Strip Based methods

Several works found that subdividing the image into horizontal strips helps with localizing
crop rows and simplifying the overall detection process. Location of crop rows in the form
of mid points through the rows for each strip is detected before combining the results from
all the strips into the final image.
Søgaard and Olsen et al. [8] used a linear combination of the color channels to convert
from RGB to greyscale. The greyscale intensity of a pixel is the same as that given by
equation (2.3).
Before estimating the row positions, multiple points denoting the center of the rows are
determined. This is done by dividing the greyscale image into a number of horizontal strips
as showed in Figure 2.3.
In order to estimate the points indicating the center line of each of the rows, the horizontal
strips were added along the vertical axis which resulted in a row vector v of length equal to
the horizontal width of the image. Following this, the vector v is split into sub-vectors, where
the lengths correspond to the nominal inter-row spacing in the middle pixel row of the image
strip. In principle, each of these sub-vectors are intersected by the center line of exactly one
row. Following this approach, the center line of each of the rows were determined.
Sainz-Costa et al. [16] developed a strategy based on analysis of video frames for the pur-
pose of crop row detection. Crop rows persist along the directions defined by the perspective
projection with respect to the 3D scene in the field. By taking advantage of this, they apply a
greyscale transformation before thresholding is employed to binarize the image. Each image
is divided into four horizontal strips following which rectangular patches are drawn over the
binarized image to identify patches of crop rows. The gravity centers of these patches are
used as the points defining the crop rows, and a line is adjusted considering these points.
The first frame in the sequence is used as a lookup table which guides the full process for
determining positions where the next patches in subsequent frames are to be identified.

24
Figure 2.3. Greyscale image divided into horizontal strips. Taken from [8]

Zhang et al. [17] developed a crop row detection method based on horizontal strips ap-
proach that can cope with complicated conditions with a focus on the presence of gaps and
weeds. In order to distinguish between the green plants (crops and weeds) and the back-
ground soil they utilized vegetation indices. Various vegetation indices have been designed
such as the ExG, i.e. excess green index [14], [18], CIVE, i.e color index of vegetation extrac-
tion [19], VEG, i.e. Vegetative Index [20], ExGR, i.e., excess green minus excess red index
[21], [22] and COM, i.e., combined index [23]. Zhang et al. investigated several thresholding
methods [12], [24]–[26] for binarizing their greyscale transformed images.
They found that double thresholding based on Otsu’s method could adapt to highly
variable environmental conditions in agricultural tasks because of its ability to dynamically
self-adjust without learning. This combined with the particle swarm optimization (PSO)
method were used to obtain a good segmentation result. The double thresholding produced
segmentation maps that had an issue with sparsity due to the presence of gaps in between

25
the plants. Morphological operations were applied to fill in these gaps and so the plants
appeared to comprise of a greater number of green pixels. An area thresholding method was
employed where white spots with areas smaller than the threshold were removed and those
larger than the threshold were kept. This further helped refine the segmentation map.

(a) Segmentation after double (b) Segmentation after mor-


thresholding phological operations

Figure 2.4. Result of binarization of greyscale transformation by [17]

Following this, the vertical projection method was applied to the horizontal strips in order
to extract the feature points that correspond to the crop row centers. The clustered point
sets were then obtained using the position clustering algorithm and shortest path method.
Finally, the crop rows were detected using lease squares fitting.

2.1.3 Linear Regression based methods

Billinsley and Schoenfisch [27] used the linear regression method was to detect lines
fitted to outliers as a method of identifying the crop-row guidance information. In addition,
Søgaard and Olsen (2003) [8] located barley crop rows using weighted linear regression. This
is a feasible approach that is applicable when pixels of crop rows are well separated from
those of the weeds. Moreover, Montalvo et al. [26] and Guerrero et al. [23] predicted the
expected position of the crop rows and then adjusted the position through the Theil-Sen

26
estimator. However, its effectiveness is highly affected by the pixels of the weeds. Therefore,
linear regression is only feasible if the pixels of the weeds and crops have been separated.

2.1.4 Stereo Vision Based Approaches

Kise et al. [28] and Kise and Zhang [29] developed a stereovision-based agricultural
machinery crop row tracking navigation system. Stereoimage processing is used to determine
3D locations of the scene points of the objects of interest from the obtained stereoimage.
Those 3D positions, determined by means of stereoimage disparity computation, provide the
base information to create an elevation map which uses a 2D array with varying intensity to
indicate the height of the crop. This approach requires crops with significant heights with
respect the ground. Because in maize fields, during the treatment stage, the heights are not
relevant, it becomes ineffective in our application. Rovira-M´as et al. [30] have applied and
extended stereovision techniques to other areas inside Precision Agriculture. Stereo-based
methods are only feasible if crops or weeds in the 3D scene display a relevant height and the
heights differ in both kind of plants.

27
3. METHODOLOGY, DATASET AND MODELING

We investigate two different approaches to detecting crop rows from images. The first of
which is a semi-supervised approach. At the base of this approach our method for crop
row detection uses a supervised learning method, which given an image classifies a region
of image pixels into either of two classes, crop-row or background; such a task is commonly
known as semantic segmentation. We use a simple U-net based neural network architecture
for solving this task. The architecture of the U-Net is kept light in terms of number of
layers and trainable parameters so as not to overfit to the small training set available. After
semantic segmentation of the images, we use density based clustering of crop-row pixels to
detect clusters resembling crop rows. These are clusters in the binary segmented image space
where the crop-row pixels are designated as pixels with a value of 1.
Then comes the unsupervised aspect of this approach. For each such cluster, we fit
a straight line to obtain slopes and intercept values that represent that crop row in the
parametric space. Using the slope values, the central four crop rows which we are interested
in the most for navigational purposes are identified and returned. By using the distribution
of slopes and intercept values over a sequence of frames coming from video frame, domain
adaptation is performed. This domain adaptation helps in making the model invariant to
the distribution shift caused by testing on data that is from a different domain than the
small set on which the model is trained on.
The second approach is a fully supervised method that learns the mapping between input
and ground truth images thus producing inference that is accurate in terms of location of
crop rows and their geometric shape throughout the region of interest. This method was
made robust owing to the availability of a larger dataset of around 1800 samples which
were collected later into the project. With increasing dataset size, we were also able to
accommodate a more complex model in terms of trainable parameters. We tested and
compared multiple model architectures and came to the realization that a pyramid of features
at multiple scales was most useful for learning the scale invariant property of the crop rows.
This provided us with the ability to learn the global geometric shape of the crop rows instead

28
of depending on local patches of green pixels in the image frame in order to conclude a region
in the image as belonging to either a crop row or a ground patch.
Throughout the following chapters, we will be discussing the different modules that went
into developing our algorithm for crop row detection. They are organized in a sequence
starting from the semi-supervised method utilizing a cascading clustering algorithm to our
final model based on a feature pyramid network which was developed after a comparative
analysis between multiple neural network architectures.

3.1 Crop Row Benchmark Dataset

We use the open source Crop Row Benchmark Dataset[31], hereby referred as CRBD to
train our model. The Crop Row Benchmark Dataset contains images of crops of different
types which can be used for evaluation of crop row detection methods. This evaluation image
set includes images of maize, celery, potato, onion, sunflower and soya bean crops. Different
amount of weed and shadow occur in the captured images. On some images, grass, sky
or road appears. Furthermore, the images are taken at moderately varying yaw, pitch and
roll angles. The images were acquired with a Panasonic LUMIX DMC-F2 digital camera
during the spring of 2014 in Croatian region Slavonia. The complete evaluation image set
contains 281 images. The images are also diverse in terms of crop row thickness, inter crop
row distances, curvature of the rows and the crop growth stage.
Furthermore, CRBD contains ground truth data which are created manually for each
image. Besides raw images, the CRBD dataset also provides these ground truth segmentation
masks in the form of data files that provide coordinates of pixels that correspond to crop
rows in the original image. We can use the coordinates to generate binary segmentation
maps to annotate the location of crop row pixels. These annotations provide us with a way
to evaluate our performance during training time by the use of a loss function. For training
on the CRBD dataset, we would like to know how the model would perform during a test
setting. This evaluation is performed by splitting the dataset into training and validation
sets. Further information on how these sets are formed is provided in section 3.6.

29
Figure 3.1. Sample crop row image from CRBD

30
CRBD also provides a Matlab script to translate the coordinate provided into a binary
segmentation map. However, since our project was written in the Python programming
language, this script was written from scratch in python. Upon running the script, all the
ground truth annotations were parsed from a coordinate representation to a binary image
representation where the black pixels represented ground and white pixels represented crop
rows.

Figure 3.2. Sample ground truth annotation from CRBD

3.2 Crop Row Modeling and Assumptions

Before attempting to solve the problem of crop row detection, we must model the problem
and define what we mean by detection. We model the process of detecting crop rows as
learning a specific representation that denotes the location of crop rows in an image. The
underlying task was to extract a meaningful representation of the location of crop rows out

31
of RGB images. How we define this representation would not only determine the final results
of our row detection process but also how convenient it would be to achieve them.

(a) Object Detection using bounding boxes (b) Semantic segmentation using pixel labels

Figure 3.3. Modeling approaches: Object detection vs Semantic Segmentation

From Figure 3.3, we see two different representations for detecting rows. The bounding
box representation draws bounding boxes around a crop row and is generally referred to as
object detection in computer vision literature. This representation is useful for determining
the location of an object of interest in the frame. The bounding box provides four terminal
(x, y) points of the rectangular box. This essentially would represent a crop row in terms
of four pairs of numbers. While this does have the advantage of reducing the amount
of time needed to label the ground truth bounding box annotations, the predictions from
the base neural network would also reflect that representation. Meaning we would not
have information on the shape of the crop rows within the bounding boxes but rather the
terminal points of it. Furthermore, as can be seen from Figure 3.3, since the crop rows are
approximately linear, we do not require four terminal points but only two to represent their
bounds.
The other form of detecting rows involve drawing lines over each pixel that correspond
to a crop row and is referred to as semantic segmentation. By adopting this approach, we
have more fine grained information on the geometric shape of the crop row in addition to
the bounds which are denoted by the terminal points of these annotation lines. In computer
vision literature, this pixel wise annotation is referred to as semantic segmentation.Semantic
segmentation provides a richer form of representation but has the down side of requiring
laborious labeling of the training set images.

32
Both approaches have their inherent advantages and disadvantages. We finally decided
on semantic segmentation for two reasons-

• Availability of the CRBD dataset, which provided ground truth labels in the form
expected by semantic segmentation.

• Since we were not decided on how we plan to utilize the row information for navigation
in the early stages, semantic segmentation would grant us more freedom to work with
later down the line.

Based on the samples in the CRBD dataset and because of the fixed position of the
camera mounted on the vehicle, we assume the viewing perspective of the crop row images
to be fixed. Since the underlying application is used for navigating, similar to [6], [32],
we adapted a region of interest (ROI) to reduce the variability of crop rows in the images.
Fixing an ROI also has the benefit of reducing the complexity of the algorithm required to
be adopted as the smaller region is easier and less time consuming to process. Our ROI
selection process assumes that only the crop rows facing the vehicle on it’s path are relevant
for determining the right path to follow.
A central cropped window of size 128x256 pixels was chosen as the region of interest or
ROI for the crop fields as shown in Figure 3.4. This reduced the number of rows required to
be processed. It also reduced the density of rows as the central region consists of rows that
are more easily distinguishable from one another than the regions around the edges of the
image. Moreover, since inter crop row distances vary so does the number of rows within the
ROI. The rows of interest were chosen to be the central 4 rows that pass through the ROI.
In order to approximate the change in directions of the rows, instead of directly detecting
the exact pixel positions of the rows, we model our task to detect the first derivative of the
curvatures that run along the shape of each of the rows. This simplifies the curved shape of
the rows into a piece-wise linear form. This process abstracts away the low level semantic
segmentation process into a model for detecting straight lines denoting the change of direction
of the crop rows.

33
Figure 3.4. ROI for Crop row image

34
4. BACKBONE: U-NET ARCHITECTURE

Initially we train a deep neural network that follows a U-Net Architecture [33]. The U-
net architecture differs from a simple encoder-decoder system by its introduction of skip
connections between the encoder layers and corresponding decoder layers. If there are a total
of n layers in the network then the U-net adds a connection between the ith and (n − i)th
layer. These connections provide information of corresponding stages of encodings to the
decoding blocks thus helping to improve the decoder’s ability to output a final segmentation
map with finer features.
Since the size of our labeled dataset is extremely small, we took caution with the network’s
complexity. To reduce the effect of overfitting as much as possible, we restricted the size
of the network by using a total of 4 encoder blocks and 4 decoder blocks. The network
architecture along with the composition of layers within the encoder and decoder blocks is
shown in Figure 4.1. The generalization ability of the model is primarily attributed to the
sparsity of parameters followed by the downstream adaptation module.
Let the ith encoder block of the U-net takes as input a tensor with dimensions f ×
h × w, where f denotes the the number of feature maps in the input, and h and w are
the dimensions of the image data. The encoder extracts features from this input through
convolution operations and successively, batch normalisation to reduce covariate shift. They
are followed by a ReLU layer for incorporating non-linearity. The input-output relation of
an encoder block can be shown by the following equation.

outputi = ReLU [BatchN orm2d{Conv2d(inputi )}] (4.1)

The resultant output expands the number of feature maps but retains the spatial di-
mensions h and w. The max pooling operation is used to increase the receptive field of a
Convolutional Neural Network and we pass the output of the ReLU activation through a
pooling layer which halves the spatial dimensions to h/2 and w/2 respectively.

encodingi = M axP ool(outputi ) (4.2)

35
Figure 4.1. U-net Architecture for segmenting crop rows. The decoder con-
sists of a downsampling path made up of four blocks (green) and the encoder
consists of an upsampling path made up of four blocks (blue). Skip connections
are added between corresponding encoder-decoder blocks to pass on features
learned by the encoder to the decoder.

36
The jt h decoding blocks perform a similar operation as shown in (4.1) but is preceded by
the transposed convolution. This reverts the spatial dimensions of the input to its original
size thus completing one stage of the decoding operation.

decodingj = T ransposedConv(inputj ) (4.3)

outputj = ReLU [BatchN orm2d{Conv2d(decodingi )}] (4.4)

For passing information about learned features between encoder and decoder blocks,
skip connections are used as can be seen in Figure 4.1. The skip connections denote a
concatenation operation where the output of encoder block and the input to the decoder
block are concatenated along the first axis before being passed through the decoder block.
If xi is the output of the it h encoder block, yj is the input to the jt h decoder block and
j = N − i where N is the total number of decoder blocks, then a skip connection between
these two blocks exist to perform the following operation-

output = Concat(xi , yj ) (4.5)

37
Table 4.1. Architectural Details of the U-net. Number of features between
encoder blocks and decoder blocks are consistent for allowing skip connections.
Dataset Method # of output Feature Maps
Enc 1 64
Encoder Enc 2 128
Enc 3 256
Enc 4 512
Conv 3x3 1024
Middle Convolution BatchNorm 1024
ReLU 1024
Conv 3x3 1024
BatchNorm 1024
ReLU 2 1024
Dec 1 512
Decoder Dec 2 256
Dec 3 128
Dec 4 64
Final Convolution Conv 3x3 2

38
5. DATA PIPELINE

In this chapter we discuss the data pipeline of our system. Since the dataset is small,
in order for our network to learn the underlying general geometric pattern of the rows as
well as possible, we chose a split prioritizing a larger portion for training. Moreover, to
reduce overfitting to the data, we used random augmentations to increase both the size and
variability of the dataset. The following are some of the key steps used in the pipeline before
training the network:

1. Low level processing

2. Training and test set split

3. Training set augmentation

5.1 Low Level processing

For low level processing, we normalize the raw images to have pixel intensities with a
fixed mean and standard deviation per color channel. This helps stabilize the training of the
network and convergence of the model. A Gaussian blur with kernel size of 3 and σ = 1 is
also used to reduce noise in the image.
The images from CRBD are each of size 320 × 240 pixels. As shown in Figure 3.2, there
are more than 10 crop rows in this image, the majority of them are densely situated towards
the edge and horizon of the canvas, due to the vanishing point effect. Detection of these rows
are not important in real-life when using such a system in the agricultural vehicle navigation
system. So, we extract an ROI (region of interest) from each image of size 128 × 256 pixels
covering the central rows. Doing so helps discard irrelevant crop rows near the edges. Also,
after extracting the ROI we obtained images with fairly equalized inter crop row distances
as shown in Figure 3.4.

39
5.2 Train and test set split

In order to test the performance of the algorithm on unseen data, we randomly shuffle
the dataset before splitting it where 80% of the dataset is used to train the model and
the remaining 20% is held out for testing different algorithms for quantitative comparison.
During training, we use 20% of training data instance as validation set for parameter tuning.
sectionTraining set augmentation
Overfitting [34] to training data is one of the central challenges in deep learning. Espe-
cially for small datasets such as CRBD, it is difficult for the network to generalize to unseen
data and much easier to memorize the entire dataset instead. To mitigate this, we employed
data augmentation methods that result in a significant increase to the variability of data
seen by our model during training.
The augmentation methods were chosen so they would emulate real world conditions of
crop fields as closely as possible. Some examples of these include solar flares and occlusions
due to weed or other objects. This helped the network improve its capability to learn the
continuity of each of the rows. We also made random lighting changes to the overall image
to emulate the various lighting conditions our system could be exposed to during test time.
Finally to improve spatial variance of the underlying dataset, images were randomly
flipped both horizontally and vertically.
All of the augmentations were performed with the help of the albumentations library [35]
in Python. The augmented datasets produced samples similar to ones shown in Figure 5.1.

40
(a) Green block occlusions (b) Vertical flip with hue shift

(c) Saturation and brightness shift (d) Solar Flares

Figure 5.1. Some samples from the augmented dataset

41
6. ALGORITHMS
6.1 Tuning Network Predictions

6.1.1 Refining the output

Although the network performed well on the test set for CRBD, it performed poorly
in the case of real world video frames. The network output segmentation map included
significant noise in its predictions that consist of both false positives within the inter crop
row spaces as well as false negatives within the crop rows themselves. A sample prediction
of the network on a video frame is shown in Figure 6.1.
We mitigate this issue to some extent by clustering the pixel predictions for crop rows
in the segmentation map using a Hierarchical Density Based clustering routine. We set a
threshold of 50 pixels to filter out the clusters of false positives in the binary image leaving
us with clusters that are large enough to qualify as crop rows. This process also has the
added benefit of removing the rows at the edges of the ROI which are smaller in size due to
the view perspective.

6.2 Choosing central rows

We choose the central four rows to be relevant to our operation the most as they help
to keep the vehicle centered while also reducing the effect of camera’s viewpoint center
shifting within a reasonably small range. To find this, we initially fit first order polynomials
or straight lines through the clusters in the segmentation map that are obtained after the
refining stage.
Because of the consistency in length and slope of the rows from the left to the right of
the ROI, we assume that the central four rows should have two properties that distinguish
them from the others:

• The four longest along the vertical axis

• The four smallest slopes along the vertical axis

42
Figure 6.1. Sample input/output of the U-Net

As such we find the central four rows by first sorting the lines based on the length of
the line segments and picking the largest four. Finally, in order to know line corresponds to
what position of the four rows (left, left-center, right-center, right), we sort them by their
slopes along the vertical axis. This results in the leftmost row having the smallest slop with
a gradual increase up to the rightmost row. Algorithm 1 demonstrates this process.

Algorithm 1: Rows-of-interest retrieval


Input: Binary segmentation mask ypred predicted by the backbone U-Net
architecture where pixel positions corresponding to crop rows have a value
of 1 and 0 elsewhere. sm , the minimum size for valid row clusters
1 Use Hdbscan[36] to identify Nm clusters of positive values in ypred with a minimum
cluster size sm
2 for i ← 1 to Nm do
3 m, c ← get slope-intercept of best fit line through cluster Ci
4 Nrows ← min(Nm , 4)
5 center_rows ← longest Nrows best fit lines through clusters Ci s.t
i ∈ {1, ..., Nrows }
6 sorted_rows ← sort center_rows by the slopes mi
7 end
8 return sorted_rows

This post processing routine converts our output from a binary segmentation map to
four pairs of mi , ci where mi is the slope of the it h row from the left and ci is the intercept.

43
6.3 Pre-Calibration and density estimation

The pre-calibration stage is what helps the model to adapt to an unseen crop field. In
this phase, we use a clustering based algorithm to fit a distribution over each of the four
predicted rows. This process is only employed once for every crop field to adapt the model
to a new domain without the need for supervised training.

6.4 Rows of interest and largest clusters

We first run the model on the first 250 frames of the video feed. The number 250 is
a tuneable hyperparameter and increasing or decreasing it would result in higher or lower
quality of adaptation respectively. After running for the first 250 frames, we record the
model’s output as 1000 (mi , ci ) pairs where i ∈ {1, 2, 3, 4}. These were then feature scaled
to eliminate the effect of magnitudes.
We make the assumption that among the 250 frames, for each of the four rows at least
over half of the predictions would be fairly accurate. This does not imply that half of the
frames would be good predictions, but that each of the rows would have over half of their
predictions to be accurate representations of their position in the frame.
Based on this, for each of the four rows the corresponding (m, c) values were then clustered
using density based clustering. This gives us four different scatter plots each consisting of
250 points in 2D space corresponding to the 250 calibration frames. The plots are shown
in Figure 6.2. After clustering the points, we pick the largest cluster as the one most likely
to correspond to the correct predictions for that specific row. We also chose density based
clustering here because for correct predictions, the variance within the (m, c) pairs would
be small compared to the incorrect predictions and density based clustering captures that
property. The parameters were chosen to be  = 0.35 and minsamples = 5.
sectionValidating correct row predictions
We extracted the points from the largest clusters for each row and then used K-means
to plot them in the same 2D space. The scatter plot is shown in Figure 6.3.

44
(a) Clustering for left row (b) Clustering for left-center row

(c) Clustering for right-center row (d) Clustering for right row

Figure 6.2. Density based clustering of predicted row parameters (feature scaled)

Algorithm 2: Largest cluster selection


Input: X, list of 2D points; , maximum distance to nearest neighbor for density
based clustering; min_samples, minimum number of nearest neighbors for
a core point
1 Nx ← population of list X
2 C ← density based clustering of points xi ; ∀xi ∈ X, parameterized by  and
min_samples
3 Nc ← number of clusters obtained
4 largest_cluster ← ∅
5 for ci ∈ C do
6 if Population(ci ) > Population(largest_cluster) then
7 largest_cluster ← ci
8 end
9 end
10 return largest_cluster

45
Figure 6.3. Kmeans clustering of row representations

As shown in the plot, the four clusters are well separated and are very concentrated along
each of the four means. This helps validate our assumption that the largest clusters are good
representations for the correct row predictions.

6.5 Density Estimation of row clusters

To find the closeness of a new row prediction to a good prediction, we need to know its
likelihood of belonging to the corresponding cluster. We do this by fitting an EM clustering
over the parameter space of the four clusters. This results in a continuous density function
that gives a high confidence if the prediction for a row is likely to have come from the same
distribution as the correct rows obtained during the calibration stage. The entire calibration
phase is shown in Algorithm 3.
By putting a threshold on the confidence value, we can control the tradeoff between
temporal smoothness of a crop row’s parameter values (slope and intercept) and the network’s
prediction accuracy for that row in the current frame. Poor confidence value acts as a trigger
that causes the U-Net to disengage from the prediction system and in turn predict the row of
interest not as a pixel level classification but as an extrapolation of previous high confidence

46
predictions of the parameter values of the same row. In a sense the calibration stage can
be defined as a type of self-supervision that provides our network with an estimate of its
performance for the current frame.

Algorithm 3: Calibration: Estimation of valid row distribution


Input: Vs , video stream; Nf rames , Number of frames to use for calibration;
1 Samples ← dictionary mapping row indices to 2-D arrays of shape (Nf rames × 2)
2 for i ← 1 to Nf rames do
3 fi ← retrieve current frame from Vs
4 predi ← get predicted segmentation map from U-Net
5 sorted_rows ← retrieve 4 central rows from predi
6 for j ← 1 to 4 do
7 append parameters (m, c) of jth row to Samplesj
8 end
9 end
10 largest_clusters ← empty list
11 for i ← 1 to 4 do
12 Ci ← get_largest_cluster(Samplesi )
13 append all elements (mi , ci ) in Ci to largest_clusters
14 end
15 µem , Σem ← parameters from EM clustering of largest_clusters
16 return µem , Σem

6.6 Frame extrapolation and temporal corrections

The model was tested using a camera that recorded videos on-board a moving vehicle
at 25fps. Since the slopes and intercepts of each of rows change uniformly along subsequent
frames, the derivative of these can be regarded as constant. This facilitates the ability to
extrapolate a prediction based on previous predictions to a reasonable degree of accuracy.
Since the change in slopes and intercepts of each row varies consistently, the derivative
can be approximated as constant through a small number of subsequent frames. Therefore
the change in parameters can be assumed to be linear. As such, given frames fi−1 and frame

47
fi with parameters (mi−1 , ci−1 ) and (mi , ci ) respectively, the parameters for the next frame
fi+1 can be approximated as shown in equation 6.1.

xi+1 = 2xi − xi−1 ; xj ∈ {mj , cj }, ∀j ∈ {i − 1, ...i + 1} (6.1)

We can control the temporal smoothness of our model’s predictions at runtime by choos-
ing when to extrapolate a prediction and when to rely on the network’s current prediction
for a row. This choice is made by putting a threshold on the confidence value of a prediction
that can be obtained from previously estimated probability densities for each of the rows.
For our tests, we found that by rejecting predictions that produce confidence scores of under
30% and replacing them with a rolling average of the previous 3 rows, the model can be
made robust to difficult frames that has shadows, occlusions or other obstructions.

48
7. FULLY SUPERVISED MODEL
7.1 Larger dataset and models tested

At the final stage of the project, we managed to acquire around 1800 labeled images of
various types of crops and crop fields. Upon further availability of labeled data, we discovered
that we could potentially build an end-to-end system that does not depend on the calibration
phase.

Figure 7.1. Availability of larger dataset

Not only were these data points more representative of our test samples than the CRBD
dataset, but the larger size meant that we could afford to modify our U-Net architecture
without worrying about overfitting anymore. Thus became the foundation of our second and
final approach.
Previously we had used successive VGG layers to increase the receptive field of the
network. With more data, we had the luxury of adopting more complex neural network
architectures. To this end, we compared the following five architectures for detecting crop
rows. We used the popular library segmentation models pytorch or SMP [37] for convenience
of implementing the different architectures.

1. A deeper U-Net

2. MANet

3. Linknet

49
4. PSPNet

5. FPN

7.1.1 Deeper-UNet

For the deeper U-Net, we followed the architecture proposed in [33] composed of 23
convolutional layers. A pictorial representation of the architecture is provided in Figure 7.2.

Figure 7.2. Deeper U-Net Architecture

Similar to our previous architecture, this deeper U-Net consists of a contracting path
(left side) and an expansive path (right side). The contracting path follows the typical
architecture of a convolutional neural network. It consists of the repeated application of two
3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU)
and a 2x2 max pooling operation with stride 2 for downsampling. At each downsampling

50
step the number of feature channels is doubled. Every step in the expansive path consists
of an upsampling of the feature map followed by a 2x2 convolution (“up-convolution”) that
halves the number of feature channels, a concatenation with the correspondingly cropped
feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU.
The cropping is necessary due to the loss of border pixels in every convolution. At the final
layer a 1x1 convolution is used to map each 64- component feature vector to the desired
number of classes. In total the network has 23 convolutional layers.
Our finding with the deeper U-Net was that it helped increase the receptive field of the
network considerably. This in turned helped with solving some of the scale issues regarding
detecting crop rows. For instance, patches of crop rows that were misclassified as ground
pixels in the shallow U-Net would be properly classified as crop rows with the help of the
deeper U-Net. However, the results were not invariant among crop row growth stages. The
network seemed to be more leaned towards predicting crop rows with greater inter-crop row
distances better. This is expected as there were no multi-scale mechanism adopted in the
simple deeper U-Net model.

7.1.2 MA-Net

Next, we investigated the Multi-scale attention network or MA-Net as proposed in [38]


for liver and tumor segmentation. This differs from a traditional U-Net by its introduction
of the position wise attention block or PAB. An illustration of the architecture of MA-Net
is shown in Figure 7.3.
Previous work have suggested that local feature information captured via using tradi-
tional convolutional network could lead to misclassification of objects. In order to capture
rich contextual relationships over local feature maps, [39] designed a position attention mod-
ule. Inspired by the position attention module, [38] used PAB to capture the spatial depen-
dencies between any two position feature maps. The PAB can model a wider range of rich
spatial contextual information over local feature maps. An illustration of the PAB module
is shown in Figure 7.4.

51
Figure 7.3. The total architecture of MA-Net

Figure 7.4. The Position-wise Attention Block (PAB). The input image is
HxWx256 and output is HxWx512. The attention feature map is obtained by
Softmax function.

7.1.3 LinkNet

To improve the efficiency of our neural network’s prediction, thus getting us closer to the
requirement of real time inference, we investigated the LinkNet architecture as proposed by
[40]. The architecture for LinkNet is illustrated in Figure 7.5. Generally, spatial information
is lost in the encoder due to pooling or strided convolution is recovered by using the pooling
indices or by full convolution. The authors of LinkNet hypothesize and prove that instead of

52
such techniques, bypassing spatial information, directly from the encoder to the correspond-
ing decoder improves accuracy along with significant decrease in processing time. In this
way, information which would have been otherwise lost at each level of encoder is preserved,
and no additional parameters and operations are wasted in relearning this lost information.

7.1.4 PSPNet

The core idea behind PSPNet [41] was to learn a global context prior. Learning global
features allow the network to exploit categories of neighboring pixels when deducing the
category of the pixel in question. To accomplish this, the authors of PSPNet introduced the
pyramid pooling module. The overall network architecture along with the Pyramid Pooling
Module that sits between the encoder and decoder is illustrated in Figure 7.6
As can be seen from the image, the pyramid pooling module fuses features under four
different pyramid scales. The coarsest level highlighted in red is global pooling to generate a
single bin output. The following pyramid level separates the feature map into different sub-
regions and forms pooled representation for different locations. The output of different levels
in the pyramid pooling module contains the feature map with varied sizes. To maintain the
weight of global feature, they use 1×1 convolution layer after each pyramid level to reduce
the dimension of context representation to 1/N of the original one if the level size of pyramid
is N. Then we directly upsample the low-dimension feature maps to get the same size feature
as the original feature map via bilinear interpolation. Finally, different levels of features are
concatenated as the final pyramid pooling global feature.

7.1.5 Feature Pyramid Network

Similar in theory to the Pyramid Scene Parsing network (PSPNet), we investigated our
final model to be the Feature Pyramid Network or FPN [42].Although the authors originially
designed the model for use on object detection, their method provided a generic pyramid
representation and could be used in applications other than object detection. Feature pyra-
mids are a basic component in recognition systems for detecting objects at different scales.

53
Figure 7.5. LinkNet Architecture.

54
Figure 7.6. Overview of PSPNet architecture.

Since our crop rows varied in density throughout the image, it made sense to exploit this
property by taking advantage of a feature pyramid’s ability to capture multi-scale features.
Another factor that played a role in choosing the feature pyramid network is it’s ability
to detect smaller objects. We noticed that the feature pyramid at multiple scale allowed
for improved detection accuracy for classes that were smaller in scale as compared to other
classes. For our crop row images, we noticed that the pixels that make up a crop row are
much fewer in number than the pixels that make up the ground patches. Therefore, detecting
these small structures was a challenge we had previously not been able to solve.

7.2 End-to-end supervised learning with feature pyramid network

In this section we will discuss the end-to-end supervised training and inference method
we adopted using Feature Pyramid Networks as our architecture of choice. We chose to go
with the Feature Pyramid Network due to two main reasons.

1. Greater accuracy and performance compared to other architectures

2. Faster processing speed for video frames

55
Figure 7.7. (a) Using an image pyramid to build a feature pyramid. Fea-
tures are computed on each of the image scales independently, which is slow.
(b) Recent detection systems have opted to use only single scale features for
faster detection. (c) An alternative is to reuse the pyramidal feature hierarchy
computed by a ConvNet as if it were a featurized image pyramid. (d) Our
proposed Feature Pyramid Network (FPN) is fast like (b) and (c), but more
accurate. In this figure, feature maps are indicate by blue outlines and thicker
outlines denote semantically stronger features.[42]

7.2.1 Dataset and train/test split

The dataset we used for this method had over 1800 samples. We split the dataset similar
to before in an 80/20 ratio corresponding to the training set and the testing respectively.
20% of the train set was used for cross validation during training.
The images we had in the dataset varied by crop type (Corn, Soy etc.) and growth stages
(early growth stage to canopy formation). This also impacted the inter crop row distances
as higher growth stages and larger crops contributed to a smaller inter crop row distance
and vice versa.

56
7.2.2 Region of Interest Selection

We selected a central region of size 320x320 as our region of interest. The larger region
included a greater portion of the image frame into our network’s prediction zone. The goal
of our approach was to train the model end-to-end without any dependency on detecting
central crop rows. Therefore, in contrast to our previous method, the feature pyramid
network approach is invariant to the number of crop rows in the region of interest. It does
not rely on detecting only the central four crop rows but provides pixel wise annotations for
every crop row that resides inside the region of interest.

7.2.3 Implementation Details

The network architecture for the feature pyramid network was implemented using the
PyTorch deep learning framework [43] in conjunction with the segmentation models pytorch
library [37]. Similar to the previous method, the albumentatons [35] library was used for
augmenting the dataset during training and generating synthetic samples that increase the
variance within the dataset. OpenCV [44] was used for loading/processing images as BGR
tensors and SciPy [45] was used for scientific computation. The model was trained on a
V100 GPU. During training, the parameters of the model were optimized using the Adam
[46] optimizer with a learning rate α of 2e − 3.
For the encoder to the FPN model, we chose the ResNet18 architecture as it is both per-
formant and consisting of a comparatively low number of layers. The encoder had pretrained
weights that were trained on the imagenet dataset [47].

7.2.4 Loss Function and Training Scheme

We trained the network for 100 epochs. The loss function chosen to be minimized was
the dice loss between the neural network’s predictions and the ground truth. The dice loss
is shown in Equation 7.1.

2 ∗ ptrue ∗ ppred
P
Ldice =P 2 (7.1)
ptrue + p2pred + 
P

57
Due to the limitations of cross entropy loss in detecting crisp boundaries, we decided to go
with the dice loss. When using cross entropy loss, the statistical distributions of labels play
a big role in training accuracy. The more unbalanced the labels are in the input training
data, the more difficult training becomes. Although weighted cross entropy can mitigate
some of the issues, the improvement is not significant.
The training scheme involved a mini batch gradient descent algorithm based on the
Adam optimizer. For the minibatch size, we found 16 to be both a large enough batch
size for convergence while also fitting within the memory constraints of our GPU resources.
At the beginning of each training epoch, we cropped the ROI from the input image frame
and applied random augmentations to the ROI such as horizontal and vertical flip, random
hue/saturation/brightness and Gaussian blurs. The image was then normalized to have
a constant mean and standard deviation. This improves training stability and helps the
network learn better.

7.2.5 Testing and Validation

The FPN’s training performance was validated on a 20% hold-out cross validation set.
At the end of each epoch, the dice loss was computed on this validation set along with the
mIoU score. This validation helped us quantify improvement of the model’s performance on
unseen data. At the end of each validation step, we compared the loss obtained with the
loss from the previous epoch, if the loss was found to decrease on the validation set with
increasing epochs, we could conclude with reasonable certainty that our model was indeed
learning the generalized pattern of crop rows in images and not just memorizing the training
data to improve performance.
This was made feasible owing to the larger training set we had available to us. Previously,
with the small CRBD sample size of only 281, regardless of bottlenecking the number of
parameteres in the model architecture, the model could always reduce the validation loss
due to the lack of variability in the training set. However, with the increased size of over
1800 samples, this issue was alleviated.

58
8. EXPERIMENTS AND RESULTS
8.1 Semi-Supervised Approach

First we shall discuss the experiments and results obtained for the semi-supervised
method. Here we conduct tests on both accuracy and inference time tests on the full end-
to-end algorithm starting from the U-Net backbone and ending in the frame extrapolation
method.

8.1.1 Distribution of row predictions

The U-Net backbone was trained for 100 iterations in a traditional semantic segmenta-
tion setting to minimize the cross-entropy loss between the output and the ground truth
segmentation map.
Experiments were performed on the Nvidia Jetson TX2. In Figure 8.1 we can see the
differences in the distribution of the best fit line parameters for each row upon the introduc-
tion of frame extrapolation based on parameters obtained from the calibration phase. As
shown, the calibration phase pushes the distribution of the points towards a concentrated
region (annotated in orange).
On the other hand, without going through the calibration phase and solely relying on
the raw predictions from the trained network, we can see a large variance in the predictions
for each of the rows. By controlling the threshold on the learned confidence metric, the
weight on frame extrapolation can be controlled to tune this difference in variance. This in
turn controls the trade-off between temporal smoothness and the U-Net’s current prediction
accuracy. A higher threshold on confidence would concentrate the predictions towards the
cluster means thereby enforcing a smoother inter-frame transition of the predictions. A
lower threshold would reduce the effects of the previous frames and lean more towards the
network’s predictions for the current frame only.
From Table 8.1 it is evident that our calibration phase had little effect on reducing
the variance in slopes of the rows. However, the reduction in the case of the intercept is
significant. This is caused by the fact that the due to highly discontinuous segments predicted

59
(a) Left row (b) Left-center row

(c) Right-center row (d) Right row

Figure 8.1. Distribution of row predictions in parameter space

Table 8.1. Changes in variance of parameters in subsequent frames, averaged


across central four crop rows
σ err
Without calibration With calibration Abs. difference
σnc σc |σnc − σc |
m 0.011 0.0026 0.0083
c 148.02 13.82 134.2

60
by the U-Net and the selection process for the rows of interest being dependent on the size
of the clusters in the predicted segmentation map, prediction for certain rows shifts to an
adjacent row.
This also explains the small difference in the variance in slopes as adjacent crop rows
have slopes that are nearly identical. In order to ensure that the algorithm can be employed
for real-time applications, we tested its performance on an Nvidia Jetson TX2. We ran the
inference on each frame of a stream fed by a camera mounted on a vehicle. The neural
network’s inference was performed on the on-board GPU and the frame extrapolation based
on calibrated parameters were processed on the CPU. Table 8.2 shows the average processing
time per frame and the average FPS achieved for both the calibration and inference phases.

Table 8.2. Processing time of algorithm on Nvidia Jetson TX2


Calibration Inference
250 frames 10000 frames
Avg time per frame 32ms 54ms
Avg FPS 31 17

8.1.2 Ablation study with and without extrapolation

Next we present an ablation study of test video frames undergoing the base U-Net archi-
tecture before being fed through the frame extrapolation method and not going through it.
The results of this study are shown in Figure 8.2.
As can be seen from the example, three subsequent frames are tested under this scenario.
In all the cases, the predictions from the U-Net are not ideal in terms of accuracy. Without
frame extrapolation applied, the prediction becomes very poor on the third frame. But
owing to the temporal smoothness function provided by the frame extrapolation algorithm,
the issues are fixed and we obtain a fairly accurate third frame after post processing.
Here, the confidence scores for the first and fourth rows were very poor when compared
to the distributions of crop row parameters learned during the calibration phase. These
confidence scores act as guiding signals that tell us that those predictions are poor and need
to be replaced by higher confidence predictions provided by the network in the past. What

61
is essentially happening here is that we have developed an unsupervised metric for crop row
detection and by relying on the metric we are able to fix much of the errors that arise during
test time.

Figure 8.2. Example of frame extrapolation to fix errors in predicted rows.


Here frames 1 and 2 are used to extrapolate the third frame since confidence
of prediction for the left and right row did not cross the threshold in the case
of the original prediction.

8.1.3 Comparison of U-Net with unsupervised methods

In this section we compare our backbone U-Net against four different classical unsu-
pervised learning methods in this study. The first one, Hough transform [9] is a feature
extraction technique in digital image processing. Canny edge detection [48] is applied to
binary image to extract the edges. With the coordinate transformation, the collinear points
in edges of the binary image converted to concurrent lines in parameter space by voting.
Hough transform detects lines by accumulating the votes.
The second method is named Sliding-window. A window of size (20 by 20) slides over the
binary image and calculates the center points of the white pixel blobs inside that window.
After sliding over the whole image, we essentially have the center points of the crop rows.
Then least-square straight line is fitted on the crop row center points. Some variant of this
method is available in literature [49].

62
The third method is Template Matching followed by Global Energy Minimization [50]. It
uses dynamic programming for efficient global energy minimization. This method can work
without any prior knowledge of crop row number, reasonably insensitive to weed and works
with different crop growth stages.
The fourth method uses RANSAC based clustering [51]. Here, HDBSCAN clustering
method is applied to binary image to separate crop rows from weed patches. Then RANSAC
is used to fit lines through crop rows.

Table 8.3. Comparison of our model’s mIoU scores and inference time per
single image on the test set with unsupervised computer vision algorithms
Model Device Number of Parameters (M) mIoU Inference time (ms)
Ours GPU 31 0.75 13.6
Hough Transform CPU N/A 0.24 2.0
Cluster Ransac CPU N/A 0.28 100.1
TMG EM[13] CPU N/A 0.32 1750
Sliding Window CPU N/A 0.27 2.8

8.2 Fully Supervised Approach

For the fully supervised approach, the training curves are shown in Figure 8.3. As we
can see, the Feature Pyramid Network converges fairly consistently between subsequent
iterations and the dice loss decreases monotonically. Figure 8.4 shows the improvement of
mIoU scores as training progresses and the dice loss between the prediction and the ground
truth decreases. From our experiments, the training curves for the validation loss and mIoU
scores seem to plateu out at around 0.5 for dice loss and 0.75 for mIoU scores. So we used
early stopping mechanism to save the best model.

63
Figure 8.3. Training and Validation dice losses.

64
Figure 8.4. Training and Validation mIoU Scores.

65
9. CONCLUSION AND FUTURE WORK
9.1 Conclusion

In this work, we first introduced a method to augment a U-Net backbone’s inference


capabilities for detecting crop rows by incorporating an unsupervised confidence score based
on a sequence of clustering methods. We showed that upon scarcity of labeled training data,
our method could adapt to a different image domain of the test set during runtime thus
demonstrating online domain adaptation capabilities.
However, the result fell short as the domain gap increased. In the presence of increased
weed pressure and varying inter-crop row distances, our method suffered due to a larger
shift in the distribution between the training and the test set. To mitigate this issue, we
required a larger labeled dataset to train on and a more complex neural network to learn
the underlying geometric pattern of crop rows.
We implemented a Feature Pyramid Network (FPN) for this purpose. The FPN not
only added more layers to both the encoder and decoder thus increasing its ability to learn
complex patterns, but it also featured a pyramid pooling module at the end of the encoder.
The pyramid pooling module (PPM) encoded features at different image resolutions which
has shown to exhibit better capacity for learning global image features as opposed to local
features that only take into account the immediate neighbors of a specific pixel of the image.
Our prior method showed good results when scarcity of dataset samples was considered
as an acceptable limiting factor. But upon availability of a larger training set, we managed to
train the FPN model by minimizing the dice score or IoU between the network’s prediction
and the ground truth data. Due to the capabilities of the FPN as mentioned above, this
yielded successful results compared to the previous method.

9.2 Future Work

As for future scope of this work, the pre-calibration method can be optimized by fitting
other distributions over the clusters than a simple Gaussian. Since a student-t distribution
is more robust to outliers, it can serve as a suitable candidate for this purpose. Furthermore,

66
to determine the valid clusters, instead of relying on a density based clustering to find the
largest cluster, a machine learning classifier such as an SVM can be used to distinguish the
valid clusters from the outliers.
Going forward, the FPN method can be used as a generalized end-to-end method for
detecting crop rows. But it can be improved both in terms of accuracy of predictions and
processing speeds. To improve accuracy of predictions, an active learning or pseudo-labeling
scheme can be implemented where only the badly predicted samples are used to train the
model iteratively until performance meets satisfactory levels. To improve processing speed,
the FPN architecture can be modified by stripping away layers from the encoder/decoder
thus reducing the number of trainable parameters. Moreover, pruning can be performed on
the overall system by optimizing the code for the specific hardware it is to be run on.
In terms of camera hardware, infrared cameras can be used to detect vegetation. Near
infrared filters have previously been used in cameras for image acquisition [52]. Depth sensing
stereo cameras can also be tested for this purpose especially in cases of high growth stages
of crops. Finally, as used by [53], a grey scale camera with a near-infrared filter for vision
guidance can be employed on agricultural robots.

67
REFERENCES

[1] J. Guerrero, J. Ruz, and G. Pajares, “Crop rows and weeds detection in maize fields
applying a computer vision system based on geometry,” Computers and Electronics in
Agriculture, vol. 142, pp. 461–472, Nov. 2017. doi: 10.1016/j.compag.2017.09.028.

[2] A. Shehata, S. Mohammad, M. Abdallah, and M. Ragab, “A survey on hough transform,


theory, techniques and applications,” Feb. 2015.

[3] I. García-Santillán, J. Guerrero, M. Montalvo Martínez, and G. Pajares, “Curved and


straight crop row detection by accumulation of green pixels from images in maize fields,”
Precision Agriculture, vol. 19, Jan. 2017. doi: 10.1007/s11119-016-9494-1.

[4] T. Bakker, H. Wouters, K. Van Asselt, J. Bontsema, L. Tang, J. Müller, and G. van
Straten, “A vision based row detection system for sugar beet,” Computers and electronics
in agriculture, vol. 60, no. 1, pp. 87–95, 2008.

[5] J. Romeo, G. Pajares, M. Montalvo, J. Guerrero, M. Guijarro, and A. Ribeiro, “Crop row
detection in maize fields inspired on the human visual perception,” The Scientific World
Journal, vol. 2012, 2012.

[6] V. R. Ponnambalam, M. Bakken, R. J. Moore, J. Glenn Omholt Gjevestad, and P. Johan


From, “Autonomous crop row guidance using adaptive multi-roi in strawberry fields,”
Sensors, vol. 20, no. 18, p. 5249, 2020.

[7] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid
networks for object detection,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2017, pp. 2117–2125.

[8] H. T. Søgaard and H. J. Olsen, “Determination of crop rows by image analysis without
segmentation,” Computers and electronics in agriculture, vol. 38, no. 2, pp. 141–158, 2003.

[9] P. V. Hough, Method and means for recognizing complex patterns, US Patent 3,069,654,
Dec. 1962.

[10] J. A. Marchant and R. Brivot, “Real-Time Tracking of Plant Rows Using a Hough Trans-
form,” en, Real-Time Imaging, vol. 1, no. 5, pp. 363–371, Nov. 1995, issn: 1077-2014. doi:
10.1006/rtim.1995.1036. [Online]. Available: https://www.sciencedirect.com/science/
article/pii/S1077201485710364.

[11] G.-Q. Jiang, C.-J. Zhao, and Y.-S. Si, “A machine vision based crop rows detection for
agricultural robots,” in 2010 International Conference on Wavelet Analysis and Pattern
Recognition, IEEE, 2010, pp. 114–118.

68
[12] N. Otsu, “A Threshold Selection Method from Gray-Level Histograms,” IEEE Transac-
tions on Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 62–66, Jan. 1979, Confer-
ence Name: IEEE Transactions on Systems, Man, and Cybernetics, issn: 2168-2909. doi:
10.1109/TSMC.1979.4310076.

[13] F. Rovira-Más, Q. Zhang, J. Reid, and J. Will, “Hough-transform-based vision algorithm


for crop row detection of an automated agricultural vehicle,” Proceedings of the Institution
of Mechanical Engineers, Part D: Journal of Automobile Engineering, vol. 219, no. 8,
pp. 999–1010, 2005.

[14] D. M. Woebbecke, G. E. Meyer, K. Von Bargen, and D. A. Mortensen, “Color indices for
weed identification under various soil, residue, and lighting conditions,” Transactions of
the ASAE, vol. 38, no. 1, pp. 259–269, 1995, Publisher: American Society of Agricultural
and Biological Engineers.

[15] P. Kultanen, L. Xu, and E. Oja, “Randomized Hough transform (RHT),” in 10th Inter-
national Conference on Pattern Recognition [1990] Proceedings, vol. i, Jun. 1990, 631–635
vol.1. doi: 10.1109/ICPR.1990.118177.

[16] N. Sainz-Costa, A. Ribeiro, X. P. Burgos-Artizzu, M. Guijarro, and G. Pajares, “Mapping


wide row crops with video sequences acquired from a tractor moving at treatment speed,”
Sensors, vol. 11, no. 7, pp. 7095–7109, 2011.

[17] X. Zhang, X. Li, B. Zhang, J. Zhou, G. Tian, Y. Xiong, and B. Gu, “Automated robust
crop-row detection in maize fields based on position clustering algorithm and shortest
path method,” Computers and Electronics in Agriculture, Sep. 2018. doi: 10 . 1016 / j .
compag.2018.09.014.

[18] A. Ribeiro, C. Fernández-Quintanilla, J. Barroso, and M. García-Alegre, “Development


of an image analysis system for estimation of weed,” Precision agriculture, vol. 5, p. 69,
2005.

[19] T. Kataoka, T. Kaneko, H. Okamoto, and S. Hata, “Crop growth estimation system using
machine vision,” in Proceedings 2003 IEEE/ASME international conference on advanced
intelligent mechatronics (AIM 2003), tex.organization: IEEE, vol. 2, 2003, b1079–b1083.

[20] T. Hague, N. D. Tillett, and H. Wheeler, “Automated Crop and Weed Monitoring in
Widely Spaced Cereals,” en, Precision Agriculture, vol. 7, no. 1, pp. 21–32, Mar. 2006,
issn: 1573-1618. doi: 10.1007/s11119-005-6787-1. [Online]. Available: https://doi.org/10.
1007/s11119-005-6787-1.

[21] J. C. Neto, A combined statistical-soft computing approach for classification and mapping
weed species in minimum-tillage systems. The University of Nebraska-Lincoln, 2004.

69
[22] G. E. Meyer and J. C. Neto, “Verification of color vegetation indices for automated crop
imaging applications,” Computers and electronics in agriculture, vol. 63, no. 2, pp. 282–
293, 2008, Publisher: Elsevier.

[23] J. M. Guerrero, M. Guijarro, M. Montalvo, J. Romeo, L. Emmi, A. Ribeiro, and G. Pa-


jares, “Automatic expert system based on images for accuracy crop row detection in maize
fields,” Expert Systems with Applications, vol. 40, no. 2, pp. 656–664, 2013, Publisher: El-
sevier.

[24] C. Gée, J. Bossu, G. Jones, and F. Truchetet, “Crop/weed discrimination in perspective


agronomic images,” Computers and Electronics in Agriculture, vol. 60, no. 1, pp. 49–59,
2008, Publisher: Elsevier.

[25] J. Bossu, C. Gée, G. Jones, and F. Truchetet, “Wavelet transform to discriminate between
crop and weed in perspective agronomic images,” computers and electronics in agriculture,
vol. 65, no. 1, pp. 133–143, 2009, Publisher: Elsevier.

[26] M. Montalvo, G. Pajares, J. M. Guerrero, J. Romeo, M. Guijarro, A. Ribeiro, J. J. Ruz,


and J. Cruz, “Automatic detection of crop rows in maize fields with high weeds pressure,”
Expert Systems with Applications, vol. 39, no. 15, pp. 11 889–11 897, 2012, Publisher:
Elsevier.

[27] J. Billingsley and M. Schoenfisch, “The successful development of a vision guidance system
for agriculture,” en, Computers and Electronics in Agriculture, Robotics in Agriculture,
vol. 16, no. 2, pp. 147–163, Jan. 1997, issn: 0168-1699. doi: 10.1016/S0168-1699(96)
00034 - 8. [Online]. Available: https : / / www . sciencedirect . com / science / article / pii / S
0168169996000348.

[28] M. Kise, Q. Zhang, and F. Rovira Más, “A Stereovision-based Crop Row Detection
Method for Tractor-automated Guidance,” en, Biosystems Engineering, vol. 90, no. 4,
pp. 357–367, Apr. 2005, issn: 1537-5110. doi: 10.1016/j.biosystemseng.2004.12.008. [On-
line]. Available: https://www.sciencedirect.com/science/article/pii/S1537511004002260.

[29] M. Kise and Q. Zhang, “Development of a stereovision sensing system for 3D crop row
structure mapping and tractor guidance,” en, Biosystems Engineering, vol. 101, no. 2,
pp. 191–198, Oct. 2008, issn: 1537-5110. doi: 10.1016/j.biosystemseng.2008.08.001. [On-
line]. Available: https://www.sciencedirect.com/science/article/pii/S1537511008002419.

[30] F. Rovira-Más, Q. Zhang, and J. F. Reid, “Stereo vision three-dimensional terrain maps
for precision agriculture,” en, Computers and Electronics in Agriculture, vol. 60, no. 2,
pp. 133–143, Mar. 2008, issn: 0168-1699. doi: 10.1016/j.compag.2007.07.007. [Online].
Available: https://www.sciencedirect.com/science/article/pii/S016816990700172X.

70
[31] Crop row benchmark dataset. [Online]. Available: http://www.etfos.unios.hr/r3dvgroup/
index.php?id=crd_dataset.

[32] G. Jiang, Z. Wang, and H. Liu, “Automatic detection of crop rows based on multi-rois,”
Expert systems with applications, vol. 42, no. 5, pp. 2429–2441, 2015.

[33] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedi-
cal image segmentation,” in International Conference on Medical image computing and
computer-assisted intervention, Springer, 2015, pp. 234–241.

[34] Wikipedia contributors, Overfitting — Wikipedia, the free encyclopedia, [Online; accessed
1-May-2020 ], 2020. [Online]. Available: https://en.wikipedia.org/w/index.php?title=
Overfitting&oldid=953810115.

[35] A. Buslaev, V. I. Iglovikov, E. Khvedchenya, A. Parinov, M. Druzhinin, and A. A. Kalinin,


“Albumentations: Fast and flexible image augmentations,” Information, vol. 11, no. 2,
2020, issn: 2078-2489. doi: 10 . 3390 / info11020125. [Online]. Available: https : / / www .
mdpi.com/2078-2489/11/2/125.

[36] L. McInnes, J. Healy, and S. Astels, “Hdbscan: Hierarchical density based clustering,”
Journal of Open Source Software, vol. 2, no. 11, p. 205, 2017.

[37] P. Yakubovskiy, Segmentation models pytorch, https://github.com/qubvel/segmentation


_models.pytorch, 2020.

[38] T. Fan, G. Wang, Y. Li, and H. Wang, “Ma-net: A multi-scale attention network for liver
and tumor segmentation,” IEEE Access, vol. 8, pp. 179 656–179 665, 2020. doi: 10.1109/
ACCESS.2020.3025372.

[39] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for
scene segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, 2019, pp. 3146–3154.

[40] A. Chaurasia and E. Culurciello, “Linknet: Exploiting encoder representations for efficient
semantic segmentation,” in 2017 IEEE Visual Communications and Image Processing
(VCIP), IEEE, 2017, pp. 1–4.

[41] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceed-
ings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–
2890.

[42] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid
networks for object detection,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2017, pp. 2117–2125.

71
[43] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N.
Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative
style, high-performance deep learning library,” in Advances in Neural Information Pro-
cessing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dÁlché-Buc, E. Fox,
and R. Garnett, Eds., Curran Associates, Inc., 2019, pp. 8024–8035. [Online]. Available:
http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-
deep-learning-library.pdf.

[44] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools, 2000.

[45] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau,


E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J.
Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson,
C. J. Carey, İ. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R.
Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro,
F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors, “SciPy 1.0: Fundamental
Algorithms for Scientific Computing in Python,” Nature Methods, vol. 17, pp. 261–272,
2020. doi: 10.1038/s41592-019-0686-2.

[46] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Con-
ference on Learning Representations, Dec. 2014.

[47] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale
hierarchical image database,” in 2009 IEEE conference on computer vision and pattern
recognition, Ieee, 2009, pp. 248–255.

[48] J. Canny, “A computational approach to edge detection,” IEEE Transactions on pattern


analysis and machine intelligence, no. 6, pp. 679–698, 1986.

[49] I. D. Garcı́a-Santillán, M. Montalvo, J. M. Guerrero, and G. Pajares, “Automatic detection


of curved and straight crop rows from images in maize fields,” Biosystems Engineering,
vol. 156, pp. 61–79, 2017.

[50] I. Vidović, R. Cupec, and Ž. Hocenski, “Crop row detection by global energy minimiza-
tion,” Pattern Recognition, vol. 55, pp. 68–86, 2016.

[51] M. N. Khan, V. P. Rajendran, M. Al Hasan, and S. Anwar, “Clustering algorithm based


straight and curved crop rowdetection using color based segmentation,” in ASME Interna-
tional Mechanical Engineering Congress and Exposition, American Society of Mechanical
Engineers, vol. accepted, 2020.

72
[52] I. Philipp and T. Rath, “Improving plant discrimination in image processing by use of
different colour space transformations,” Computers and electronics in agriculture, vol. 35,
no. 1, pp. 1–15, 2002.

[53] B. Åstrand and A.-J. Baerveldt, “An agricultural mobile robot with vision-based percep-
tion for mechanical weed control,” Autonomous robots, vol. 13, no. 1, pp. 21–35, 2002.

73

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy