Master Thesis Report-Compressed
Master Thesis Report-Compressed
Author: Supervisor:
Dhiraj Kumar PANDEY Prof. Dr. Luis LEIVA
Reviewer:
Prof. Dr. Thomas ENGEL
Advisors:
Dr. Alexej SIMETH
Dr. Jeff MANGERS
August 2024
Non-Disclosure Notice
Date:
August 2024
The increasing challenge of plastic waste management has necessitated the de-
velopment of advanced technologies for efficient recycling processes. Among
various types of plastics, Polyethylene Terephthalate (PET) is particularly sig-
nificant due to its widespread use and recyclability. However, the effectiveness
of recycling efforts is often hindered by the difficulty in accurately sorting PET
from other plastics. This thesis addresses this issue by leveraging state-of-the-art
CV techniques to quantify the sorting efficiency. The research presented in this
thesis focuses on the development and implementation of a deep learning-based
system capable of accurately identifying, classifying, and counting different PET
plastics in real time on recorded videos of conveyor belts. A dataset was com-
piled, encompassing clear and colored PET plastics, as well as other materials
commonly found in plastic streams. The data was processed and annotated
to ensure high-quality input for model training. The deep learning model em-
ployed in this study, YOLOv8, was selected for its superior real-time processing
capabilities and accuracy. The model was trained and validated using the com-
piled dataset. The result shows that the trained model achieved 84.7 mAP@50.
For accuracy Key Performing Indicators (KPI), the Separation Efficiency (SE)
formula is suggested and clear PET SE is recorded at 94.6% on test data and
colored PET SE is recorded at 81.1% on the same test data.
Keywords: Plastic waste management, Polyethylene Terephthalate (PET),
computer vision, deep learning, YOLOv8, object detection, recycling, sustain-
ability, automated sorting.
Table of Contents
List of Figures i
List of Tables ii
1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Experiment Objective . . . . . . . . . . . . . . . . . . . . . . . . 2
2 State-of-the-Art 3
2.1 Advanced Technologies in Automatic Plastic Sorting . . . . . . . 3
2.2 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.1 Convolutional Neural Networks . . . . . . . . . . . . . . . 4
2.2.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 Image Segmentation . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Image Enhancement and Super-Resolution . . . . . . . . 6
2.2.5 Recent Advancements and Applications . . . . . . . . . . 7
2.3 PET detection using Computer Vision (CV) . . . . . . . . . . . . 7
2.4 Benchmarking of models . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Machine Learning (ML) and Deep Learning (DL) in Waste
Segregation . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.2 RetinaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Results 30
4
List of Figures
1 Example of modern machine learning pipeline (Datatron, 2023) . 8
2 YOLOv8 architecture diagram (Ultralytics, 2023a) . . . . . . . . 11
3 Faster R-CNN architecture diagram (Endachev et al., 2019) . . . 12
4 RetinaNet architecture diagram (Lin et al., 2017) . . . . . . . . . 14
5 Workflow and experiment pipeline . . . . . . . . . . . . . . . . . 17
6 PET images found on online datasets (Smirnovs, 2023) . . . . . . 18
7 Initial set of objects captured in sorting facility . . . . . . . . . . 19
8 Examples of objects clicked in lab setup . . . . . . . . . . . . . . 19
9 Examples of objects captured on conveyor belt . . . . . . . . . . 20
10 Annotation pipeline from raw video feed . . . . . . . . . . . . . . 20
11 Examples of annotations of filtered segments . . . . . . . . . . . 22
12 Number of images vs Number of instances per image . . . . . . . 24
13 Object tracking and counting on on different streams . . . . . . . 26
14 Sorting facility setup . . . . . . . . . . . . . . . . . . . . . . . . . 27
15 Inference results of V1 models . . . . . . . . . . . . . . . . . . . . 30
16 Inference results of V2 models . . . . . . . . . . . . . . . . . . . . 31
17 Inference results of V3 models . . . . . . . . . . . . . . . . . . . . 31
18 Inference results of V4 models . . . . . . . . . . . . . . . . . . . . 32
19 Training result of v4 model . . . . . . . . . . . . . . . . . . . . . 33
20 Trained model metrics . . . . . . . . . . . . . . . . . . . . . . . . 34
21 Trained Model Confusion Matrix . . . . . . . . . . . . . . . . . . 35
i
List of Tables
1 Performance comparison between YOLOv8 and Mask R-CNN in
orchard environments. (Sapkota et al., 2024) . . . . . . . . . . . 16
2 Performance Comparison between YOLOv8 and Faster R-CNN
for License Plate Detection. (Angelika Mulia et al., 2024) . . . . 16
3 Number of Images and Instances per Class in the dataset before
augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Performance metrics obtained by the trained YOLO model at
specific confidence levels. . . . . . . . . . . . . . . . . . . . . . . . 32
5 Inference Results Based on Different Model Versions . . . . . . . 32
6 Recognition Rates for Different Classes . . . . . . . . . . . . . . . 32
7 Comparison of Manually Counted and Model Counted Objects
in Testing Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
ii
List of Abbreviations
CNNs Convolutional Neural Networks
COCO Common Objects in Context
CSP Cross-Stage Partial
CV Computer Vision
DETR Deformable Transformer
DL Deep Learning
ESRGAN Enhanced Super-Resolution Generative Adversarial Networks
FN False Negative
FP False Positive
FPN Feature Pyramid Networks
GPU Graphics Processing Units
HDPE High Density Polyethylene
KPI Key Performing Indicators
mAP Mean Average Precision
ML Machine Learning
NIR Near Infrared
PET Polyethylene Terephthalate
PP Polypropylene
PR Precision-Recall
PS Polystyrene
ROI Region of Interest
RPN Region Proposal Network
SAM Segment Anything Model
SE Separation Efficiency
SOTA State-of-the-Art
SPPF Spatial Pyramid Pooling Fast
SRCNN Super-Resolution Convolutional Neural Network
iii
SSD Single Shot MultiBox Detector
TP True Positive
TPU Tensor Processing Units
YOLO You only look once
iv
1 Introduction
The escalating issue of plastic waste has necessitated the development of ad-
vanced recycling technologies (Motunrayo, 2024), particularly for sorting differ-
ent types of plastics. Polyethylene Terephthalate (PET), widely used in pack-
aging and known for its recyclability, is a primary focus (Hopewell et al., 2009).
The success of recycling efforts largely depends on the accurate separation of
PET from non-PET plastics, a longstanding challenge in the recycling industry
(Ragaert et al., 2017). This thesis addresses this challenge by implementing
Computer Vision (CV) technologies to identify PET and non-PET plastics,
aiming to enhance the accuracy and efficiency of the sorting process, further
differentiating clear PET and colored PET.
The motivation for this study comes from the urgent need to improve recy-
cling rates and reduce the environmental footprint of plastic waste. Accurately
sorted plastics allow recycling facilities to significantly enhance the quality of
recycled materials and reduce contamination in the recycling stream. This not
only contributes to environmental sustainability but also supports the circular
economy by making recycled plastics more valuable and usable for manufactur-
ers. (WRAP, 2019)
To achieve the objectives, a CV-based approach is proposed, involving the
capture of images of objects during various stages of the sorting process (Lu and
Chen, 2022). These images are analyzed using a Deep Learning (DL) algorithm
to classify them as clear PET, colored PET, and Others. The SE of the sorting
process will be evaluated at different streams to identify potential improvements.
Through this research, the aim is to contribute to the development of more
efficient and sustainable recycling processes, addressing both the technical and
design aspects of plastic recycling.
This thesis is structured as follows: Chapter 2 discusses the State-of-the-
Art (SOTA) approaches relevant to the tasks at hand. Chapter 3 outlines the
methodologies employed in the experiment, including data processing and model
training. Chapter 4 presents the experimental results across various stages.
Finally, Chapter 5 concludes with an interpretation of the findings, a discussion
of limitations, and suggestions for future improvements.
1
accuracy.
Furthermore, there is no current method to measure the accuracy of segre-
gating clear PET from colored PET. The proposed CV and DL models aim to
identify errors in the segregation of clear and colored PET streams, enhancing
the overall efficiency and reliability of the recycling process.
2
2 State-of-the-Art
The plastic sorting industry has evolved significantly with advancements in
technology, aiming to enhance the efficiency and accuracy of sorting processes
(Padalkar et al., 2021). This section provides an overview of the current SOTA
technologies and methodologies in plastic sorting, including the latest Machine
Learning (ML) models and automated systems. Additionally, it presents a com-
parative analysis of various object detection models, supported by benchmarking
studies.
3
detection and recognition to image segmentation and generation. These models
excel at extracting complex features from images, enabling highly accurate and
efficient solutions. Additionally, advancements in hardware, such as Graphics
Processing Units (GPU) and Tensor Processing Units (TPU), have accelerated
the training and deployment of DL models for image processing applications.
As a result, the SOTA in image processing continues to evolve rapidly, with new
techniques and applications emerging regularly. (Valente et al., 2023)
Key Architectures
AlexNet: AlexNet is a pioneering convolutional neural network (CNN) archi-
tecture that advanced the field of DL by achieving groundbreaking results in
image classification tasks. Developed by Alex Krizhevsky, Ilya Sutskever, and
Geoffrey Hinton, AlexNet won the 2012 ImageNet Large Scale Visual Recogni-
tion Challenge (ILSVRC), demonstrating the effectiveness of DL in CV (Rus-
sakovsky et al., 2015). The network consists of eight layers, including five con-
volutional layers and three fully connected layers, and introduced innovations
like ReLU activation functions and dropout to prevent overfitting (Krizhevsky
et al., 2012).
VGGNet is another influential deep convolutional neural network archi-
tecture, was introduced in 2014 (Simonyan and Zisserman, 2015). VGGNet is
characterized by its use of multiple convolutional layers with small (3x3) filters,
stacked on top of each other. This design allows VGGNet to learn increasingly
complex features as the network deepens. The network also employs pooling
layers to reduce dimensionality and computational complexity. VGGNet’s sim-
ple yet effective architecture has made it a popular choice for various CV tasks,
and it has served as a benchmark for many subsequent models (Canziani et al.,
2016).
ResNet introduced in 2015, made a significant breakthrough in training
4
extremely deep neural networks. ResNet addresses the vanishing gradient prob-
lem, which hinders the training of deeper networks, by introducing skip connec-
tions. These connections allow the network to learn residual functions, making
it easier to optimize and achieve higher accuracy. ResNet achieved remarkable
success in various CV tasks, including image classification and object detection,
demonstrating the effectiveness of its deep architecture and innovative design.
(He et al., 2015)
Key Models
YOLO: YOLO is a popular one-stage object detection framework that has
gained significant attention due to its speed and accuracy (Redmon et al., 2016).
Unlike two-stage methods, YOLO directly predicts object bounding boxes and
classes in a single pass, making it more efficient. YOLO divides the input image
into a grid and predicts objectness scores, bounding box coordinates, and class
probabilities for each grid cell. This approach allows YOLO to detect multiple
objects simultaneously, making it suitable for real-time applications. The origi-
nal YOLO architecture has been further improved in subsequent versions, such
as YOLOv5 and YOLOv8, which have introduced enhancements like feature
pyramids and better backbone networks to enhance performance (Ultralytics,
2023b).
Faster R-CNN Combines region proposal networks with Fast R-CNN, sig-
nificantly improving the speed of generating region proposals and refining their
classification and bounding boxes (Ren et al., 2016). This model is known for
its accuracy but is slower compared to single-stage detectors like YOLO. (Ren
et al., 2016)
5
RetinaNet addresses the challenge of class imbalance in object detection
using focal loss, making it a powerful single-stage detector that balances speed
and accuracy (Lin et al., 2018).
6
Enhanced Super-Resolution Generative Adversarial Networks (ES-
RGAN) is a state-of-the-art DL-based approach for image super-resolution,
introduced by Wang. ESRGAN combines a perceptual loss function with ad-
versarial training to generate high-quality, photorealistic super-resolved images.
The perceptual loss encourages the network to learn features that are per-
ceptually meaningful to humans, rather than focusing solely on pixel-wise dif-
ferences. The adversarial training component further improves the quality of
the generated images by using a discriminator to distinguish between real and
fake images. ESRGAN has achieved impressive results in various image super-
resolution tasks, producing visually pleasing and high-fidelity images. ESRGAN
is build on the GAN framework to produce high-fidelity images with finer details,
improving upon previous models like SRGAN by using enhanced loss functions
and network architectures. (Wang et al., 2018)
Denoising Autoencoders models can learn to remove noise from images,
enhancing their quality by training on corrupted and clean image pairs. (Kumar
et al., 2014)
7
and sorting of PET plastics are crucial for effective recycling and waste man-
agement. CV techniques offer a promising approach to automate this process,
enabling efficient and reliable PET plastic identification (Choi et al., 2023a).
This section reviews the state-of-the-art methodologies and technologies em-
ployed in the detection of PET plastics using CV.
8
Case Studies and Applications
Several studies and projects have demonstrated the effectiveness of CV and DL
in PET plastic detection:
AMP Robotics is a leading provider of AI-powered robotic sorting solu-
tions for the recycling industry (Pransky, 2020). Their innovative technology
has been successfully applied to PET plastic recycling, significantly improving
efficiency and accuracy. By utilizing advanced CV algorithms and robotic arms,
AMP Robotics can accurately identify and sort PET plastics from mixed waste
streams, ensuring high-quality recycled materials. Their systems have been
deployed in various recycling facilities worldwide, contributing to the circular
economy and reducing plastic waste.(Pact, 2024)
ZEN Robotics is another prominent company specializing in AI-powered
robotic sorting solutions, with a strong focus on PET plastic recycling (Lukka
et al., 2014). Their advanced robotic systems utilize CV and machine learning
algorithms to accurately identify and sort various materials, including PET
plastics, from mixed waste streams. Zen Robotics’ technology has been deployed
in recycling facilities worldwide, demonstrating its effectiveness in improving
sorting efficiency and reducing contamination rates. By automating the sorting
process, Zen Robotics helps to increase the value of recycled PET plastics,
contributing to a more sustainable circular economy. (Corporation, 2024)
Founded in 2019, Recycleye was created with the mission to apply advanced
AI to the traditionally manual and labor-intensive process of waste sorting. The
company aims to address the inefficiencies in current waste management sys-
tems, where human workers often perform monotonous and hazardous tasks of
sorting recyclable materials. By automating this process, Recycleye seeks to im-
prove the accuracy, efficiency, and safety of waste sorting operations, ultimately
contributing to a more sustainable environment. (Ltd., 2023)
Conclusion
The integration of CV and DL in PET plastic detection has advanced the ca-
pabilities of automated recycling systems. Models like YOLO, Faster R-CNN,
and U-Net, along with spectroscopy and sensor fusion techniques, provide ro-
bust solutions for accurately identifying and sorting PET plastics (Choi et al.,
2023b). As these technologies continue to evolve, they will play an increasingly
vital role in promoting efficient recycling processes and supporting sustainable
waste management practices.
9
challenges posed by the complex and heterogeneous nature of waste streams,
enabling more efficient and accurate recycling processes. Recent developments
and experiments in this area have shown promising results, indicating that ML
and DL can overcome the limitations of traditional waste sorting techniques,
thereby improving the overall effectiveness of waste management systems (Zhu
et al., 2021a). This section explores the current state-of-the-art methodologies
in machine learning (ML) and deep learning (DL) for waste segregation, em-
phasizing their capabilities, benefits, and the significant impact they have had
on enhancing recycling processes.
YOLOv8
As shown in Figure 2, The YOLOv8 architecture can be conceptualized into
three fundamental components:
1. Backbone: The backbone serves as the CNN that extracts essential fea-
tures from the input image. In YOLOv8, this role is fulfilled by a custom
CSPDarknet53 backbone, which integrates Cross-Stage Partial (CSP) con-
nections. These connections are designed to enhance the flow of informa-
10
Figure 2: YOLOv8 architecture diagram (Ultralytics, 2023a)
11
dictions. It employs multiple detection modules that are responsible for
predicting bounding boxes, objectness scores, and class probabilities for
each grid cell within the feature map. These predictions are subsequently
aggregated to produce the final detections, ensuring a robust and precise
identification of objects across the image.
This architecture allows YOLOv8 to deliver high-speed and accurate object
detection, making it well-suited for real-time applications in complex environ-
ments, such as automated waste segregation systems.
YOLOv8’s remarkable performance is driven by several key innovations.
YOLOv8 integrates a Spatial Attention Mechanism that enhances object
localization by concentrating on the most relevant parts of the image. This
selective focus allows the model to accurately identify and localize objects, even
in complex scenes (Agbehadji et al., 2022).
The C2f module in YOLOv8 plays a critical role in merging high-level seman-
tic features with low-level spatial details. This feature fusion is particularly
beneficial for improving the detection accuracy of small objects, which often
pose challenges in object detection tasks. (Joiya, 2022)
The CSPDarknet53 backbone in YOLOv8 incorporates bottleneck struc-
tures that reduce computational complexity without compromising accuracy.
Furthermore, the Spatial Pyramid Pooling Fast (SPPF) layer is designed
to capture features at multiple scales, which significantly enhances the model’s
overall detection performance by allowing it to detect objects of varying sizes
more effectively (Agbehadji et al., 2022).
Faster R-CNN
Faster R-CNN is composed of three main components as shown in Figure 3.
12
function as a series of filters that learn to detect specific patterns asso-
ciated with the objects of interest during training. For example, if the
goal is to detect human faces, the filters will learn to recognize shapes,
textures, and colors unique to facial features. This process is analogous to
how a coffee filter works: just as a coffee filter allows only liquid to pass
through while trapping the grounds, the convolutional layers allow only
the most relevant features of the object to be passed forward, filtering out
irrelevant information. The output of these convolutional layers is a fea-
ture map that serves as a condensed representation of the original image.
CNNs, which form the backbone of Faster R-CNN, typically consist of
a sequence of convolutional layers, pooling layers, and a fully connected
layer. The convolutional layers perform the feature extraction, the pooling
layers reduce the dimensionality of the feature maps by retaining the most
significant information, and the fully connected layer usually handles the
final classification task.
• RPN: The RPN is a small, efficient neural network that slides over the
feature map generated by the convolutional layers. It scans the entire
feature map and predicts whether any region contains an object. Addi-
tionally, the RPN proposes bounding boxes around these detected objects.
The RPN’s primary function is to generate potential Region of Interest
(ROI) that are likely to contain objects, thus narrowing down the search
space for the subsequent steps.
• Classification and Bounding Box Prediction: After the RPN has proposed
potential object regions, these proposals are passed to a fully connected
network that performs two tasks: classification and bounding box regres-
sion. The network classifies each proposed region into a specific object
class and refines the bounding box coordinates to more accurately enclose
the object. This final step ensures that the detected objects are correctly
identified and precisely located within the image.
This architecture allows Faster R-CNN to efficiently and accurately detect
objects in images, making it a powerful tool for various CV tasks. This model
excels in accuracy, utilizing a two-stage process with a region proposal network
followed by a Fast R-CNN detector. While it provides high precision, its infer-
ence time is slower due to the more complex processing pipeline. Faster R-CNN
is suitable for applications where accuracy is paramount, but it may not meet
the real-time processing requirements of fast-paced sorting facilities. (Ren et al.,
2016)
2.4.2 RetinaNet
Let’s dive a bit deeper into RetinaNet and its components. Figure 4 shows the
RetinaNet architecture diagram. RetinaNet is a widely used object detection
architecture known for its ability to achieve a balance between speed and ac-
curacy, particularly when detecting objects across a wide range of scales. The
13
architecture of RetinaNet is built on the principle of a single-stage detector, but
it incorporates a novel loss function called the Focal Loss, which addresses the
issue of class imbalance during training. Figure 4 shows the detailed explanation
of its architecture. (Lin et al., 2017)
14
classification subnet, the regression subnet is also shared across all
pyramid levels.
• Focal Loss: One of the key innovations in RetinaNet is the introduction
of Focal Loss, which addresses the issue of class imbalance (Lin et al.,
2018). In typical object detection tasks, there are far more background
(negative) samples than foreground (positive) samples. This imbalance
can lead to the model focusing too much on easy negatives, resulting
in suboptimal performance. Focal Loss reduces the relative loss for well-
classified examples, allowing the model to focus more on hard, misclassified
examples during training. This leads to improved detection performance,
especially for small and difficult-to-detect objects.
• Anchor Boxes: RetinaNet uses anchor boxes of various sizes and aspect
ratios to detect objects. These anchor boxes are predefined bounding
boxes that serve as initial guesses for object locations. The network then
adjusts these boxes through the regression subnet to match the ground
truth more accurately.
• Inference: During inference, RetinaNet processes the image through the
backbone and FPN to generate feature maps at multiple scales. The clas-
sification and regression subnets then predict the class probabilities and
bounding box offsets for each anchor box. Non-Maximum Suppression is
applied to remove redundant boxes and produce the final set of detections.
Benchmarking Studies
The choice between YOLOv8, RetinaNet, and Faster R-CNN depends largely on
the specific requirements of the application. YOLOv8 offers unparalleled speed,
making it the go-to choice for real-time detection tasks (Joiya, 2022). RetinaNet
provides a balanced approach, offering both speed and accuracy, particularly in
scenarios with class imbalance. Faster R-CNN remains the gold standard for
tasks that demand the highest accuracy and precision, albeit at the cost of
slower processing times.
Precision and Recall are key metrics for evaluating classification models,
particularly in tasks like object detection. Precision measures the accuracy of
the positive predictions. It is the proportion of correctly predicted positive
instances out of all instances predicted as positive (Equation 1).
Recall (or Sensitivity) measures how well the model identifies all actual positive
instances. It is the proportion of correctly predicted positive instances out of
all actual positive instances (Equation 2).
15
For applications like real-time video analysis or embedded systems where
speed is crucial, YOLOv8 is likely the best choice based on the various studies
done on different datasets. As the experiment done by Sapkota et al. on complex
orchard environment shows YOLOv8 offers superior speed and performance,
making it more advantageous for real-time applications in agriculture. Findings
of the experiment showed in Table 1.
Angelika Mulia et al. conducted the experiment for Licence Plate Recog-
nition and found that YOLOv8 performed better than Faster R-CNN model.
Table 2 shows the experiment result.
Ezzeddini et al. also showed that for real time scenarios YOLOv8 excels. For
highly specialized tasks where detection accuracy is non-negotiable, such as in
medical imaging or high-stakes security systems, Faster R-CNN is the preferred
model.
Benchmarking studies provide a comprehensive evaluation of these models,
highlighting their performance across various datasets and metrics. For exam-
ple, the COCO object detection challenge and other similar benchmarks have
consistently shown that YOLO models, including YOLOv8, achieve faster infer-
ence times while maintaining competitive accuracy compared to Faster R-CNN
and RetinaNet. These benchmarks underscore the trade-offs between speed and
accuracy and guide the selection of models based on specific application needs.
16
3 Data Processing and Model Training
This section outlines the methodological framework employed in the study. The
experimental methodology is described in subsection 3.1. Subsection 3.2 details
the data collection processes. In subsection 3.3, the data processing and cleaning
methods are explained. Subsection 3.4 discusses the rationale behind selecting
specific machine learning models for this task. Subsection 3.5 elaborates on the
training strategies and optimization techniques adopted. Subsections 3.6 and
3.7 focus on the methodologies employed for object tracking, object counting,
and the evaluation of model performance using relevant KPIs.
3.1 Methodology
This section gives the methods and necessary steps of the experiment conducted
for solving the proposed problem of this thesis. The thesis follows the workflow
and pipeline shown in Figure 5.
17
3.2 Data Collection
Developing an effective model required the assembly of a comprehensive dataset
that included both clear PET and colored PET images. To meet this require-
ment, an extensive search was conducted across various internet platforms such
as Google, Roboflow, and Kaggle, to locate existing datasets. This search led
to the identification of numerous publicly accessible user-generated datasets.
However, these datasets were primarily composed of laboratory-generated im-
ages, featuring new and pristine plastic materials rather than actual waste. This
posed a significant concern, as it is crucial for the dataset to reflect real-world
conditions, where plastics are frequently exposed to environmental factors that
result in wear, fading, and damage. The lack of diversity and realism in these
datasets could hinder the model’s ability to generalize effectively in real-world
scenarios. Figure 6 shows some of these examples.
These datasets did not match the tough conditions in a sorting facility, where
plastics are often bent, crushed, or misshapen. To fix this, data was collected
directly from the sorting facility. Photos were taken to show the sorting process
clearly. The goal is to build a dataset that closely matches real-world conditions,
with plastics in various shapes, sizes, and conditions.
The focus is on capturing details like how the conveyor belt moved, how the
plastics interacted with the machines, and the lighting conditions in the facility.
These elements were included to create a thorough dataset that could teach the
model to handle different situations, making it more accurate and reliable. The
first images captured in Figure 7, show detailed views of the real environment,
highlighting the conditions the model needs to handle.
The first set of images captured had many issues, including blurriness and
overlapping objects, as shown in Figure 7. These problems made it very diffi-
cult to annotate the images and train the models effectively. The images had
poor quality due to the fast movement of plastics on the conveyor belt during
sorting, which made it hard to get clear, sharp pictures. The sorting facility’s
environment, with its complex machinery and different lighting conditions, also
added to the challenge of capturing high-quality images. These poor-quality
images made it tough to identify and label the different types of plastics, which
could negatively affect how well the model learned and generalized.
18
Figure 7: Initial set of objects captured in sorting facility
To address the issues with the initial dataset, individual pieces of trash were
collected and photographed separately. This method was intended to create a
higher-quality dataset to improve classification accuracy. Seven bags of various
plastics were gathered from the sorting facility. In a controlled lab setup, a
black mat was used to mimic the conveyor belt, enabling the capture of over
1,000 images. These images were labeled to ensure accuracy, providing a diverse
dataset for better model training. Some examples of these images are shown in
Figure 8.
To improve the dataset, a further experiment was carried out to capture live
feeds of conveyor belts carrying clear PET, colored PET, and mixed plastics.
Images were extracted from the captured video feeds. The goal was to gather
real-time images that accurately represent the variability and conditions found
in a sorting facility.
The live feed capture was planned to cover the full range of the sorting envi-
ronment on different conveyor belts. This included documenting the continuous
movement of plastics on conveyor belts under different lighting conditions and
interactions with machinery. The resulting dataset contained a wide range of
images showing plastics in various states, including clear and colored PET, both
separately and mixed together.
This video-to-frame method offered several key benefits. It allowed the col-
19
lection of images that closely mirror real-world scenarios, where plastics are
often mixed and affected by changing light and positions. Also, the live feeds
captured the dynamic nature of the sorting process, providing the model with
a realistic dataset for learning. The images were then annotated to accurately
identify each type of plastic, helping the model learn to distinguish between
different plastics in real-world conditions.
Examples of these live feed images are shown below (Figure 9), highlighting
the enhanced quality and diversity of the dataset. This improvement is expected
to greatly increase the model’s classification accuracy by exposing it to a wide
variety of scenarios and conditions similar to those in operational environments.
20
These video feeds include a range of plastic types such as clear PET, colored
PET, and mixed plastics. Frames are extracted at specific intervals to build
a diverse collection of static images that accurately reflect the dynamic envi-
ronment of the sorting facility. This video-to-frame method ensures that the
dataset captures the variability and complexity found in real-world conditions,
laying a foundation for the subsequent tasks of annotation and model training.
For this experiment, 10 frames per second were extracted. Extracting 10 frames
per second from the video feeds provided a comprehensive representation of the
conveyor belt. This frame rate also ensured that the dataset included a wide
range of plastic positions, orientations, and interactions.
21
eliminate minor details and noise that are unlikely to contribute significantly
to the model’s learning process. By focusing on larger segments, the dataset
retains only those portions of the image that are substantial enough to provide
valuable information during training.
After filtering based on contour area, the dataset undergoes an overlap fil-
tering process. In this step, segments that overlap more than 80% with larger
segments are excluded. This is crucial for removing redundant segments that
could introduce confusion and inefficiency during model training. By ensuring
that each segment is distinct and non-redundant, the model can better focus on
learning the unique characteristics of each type of plastic, thereby enhancing its
ability to generalize and perform accurately in real-world scenarios.
This filtering process greatly improves the dataset by retaining only mean-
ingful and distinct segments, resulting in more manageable and higher-quality
segments. Such refinement is vital for the subsequent stages of model training,
ultimately leading to enhanced accuracy and reliability in practical applications.
(a) (b)
22
Before augmentation, the dataset is divided into three subsets to facilitate
effective training, validation, and testing of the model. 70% of the dataset is
allocated to the training set, which serves as the primary source of data for
the model to learn underlying patterns and features. 20% is designated for the
validation set, which is used during training to fine-tune hyperparameters and
iteratively improve the model, ensuring it generalizes well beyond the training
data. The remaining 10% is reserved as the test set, providing an unbiased eval-
uation of the final model’s performance, and allowing for a thorough assessment
of its accuracy and generalization capabilities.
Table 3 shows that the ”clear-pet” class appears in 337 images with a total
of 1,730 instances, making it the most represented class in the dataset. The
”colored-pet” class is found in 289 images with 779 instances, while the ”others”
class, which likely includes miscellaneous items, appears in 109 images with 162
instances. The images used in the dataset come from two primary sources:
lab-clicked images and frames extracted from video footage. All images have a
resolution of 640x640 pixels, ensuring consistency in the dataset.
Figure 12 illustrates the distribution of the number of images relative to the
number of instances per image in the dataset. On the x-axis, the ”Number of
Instances” refers to the count of distinct objects or segments present in each
image, while the y-axis represents the ”Number of Images” corresponding to
each instance count. The majority of images contain only one instance, as
indicated by the highest bar at the ”1” mark, with 450 images in this category.
This also denotes that the majority of dataset images are lab-clicked images with
a single instance per image. As the number of instances per image increases, the
number of images decreases significantly. Figure 12 also shows that, on average,
there are 3.83 instances in an image.
The augmentation process involves applying several transformations to each
image in the training dataset. Horizontal and vertical flips are performed to
ensure that the model can recognize plastics regardless of their orientation.
Rotations within a range of -15° to +15° introduce angular variations. Addi-
tionally, shear transformations applied both horizontally and vertically within
a range of ±10°, simulate perspective changes and enhance the model’s ability
to generalize across different viewpoints.
By dividing the dataset and integrating these augmentations, this stage en-
sures that the model is trained on a comprehensive and representative sample
of data.
Table 3: Number of Images and Instances per Class in the dataset before aug-
mentation
23
Figure 12: Number of images vs Number of instances per image
24
design and support for various customizations, meets these requirements well.
Given the priority of fast inference in this project, YOLOv8 was selected
over other models like Faster R-CNN and RetinaNet due to its superior real-
time processing capabilities.
25
types, such as clear PET, colored PET, and other mixed materials. Given the
continuous movement of objects on the conveyor belts, ten frames per second
are extracted to create a sequence of static images.
This approach ensured a dataset that accurately represented the flow of
materials on the conveyor belts, providing a foundation for subsequent detection,
tracking, and counting tasks. The trained version 4 model is used for object
tracking. The tracking algorithm gives a unique track ID to each detected object
in the frame. An example of object tracking and counting is shown in Figure
13.
One of the critical challenges addressed in this project was the potential for
counting the same object multiple times if the counting is based on the track
ID as the conveyor belt continuously moves, changing the position of objects
in consequent frames. Especially if it temporarily disappeared from the frame
due to occlusion or left and re-entered the frame. An algorithm is needed to
overcome this issue. A virtual line is drawn as a specific vertical section serving
as a checkpoint.
For each detected object, the model outputs the bounding box coordinates,
the class of the object, the confidence score, and the track ID. These details
are used to update the tracking history and determine the object’s movement
relative to a predefined horizontal line placed at 80% of the frame height from
the bottom.
This line ensures that each object is counted only once. The algorithm
updates the count only when the object’s center crosses this line, partially pre-
venting multiple counts of the same object even if it reappears in the frame
later.
Additionally, the algorithm maintains a record of the highest confidence score
associated with each object’s classification. If an object is detected with a higher
confidence score in a subsequent frame, the classification is updated to reflect
this. This ensures that the classification accuracy is maximized throughout the
video. The total object count and the count for each class are calculated and
26
get as output.
”Clear PET”, ”Colored PET”, and ”Others”. To calculate the SE of the colored
PET stream, the following formula can be applied.
T Pcolored
SEcolored = ( ) ∗ 100 (4)
T Pcolored + F Ncolored
Where:
SE colored: SE of colored stream
27
T P colored: TP for colored output stream (number of correctly sorted
colored PET items in colored output stream)
F N colored: False Negative (FN) for colored output stream (number of
items of the colored PET that should have been in colored output stream
but ended up in clear PET output stream)
Similarly, the SE of clear PET stream can be calculated by the formula:
T Pclear
SEclear = ( ) ∗ 100 (5)
T Pclear + F Nclear
Where:
SE clear: SE of clear PET stream
T P clear: TP for clear output stream (number of correctly sorted clear
PET items in clear output stream)
F N clear: FN for clear output stream (number of items of the clear PET
that should have been in clear output stream but ended up in colored PET
output stream)
The SE is influenced by the precision of the trained model. When the model’s
Mean Average Precision (mAP) for a particular class is less than 100%, it indi-
cates that the model will misclassify some items belonging to that class. This
misclassification increases the number of False Negatives (FN), which are
items that belong to a class but are incorrectly classified as belonging to an-
other class. Conversely, this will also lead to a decrease in True Positives
(TP), which are the items correctly identified as belonging to the class.
Since the SE calculation depends on both TP and FN values, the model’s
mAP directly impacts the calculated SE. Therefore, to accurately calculate SE,
the model’s performance for each class must be considered.
The SE for a specific output stream i can be calculated as:
(T Pi ∗ mAPi )
SEi = ∗ 100 (6)
(T Pi ∗ mAPi + F Ni )
Where:
SEi : SE of output stream i
T Pi : TP for output stream i
F Ni : FN for output stream i
mAPi : mAP of the model for the target class of output stream i
For example:
• If the mAP for clear PET is 81%, this means that 19% of clear PET items
will be misclassified. Consequently, this increases the FN count for clear
PET and decreases the TP count, leading to a lower SE for clear PET.
28
• Similarly, if the mAP for colored PET is 86%, the SE calculation for
colored PET will be similarly affected.
29
4 Results
In this section, all results of the experiment are presented. YOLOv8 optimizer
is set to auto. The auto optimizer settings for YOLOv8 use the AdamW
optimizer with a learning rate of 0.001429 and a momentum value of 0.9. The
optimizer is configured with three parameter groups: one group of 57 weights
with no weight decay, another group of 64 weights with a decay rate of 0.0005
for regularization, and a final group of 63 biases also with no decay. Two images
for inferencing are used to compare the results of all the experiments. Both the
images are taken on the conveyor belt.
V1 result
The v1 model recorded a value of 1 in precision at a confidence level of 0.983.
The recall was recorded 0.98 value at a confidence level of 0. This model achieved
mAP score of 87.8 at a confidence level of 0.5. It also recorded an F1-score of
0.83 value at a confidence level of 0.663. Figure 15 shows the inference of the
V1 model.
V2 result
The v2 model recorded a value of 1 in precision at a confidence level of 0.966.
The recall was recorded 0.96 value at a confidence level of 0. This model achieved
mAP score of 81.5 at a confidence level of 0.5. It also recorded an F1-score of
0.75 value at a confidence level of 0.371. Figure 18 shows the inference of the
V2 model. Figure 16 shows the inference of the V2 model.
V3 result
The v3 model recorded a value of 1 in precision at a confidence level of 0.949.
The recall was recorded 0.95 value at a confidence level of 0. This model achieved
30
Figure 16: Inference results of V2 models
V4 result
Figure 18 shows the inference result of the V4 model. The final experimental
results show that the model achieved a classification accuracy of 84.7 mAP
(Figure 19b) for three classes (Clear PET, Colored PET, and others). Figure
19a shows the testing of the trained algorithm. Figure 20 provides a graphical
representation of the performance metrics obtained by the trained YOLO model.
In the precision graph (Figure 20a), one can see a value of 1 at a confidence
level of 0.954. In the Recall graph (Figure 20b), there is a 0.94 value recorded
at a confidence level of 0. The Precision-Recall (PR) curve (Figure 20c) shows
a mAP score of 84.5 at a confidence level of 0.5. In the F1-score graph (Figure
31
20d), one can see a 0.81 value at a confidence level of 0.523. The performance
metrics are also shown in Table 4.
As for Figure 21, it presents the confusion matrix generated by the version 4
trained YOLOv8 model, which categorizes objects into clear PET, colored PET,
others, and background across the entire dataset. Table 5 shows the recognition
rates of different classes.
Class TP FN FP TN
Clear PET 81% 19% 18% 0%
Colored PET 86% 14% 13% 0%
Others 90% 10% 1% 0%
Background 76% 24% 0% 0%
32
(a) Testing of trained algorithm (b) mAP graph
Capturing testing videos for both input and output streams presented sig-
nificant challenges due to the inability to stop the conveyor. This made it
difficult to record videos from different streams simultaneously, leading to po-
tential discrepancies. Despite these challenges, 2-minute videos were recorded
as simultaneously as possible. As shown in Table 6, the clear PET output video
had 713 manually counted objects, while the trained model counted only 622.
Similarly, for the colored PET test video, the manual count was 113 objects,
but the model counted only 101. These discrepancies highlight the challenges
faced in capturing and also model analyzing the data accurately.
SE of the trained model for test videos is calculated. For clear PET it is
94.6%. For colored PET 81.1%.
33
(a) Precision (b) Recall
34
Figure 21: Trained Model Confusion Matrix
35
5 Discussion and Conclusion
5.1 Interpretation of Experimental Results
The interpretation and comparison of the experimental results provide valuable
insights into the effectiveness of the proposed system. The overall classification
accuracy, reflected by an 84.7 mAP for clear PET, colored PET, and other
materials, underscores the system’s high precision in distinguishing between
various types of PET plastics. This level of accuracy is particularly important
for calculating the SE of the PET sorting machine, as it directly influences the
reliability of the sorting process.
However, while the model successfully counted approximately 90% of the
objects on the conveyor belt in the test videos, it is essential to evaluate whether
this performance is sufficient for the targeted purpose. For certain applications,
a 90% counting accuracy may be acceptable, especially when rapid decision-
making is prioritized. The model’s quick inference time plays a significant role
here, enabling timely classification that keeps pace with the conveyor belt’s
speed, thereby ensuring efficient and continuous operation.
Yet, it is also crucial to examine the model’s performance across different
classes. If the 90% accuracy is consistent across all classes, it may be deemed
satisfactory. However, if certain classes, such as colored PET or others, exhibit
lower accuracy, this discrepancy could be problematic. In such cases, the model
might struggle to accurately classify and count these specific classes, leading to
potential inefficiencies in the sorting process.
The variation in accuracy across classes could stem from several factors,
including the visual similarity between classes, differences in the quality or
quantity of training data for each class, or the inherent complexity of distin-
guishing certain materials. Therefore, while the overall 90% counting accuracy
is promising, further analysis is needed to ensure that the system meets the
desired performance standards for each class, ensuring a robust and reliable
quantification of the sorting process.
36
objects that are heavily overlapped or partially occluded by other items. In
real-world sorting facilities, objects often move close to one another or may
be partially hidden, posing significant challenges for the model in distinguish-
ing between them. While the trained model performed adequately in various
scenarios, its performance could be enhanced by addressing these complexities.
To improve the model’s ability to handle such challenging visual scenes,
several strategies could be employed. First, improving the quality of image cap-
ture, such as using higher resolution cameras or optimizing lighting conditions,
could help in obtaining clearer and more detailed images, making it easier for
the model to distinguish between overlapping objects. Additionally, creating a
more balanced and diverse dataset that includes a higher proportion of images
with overlapping and occluded objects would better train the model to recognize
and classify such scenarios.
The performance of the trained model can be affected by variations in en-
vironmental conditions, such as changes in lighting, reflections, or shadows.
Inconsistent lighting conditions in the sorting facility, for example, can lead
to variations in detection accuracy. Although the model was trained on data
reflecting a range of conditions, variations may still impact its performance. Im-
plementing more sophisticated data augmentation techniques during training,
or deploying models specifically tuned to different environmental conditions, can
mitigate this issue.
While the object tracking and counting mechanisms are effective for the
specific task at hand, they were applied only for the sorting facility conveyor
position and may struggle with more complex scenarios. For example, the model
encounters difficulties in maintaining accurate counts if objects disappear and
reappear within the frame or if they move in unpredictable patterns. The current
approach does not incorporate advanced tracking algorithms, such as Kalman
filters or optical flow, which can enhance the accuracy and reliability of object
tracking in more challenging environments.
Although YOLOv8 is designed for real-time processing, it still requires signif-
icant computational resources, particularly when handling high-resolution im-
ages or processing multiple video streams simultaneously. This could limit its
deployment in environments with limited hardware capabilities. Optimizing
the model for lower-resource environments, possibly through model pruning or
quantization, can make it more accessible for broader use cases.
Although the model achieved high precision and recall, a thorough analysis
of False Positive (FP) and FN was not conducted within the time constraints
of this project. Such an analysis is crucial for identifying the underlying causes
of these errors, which can offer valuable insights for future improvements. In
subsequent research, focusing on detailed error analysis could enable targeted
adjustments to the model or training process, ultimately reducing the occurrence
of these errors and enhancing overall performance.
In summary, while the YOLOv8 trained model proved to be promising for the
task of real-time PET plastic detection and sorting, addressing these limitations
through additional research, data collection, and model refinement can further
enhance its performance and applicability across a wider range of real-world
37
scenarios.
5.3 Conclusion
This thesis explored the development and implementation of a YOLOv8-based
model for real-time PET plastic detection, tracking, and counting in a sorting
facility environment. The primary objective was to create a robust system
capable of accurately identifying different types of PET plastics under dynamic
and challenging conditions, thereby quantifying the efficiency of the PET sorting
process.
Through a custom-designed data processing pipeline, a training dataset was
assembled, encompassing a diverse range of clear PET, colored PET, and mixed
plastics. The dataset was annotated and augmented to ensure the model was
trained on a broad spectrum of scenarios reflective of real-world conditions. The
YOLOv8 model was selected for its superior balance between speed and accu-
racy, critical attributes for real-time applications where rapid decision-making
is essential.
The experimental results demonstrated the model’s strong performance,
with high precision, recall, and mAP scores across all classification tasks. The
model’s real-time processing capabilities were validated through its testing on
video feeds, where it effectively detected, tracked, and counted plastics as they
moved along conveyor belts in a sorting facility. These results confirm the
model’s potential as a valuable tool for quantifying the efficiency and accuracy
of automated PET plastic sorting system.
Despite these successes, the project also identified several limitations, such
as the need for a more diverse dataset, improved handling of overlapping and
occluded objects, and better scalability to larger and more complex systems.
Addressing these limitations in future work can further enhance the model’s
robustness and applicability, making it suitable for a broader range of waste-
sorting scenarios.
In conclusion, this project demonstrates the viability of using advanced DL
models like YOLOv8 in industrial waste sorting applications. The model’s abil-
ity to deliver high-speed, accurate object detection and classification in real-time
underscores its potential to significantly improve the efficiency and effectiveness
of sorting operations. As the global emphasis on waste management and recy-
cling continues to grow, such technologies will play an increasingly critical role
in driving sustainable practices and reducing environmental impact. Future
research and development efforts should focus on expanding the model’s capa-
bilities, optimizing its performance in diverse environments, and integrating it
with larger, more complex waste management systems.
38
References
I. E. Agbehadji, A. Abayomi, K.-H. N. Bui, R. C. Millham, and E. Freeman.
Nature-inspired search method and custom waste object detection and classifi-
cation model for smart waste bin. Sensors, 22(16), 2022. ISSN 1424-8220. doi:
10.3390/s22166176. URL https://www.mdpi.com/1424-8220/22/16/6176.
A. A. A. Ahmed and A. B. M. Asadullah. Artificial intelligence and machine
learning in waste management and recycling. Engineering International, 8:
43–52, 05 2020. doi: 10.18034/ei.v8i1.498.
D. Angelika Mulia, S. Safitri, and I. Gede Putra Kusuma Negara. Yolov8
and faster r-cnn performance evaluation with super-resolution in license plate
recognition. International Journal of Computing and Digital Systems, 16(1):
365–375, 2024.
A. Canziani, A. Paszke, and E. Culurciello. An analysis of deep neural network
models for practical applications. arXiv preprint arXiv:1605.07678, 2016.
URL https://arxiv.org/abs/1605.07678.
S. Chatterjee, D. Hazra, and Y.-C. Byun. Incepx-ensemble: Performance
enhancement based on data augmentation and hybrid learning for recy-
cling transparent pet bottles. IEEE Access, 10:52280–52293, 2022a. doi:
10.1109/ACCESS.2022.3174076.
S. Chatterjee, D. Hazra, Y.-C. Byun, and Y.-W. Kim. Enhancement of im-
age classification using transfer learning and gan-based synthetic data aug-
mentation. Mathematics, 10(9), 2022b. ISSN 2227-7390. doi: 10.3390/
math10091541. URL https://www.mdpi.com/2227-7390/10/9/1541.
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab:
Semantic image segmentation with deep convolutional nets, atrous convolu-
tion, and fully connected crfs. IEEE transactions on pattern analysis and
machine intelligence, 40(4):834–848, 2017.
J. Choi, B. Lim, and Y. Yoo. Advancing plastic waste classification and recycling
efficiency: Integrating image sensors and deep learning algorithms. Applied
Sciences, 13(18), 2023a. ISSN 2076-3417. doi: 10.3390/app131810224. URL
https://www.mdpi.com/2076-3417/13/18/10224.
J. Choi, B. Lim, and Y. Yoo. Advancing plastic waste classification and recycling
efficiency: Integrating image sensors and deep learning algorithms. Applied
Sciences, 13(18):10224, 2023b.
Cimbria. Leading in plastic recycling with optical sorting. Cimbria
News, 2024. URL https://www.cimbria.com/en/about/news/leading-
in-plastic-recycling-with-optical-sorting.html. Accessed: August
8, 2024.
39
T. Corporation. Zenrobotics: Smart robotics for waste sorting. https://www.
terex.com/zenrobotics, 2024. Accessed: August 8, 2024.
Datatron. What is a machine learning pipeline? https://datatron.com/what-
is-a-machine-learning-pipeline/, 2023. Accessed: 2024-06-27.
40
G. Jakovljevic, M. Govedarica, and F. Alvarez-Taboada. A deep learning model
for automatic plastic mapping using unmanned aerial vehicle (uav) data. Re-
mote Sensing, 12(9), 2020. ISSN 2072-4292. doi: 10.3390/rs12091515. URL
https://www.mdpi.com/2072-4292/12/9/1515.
F. Joiya. Object detection: Yolo vs faster r-cnn. International Research Journal
of Modernization in Engineering Technology and Science, pages 1911–1915,
2022. doi: 10.56726/irjmets30226.
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao,
S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollar, and R. Girshick. Segment
anything. In Proceedings of the IEEE/CVF International Conference on Com-
puter Vision (ICCV), pages 4015–4026, October 2023a.
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao,
S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick. Segment
anything, 2023b. URL https://arxiv.org/abs/2304.02643.
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification
with deep convolutional neural networks. In F. Pereira, C. Burges,
L. Bottou, and K. Weinberger, editors, Advances in Neural Infor-
mation Processing Systems, volume 25. Curran Associates, Inc., 2012.
URL https://proceedings.neurips.cc/paper_files/paper/2012/file/
c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
V. Kumar, G. Nandi, and R. Kala. Static hand gesture recognition using stacked
denoising sparse autoencoders. In 2014 Seventh International Conference on
Contemporary Computing (IC3), pages 99–104, 2014. doi: 10.1109/IC3.2014.
6897155.
D. Larentzakis, F. Raptopoulos, P. Timilsina, and M. Maniadakis. Ai-powered
robotic material recovery in a box. 2023.
T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal loss for dense
object detection. arXiv preprint arXiv:1708.02002, 2017.
T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense
object detection, 2018. URL https://arxiv.org/abs/1708.02002.
R. Ltd. Recycleye ai: Transforming waste sorting with artificial intelligence.
https://recycleye.com/, 2023. Accessed: 2024-06-27.
W. Lu and J. Chen. Computer vision for solid waste sorting: A critical review
of academic research. Waste Management, 142:29–43, 2022.
T. J. Lukka, T. Tossavainen, J. V. Kujala, and T. Raiko. Zenrobotics recy-
cler - robotic sorting using machine learning. 2014. URL https://api.
semanticscholar.org/CorpusID:63618129.
41
M. Manley and V. Baeten. Chapter 3 - spectroscopic technique: Near infrared
(nir) spectroscopy. In D.-W. Sun, editor, Modern Techniques for Food Au-
thentication (Second Edition), pages 51–102. Academic Press, second edition
edition, 2018. ISBN 978-0-12-814264-6. doi: https://doi.org/10.1016/B978-
0-12-814264-6.00003-7. URL https://www.sciencedirect.com/science/
article/pii/B9780128142646000037.
S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and D. Terzopoulos.
Image segmentation using deep learning: A survey. IEEE transactions on
pattern analysis and machine intelligence, 44(7):3523–3542, 2021.
O. Motunrayo. Innovations in recycling technologies for the circular economy,
05 2024.
T. Muringayil Joseph, S. Azat, Z. Ahmadi, O. Moini Jazani, A. Esmaeili,
E. Kianfar, J. Haponiuk, and S. Thomas. Polyethylene terephthalate (pet)
recycling: A review. Case Studies in Chemical and Environmental Engineer-
ing, 9:100673, 2024. ISSN 2666-0164. doi: https://doi.org/10.1016/j.cscee.
2024.100673. URL https://www.sciencedirect.com/science/article/
pii/S2666016424000677.
U. P. Pact. Amp robotics: Case study. https://usplasticspact.org/case-
study/amp-robotics-2/, 2024. Accessed: August 8, 2024.
A. Padalkar, P. Pathak, and P. Stynes. An object detection and scaling model
for plastic waste sorting, 12 2021.
J. Pransky. The pransky interview: Dr. matanya horowitz, founder and ceo of
amp robotics. Industrial Robot: the international journal of robotics research
and application, ahead-of-print, 04 2020. doi: 10.1108/IR-02-2020-0038.
K. Ragaert, L. Delva, and K. Van Geem. Mechanical and chemical recycling
of solid plastic waste. Waste Management, 69:24–58, 2017. URL https:
//doi.org/10.1016/j.wasman.2017.07.044.
E. Ramos, A. G. Lopes, and F. Mendonça. Application of machine learning in
plastic waste detection and classification: A systematic review. Processes, 12
(8), 2024. ISSN 2227-9717. doi: 10.3390/pr12081632. URL https://www.
mdpi.com/2227-9717/12/8/1632.
Recycleye. Computer vision evolved: The role of ai in waste sorting. https://
recycleye.com/computer-vision-evolved-role-waste-sorting/, 2023.
Accessed: 2024-06-27.
42
S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object
detection with region proposal networks, 2016. URL https://arxiv.org/
abs/1506.01497.
O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for
biomedical image segmentation, 2015. URL https://arxiv.org/abs/1505.
04597.
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet
Large Scale Visual Recognition Challenge. International Journal of Computer
Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
R. Sapkota, D. Ahmed, and M. Karkee. Comparing yolov8 and mask r-cnn
for instance segmentation in complex orchard environments. Artificial In-
telligence in Agriculture, 13:84–99, 2024. ISSN 2589-7217. doi: https:
//doi.org/10.1016/j.aiia.2024.07.001. URL https://www.sciencedirect.
com/science/article/pii/S258972172400028X.
R. T. Schirrmeister, V. Cialdella, M. Stricker, et al. Deep learning for automated
image analysis in molecular sciences. Nature Reviews Chemistry, 5(12):842–
860, 2021. doi: 10.1038/s41570-021-00294-z. URL https://www.nature.
com/articles/s41570-021-00294-z.
F. Schmidt, N. Christiansen, and R. Lovrincic. The laboratory at hand: Plas-
tic sorting made easy. PhotonicsViews, 17(5):56–59, 2020. doi: https:
//doi.org/10.1002/phvs.202000036. URL https://onlinelibrary.wiley.
com/doi/abs/10.1002/phvs.202000036.
Z. Sharif, S. Khan, and A. Iqbal. Deep learning models for waste segregation: A
comparative study. Journal of Environmental Management, 307:114580, 2022.
doi: 10.1016/j.jenvman.2022.114580. URL https://www.sciencedirect.
com/science/article/pii/S0301479722002039.
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition, 2015. URL https://arxiv.org/abs/1409.1556.
D. Smirnovs. plastics dataset. https://universe.roboflow.com/dmitrijs-
smirnovs-afbdw/plastics-6aztq, feb 2023. URL https://universe.
roboflow.com/dmitrijs-smirnovs-afbdw/plastics-6aztq. visited on
2024-08-27.
M. Tan and Q. V. Le. Efficientnet: Rethinking model scaling for convolutional
neural networks, 2020. URL https://arxiv.org/abs/1905.11946.
Ultralytics. Ultralytics github repository. https://github.com/ultralytics/
ultralytics/issues/189, 2023a. Accessed: 2024-06-27.
Ultralytics. Yolov8: State-of-the-art yolo model. https://ultralytics.com/
yolov8, 2023b. Accessed: 2024-06-27.
43
J. Valente, J. António, C. Mora, and S. Jardim. Developments in image pro-
cessing using deep learning and reinforcement learning. Journal of Imag-
ing, 9(10), 2023. ISSN 2313-433X. doi: 10.3390/jimaging9100207. URL
https://www.mdpi.com/2313-433X/9/10/207.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin. Attention is all you need. Advances in Neural
Information Processing Systems, 30, 2017. URL https://arxiv.org/abs/
1706.03762.
X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, C. C. Loy, Y. Qiao, and
X. Tang. Esrgan: Enhanced super-resolution generative adversarial networks,
2018. URL https://arxiv.org/abs/1809.00219.
K. S. Woon, Z. X. Phuang, Z. Lin, and C. T. Lee. A novel food waste man-
agement framework combining optical sorting system and anaerobic diges-
tion: A case study in malaysia. Energy, 232:121094, 2021. ISSN 0360-
5442. doi: https://doi.org/10.1016/j.energy.2021.121094. URL https://www.
sciencedirect.com/science/article/pii/S0360544221013426.
WRAP. Plastic pollution: Understanding the global crisis. 2019.
URL https://www.wrap.org.uk/content/plastic-pollution-
understanding-global-crisis.
M. Wu and L. Chen. Image recognition based on deep learning. In 2015 Chinese
Automation Congress (CAC), pages 542–546, 2015. doi: 10.1109/CAC.2015.
7382560.
Xometry. What is polyethylene terephthalate? https://www.xometry.com/
resources/materials/polyethylene-terephthalate/, 2024. Accessed:
August 8, 2024.
L. Yue, H. Shen, J. Li, Q. Yuan, H. Zhang, and L. Zhang. Image super-
resolution: The techniques, applications, and future. Signal Processing,
128:389–408, 2016. ISSN 0165-1684. doi: https://doi.org/10.1016/j.sigpro.
2016.05.002. URL https://www.sciencedirect.com/science/article/
pii/S0165168416300536.
X. Zhang, Y. Li, and Q. Liu. Deep learning applications in solid waste manage-
ment: A review. International Journal of Advanced Computer Science and
Applications (IJACSA), 13(3):376–383, 2022. doi: 10.14569/IJACSA.2022.
0130347. URL https://thesai.org/Downloads/Volume13No3/Paper_47-
Deep_Learning_Applications_in_Solid_Waste_Management.pdf.
J. Zhu, J. Chen, and R. He. Machine learning in waste classification and re-
cycling: Recent developments, challenges, and prospects. IEEE Access, 9:
151123–151139, 2021a. doi: 10.1109/ACCESS.2021.3119779.
X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai. Deformable detr:
Deformable transformers for end-to-end object detection. arXiv preprint
arXiv:2010.04159, 2021b. URL https://arxiv.org/abs/2010.04159.
44