0% found this document useful (0 votes)
18 views53 pages

Master Thesis Report-Compressed

Uploaded by

emkaymuzik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views53 pages

Master Thesis Report-Compressed

Uploaded by

emkaymuzik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Faculty of Science, Technology, and Medicine

Tracking and predicting PET


packaging in a sorting process
using Deep Learning-based
Computer Vision
Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of
Master in Information and Computer Sciences

Author: Supervisor:
Dhiraj Kumar PANDEY Prof. Dr. Luis LEIVA

Reviewer:
Prof. Dr. Thomas ENGEL

Advisors:
Dr. Alexej SIMETH
Dr. Jeff MANGERS

August 2024
Non-Disclosure Notice

This master thesis report contains proprietary and confidential information of


Hein Déchets S.à r.l. The information is supplied on the understanding that it
will be held in strict confidence and will not be disclosed to third parties - except
for the thesis supervisor and authorized examining persons - without the prior
written consent of the author and the aforementioned company. Unauthorized
disclosure or use of this information is strictly prohibited.

Date:
August 2024

Note: This non-disclosure notice is applicable to the content and information


contained within this thesis, and it remains in effect unless explicitly stated
otherwise by the author and the university.
Acknowledgments

I extend my deepest gratitude to my supervisor, Prof. Dr. Luis Leiva, and


reviewer, Prof. Dr. Thomas Engel, for their unwavering support and guidance
throughout this journey. Their expertise and insights have been invaluable to
the success of this research.
I am profoundly thankful to Hein Déchets S.à r.l. for providing me with the
opportunity to conduct this research, and for the resources and assistance that
were instrumental in the completion of this thesis.
Special thanks go to my research advisors, Alexej Simeth and Jeff Mangers,
whose invaluable guidance, constructive feedback, and steadfast support have
been crucial throughout the course of my master’s thesis. Their combined ex-
pertise has greatly enriched the quality of this work.
Finally, I would like to express my sincere appreciation to my wife Charu for
her exceptional support during this process. Her encouragement and unwavering
belief in me have been essential to the completion of this work.
Abstract

The increasing challenge of plastic waste management has necessitated the de-
velopment of advanced technologies for efficient recycling processes. Among
various types of plastics, Polyethylene Terephthalate (PET) is particularly sig-
nificant due to its widespread use and recyclability. However, the effectiveness
of recycling efforts is often hindered by the difficulty in accurately sorting PET
from other plastics. This thesis addresses this issue by leveraging state-of-the-art
CV techniques to quantify the sorting efficiency. The research presented in this
thesis focuses on the development and implementation of a deep learning-based
system capable of accurately identifying, classifying, and counting different PET
plastics in real time on recorded videos of conveyor belts. A dataset was com-
piled, encompassing clear and colored PET plastics, as well as other materials
commonly found in plastic streams. The data was processed and annotated
to ensure high-quality input for model training. The deep learning model em-
ployed in this study, YOLOv8, was selected for its superior real-time processing
capabilities and accuracy. The model was trained and validated using the com-
piled dataset. The result shows that the trained model achieved 84.7 mAP@50.
For accuracy Key Performing Indicators (KPI), the Separation Efficiency (SE)
formula is suggested and clear PET SE is recorded at 94.6% on test data and
colored PET SE is recorded at 81.1% on the same test data.
Keywords: Plastic waste management, Polyethylene Terephthalate (PET),
computer vision, deep learning, YOLOv8, object detection, recycling, sustain-
ability, automated sorting.
Table of Contents
List of Figures i

List of Tables ii

List of Abbreviations iii

1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Experiment Objective . . . . . . . . . . . . . . . . . . . . . . . . 2

2 State-of-the-Art 3
2.1 Advanced Technologies in Automatic Plastic Sorting . . . . . . . 3
2.2 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.1 Convolutional Neural Networks . . . . . . . . . . . . . . . 4
2.2.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 Image Segmentation . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Image Enhancement and Super-Resolution . . . . . . . . 6
2.2.5 Recent Advancements and Applications . . . . . . . . . . 7
2.3 PET detection using Computer Vision (CV) . . . . . . . . . . . . 7
2.4 Benchmarking of models . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Machine Learning (ML) and Deep Learning (DL) in Waste
Segregation . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.2 RetinaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Data Processing and Model Training 17


3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Object Tracking and Object Counting . . . . . . . . . . . . . . . 25
3.7 Performance Evaluation and Accuracy KPIs . . . . . . . . . . . . 27

4 Results 30

5 Discussion and Conclusion 36


5.1 Interpretation of Experimental Results . . . . . . . . . . . . . . . 36
5.2 Limitations and Further Improvements . . . . . . . . . . . . . . . 36
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4
List of Figures
1 Example of modern machine learning pipeline (Datatron, 2023) . 8
2 YOLOv8 architecture diagram (Ultralytics, 2023a) . . . . . . . . 11
3 Faster R-CNN architecture diagram (Endachev et al., 2019) . . . 12
4 RetinaNet architecture diagram (Lin et al., 2017) . . . . . . . . . 14
5 Workflow and experiment pipeline . . . . . . . . . . . . . . . . . 17
6 PET images found on online datasets (Smirnovs, 2023) . . . . . . 18
7 Initial set of objects captured in sorting facility . . . . . . . . . . 19
8 Examples of objects clicked in lab setup . . . . . . . . . . . . . . 19
9 Examples of objects captured on conveyor belt . . . . . . . . . . 20
10 Annotation pipeline from raw video feed . . . . . . . . . . . . . . 20
11 Examples of annotations of filtered segments . . . . . . . . . . . 22
12 Number of images vs Number of instances per image . . . . . . . 24
13 Object tracking and counting on on different streams . . . . . . . 26
14 Sorting facility setup . . . . . . . . . . . . . . . . . . . . . . . . . 27
15 Inference results of V1 models . . . . . . . . . . . . . . . . . . . . 30
16 Inference results of V2 models . . . . . . . . . . . . . . . . . . . . 31
17 Inference results of V3 models . . . . . . . . . . . . . . . . . . . . 31
18 Inference results of V4 models . . . . . . . . . . . . . . . . . . . . 32
19 Training result of v4 model . . . . . . . . . . . . . . . . . . . . . 33
20 Trained model metrics . . . . . . . . . . . . . . . . . . . . . . . . 34
21 Trained Model Confusion Matrix . . . . . . . . . . . . . . . . . . 35

i
List of Tables
1 Performance comparison between YOLOv8 and Mask R-CNN in
orchard environments. (Sapkota et al., 2024) . . . . . . . . . . . 16
2 Performance Comparison between YOLOv8 and Faster R-CNN
for License Plate Detection. (Angelika Mulia et al., 2024) . . . . 16
3 Number of Images and Instances per Class in the dataset before
augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Performance metrics obtained by the trained YOLO model at
specific confidence levels. . . . . . . . . . . . . . . . . . . . . . . . 32
5 Inference Results Based on Different Model Versions . . . . . . . 32
6 Recognition Rates for Different Classes . . . . . . . . . . . . . . . 32
7 Comparison of Manually Counted and Model Counted Objects
in Testing Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

ii
List of Abbreviations
CNNs Convolutional Neural Networks
COCO Common Objects in Context
CSP Cross-Stage Partial
CV Computer Vision
DETR Deformable Transformer
DL Deep Learning
ESRGAN Enhanced Super-Resolution Generative Adversarial Networks
FN False Negative
FP False Positive
FPN Feature Pyramid Networks
GPU Graphics Processing Units
HDPE High Density Polyethylene
KPI Key Performing Indicators
mAP Mean Average Precision
ML Machine Learning
NIR Near Infrared
PET Polyethylene Terephthalate
PP Polypropylene
PR Precision-Recall
PS Polystyrene
ROI Region of Interest
RPN Region Proposal Network
SAM Segment Anything Model
SE Separation Efficiency
SOTA State-of-the-Art
SPPF Spatial Pyramid Pooling Fast
SRCNN Super-Resolution Convolutional Neural Network

iii
SSD Single Shot MultiBox Detector
TP True Positive
TPU Tensor Processing Units
YOLO You only look once

iv
1 Introduction
The escalating issue of plastic waste has necessitated the development of ad-
vanced recycling technologies (Motunrayo, 2024), particularly for sorting differ-
ent types of plastics. Polyethylene Terephthalate (PET), widely used in pack-
aging and known for its recyclability, is a primary focus (Hopewell et al., 2009).
The success of recycling efforts largely depends on the accurate separation of
PET from non-PET plastics, a longstanding challenge in the recycling industry
(Ragaert et al., 2017). This thesis addresses this challenge by implementing
Computer Vision (CV) technologies to identify PET and non-PET plastics,
aiming to enhance the accuracy and efficiency of the sorting process, further
differentiating clear PET and colored PET.
The motivation for this study comes from the urgent need to improve recy-
cling rates and reduce the environmental footprint of plastic waste. Accurately
sorted plastics allow recycling facilities to significantly enhance the quality of
recycled materials and reduce contamination in the recycling stream. This not
only contributes to environmental sustainability but also supports the circular
economy by making recycled plastics more valuable and usable for manufactur-
ers. (WRAP, 2019)
To achieve the objectives, a CV-based approach is proposed, involving the
capture of images of objects during various stages of the sorting process (Lu and
Chen, 2022). These images are analyzed using a Deep Learning (DL) algorithm
to classify them as clear PET, colored PET, and Others. The SE of the sorting
process will be evaluated at different streams to identify potential improvements.
Through this research, the aim is to contribute to the development of more
efficient and sustainable recycling processes, addressing both the technical and
design aspects of plastic recycling.
This thesis is structured as follows: Chapter 2 discusses the State-of-the-
Art (SOTA) approaches relevant to the tasks at hand. Chapter 3 outlines the
methodologies employed in the experiment, including data processing and model
training. Chapter 4 presents the experimental results across various stages.
Finally, Chapter 5 concludes with an interpretation of the findings, a discussion
of limitations, and suggestions for future improvements.

1.1 Problem Statement


In recycling and sorting facilities, used packaging and trash are segregated into
various categories such as PET, Polypropylene (PP), Polystyrene (PS), High
Density Polyethylene (HDPE), Allumunium, and Tetra. Given the large vol-
ume of trash directed to the PET section, it is crucial to accurately sort PET
from non-PET materials. Effective recycling begins with precise PET and non-
PET classification. PET plastics are easily recyclable, although the process has
environmental impacts (Muringayil Joseph et al., 2024). The machines used
for sorting lack accuracy measurement, leaving facilities unaware of their seg-
regation effectiveness. To address this, the thesis proposes using CV and DL
to compare the input trash with the output stream, quantifying the sorting

1
accuracy.
Furthermore, there is no current method to measure the accuracy of segre-
gating clear PET from colored PET. The proposed CV and DL models aim to
identify errors in the segregation of clear and colored PET streams, enhancing
the overall efficiency and reliability of the recycling process.

1.2 Experiment Objective


The primary objective of this experiment is to develop and validate a CV-based
system capable of accurately detecting, classifying, and counting different types
of plastics, specifically focusing on PET. This system aims to address the chal-
lenge of improving the efficiency and accuracy of plastic sorting processes in
sorting facilities. By leveraging advanced DL models, such as YOLOv8, the
experiment seeks to enhance the real-time identification of clear PET, colored
PET, and other materials on conveyor belts within a sorting environment. The
ultimate goal is to create a robust and scalable solution that can quantify the
separation and thus reduce contamination in the recycling stream. The ex-
periment will involve the collection and processing of a diverse dataset, the
implementation of a sophisticated annotation and training pipeline, and the
evaluation of the model’s performance in real-world conditions.

2
2 State-of-the-Art
The plastic sorting industry has evolved significantly with advancements in
technology, aiming to enhance the efficiency and accuracy of sorting processes
(Padalkar et al., 2021). This section provides an overview of the current SOTA
technologies and methodologies in plastic sorting, including the latest Machine
Learning (ML) models and automated systems. Additionally, it presents a com-
parative analysis of various object detection models, supported by benchmarking
studies.

2.1 Advanced Technologies in Automatic Plastic Sorting


Modern plastic sorting facilities employ a variety of advanced technologies to
achieve high levels of precision and throughput in the circular economy (Howard
et al., 2024). Key technologies include optical sorting (Huang et al., 2010), Near
Infrared (NIR) spectroscopy (Manley and Baeten, 2018), and ML-based image
recognition systems (Wu and Chen, 2015).
Optical Sorting: Optical sorting systems use cameras and sensors to iden-
tify and sort plastics based on their color and transparency (Woon et al., 2021).
High-speed cameras capture images of the plastics on conveyor belts, and image
processing algorithms analyze these images to classify and sort the materials.
This technology enhances the efficiency and accuracy of the recycling process,
enabling the effective separation of PET, HDPE, and other types of plastics,
thus improving the purity of recycled materials and reducing contamination in
recycling streams. (Cimbria, 2024)
NIR Spectroscopy: NIR spectroscopy is widely used in plastic sorting to
identify different types of polymers based on their spectral fingerprints. NIR
sensors detect the specific wavelengths of light absorbed and reflected by various
plastics, enabling accurate sorting. This technology is particularly effective for
sorting opaque and colored plastics, which may not be easily distinguishable by
optical sorting alone. (Schmidt et al., 2020)
Machine Learning-Based Image Recognition: ML-based image recog-
nition has emerged as a powerful tool for enhancing plastic sorting efficiency.
By training DL models on vast datasets of plastic images, these systems can
accurately identify and classify various types of plastics (Choi et al., 2023a).
Convolutional Neural Networks (CNNs) are particularly effective in extracting
relevant features from plastic images, enabling precise sorting decisions. As
these technologies continue to advance, they hold the potential to revolution-
ize the recycling industry by improving sorting accuracy and reducing manual
labor. (Ramos et al., 2024)

2.2 Image Processing


The field of image processing has witnessed remarkable advancements in recent
years, driven by breakthroughs in DL and artificial intelligence. CNNs have
emerged as a powerful tool for a wide range of image-related tasks, from object

3
detection and recognition to image segmentation and generation. These models
excel at extracting complex features from images, enabling highly accurate and
efficient solutions. Additionally, advancements in hardware, such as Graphics
Processing Units (GPU) and Tensor Processing Units (TPU), have accelerated
the training and deployment of DL models for image processing applications.
As a result, the SOTA in image processing continues to evolve rapidly, with new
techniques and applications emerging regularly. (Valente et al., 2023)

2.2.1 Convolutional Neural Networks


CNNs have revolutionized the field of CV, establishing themselves as the SOTA
for a wide range of image-related tasks. CNNs are specifically designed to pro-
cess and analyze image data, leveraging their hierarchical structure to extract
meaningful features at different levels of abstraction. The convolutional layers
in CNNs apply filters to the input image, capturing local patterns and features.
These features are then combined and passed through pooling layers to reduce
dimensionality and computational complexity. Finally, fully connected layers
are used to classify or predict based on the extracted features. Advancements
in CNN architectures, such as ResNet, VGGNet, and AlexNet, have significantly
improved their performance and capabilities, enabling them to achieve remark-
able results in tasks like image classification, object detection, and semantic
segmentation. CNNs have found widespread applications in various domains,
including autonomous driving, medical image analysis, and surveillance systems.
(Valente et al., 2023)

Key Architectures
AlexNet: AlexNet is a pioneering convolutional neural network (CNN) archi-
tecture that advanced the field of DL by achieving groundbreaking results in
image classification tasks. Developed by Alex Krizhevsky, Ilya Sutskever, and
Geoffrey Hinton, AlexNet won the 2012 ImageNet Large Scale Visual Recogni-
tion Challenge (ILSVRC), demonstrating the effectiveness of DL in CV (Rus-
sakovsky et al., 2015). The network consists of eight layers, including five con-
volutional layers and three fully connected layers, and introduced innovations
like ReLU activation functions and dropout to prevent overfitting (Krizhevsky
et al., 2012).
VGGNet is another influential deep convolutional neural network archi-
tecture, was introduced in 2014 (Simonyan and Zisserman, 2015). VGGNet is
characterized by its use of multiple convolutional layers with small (3x3) filters,
stacked on top of each other. This design allows VGGNet to learn increasingly
complex features as the network deepens. The network also employs pooling
layers to reduce dimensionality and computational complexity. VGGNet’s sim-
ple yet effective architecture has made it a popular choice for various CV tasks,
and it has served as a benchmark for many subsequent models (Canziani et al.,
2016).
ResNet introduced in 2015, made a significant breakthrough in training

4
extremely deep neural networks. ResNet addresses the vanishing gradient prob-
lem, which hinders the training of deeper networks, by introducing skip connec-
tions. These connections allow the network to learn residual functions, making
it easier to optimize and achieve higher accuracy. ResNet achieved remarkable
success in various CV tasks, including image classification and object detection,
demonstrating the effectiveness of its deep architecture and innovative design.
(He et al., 2015)

2.2.2 Object Detection


Object detection, a fundamental task in CV, involves identifying and localiz-
ing objects within an image or video sequence. Over the past decade, signifi-
cant advancements have been made in object detection, driven primarily by the
development of DL techniques. DL-based object detection methods have out-
performed traditional approaches by leveraging the power of CNNs to extract
complex features from images. These methods can be broadly categorized into
two main approaches:
Two-Stage Detectors: These methods employ a Region Proposal Network
(RPN) to generate potential object bounding boxes, followed by a classification
network to assign object categories to these regions. Examples include Faster
R-CNN, Mask R-CNN, and Feature Pyramid Networks (FPN). (Ren et al.,
2016)
One-Stage Detectors: These methods directly predict object bounding
boxes and classes in a single step, often using anchor boxes or keypoints. Ex-
amples include You only look once (YOLO), Single Shot MultiBox Detector
(SSD), and RetinaNet. (Redmon et al., 2016)

Key Models
YOLO: YOLO is a popular one-stage object detection framework that has
gained significant attention due to its speed and accuracy (Redmon et al., 2016).
Unlike two-stage methods, YOLO directly predicts object bounding boxes and
classes in a single pass, making it more efficient. YOLO divides the input image
into a grid and predicts objectness scores, bounding box coordinates, and class
probabilities for each grid cell. This approach allows YOLO to detect multiple
objects simultaneously, making it suitable for real-time applications. The origi-
nal YOLO architecture has been further improved in subsequent versions, such
as YOLOv5 and YOLOv8, which have introduced enhancements like feature
pyramids and better backbone networks to enhance performance (Ultralytics,
2023b).
Faster R-CNN Combines region proposal networks with Fast R-CNN, sig-
nificantly improving the speed of generating region proposals and refining their
classification and bounding boxes (Ren et al., 2016). This model is known for
its accuracy but is slower compared to single-stage detectors like YOLO. (Ren
et al., 2016)

5
RetinaNet addresses the challenge of class imbalance in object detection
using focal loss, making it a powerful single-stage detector that balances speed
and accuracy (Lin et al., 2018).

2.2.3 Image Segmentation


Image segmentation, a fundamental task in CV, involves partitioning an image
into distinct regions corresponding to different objects or features (Minaee et al.,
2021). DL techniques have significantly advanced the field, with CNNs playing
a pivotal role in achieving state-of-the-art performance.
One of the most influential architectures for image segmentation is U-Net
(Ronneberger et al., 2015). U-Net employs an encoder-decoder structure, cap-
turing context through the encoder path and recovering spatial information
through the decoder path. Skip connections between the encoder and de-
coder layers help preserve fine-grained details, resulting in accurate segmen-
tation masks.
DeepLab is another prominent architecture that incorporates atrous con-
volution to increase the effective field of view and capture more contextual
information. DeepLab also utilizes dilated convolution and a spatial pyramid
pooling module to improve segmentation accuracy. (Chen et al., 2017)
Instance segmentation, a more challenging task that requires identifying and
segmenting individual instances of objects within an image, has also seen sig-
nificant progress. Mask R-CNN is a popular approach that combines object
detection and instance segmentation, leveraging a region proposal network and
a mask prediction branch. (He et al., 2018)
Recent advancements in image segmentation include the development of
transformer-based models, such as Deformable Transformer (DETR) (Zhu et al.,
2021b), which have demonstrated promising results in various segmentation
tasks. Additionally, self-supervised learning techniques are being explored to
reduce the reliance on large labeled datasets. Segment Anything Model (SAM)
is a foundation model that can perform zero-shot instance segmentation on any
image or video, simply by providing a prompt or mask (Kirillov et al., 2023a).

2.2.4 Image Enhancement and Super-Resolution


Image enhancement and super-resolution techniques introduced by Linwei Yue,
improve the quality and resolution of images, making them clearer and more
detailed. Selected key models are presented in the following. (Yue et al., 2016)
Super-Resolution Convolutional Neural Network (SRCNN) is a pi-
oneering DL-based approach for image super-resolution, introduced by Dong.
SRCNN consists of three layers: a feature extraction layer, a nonlinear map-
ping layer, and a reconstruction layer. The feature extraction layer learns low-
level features from the low-resolution input image, while the nonlinear mapping
layer transforms these features into high-level representations. Finally, the re-
construction layer upsamples the features to generate a high-resolution image.
(Dong et al., 2016)

6
Enhanced Super-Resolution Generative Adversarial Networks (ES-
RGAN) is a state-of-the-art DL-based approach for image super-resolution,
introduced by Wang. ESRGAN combines a perceptual loss function with ad-
versarial training to generate high-quality, photorealistic super-resolved images.
The perceptual loss encourages the network to learn features that are per-
ceptually meaningful to humans, rather than focusing solely on pixel-wise dif-
ferences. The adversarial training component further improves the quality of
the generated images by using a discriminator to distinguish between real and
fake images. ESRGAN has achieved impressive results in various image super-
resolution tasks, producing visually pleasing and high-fidelity images. ESRGAN
is build on the GAN framework to produce high-fidelity images with finer details,
improving upon previous models like SRGAN by using enhanced loss functions
and network architectures. (Wang et al., 2018)
Denoising Autoencoders models can learn to remove noise from images,
enhancing their quality by training on corrupted and clean image pairs. (Kumar
et al., 2014)

2.2.5 Recent Advancements and Applications


Recent advancements in ML and DL for image processing include the develop-
ment of more efficient architectures like MobileNets (Howard et al., 2017) and
EfficientNet (Tan and Le, 2020), which are designed for deployment on edge
devices and mobile platforms. Additionally, the integration of attention mech-
anisms proposed by (Vaswani et al., 2017), such as those in Transformer-based
models, has further improved the ability of models to focus on important parts
of an image, enhancing performance across various tasks.
In practical applications, these advancements are being used across diverse
fields such as medical imaging, autonomous driving, security surveillance, and
remote sensing. For instance, in medical imaging, CNNs are used for the au-
tomatic detection of diseases from X-rays and MRIs, significantly aiding diag-
nostic processes. In autonomous driving, real-time object detection models like
YOLOv8 are critical for identifying pedestrians, vehicles, and other objects,
ensuring safe navigation (Ultralytics, 2023b).
The state-of-the-art in image processing using ML and DL is marked by con-
tinuous innovation and improvement. From foundational models like AlexNet
and ResNet to advanced applications in object detection, segmentation, and en-
hancement, these technologies are transforming the way images are analyzed and
understood. As ML and DL techniques continue to evolve, their applications
in image processing are expected to become even more powerful and important
for driving advancements in numerous fields (Schirrmeister et al., 2021).

2.3 PET detection using CV


PET is a widely used plastic material found in various products, including bev-
erage bottles, food packaging, and textiles (Xometry, 2024). Accurate detection

7
and sorting of PET plastics are crucial for effective recycling and waste man-
agement. CV techniques offer a promising approach to automate this process,
enabling efficient and reliable PET plastic identification (Choi et al., 2023a).
This section reviews the state-of-the-art methodologies and technologies em-
ployed in the detection of PET plastics using CV.

Machine Learning Pipelines


Modern object detection systems often employ comprehensive ML pipelines that
integrate various stages of data processing, feature extraction, model training,
and deployment (Datatron, 2023). Figure 1 shows an example of such a pipeline.

Figure 1: Example of modern machine learning pipeline (Datatron, 2023)

DL-based image segmentation techniques, including U-Net and DeepLab,


have also been employed for PET plastic detection, enabling pixel-level local-
ization and segmentation of plastic objects. These methods are particularly
useful when dealing with complex scenes or overlapping objects. (Jakovljevic
et al., 2020)
Data augmentation techniques such as flipping, rotation, and scaling are
used to increase the diversity of the training dataset, improving the model’s
ability to generalize to new, unseen data. (Chatterjee et al., 2022a)
Transfer learning leveraging pre-trained models on large datasets like Ima-
geNet, Common Objects in Context (COCO), etc, transfer learning allows for
efficient training of PET detection models with limited labeled data. (Chatter-
jee et al., 2022b)
Automated annotation tools like the SAM automate the annotation process,
generating high-quality training data with minimal manual intervention (Kir-
illov et al., 2023b). This kind of technology have made the annotation task
faster with very little manual intervention. (Larentzakis et al., 2023)

8
Case Studies and Applications
Several studies and projects have demonstrated the effectiveness of CV and DL
in PET plastic detection:
AMP Robotics is a leading provider of AI-powered robotic sorting solu-
tions for the recycling industry (Pransky, 2020). Their innovative technology
has been successfully applied to PET plastic recycling, significantly improving
efficiency and accuracy. By utilizing advanced CV algorithms and robotic arms,
AMP Robotics can accurately identify and sort PET plastics from mixed waste
streams, ensuring high-quality recycled materials. Their systems have been
deployed in various recycling facilities worldwide, contributing to the circular
economy and reducing plastic waste.(Pact, 2024)
ZEN Robotics is another prominent company specializing in AI-powered
robotic sorting solutions, with a strong focus on PET plastic recycling (Lukka
et al., 2014). Their advanced robotic systems utilize CV and machine learning
algorithms to accurately identify and sort various materials, including PET
plastics, from mixed waste streams. Zen Robotics’ technology has been deployed
in recycling facilities worldwide, demonstrating its effectiveness in improving
sorting efficiency and reducing contamination rates. By automating the sorting
process, Zen Robotics helps to increase the value of recycled PET plastics,
contributing to a more sustainable circular economy. (Corporation, 2024)
Founded in 2019, Recycleye was created with the mission to apply advanced
AI to the traditionally manual and labor-intensive process of waste sorting. The
company aims to address the inefficiencies in current waste management sys-
tems, where human workers often perform monotonous and hazardous tasks of
sorting recyclable materials. By automating this process, Recycleye seeks to im-
prove the accuracy, efficiency, and safety of waste sorting operations, ultimately
contributing to a more sustainable environment. (Ltd., 2023)

Conclusion
The integration of CV and DL in PET plastic detection has advanced the ca-
pabilities of automated recycling systems. Models like YOLO, Faster R-CNN,
and U-Net, along with spectroscopy and sensor fusion techniques, provide ro-
bust solutions for accurately identifying and sorting PET plastics (Choi et al.,
2023b). As these technologies continue to evolve, they will play an increasingly
vital role in promoting efficient recycling processes and supporting sustainable
waste management practices.

2.4 Benchmarking of models


The application of ML and DL in waste segregation has emerged as an important
advancement in the recycling industry, significantly enhancing the efficiency and
accuracy of sorting processes (Ahmed and Asadullah, 2020). These technologies
have significantly enhanced the precision of automated systems in classifying and
sorting various types of waste materials. These advanced methods address the

9
challenges posed by the complex and heterogeneous nature of waste streams,
enabling more efficient and accurate recycling processes. Recent developments
and experiments in this area have shown promising results, indicating that ML
and DL can overcome the limitations of traditional waste sorting techniques,
thereby improving the overall effectiveness of waste management systems (Zhu
et al., 2021a). This section explores the current state-of-the-art methodologies
in machine learning (ML) and deep learning (DL) for waste segregation, em-
phasizing their capabilities, benefits, and the significant impact they have had
on enhancing recycling processes.

2.4.1 ML and DL in Waste Segregation


ML in waste segregation has shown promising potential in enhancing the accu-
racy and efficiency of sorting processes, particularly in plastic waste manage-
ment. According to a systematic review by (Ramos et al., 2024), ML models,
particularly those based on CNNs, have demonstrated significant effectiveness
in detecting and classifying plastic waste (Recycleye, 2023). The review high-
lights that these models often exceed 90% accuracy in real-world applications,
making them crucial tools in addressing the complexities of heterogeneous waste
streams and improving recycling outcomes.
DL has become a transformative tool in solid waste management, particu-
larly in waste segregation, by enhancing the accuracy and efficiency of sorting
processes. DL models such as CNNs are extensively utilized to classify waste ma-
terials with high precision. These models can handle the complexities of diverse
waste streams, offering robust solutions for detecting, classifying, and sorting
various types of waste, thereby significantly improving recycling processes and
reducing manual labor in waste management. (Zhang et al., 2022)
Advanced object detection models like YOLO , Faster R-CNN, and Reti-
naNet have been employed for real-time waste segregation. These models not
only classify waste materials but also detect their location within the image, pro-
viding a more comprehensive solution for automated sorting systems. (Sharif
et al., 2022)
The choice of a deep learning (DL) model for waste segregation hinges on
balancing speed and accuracy, with YOLOv8, Faster R-CNN, and RetinaNet
standing out as leading contenders. Each of these models offers unique advan-
tages and trade-offs that must be considered based on the specific demands of
the application. (Sharif et al., 2022)

YOLOv8
As shown in Figure 2, The YOLOv8 architecture can be conceptualized into
three fundamental components:
1. Backbone: The backbone serves as the CNN that extracts essential fea-
tures from the input image. In YOLOv8, this role is fulfilled by a custom
CSPDarknet53 backbone, which integrates Cross-Stage Partial (CSP) con-
nections. These connections are designed to enhance the flow of informa-

10
Figure 2: YOLOv8 architecture diagram (Ultralytics, 2023a)

tion between layers, thereby improving the model’s accuracy by allowing


more effective feature reuse and gradient propagation during training.
2. Neck: The neck, also referred to as the feature fusion module, is crucial
for combining feature maps derived from various stages of the backbone.
This process captures information at multiple scales, which is particularly
important for detecting objects of varying sizes. YOLOv8 introduces a
novel C2f module in place of the conventional FPN. The C2f module is the
faster implementation of CSP and it adeptly merges high-level semantic
features with low-level spatial details, significantly enhancing detection
accuracy, especially for smaller objects.
3. Head: The head of the YOLOv8 model is tasked with generating pre-

11
dictions. It employs multiple detection modules that are responsible for
predicting bounding boxes, objectness scores, and class probabilities for
each grid cell within the feature map. These predictions are subsequently
aggregated to produce the final detections, ensuring a robust and precise
identification of objects across the image.
This architecture allows YOLOv8 to deliver high-speed and accurate object
detection, making it well-suited for real-time applications in complex environ-
ments, such as automated waste segregation systems.
YOLOv8’s remarkable performance is driven by several key innovations.
YOLOv8 integrates a Spatial Attention Mechanism that enhances object
localization by concentrating on the most relevant parts of the image. This
selective focus allows the model to accurately identify and localize objects, even
in complex scenes (Agbehadji et al., 2022).
The C2f module in YOLOv8 plays a critical role in merging high-level seman-
tic features with low-level spatial details. This feature fusion is particularly
beneficial for improving the detection accuracy of small objects, which often
pose challenges in object detection tasks. (Joiya, 2022)
The CSPDarknet53 backbone in YOLOv8 incorporates bottleneck struc-
tures that reduce computational complexity without compromising accuracy.
Furthermore, the Spatial Pyramid Pooling Fast (SPPF) layer is designed
to capture features at multiple scales, which significantly enhances the model’s
overall detection performance by allowing it to detect objects of varying sizes
more effectively (Agbehadji et al., 2022).

Faster R-CNN
Faster R-CNN is composed of three main components as shown in Figure 3.

Figure 3: Faster R-CNN architecture diagram (Endachev et al., 2019)

• Convolutional Layers: The convolutional layers in Faster R-CNN are de-


signed to extract relevant features from the input image. These layers

12
function as a series of filters that learn to detect specific patterns asso-
ciated with the objects of interest during training. For example, if the
goal is to detect human faces, the filters will learn to recognize shapes,
textures, and colors unique to facial features. This process is analogous to
how a coffee filter works: just as a coffee filter allows only liquid to pass
through while trapping the grounds, the convolutional layers allow only
the most relevant features of the object to be passed forward, filtering out
irrelevant information. The output of these convolutional layers is a fea-
ture map that serves as a condensed representation of the original image.
CNNs, which form the backbone of Faster R-CNN, typically consist of
a sequence of convolutional layers, pooling layers, and a fully connected
layer. The convolutional layers perform the feature extraction, the pooling
layers reduce the dimensionality of the feature maps by retaining the most
significant information, and the fully connected layer usually handles the
final classification task.
• RPN: The RPN is a small, efficient neural network that slides over the
feature map generated by the convolutional layers. It scans the entire
feature map and predicts whether any region contains an object. Addi-
tionally, the RPN proposes bounding boxes around these detected objects.
The RPN’s primary function is to generate potential Region of Interest
(ROI) that are likely to contain objects, thus narrowing down the search
space for the subsequent steps.
• Classification and Bounding Box Prediction: After the RPN has proposed
potential object regions, these proposals are passed to a fully connected
network that performs two tasks: classification and bounding box regres-
sion. The network classifies each proposed region into a specific object
class and refines the bounding box coordinates to more accurately enclose
the object. This final step ensures that the detected objects are correctly
identified and precisely located within the image.
This architecture allows Faster R-CNN to efficiently and accurately detect
objects in images, making it a powerful tool for various CV tasks. This model
excels in accuracy, utilizing a two-stage process with a region proposal network
followed by a Fast R-CNN detector. While it provides high precision, its infer-
ence time is slower due to the more complex processing pipeline. Faster R-CNN
is suitable for applications where accuracy is paramount, but it may not meet
the real-time processing requirements of fast-paced sorting facilities. (Ren et al.,
2016)

2.4.2 RetinaNet
Let’s dive a bit deeper into RetinaNet and its components. Figure 4 shows the
RetinaNet architecture diagram. RetinaNet is a widely used object detection
architecture known for its ability to achieve a balance between speed and ac-
curacy, particularly when detecting objects across a wide range of scales. The

13
architecture of RetinaNet is built on the principle of a single-stage detector, but
it incorporates a novel loss function called the Focal Loss, which addresses the
issue of class imbalance during training. Figure 4 shows the detailed explanation
of its architecture. (Lin et al., 2017)

Figure 4: RetinaNet architecture diagram (Lin et al., 2017)

• Backbone Network: The backbone of RetinaNet is typically a deep CNN,


such as ResNet, which is pre-trained on large datasets like ImageNet. The
backbone is responsible for extracting hierarchical features from the input
image. As the image passes through the network, the spatial resolution of
the feature maps decreases, but the depth of the feature maps increases,
capturing rich semantic information at different scales. To retain spatial
information at multiple scales, RetinaNet employs a FPN on top of the
backbone. The FPN constructs a multi-scale feature pyramid by combin-
ing low-resolution, high-semantic information with high-resolution, low-
semantic information, allowing the network to detect objects of varying
sizes effectively.
• FPN: The FPN is crucial in RetinaNet’s architecture for handling objects
at different scales. It takes the feature maps generated by the backbone
network and creates a pyramid of features at multiple scales. The FPN
enables the network to detect both small and large objects by providing a
rich feature representation at every scale. This multi-scale feature pyramid
is then used for both the classification and localization of objects.
• Subnets for Classification and Regression: RetinaNet includes two subnets
that are applied to each level of the feature pyramid:
– Classification Subnet: This subnet predicts the probability of the
presence of an object for each anchor box at every spatial location
in the feature map. The classification subnet is shared across all
pyramid levels, which helps in efficiently handling the large number
of anchor boxes.
– Regression Subnet: This subnet is responsible for predicting the
bounding box offsets for each anchor box, effectively refining the
anchor box proposals to tightly fit the detected objects. Like the

14
classification subnet, the regression subnet is also shared across all
pyramid levels.
• Focal Loss: One of the key innovations in RetinaNet is the introduction
of Focal Loss, which addresses the issue of class imbalance (Lin et al.,
2018). In typical object detection tasks, there are far more background
(negative) samples than foreground (positive) samples. This imbalance
can lead to the model focusing too much on easy negatives, resulting
in suboptimal performance. Focal Loss reduces the relative loss for well-
classified examples, allowing the model to focus more on hard, misclassified
examples during training. This leads to improved detection performance,
especially for small and difficult-to-detect objects.
• Anchor Boxes: RetinaNet uses anchor boxes of various sizes and aspect
ratios to detect objects. These anchor boxes are predefined bounding
boxes that serve as initial guesses for object locations. The network then
adjusts these boxes through the regression subnet to match the ground
truth more accurately.
• Inference: During inference, RetinaNet processes the image through the
backbone and FPN to generate feature maps at multiple scales. The clas-
sification and regression subnets then predict the class probabilities and
bounding box offsets for each anchor box. Non-Maximum Suppression is
applied to remove redundant boxes and produce the final set of detections.

Benchmarking Studies
The choice between YOLOv8, RetinaNet, and Faster R-CNN depends largely on
the specific requirements of the application. YOLOv8 offers unparalleled speed,
making it the go-to choice for real-time detection tasks (Joiya, 2022). RetinaNet
provides a balanced approach, offering both speed and accuracy, particularly in
scenarios with class imbalance. Faster R-CNN remains the gold standard for
tasks that demand the highest accuracy and precision, albeit at the cost of
slower processing times.
Precision and Recall are key metrics for evaluating classification models,
particularly in tasks like object detection. Precision measures the accuracy of
the positive predictions. It is the proportion of correctly predicted positive
instances out of all instances predicted as positive (Equation 1).

True Positives (TP)


Precision = (1)
True Positives (TP) + False Positives (FP)

Recall (or Sensitivity) measures how well the model identifies all actual positive
instances. It is the proportion of correctly predicted positive instances out of
all actual positive instances (Equation 2).

True Positives (TP)


Recall = (2)
True Positives (TP) + False Negatives (FN)

15
For applications like real-time video analysis or embedded systems where
speed is crucial, YOLOv8 is likely the best choice based on the various studies
done on different datasets. As the experiment done by Sapkota et al. on complex
orchard environment shows YOLOv8 offers superior speed and performance,
making it more advantageous for real-time applications in agriculture. Findings
of the experiment showed in Table 1.

Task YOLOv8 Mask R-CNN


Precision: 92.9%, Precision: 84.7%,
Single-class segmentation
Recall: 97% Recall: 88%
Precision: 90.6%, Precision: 81.3%,
Multi-class segmentation
Recall: 95% Recall: 83.7%
128 (single-class), 78 (single-class),
Inference Speed (FPS)
92 (multi-class) 64 (multi-class)

Table 1: Performance comparison between YOLOv8 and Mask R-CNN in or-


chard environments. (Sapkota et al., 2024)

Angelika Mulia et al. conducted the experiment for Licence Plate Recog-
nition and found that YOLOv8 performed better than Faster R-CNN model.
Table 2 shows the experiment result.

Metric YOLOv8 Faster R-CNN


AP@0.5 (Train) 93% 71%
AP@0.5-0.95 (Train) 71% 50%
AP@0.5 (Test) 90.6% 74%
AP@0.5-0.95 (Test) 68.6% 51%

Table 2: Performance Comparison between YOLOv8 and Faster R-CNN for


License Plate Detection. (Angelika Mulia et al., 2024)

Ezzeddini et al. also showed that for real time scenarios YOLOv8 excels. For
highly specialized tasks where detection accuracy is non-negotiable, such as in
medical imaging or high-stakes security systems, Faster R-CNN is the preferred
model.
Benchmarking studies provide a comprehensive evaluation of these models,
highlighting their performance across various datasets and metrics. For exam-
ple, the COCO object detection challenge and other similar benchmarks have
consistently shown that YOLO models, including YOLOv8, achieve faster infer-
ence times while maintaining competitive accuracy compared to Faster R-CNN
and RetinaNet. These benchmarks underscore the trade-offs between speed and
accuracy and guide the selection of models based on specific application needs.

16
3 Data Processing and Model Training
This section outlines the methodological framework employed in the study. The
experimental methodology is described in subsection 3.1. Subsection 3.2 details
the data collection processes. In subsection 3.3, the data processing and cleaning
methods are explained. Subsection 3.4 discusses the rationale behind selecting
specific machine learning models for this task. Subsection 3.5 elaborates on the
training strategies and optimization techniques adopted. Subsections 3.6 and
3.7 focus on the methodologies employed for object tracking, object counting,
and the evaluation of model performance using relevant KPIs.

3.1 Methodology
This section gives the methods and necessary steps of the experiment conducted
for solving the proposed problem of this thesis. The thesis follows the workflow
and pipeline shown in Figure 5.

Figure 5: Workflow and experiment pipeline

The methodology involved several critical stages to ensure a comprehensive


evaluation of object detection models for PET plastic detection. Initially, data
collection was carried out by capturing live video feeds from various conveyor
belts containing clear PET, colored PET, and mixed waste materials, creating
a diverse and challenging dataset. The next step focused on data annotation,
where frames were extracted from videos to create images. Images were anno-
tated utilizing SAM to produce precise segmentations of each object of interest
within the frames. This was followed by data preparation, which included fil-
tering irrelevant segments, applying augmentation techniques such as flipping
and rotation, and splitting the dataset into training, validation, and test sets to
enhance the model’s generalization capabilities.
Finally, the trained model was tested on the captured video footage to eval-
uate their object detection performance, assessing metrics such as accuracy, and
processing speed. Also, Object counting and object tracking were implemented
to find the number of objects for each class. The result was calculated by apply-
ing the SE formula. This approach provided a rigorous assessment of the models
under conditions closely resembling real-world waste sorting environments.

17
3.2 Data Collection
Developing an effective model required the assembly of a comprehensive dataset
that included both clear PET and colored PET images. To meet this require-
ment, an extensive search was conducted across various internet platforms such
as Google, Roboflow, and Kaggle, to locate existing datasets. This search led
to the identification of numerous publicly accessible user-generated datasets.
However, these datasets were primarily composed of laboratory-generated im-
ages, featuring new and pristine plastic materials rather than actual waste. This
posed a significant concern, as it is crucial for the dataset to reflect real-world
conditions, where plastics are frequently exposed to environmental factors that
result in wear, fading, and damage. The lack of diversity and realism in these
datasets could hinder the model’s ability to generalize effectively in real-world
scenarios. Figure 6 shows some of these examples.

Figure 6: PET images found on online datasets (Smirnovs, 2023)

These datasets did not match the tough conditions in a sorting facility, where
plastics are often bent, crushed, or misshapen. To fix this, data was collected
directly from the sorting facility. Photos were taken to show the sorting process
clearly. The goal is to build a dataset that closely matches real-world conditions,
with plastics in various shapes, sizes, and conditions.
The focus is on capturing details like how the conveyor belt moved, how the
plastics interacted with the machines, and the lighting conditions in the facility.
These elements were included to create a thorough dataset that could teach the
model to handle different situations, making it more accurate and reliable. The
first images captured in Figure 7, show detailed views of the real environment,
highlighting the conditions the model needs to handle.
The first set of images captured had many issues, including blurriness and
overlapping objects, as shown in Figure 7. These problems made it very diffi-
cult to annotate the images and train the models effectively. The images had
poor quality due to the fast movement of plastics on the conveyor belt during
sorting, which made it hard to get clear, sharp pictures. The sorting facility’s
environment, with its complex machinery and different lighting conditions, also
added to the challenge of capturing high-quality images. These poor-quality
images made it tough to identify and label the different types of plastics, which
could negatively affect how well the model learned and generalized.

18
Figure 7: Initial set of objects captured in sorting facility

To address the issues with the initial dataset, individual pieces of trash were
collected and photographed separately. This method was intended to create a
higher-quality dataset to improve classification accuracy. Seven bags of various
plastics were gathered from the sorting facility. In a controlled lab setup, a
black mat was used to mimic the conveyor belt, enabling the capture of over
1,000 images. These images were labeled to ensure accuracy, providing a diverse
dataset for better model training. Some examples of these images are shown in
Figure 8.

Figure 8: Examples of objects clicked in lab setup

To improve the dataset, a further experiment was carried out to capture live
feeds of conveyor belts carrying clear PET, colored PET, and mixed plastics.
Images were extracted from the captured video feeds. The goal was to gather
real-time images that accurately represent the variability and conditions found
in a sorting facility.
The live feed capture was planned to cover the full range of the sorting envi-
ronment on different conveyor belts. This included documenting the continuous
movement of plastics on conveyor belts under different lighting conditions and
interactions with machinery. The resulting dataset contained a wide range of
images showing plastics in various states, including clear and colored PET, both
separately and mixed together.
This video-to-frame method offered several key benefits. It allowed the col-

19
lection of images that closely mirror real-world scenarios, where plastics are
often mixed and affected by changing light and positions. Also, the live feeds
captured the dynamic nature of the sorting process, providing the model with
a realistic dataset for learning. The images were then annotated to accurately
identify each type of plastic, helping the model learn to distinguish between
different plastics in real-world conditions.
Examples of these live feed images are shown below (Figure 9), highlighting
the enhanced quality and diversity of the dataset. This improvement is expected
to greatly increase the model’s classification accuracy by exposing it to a wide
variety of scenarios and conditions similar to those in operational environments.

Figure 9: Examples of objects captured on conveyor belt

3.3 Data Processing


Creating a robust and effective dataset for training the model requires a care-
fully designed data processing pipeline. This pipeline plays a critical role in
converting raw video feeds into a well-structured dataset, complete with accu-
rate annotations that are essential for training ML models effectively. Figure
10 shows the following pipeline for annotations which was followed.

Figure 10: Annotation pipeline from raw video feed

Stage 1: Frame Extraction from Video Feeds


The first step in the data processing pipeline involves extracting frames from
live video feeds captured at various conveyor belts within a sorting facility.

20
These video feeds include a range of plastic types such as clear PET, colored
PET, and mixed plastics. Frames are extracted at specific intervals to build
a diverse collection of static images that accurately reflect the dynamic envi-
ronment of the sorting facility. This video-to-frame method ensures that the
dataset captures the variability and complexity found in real-world conditions,
laying a foundation for the subsequent tasks of annotation and model training.
For this experiment, 10 frames per second were extracted. Extracting 10 frames
per second from the video feeds provided a comprehensive representation of the
conveyor belt. This frame rate also ensured that the dataset included a wide
range of plastic positions, orientations, and interactions.

Stage 2: Automated Segmentation


The second stage involves developing an annotation pipeline that utilizes the
SAM approach to identify and annotate all segments within the extracted frame
images. This advanced technique was employed to achieve precise and com-
prehensive segmentation, facilitating accurate labeling of the various types of
plastics. The SAM pre-trained model, which outputs segment proposals, was
integrated into the pipeline to automate the segmentation process, ensuring
consistency and efficiency in generating high-quality annotations. Each seg-
ment was carefully reviewed and adjusted as needed to ensure accuracy, thereby
enhancing the dataset’s value for training robust models.
The SAM was chosen for its advanced capability to automate segmentation
with high precision and consistency as discussed in section 2.3. SAM’s robust
algorithms effectively manage the variability found in real-world environments,
such as differing lighting conditions and the dynamic movement of plastics on
conveyor belts. The scalability of SAM allows for efficient processing of large
datasets, which is essential for handling the data generated from live video
feeds. Also, SAM integrates seamlessly into the existing pipeline, streamlining
the workflow from frame extraction to annotation, improving the efficiency of
creating the training dataset and the model’s performance in practical applica-
tions.

Stage 3: Filtering Small and Overlapping Segments


The third stage of the data processing pipeline is dedicated to refining the
segmented dataset by filtering out small and overlapping segments. This step
is essential for ensuring that the dataset consists of meaningful and distinct
segments, which in turn improves the overall quality of the dataset.
The process begins by identifying and removing segments with a contour
area smaller than 10,000 pixels. The experimentation was carried out with
thresholds ranging from 5,000 to 15,000 pixels. It was observed that a thresh-
old of 5,000 pixels was too small, allowing non-significant segments to appear,
while 15,000 pixels were too large, excluding some significant plastic pieces.
Based on the distance between the camera and the conveyor belt, a threshold
of 10,000 pixels was set as the optimal value. This threshold is designed to

21
eliminate minor details and noise that are unlikely to contribute significantly
to the model’s learning process. By focusing on larger segments, the dataset
retains only those portions of the image that are substantial enough to provide
valuable information during training.
After filtering based on contour area, the dataset undergoes an overlap fil-
tering process. In this step, segments that overlap more than 80% with larger
segments are excluded. This is crucial for removing redundant segments that
could introduce confusion and inefficiency during model training. By ensuring
that each segment is distinct and non-redundant, the model can better focus on
learning the unique characteristics of each type of plastic, thereby enhancing its
ability to generalize and perform accurately in real-world scenarios.
This filtering process greatly improves the dataset by retaining only mean-
ingful and distinct segments, resulting in more manageable and higher-quality
segments. Such refinement is vital for the subsequent stages of model training,
ultimately leading to enhanced accuracy and reliability in practical applications.

Stage 4: Annotation of Each Filtered Segment


The fourth stage of the data processing pipeline focuses on the detailed annota-
tion of each filtered segment (Figure 11). After the mentioned filtering processes,
the remaining segments are those that are significant in size and distinct in their
characteristics. These segments are now prepared for annotation.

(a) (b)

Figure 11: Examples of annotations of filtered segments

In this stage, each filtered segment undergoes annotation to ensure that


every detail is correctly labeled. The annotation process involves identifying
and categorizing the segments based on the type of PET plastic.

Stage 5: Dataset Division and Augmentation


In the fifth stage of the data processing pipeline, attention turns to enhanc-
ing the dataset’s diversity and robustness through image augmentation, after a
systematic division of the dataset for training, validation, and testing purposes.

22
Before augmentation, the dataset is divided into three subsets to facilitate
effective training, validation, and testing of the model. 70% of the dataset is
allocated to the training set, which serves as the primary source of data for
the model to learn underlying patterns and features. 20% is designated for the
validation set, which is used during training to fine-tune hyperparameters and
iteratively improve the model, ensuring it generalizes well beyond the training
data. The remaining 10% is reserved as the test set, providing an unbiased eval-
uation of the final model’s performance, and allowing for a thorough assessment
of its accuracy and generalization capabilities.
Table 3 shows that the ”clear-pet” class appears in 337 images with a total
of 1,730 instances, making it the most represented class in the dataset. The
”colored-pet” class is found in 289 images with 779 instances, while the ”others”
class, which likely includes miscellaneous items, appears in 109 images with 162
instances. The images used in the dataset come from two primary sources:
lab-clicked images and frames extracted from video footage. All images have a
resolution of 640x640 pixels, ensuring consistency in the dataset.
Figure 12 illustrates the distribution of the number of images relative to the
number of instances per image in the dataset. On the x-axis, the ”Number of
Instances” refers to the count of distinct objects or segments present in each
image, while the y-axis represents the ”Number of Images” corresponding to
each instance count. The majority of images contain only one instance, as
indicated by the highest bar at the ”1” mark, with 450 images in this category.
This also denotes that the majority of dataset images are lab-clicked images with
a single instance per image. As the number of instances per image increases, the
number of images decreases significantly. Figure 12 also shows that, on average,
there are 3.83 instances in an image.
The augmentation process involves applying several transformations to each
image in the training dataset. Horizontal and vertical flips are performed to
ensure that the model can recognize plastics regardless of their orientation.
Rotations within a range of -15° to +15° introduce angular variations. Addi-
tionally, shear transformations applied both horizontally and vertically within
a range of ±10°, simulate perspective changes and enhance the model’s ability
to generalize across different viewpoints.
By dividing the dataset and integrating these augmentations, this stage en-
sures that the model is trained on a comprehensive and representative sample
of data.

Class Name Number of Images Number of Instances Ratio


clear-pet 337 1730 1:5.1
colored-pet 289 779 1:2.7
others 109 162 1:1.5

Table 3: Number of Images and Instances per Class in the dataset before aug-
mentation

23
Figure 12: Number of images vs Number of instances per image

3.4 Model Selection


The next step in the workflow is model selection, which involves identifying the
most suitable ML architecture to handle the classification of various types of
plastics under real-world conditions. The primary goal is to choose a model that
can accurately distinguish between different plastic types, such as clear PET,
colored PET, and mixed plastics, even in challenging environments where items
are in motion or partially obscured.
The SOTA section reviewed models known for their robustness and effec-
tiveness in object detection and classification tasks. Considering the nature of
the problem, models from the YOLO series are evaluated due to their proven
success in real-time object detection. As discussed in SOTA section, YOLOv8
was selected based on its superiority over Faster R-CNN and RetinaNet for
inference results.
Several key criteria guide the model selection process. The model must
first demonstrate high precision and recall rates in detecting and classifying
different types of plastics, ensuring that it can reliably identify and distinguish
between various materials. Additionally, inference efficiency is crucial to ensure
the model can operate in real-time without requiring excessive resources, which
is vital for maintaining operational efficiency in practical deployment scenarios.
Scalability and adaptability are also important considerations. The chosen
model should be easily fine-tuned to fit the specific dataset and seamlessly in-
tegrate with the existing data processing pipeline. YOLOv8, with its modular

24
design and support for various customizations, meets these requirements well.
Given the priority of fast inference in this project, YOLOv8 was selected
over other models like Faster R-CNN and RetinaNet due to its superior real-
time processing capabilities.

3.5 Model Training


The model training phase utilizes the prepared and annotated dataset to train
the YOLOv8 model. The auto optimizer of Yolov8 was used for training.
In the initial training version, the model was trained using the dataset with-
out any augmentation. This dataset only had images clicked on a lab-controlled
environment and images taken on the conveyor belt. This dataset provided a
foundation for the model, allowing it to learn the basic features and character-
istics of clear and colored PET plastics. However, the absence of augmented
images limited the diversity of the training data, which may have affected the
model’s ability to generalize to new, unseen scenarios. Although this training
resulted in the high level of accuracy in mAP@50, it failed to recognize basic
colored PET.
The second version of the model training is done on images included from
video frames and also introduces augmented images to the dataset, including
transformations such as flipping, rotation, and shearing. These augmentations
simulated various real-world conditions. This results in a high mAP@50 but
lacks in the inferencing of colored images. The inference result didn’t get better
with this version. In the dataset, the number of colored PET images and their
instances were low.
To address the imbalance, the third version of the training process included
a significantly larger number of colored PET images. This enhancement aimed
to ensure that the model received adequate exposure to colored PET plastics,
improving its accuracy in identifying this category.
In the fourth version, the dataset was further expanded to include a third
class labeled ”others,” representing miscellaneous materials found on the con-
veyor belts along with PET plastics. This addition was crucial for the model
to accurately differentiate between PET plastics and other materials, thereby
enhancing its overall classification performance. The inclusion of this third class
allowed the model to be more versatile and effective in real-world sorting appli-
cations, where various non-PET materials may be present.

3.6 Object Tracking and Object Counting


To quantify the accuracy of sorting machines, object counting and tracking are
essential components as the objects move along conveyor belts. This section
details the methodology and processes implemented in the project to achieve
object detection, tracking, and counting using ML techniques. The focus is to
avoid double counting the objects.
The first step in the process involves capturing live video feeds from various
conveyor belts in the sorting facility. These videos included a mix of plastic

25
types, such as clear PET, colored PET, and other mixed materials. Given the
continuous movement of objects on the conveyor belts, ten frames per second
are extracted to create a sequence of static images.
This approach ensured a dataset that accurately represented the flow of
materials on the conveyor belts, providing a foundation for subsequent detection,
tracking, and counting tasks. The trained version 4 model is used for object
tracking. The tracking algorithm gives a unique track ID to each detected object
in the frame. An example of object tracking and counting is shown in Figure
13.
One of the critical challenges addressed in this project was the potential for
counting the same object multiple times if the counting is based on the track
ID as the conveyor belt continuously moves, changing the position of objects
in consequent frames. Especially if it temporarily disappeared from the frame
due to occlusion or left and re-entered the frame. An algorithm is needed to
overcome this issue. A virtual line is drawn as a specific vertical section serving
as a checkpoint.

Figure 13: Object tracking and counting on on different streams

For each detected object, the model outputs the bounding box coordinates,
the class of the object, the confidence score, and the track ID. These details
are used to update the tracking history and determine the object’s movement
relative to a predefined horizontal line placed at 80% of the frame height from
the bottom.
This line ensures that each object is counted only once. The algorithm
updates the count only when the object’s center crosses this line, partially pre-
venting multiple counts of the same object even if it reappears in the frame
later.
Additionally, the algorithm maintains a record of the highest confidence score
associated with each object’s classification. If an object is detected with a higher
confidence score in a subsequent frame, the classification is updated to reflect
this. This ensures that the classification accuracy is maximized throughout the
video. The total object count and the count for each class are calculated and

26
get as output.

3.7 Performance Evaluation and Accuracy KPIs


SE is a metric employed to evaluate the performance of a sorting system in
classifying items into specific categories. For a given class, it is computed by
comparing the number of correctly classified items True Positive (TP) to the
total number of items belonging to that class (total count, TC). Mathematically,
SE is expressed as:
TP
SE = ∗ 100 (3)
TC
The sorting facility setup is shown in Figure 14. The input stream consists of

Figure 14: Sorting facility setup

”Clear PET”, ”Colored PET”, and ”Others”. To calculate the SE of the colored
PET stream, the following formula can be applied.
T Pcolored
SEcolored = ( ) ∗ 100 (4)
T Pcolored + F Ncolored
Where:
SE colored: SE of colored stream

27
T P colored: TP for colored output stream (number of correctly sorted
colored PET items in colored output stream)
F N colored: False Negative (FN) for colored output stream (number of
items of the colored PET that should have been in colored output stream
but ended up in clear PET output stream)
Similarly, the SE of clear PET stream can be calculated by the formula:
T Pclear
SEclear = ( ) ∗ 100 (5)
T Pclear + F Nclear
Where:
SE clear: SE of clear PET stream
T P clear: TP for clear output stream (number of correctly sorted clear
PET items in clear output stream)
F N clear: FN for clear output stream (number of items of the clear PET
that should have been in clear output stream but ended up in colored PET
output stream)
The SE is influenced by the precision of the trained model. When the model’s
Mean Average Precision (mAP) for a particular class is less than 100%, it indi-
cates that the model will misclassify some items belonging to that class. This
misclassification increases the number of False Negatives (FN), which are
items that belong to a class but are incorrectly classified as belonging to an-
other class. Conversely, this will also lead to a decrease in True Positives
(TP), which are the items correctly identified as belonging to the class.
Since the SE calculation depends on both TP and FN values, the model’s
mAP directly impacts the calculated SE. Therefore, to accurately calculate SE,
the model’s performance for each class must be considered.
The SE for a specific output stream i can be calculated as:

(T Pi ∗ mAPi )
SEi = ∗ 100 (6)
(T Pi ∗ mAPi + F Ni )
Where:
SEi : SE of output stream i
T Pi : TP for output stream i
F Ni : FN for output stream i
mAPi : mAP of the model for the target class of output stream i
For example:
• If the mAP for clear PET is 81%, this means that 19% of clear PET items
will be misclassified. Consequently, this increases the FN count for clear
PET and decreases the TP count, leading to a lower SE for clear PET.

28
• Similarly, if the mAP for colored PET is 86%, the SE calculation for
colored PET will be similarly affected.

29
4 Results
In this section, all results of the experiment are presented. YOLOv8 optimizer
is set to auto. The auto optimizer settings for YOLOv8 use the AdamW
optimizer with a learning rate of 0.001429 and a momentum value of 0.9. The
optimizer is configured with three parameter groups: one group of 57 weights
with no weight decay, another group of 64 weights with a decay rate of 0.0005
for regularization, and a final group of 63 biases also with no decay. Two images
for inferencing are used to compare the results of all the experiments. Both the
images are taken on the conveyor belt.

V1 result
The v1 model recorded a value of 1 in precision at a confidence level of 0.983.
The recall was recorded 0.98 value at a confidence level of 0. This model achieved
mAP score of 87.8 at a confidence level of 0.5. It also recorded an F1-score of
0.83 value at a confidence level of 0.663. Figure 15 shows the inference of the
V1 model.

Figure 15: Inference results of V1 models

V2 result
The v2 model recorded a value of 1 in precision at a confidence level of 0.966.
The recall was recorded 0.96 value at a confidence level of 0. This model achieved
mAP score of 81.5 at a confidence level of 0.5. It also recorded an F1-score of
0.75 value at a confidence level of 0.371. Figure 18 shows the inference of the
V2 model. Figure 16 shows the inference of the V2 model.

V3 result
The v3 model recorded a value of 1 in precision at a confidence level of 0.949.
The recall was recorded 0.95 value at a confidence level of 0. This model achieved

30
Figure 16: Inference results of V2 models

mAP score of 80.1 at a confidence level of 0.5. It also recorded an F1-score of


0.74 value at a confidence level of 0.471. Figure 17 shows the inference of the
V3 model.

Figure 17: Inference results of V3 models

V4 result
Figure 18 shows the inference result of the V4 model. The final experimental
results show that the model achieved a classification accuracy of 84.7 mAP
(Figure 19b) for three classes (Clear PET, Colored PET, and others). Figure
19a shows the testing of the trained algorithm. Figure 20 provides a graphical
representation of the performance metrics obtained by the trained YOLO model.
In the precision graph (Figure 20a), one can see a value of 1 at a confidence
level of 0.954. In the Recall graph (Figure 20b), there is a 0.94 value recorded
at a confidence level of 0. The Precision-Recall (PR) curve (Figure 20c) shows
a mAP score of 84.5 at a confidence level of 0.5. In the F1-score graph (Figure

31
20d), one can see a 0.81 value at a confidence level of 0.523. The performance
metrics are also shown in Table 4.

Metric Confidence Level Value


Precision 0.954 1.00
Recall 0.00 0.94
mAP Score 0.50 84.5
F1-score 0.523 0.81

Table 4: Performance metrics obtained by the trained YOLO model at specific


confidence levels.

Figure 18: Inference results of V4 models

Table 5 shows the inference result of both selected images compared.

Table 5: Inference Results Based on Different Model Versions

Trained Model Version Image 1 Image 2


V1 clear-pet: 20 clear-pet: 2, colored-pet: 3
V2 clear-pet: 26 clear-pet: 2, colored-pet: 5
V3 clear-pet: 36 clear-pet: 2, colored-pet: 5
V4 clear-pet: 36 clear-pet: 7, colored-pet: 6

As for Figure 21, it presents the confusion matrix generated by the version 4
trained YOLOv8 model, which categorizes objects into clear PET, colored PET,
others, and background across the entire dataset. Table 5 shows the recognition
rates of different classes.

Class TP FN FP TN
Clear PET 81% 19% 18% 0%
Colored PET 86% 14% 13% 0%
Others 90% 10% 1% 0%
Background 76% 24% 0% 0%

Table 6: Recognition Rates for Different Classes

32
(a) Testing of trained algorithm (b) mAP graph

Figure 19: Training result of v4 model

The YOLOv8 trained model demonstrated efficiency in processing images,


with an average inference time of 7.8 ms per frame. For multi-class detection,
the inference time was slightly higher at 10.9 ms per frame. This inferencing
is done on Tesla T4 GPU with Python 3.10. These results confirm the model’s
suitability for real-time applications, where rapid processing is crucial.

Test Video Manually Counted Objects Model Counted Objects


Clear PET 713 622
Colored PET 113 101

Table 7: Comparison of Manually Counted and Model Counted Objects in


Testing Videos

Capturing testing videos for both input and output streams presented sig-
nificant challenges due to the inability to stop the conveyor. This made it
difficult to record videos from different streams simultaneously, leading to po-
tential discrepancies. Despite these challenges, 2-minute videos were recorded
as simultaneously as possible. As shown in Table 6, the clear PET output video
had 713 manually counted objects, while the trained model counted only 622.
Similarly, for the colored PET test video, the manual count was 113 objects,
but the model counted only 101. These discrepancies highlight the challenges
faced in capturing and also model analyzing the data accurately.
SE of the trained model for test videos is calculated. For clear PET it is
94.6%. For colored PET 81.1%.

33
(a) Precision (b) Recall

(c) PR (d) F-1 score

Figure 20: Trained model metrics

34
Figure 21: Trained Model Confusion Matrix

35
5 Discussion and Conclusion
5.1 Interpretation of Experimental Results
The interpretation and comparison of the experimental results provide valuable
insights into the effectiveness of the proposed system. The overall classification
accuracy, reflected by an 84.7 mAP for clear PET, colored PET, and other
materials, underscores the system’s high precision in distinguishing between
various types of PET plastics. This level of accuracy is particularly important
for calculating the SE of the PET sorting machine, as it directly influences the
reliability of the sorting process.
However, while the model successfully counted approximately 90% of the
objects on the conveyor belt in the test videos, it is essential to evaluate whether
this performance is sufficient for the targeted purpose. For certain applications,
a 90% counting accuracy may be acceptable, especially when rapid decision-
making is prioritized. The model’s quick inference time plays a significant role
here, enabling timely classification that keeps pace with the conveyor belt’s
speed, thereby ensuring efficient and continuous operation.
Yet, it is also crucial to examine the model’s performance across different
classes. If the 90% accuracy is consistent across all classes, it may be deemed
satisfactory. However, if certain classes, such as colored PET or others, exhibit
lower accuracy, this discrepancy could be problematic. In such cases, the model
might struggle to accurately classify and count these specific classes, leading to
potential inefficiencies in the sorting process.
The variation in accuracy across classes could stem from several factors,
including the visual similarity between classes, differences in the quality or
quantity of training data for each class, or the inherent complexity of distin-
guishing certain materials. Therefore, while the overall 90% counting accuracy
is promising, further analysis is needed to ensure that the system meets the
desired performance standards for each class, ensuring a robust and reliable
quantification of the sorting process.

5.2 Limitations and Further Improvements


While the YOLOv8 trained model demonstrates strong performance in detect-
ing, classifying, and counting PET in a real-time sorting environment, several
limitations were identified during the project. These limitations highlight ar-
eas where further improvements could enhance the model’s applicability and
effectiveness in diverse and more challenging scenarios.
The dataset used for training primarily consisted of clear PET, colored PET,
and mixed plastics. Although this dataset was diverse enough to train the model
for the intended application, it did not include a balanced number of instances
for each class. This limitation restricts the model’s general performance and
is skewed toward clear PET. Expanding the dataset to include more training
images and instances will ensure better general performance.
The model shows some limitations in accurately detecting and classifying

36
objects that are heavily overlapped or partially occluded by other items. In
real-world sorting facilities, objects often move close to one another or may
be partially hidden, posing significant challenges for the model in distinguish-
ing between them. While the trained model performed adequately in various
scenarios, its performance could be enhanced by addressing these complexities.
To improve the model’s ability to handle such challenging visual scenes,
several strategies could be employed. First, improving the quality of image cap-
ture, such as using higher resolution cameras or optimizing lighting conditions,
could help in obtaining clearer and more detailed images, making it easier for
the model to distinguish between overlapping objects. Additionally, creating a
more balanced and diverse dataset that includes a higher proportion of images
with overlapping and occluded objects would better train the model to recognize
and classify such scenarios.
The performance of the trained model can be affected by variations in en-
vironmental conditions, such as changes in lighting, reflections, or shadows.
Inconsistent lighting conditions in the sorting facility, for example, can lead
to variations in detection accuracy. Although the model was trained on data
reflecting a range of conditions, variations may still impact its performance. Im-
plementing more sophisticated data augmentation techniques during training,
or deploying models specifically tuned to different environmental conditions, can
mitigate this issue.
While the object tracking and counting mechanisms are effective for the
specific task at hand, they were applied only for the sorting facility conveyor
position and may struggle with more complex scenarios. For example, the model
encounters difficulties in maintaining accurate counts if objects disappear and
reappear within the frame or if they move in unpredictable patterns. The current
approach does not incorporate advanced tracking algorithms, such as Kalman
filters or optical flow, which can enhance the accuracy and reliability of object
tracking in more challenging environments.
Although YOLOv8 is designed for real-time processing, it still requires signif-
icant computational resources, particularly when handling high-resolution im-
ages or processing multiple video streams simultaneously. This could limit its
deployment in environments with limited hardware capabilities. Optimizing
the model for lower-resource environments, possibly through model pruning or
quantization, can make it more accessible for broader use cases.
Although the model achieved high precision and recall, a thorough analysis
of False Positive (FP) and FN was not conducted within the time constraints
of this project. Such an analysis is crucial for identifying the underlying causes
of these errors, which can offer valuable insights for future improvements. In
subsequent research, focusing on detailed error analysis could enable targeted
adjustments to the model or training process, ultimately reducing the occurrence
of these errors and enhancing overall performance.
In summary, while the YOLOv8 trained model proved to be promising for the
task of real-time PET plastic detection and sorting, addressing these limitations
through additional research, data collection, and model refinement can further
enhance its performance and applicability across a wider range of real-world

37
scenarios.

5.3 Conclusion
This thesis explored the development and implementation of a YOLOv8-based
model for real-time PET plastic detection, tracking, and counting in a sorting
facility environment. The primary objective was to create a robust system
capable of accurately identifying different types of PET plastics under dynamic
and challenging conditions, thereby quantifying the efficiency of the PET sorting
process.
Through a custom-designed data processing pipeline, a training dataset was
assembled, encompassing a diverse range of clear PET, colored PET, and mixed
plastics. The dataset was annotated and augmented to ensure the model was
trained on a broad spectrum of scenarios reflective of real-world conditions. The
YOLOv8 model was selected for its superior balance between speed and accu-
racy, critical attributes for real-time applications where rapid decision-making
is essential.
The experimental results demonstrated the model’s strong performance,
with high precision, recall, and mAP scores across all classification tasks. The
model’s real-time processing capabilities were validated through its testing on
video feeds, where it effectively detected, tracked, and counted plastics as they
moved along conveyor belts in a sorting facility. These results confirm the
model’s potential as a valuable tool for quantifying the efficiency and accuracy
of automated PET plastic sorting system.
Despite these successes, the project also identified several limitations, such
as the need for a more diverse dataset, improved handling of overlapping and
occluded objects, and better scalability to larger and more complex systems.
Addressing these limitations in future work can further enhance the model’s
robustness and applicability, making it suitable for a broader range of waste-
sorting scenarios.
In conclusion, this project demonstrates the viability of using advanced DL
models like YOLOv8 in industrial waste sorting applications. The model’s abil-
ity to deliver high-speed, accurate object detection and classification in real-time
underscores its potential to significantly improve the efficiency and effectiveness
of sorting operations. As the global emphasis on waste management and recy-
cling continues to grow, such technologies will play an increasingly critical role
in driving sustainable practices and reducing environmental impact. Future
research and development efforts should focus on expanding the model’s capa-
bilities, optimizing its performance in diverse environments, and integrating it
with larger, more complex waste management systems.

38
References
I. E. Agbehadji, A. Abayomi, K.-H. N. Bui, R. C. Millham, and E. Freeman.
Nature-inspired search method and custom waste object detection and classifi-
cation model for smart waste bin. Sensors, 22(16), 2022. ISSN 1424-8220. doi:
10.3390/s22166176. URL https://www.mdpi.com/1424-8220/22/16/6176.
A. A. A. Ahmed and A. B. M. Asadullah. Artificial intelligence and machine
learning in waste management and recycling. Engineering International, 8:
43–52, 05 2020. doi: 10.18034/ei.v8i1.498.
D. Angelika Mulia, S. Safitri, and I. Gede Putra Kusuma Negara. Yolov8
and faster r-cnn performance evaluation with super-resolution in license plate
recognition. International Journal of Computing and Digital Systems, 16(1):
365–375, 2024.
A. Canziani, A. Paszke, and E. Culurciello. An analysis of deep neural network
models for practical applications. arXiv preprint arXiv:1605.07678, 2016.
URL https://arxiv.org/abs/1605.07678.
S. Chatterjee, D. Hazra, and Y.-C. Byun. Incepx-ensemble: Performance
enhancement based on data augmentation and hybrid learning for recy-
cling transparent pet bottles. IEEE Access, 10:52280–52293, 2022a. doi:
10.1109/ACCESS.2022.3174076.
S. Chatterjee, D. Hazra, Y.-C. Byun, and Y.-W. Kim. Enhancement of im-
age classification using transfer learning and gan-based synthetic data aug-
mentation. Mathematics, 10(9), 2022b. ISSN 2227-7390. doi: 10.3390/
math10091541. URL https://www.mdpi.com/2227-7390/10/9/1541.
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab:
Semantic image segmentation with deep convolutional nets, atrous convolu-
tion, and fully connected crfs. IEEE transactions on pattern analysis and
machine intelligence, 40(4):834–848, 2017.
J. Choi, B. Lim, and Y. Yoo. Advancing plastic waste classification and recycling
efficiency: Integrating image sensors and deep learning algorithms. Applied
Sciences, 13(18), 2023a. ISSN 2076-3417. doi: 10.3390/app131810224. URL
https://www.mdpi.com/2076-3417/13/18/10224.

J. Choi, B. Lim, and Y. Yoo. Advancing plastic waste classification and recycling
efficiency: Integrating image sensors and deep learning algorithms. Applied
Sciences, 13(18):10224, 2023b.
Cimbria. Leading in plastic recycling with optical sorting. Cimbria
News, 2024. URL https://www.cimbria.com/en/about/news/leading-
in-plastic-recycling-with-optical-sorting.html. Accessed: August
8, 2024.

39
T. Corporation. Zenrobotics: Smart robotics for waste sorting. https://www.
terex.com/zenrobotics, 2024. Accessed: August 8, 2024.
Datatron. What is a machine learning pipeline? https://datatron.com/what-
is-a-machine-learning-pipeline/, 2023. Accessed: 2024-06-27.

C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep


convolutional networks. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 38(2):295–307, 2016. doi: 10.1109/TPAMI.2015.2439281.
D. Endachev, P. Vasin, and S. Shadrin. Applicability of computer vision ar-
chitectures and their influence on traffic safety of autonomous vehicles. 8:
5295–5301, 08 2019. doi: 10.35940/ijitee.F91540981119.
L. Ezzeddini, J. Ktari, T. Frikha, N. Alsharabi, A. Alayba, A. J. Alzahrani,
A. Jadi, A. Alkholidi, and H. Hamam. Analysis of the performance of faster
r-cnn and yolov8 in detecting fishing vessels and fishes in real time. PeerJ
Computer Science, 10:e2033, May 2024. ISSN 2376-5992. doi: 10.7717/peerj-
cs.2033. URL https://doi.org/10.7717/peerj-cs.2033.
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recogni-
tion, 2015. URL https://arxiv.org/abs/1512.03385.
K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn, 2018. URL https:
//arxiv.org/abs/1703.06870.
J. Hopewell, R. Dvorak, and E. Kosior. Plastics recycling: Challenges and
opportunities. Philosophical Transactions of the Royal Society B: Biologi-
cal Sciences, 364(1526):2115–2126, 2009. URL https://doi.org/10.1098/
rstb.2008.0311.
A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural net-
works for mobile vision applications, 2017. URL https://arxiv.org/abs/
1704.04861.
I. A. Howard, D. Busko, G. Gao, P. Wendler, E. Madirov, A. Turshatov,
J. Moesslein, and B. S. Richards. Sorting plastics waste for a circular econ-
omy: Perspectives for lanthanide luminescent markers. Resources, Conserva-
tion and Recycling, 205:107557, 2024. ISSN 0921-3449. doi: https://doi.org/
10.1016/j.resconrec.2024.107557. URL https://www.sciencedirect.com/
science/article/pii/S0921344924001526.
J. Huang, T. Pretz, and Z. Bian. Intelligent solid waste processing using optical
sensor based sorting technology. In 2010 3rd International Congress on Image
and Signal Processing, volume 4, pages 1657–1661, 2010. doi: 10.1109/CISP.
2010.5647729.

40
G. Jakovljevic, M. Govedarica, and F. Alvarez-Taboada. A deep learning model
for automatic plastic mapping using unmanned aerial vehicle (uav) data. Re-
mote Sensing, 12(9), 2020. ISSN 2072-4292. doi: 10.3390/rs12091515. URL
https://www.mdpi.com/2072-4292/12/9/1515.
F. Joiya. Object detection: Yolo vs faster r-cnn. International Research Journal
of Modernization in Engineering Technology and Science, pages 1911–1915,
2022. doi: 10.56726/irjmets30226.
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao,
S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollar, and R. Girshick. Segment
anything. In Proceedings of the IEEE/CVF International Conference on Com-
puter Vision (ICCV), pages 4015–4026, October 2023a.
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao,
S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick. Segment
anything, 2023b. URL https://arxiv.org/abs/2304.02643.
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification
with deep convolutional neural networks. In F. Pereira, C. Burges,
L. Bottou, and K. Weinberger, editors, Advances in Neural Infor-
mation Processing Systems, volume 25. Curran Associates, Inc., 2012.
URL https://proceedings.neurips.cc/paper_files/paper/2012/file/
c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
V. Kumar, G. Nandi, and R. Kala. Static hand gesture recognition using stacked
denoising sparse autoencoders. In 2014 Seventh International Conference on
Contemporary Computing (IC3), pages 99–104, 2014. doi: 10.1109/IC3.2014.
6897155.
D. Larentzakis, F. Raptopoulos, P. Timilsina, and M. Maniadakis. Ai-powered
robotic material recovery in a box. 2023.
T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal loss for dense
object detection. arXiv preprint arXiv:1708.02002, 2017.
T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense
object detection, 2018. URL https://arxiv.org/abs/1708.02002.
R. Ltd. Recycleye ai: Transforming waste sorting with artificial intelligence.
https://recycleye.com/, 2023. Accessed: 2024-06-27.
W. Lu and J. Chen. Computer vision for solid waste sorting: A critical review
of academic research. Waste Management, 142:29–43, 2022.
T. J. Lukka, T. Tossavainen, J. V. Kujala, and T. Raiko. Zenrobotics recy-
cler - robotic sorting using machine learning. 2014. URL https://api.
semanticscholar.org/CorpusID:63618129.

41
M. Manley and V. Baeten. Chapter 3 - spectroscopic technique: Near infrared
(nir) spectroscopy. In D.-W. Sun, editor, Modern Techniques for Food Au-
thentication (Second Edition), pages 51–102. Academic Press, second edition
edition, 2018. ISBN 978-0-12-814264-6. doi: https://doi.org/10.1016/B978-
0-12-814264-6.00003-7. URL https://www.sciencedirect.com/science/
article/pii/B9780128142646000037.
S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and D. Terzopoulos.
Image segmentation using deep learning: A survey. IEEE transactions on
pattern analysis and machine intelligence, 44(7):3523–3542, 2021.
O. Motunrayo. Innovations in recycling technologies for the circular economy,
05 2024.
T. Muringayil Joseph, S. Azat, Z. Ahmadi, O. Moini Jazani, A. Esmaeili,
E. Kianfar, J. Haponiuk, and S. Thomas. Polyethylene terephthalate (pet)
recycling: A review. Case Studies in Chemical and Environmental Engineer-
ing, 9:100673, 2024. ISSN 2666-0164. doi: https://doi.org/10.1016/j.cscee.
2024.100673. URL https://www.sciencedirect.com/science/article/
pii/S2666016424000677.
U. P. Pact. Amp robotics: Case study. https://usplasticspact.org/case-
study/amp-robotics-2/, 2024. Accessed: August 8, 2024.
A. Padalkar, P. Pathak, and P. Stynes. An object detection and scaling model
for plastic waste sorting, 12 2021.
J. Pransky. The pransky interview: Dr. matanya horowitz, founder and ceo of
amp robotics. Industrial Robot: the international journal of robotics research
and application, ahead-of-print, 04 2020. doi: 10.1108/IR-02-2020-0038.
K. Ragaert, L. Delva, and K. Van Geem. Mechanical and chemical recycling
of solid plastic waste. Waste Management, 69:24–58, 2017. URL https:
//doi.org/10.1016/j.wasman.2017.07.044.
E. Ramos, A. G. Lopes, and F. Mendonça. Application of machine learning in
plastic waste detection and classification: A systematic review. Processes, 12
(8), 2024. ISSN 2227-9717. doi: 10.3390/pr12081632. URL https://www.
mdpi.com/2227-9717/12/8/1632.
Recycleye. Computer vision evolved: The role of ai in waste sorting. https://
recycleye.com/computer-vision-evolved-role-waste-sorting/, 2023.
Accessed: 2024-06-27.

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once:


Unified, real-time object detection, 2016. URL https://arxiv.org/abs/
1506.02640.

42
S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object
detection with region proposal networks, 2016. URL https://arxiv.org/
abs/1506.01497.
O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for
biomedical image segmentation, 2015. URL https://arxiv.org/abs/1505.
04597.
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet
Large Scale Visual Recognition Challenge. International Journal of Computer
Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
R. Sapkota, D. Ahmed, and M. Karkee. Comparing yolov8 and mask r-cnn
for instance segmentation in complex orchard environments. Artificial In-
telligence in Agriculture, 13:84–99, 2024. ISSN 2589-7217. doi: https:
//doi.org/10.1016/j.aiia.2024.07.001. URL https://www.sciencedirect.
com/science/article/pii/S258972172400028X.
R. T. Schirrmeister, V. Cialdella, M. Stricker, et al. Deep learning for automated
image analysis in molecular sciences. Nature Reviews Chemistry, 5(12):842–
860, 2021. doi: 10.1038/s41570-021-00294-z. URL https://www.nature.
com/articles/s41570-021-00294-z.
F. Schmidt, N. Christiansen, and R. Lovrincic. The laboratory at hand: Plas-
tic sorting made easy. PhotonicsViews, 17(5):56–59, 2020. doi: https:
//doi.org/10.1002/phvs.202000036. URL https://onlinelibrary.wiley.
com/doi/abs/10.1002/phvs.202000036.
Z. Sharif, S. Khan, and A. Iqbal. Deep learning models for waste segregation: A
comparative study. Journal of Environmental Management, 307:114580, 2022.
doi: 10.1016/j.jenvman.2022.114580. URL https://www.sciencedirect.
com/science/article/pii/S0301479722002039.
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition, 2015. URL https://arxiv.org/abs/1409.1556.
D. Smirnovs. plastics dataset. https://universe.roboflow.com/dmitrijs-
smirnovs-afbdw/plastics-6aztq, feb 2023. URL https://universe.
roboflow.com/dmitrijs-smirnovs-afbdw/plastics-6aztq. visited on
2024-08-27.
M. Tan and Q. V. Le. Efficientnet: Rethinking model scaling for convolutional
neural networks, 2020. URL https://arxiv.org/abs/1905.11946.
Ultralytics. Ultralytics github repository. https://github.com/ultralytics/
ultralytics/issues/189, 2023a. Accessed: 2024-06-27.
Ultralytics. Yolov8: State-of-the-art yolo model. https://ultralytics.com/
yolov8, 2023b. Accessed: 2024-06-27.

43
J. Valente, J. António, C. Mora, and S. Jardim. Developments in image pro-
cessing using deep learning and reinforcement learning. Journal of Imag-
ing, 9(10), 2023. ISSN 2313-433X. doi: 10.3390/jimaging9100207. URL
https://www.mdpi.com/2313-433X/9/10/207.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin. Attention is all you need. Advances in Neural
Information Processing Systems, 30, 2017. URL https://arxiv.org/abs/
1706.03762.
X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, C. C. Loy, Y. Qiao, and
X. Tang. Esrgan: Enhanced super-resolution generative adversarial networks,
2018. URL https://arxiv.org/abs/1809.00219.
K. S. Woon, Z. X. Phuang, Z. Lin, and C. T. Lee. A novel food waste man-
agement framework combining optical sorting system and anaerobic diges-
tion: A case study in malaysia. Energy, 232:121094, 2021. ISSN 0360-
5442. doi: https://doi.org/10.1016/j.energy.2021.121094. URL https://www.
sciencedirect.com/science/article/pii/S0360544221013426.
WRAP. Plastic pollution: Understanding the global crisis. 2019.
URL https://www.wrap.org.uk/content/plastic-pollution-
understanding-global-crisis.
M. Wu and L. Chen. Image recognition based on deep learning. In 2015 Chinese
Automation Congress (CAC), pages 542–546, 2015. doi: 10.1109/CAC.2015.
7382560.
Xometry. What is polyethylene terephthalate? https://www.xometry.com/
resources/materials/polyethylene-terephthalate/, 2024. Accessed:
August 8, 2024.
L. Yue, H. Shen, J. Li, Q. Yuan, H. Zhang, and L. Zhang. Image super-
resolution: The techniques, applications, and future. Signal Processing,
128:389–408, 2016. ISSN 0165-1684. doi: https://doi.org/10.1016/j.sigpro.
2016.05.002. URL https://www.sciencedirect.com/science/article/
pii/S0165168416300536.
X. Zhang, Y. Li, and Q. Liu. Deep learning applications in solid waste manage-
ment: A review. International Journal of Advanced Computer Science and
Applications (IJACSA), 13(3):376–383, 2022. doi: 10.14569/IJACSA.2022.
0130347. URL https://thesai.org/Downloads/Volume13No3/Paper_47-
Deep_Learning_Applications_in_Solid_Waste_Management.pdf.
J. Zhu, J. Chen, and R. He. Machine learning in waste classification and re-
cycling: Recent developments, challenges, and prospects. IEEE Access, 9:
151123–151139, 2021a. doi: 10.1109/ACCESS.2021.3119779.
X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai. Deformable detr:
Deformable transformers for end-to-end object detection. arXiv preprint
arXiv:2010.04159, 2021b. URL https://arxiv.org/abs/2010.04159.

44

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy