Chunk 2
Chunk 2
Typical Tasks
There are four major categories in computer vision: recognition tasks, motion analysis,
image restoration and geometry reconstruction. The following figure illustrates those
tasks.
Recognition tasks
There are different types of recognition tasks in computer vision. Typical tasks involve the
detection of objects, persons, poses, or images. Object recognition deals with the estima-
tion of different classes of objects that are contained in an image (Zou et al., 2019). For
instance, a very basic classifier could be used to detect whether there is a hazardous mate-
rial label on an image or not. Making the classifier more specific could additionally recog-
nize information about the label type such as “flammable” or “poison.” Object recognition
is also important in the area of autonomous driving to detect other vehicles or pedes-
trians.
In object identification tasks, objects or persons that are in an image are identified using
unique features (Barik & Mondal, 2010). For person identification, for example, a computer
vision system can use characteristics, such as fingerprint, face or handwriting. Facial rec-
65
ognition, for instance, uses biometric features from an image and compares them to the
biometric features of other images from a given database. Person identification is com-
monly used to verify the identity of a person for access control.
Pose estimation tasks play an important role in autonomous driving. The goal is to esti-
mate the orientation and/or position of a given object relative to the camera (Chen et al.,
2020). This can, for instance, be the distance to another vehicle ahead or an obstacle on
the road.
In classical odometry, motion sensors are used to estimate the change of the position of
an object over time. Visual odometry, conversely, hand analyzes a sequence of images to
gather information about the position and orientation of the camera (Aqel et al., 2016).
Autonomous cleaning bots can, for instance, use this information to estimate the location
in a specific room.
In tracking tasks, an object is located and followed in successive frames. A frame can be
defined as a single image in a longer sequence of images, such as videos or animations
(Yilmaz et al., 2006). This can, for instance, be the tracking of people, vehicles, or animals.
Image restoration deals with the process of recovering a blurry or noisy image to an image
of better and clearer quality. This can, for instance, be old photographs, but also movies
that were damaged over time. To recover the image quality, filters like median or low-pass
Noise filters can remove the noise (Dhruv et al., 2017). Nowadays, methods from image restora-
In computer vision, Noise tion can also be used to restore missing or damaged parts of an artwork.
refers to a quality loss of
an image which is caused
by a disturbed signal. Geometry reconstruction tasks
In computer vision, there are five major challenges that must be tackled (Szeliski, 2022):
• The illumination of an object is very important. If lighting conditions change, this can
yield different results in the recognition process. For instance, red can easily be
detected as orange if the environment is bright.
66
• Differentiating similar objects can also be difficult in recognition tasks. If a system is
trained to recognize a ball it might also try to identify an egg as a ball.
• The size and aspect ratios of objects in images or videos pose another challenge in com-
puter vision. In an image, objects that are further away will appear to be smaller than
closer objects even if they are the same size.
• Algorithms must be able to deal with rotation of an object. If we look for instance at a
pencil on a table, it can either look like a line when we look from the top or as a circle
when we change to a different perspective.
• The location of objects can vary. In computer vision, this effect is called translation.
Going back to our example of the pencil, it should not make a difference to the algo-
rithm if the pencil is located on the center of a paper or next to it.
Because of these challenges, there is muchre research towards algorithms that are scale-,
rotation-, and/or translation invariant (Szeliski, 2022).
Pixels
Images are constructed as a two-dimensional pixel array (Lyra et al., 2011). A pixel is the
smallest unit of a picture. The word originates from the two terms “pictures” (pix) and
“element”(el) (Lyon, 2006). A pixel is normally represented as a single square with one
color. It becomes visible when zooming deep into a digital image. You can see an example
of the pixels of an image in the figure below.
67
In the resolution of an image, the number of pixels is specified. If the resolution is high, the
more details will be in the image. Conversely, if the resolution is low, the picture might
look fuzzy or blurry.
Color representations
There are various ways to represent the color of a pixel as a numerical value. The easiest
way is to use monochrome pictures. In this case, the color of a pixel will be represented by
a single bit, being 0 or 1. In a true color image, a pixel will be represented by 24 bits.
The following table shows the most important color representations with the according
number of available colors (color depth).
One way to represent colors is the RGB color representation. We illustrate this using the
24-bit color representation. Using RGB, the 24 bits of a pixel are separated in three parts,
each 8 bits in length. Each of those parts represents the intensity of a color between 0 and
255. The first is the color red (R), the second green (G), and last blue (B). Out of these three
components all the other colors can be mixed additively. For instance the color code
RGB(0, 255, 0) will yield 100 percent green. If all values are set to 0, the resulting color will
be black. If all values are set to 255 it will be white. The figure below illustrates how the
colors are mixed in an additive way.
68
Figure 19: Additive Mixing of Colors
Another way to represent colors is the CMYK model. In contrast to the RGB representation
it is a subtractive color model comprised of cyan, magenta, yellow and key (black). The
color values in CMYK range from 0 to 1. Therefore, to convert colors from RGB to CMYK, the
RGB values first have to be divided by 255. Therefore, the values of cyan, magenta, yellow
and key can be computed as follows:
R G B
K=1 − max , ,
255 255 255
R
1− −K
255
C= 1−K
G
1− −K
255
M= 1−K
B
1− −K
255
Y= 1−K
While the RGB is better suited for digital representation of images, CMYK is commonly
used for printed material.
69
Images as functions
We will now discuss how an image can be built from single pixels. To do that, we need a
function that can map a two-dimensional coordinate (x,y) to a specific color value. On the
x-axis we begin on the left with a value of 0 and continue to the right until the maximum
width of an image is reached. On the y-axis, we begin with 0 at the top and reach the
height of an image at the bottom.
Let us look at the function f x, y for an 8-bit gray scale image. The function values of
f 42, 100 = 0 would mean that we will have a black pixel 42 pixels to the right and 100
pixels below the starting point. In a 24-bit image the result of the function would be a tri-
ple value indicating the RGB intensity of the specified pixel.
Filters
Filters play an important role in computer vision when it comes to applying affects to an
image, implementing techniques like smoothing, or inpainting, or extracting useful infor-
mation from an image, like the detection of corners or edges. It can be defined as a func-
tion that gets an image as an input, applies modifications to that image, and returns the
filtered image as an output (Szeliski, 2022).
2D convolution
The convolution of an image I with a kernel k with a size of n and a center coordinate a can
be calculated as follows:
n n
I ⋅ x, y = ∑ ∑ I x + i − a, y + j − a k i, j
i = 1i = j
where I · x, y is the value of the resulting image I · at position x, y while I is the origi-
nal image. The center coordinate for a 3x3 convolution matrix is 2, for a 5x5 convolution
matrix 3 and so forth. To understand the process, we will use the following example of a
3x3 convolution. The kernel matrix used for the convolution is shown in the middle col-
umn of the figure.
70
Figure 20: 2D Image Convolution
The kernel matrix is moved over each position of the input image. In our input image the
current position is marked orange. In our example we start with the center position of the
image and multiply the image on this position with the values of the kernel matrix. The
resulting value for the center position of our filtered image is computed as follows:
0 · 41 + 0 · 26 + 0 · 86 + 0 · 27 + 0 · 42 + 1 · 47 + 0 · 44 + 0 · 88 + 0
· 41 = 47
In the next step, we shift the kernel matrix to the next position and compute the new value
of the filtered image:
0 · 26 + 0 · 86 + 0 · 41 + 0 · 42 + 0 · 47 + 1 · 93 + 0 · 88 + 0 · 41 + 0
· 24 = 93
The bottom row in our figure shows the result after all positions of the image have been
multiplied with the kernel matrix.
71
Padding techniques
If convolution techniques are applied to images, we face the problem that in the first and
last rows and columns of an image there will not be enough values to apply the matrix
multiplication with the convolution matrix. To solve this, we can add additional values at
the border of our input images. This process is referred to as padding (Szeliski, 2022).
There are three padding techniques that are commonly used: constant, replication, and
reflection padding.
In constant padding, a constant number (e.g., zero) is used to fill the empty cells. Replica-
tion padding uses a replication of the values from the nearest neighboring cells. In reflec-
tion padding, the value from the opposite side of a pixel is used to fill the cell. For instance,
the cell on the top left will be filled with the value on the bottom right (Szeliski, 2022).
The figure above illustrates how the three padding techniques are applied to an image.
Distortion
Image processing in computer vision is normally done with the assumption that an image
we receive from a camera is a linear projection of a scene. That means that if we have a
straight line in the real world we can expect it to be a straight line in the digital representa-
tion of the image (Szeliski, 2022). However, in practical scenarios camera lenses often
cause distortion. There exist two kinds of distortions – radial and tangential – that will be
explained in the following.
72
Radial distortion
Radial distortion appears when lines that are normally straight bend towards the edge of
the camera lens (Wang et al., 2009). The intensity of the distortion depends on the size of
the lens. With smaller lenses we will find higher distortion. Moreover, radial distortion is
also more dominant when wide-angle lenses are used. In general, there are four types of
radial distortion (Szeliski, 2022):
1. Barrel distortion/positive radial distortion: Lines in the center of an image are bent to
the outside.
2. Pincushion distortion/negative radial distortion: Lines in the center of an image are
bent to the inside.
3. Complex distortion/mustache radial distortion: Lines with a combination of positive
and negative distortion.
4. Fisheye radial distortion: Occurs with ultra wide-angle lenses, e.g., a peephole.
Tangential distortion
Besides radial distortion, tangential distortion is another effect that can often be observed
in digital imaging. Tangential distortion is caused if the image sensor unit and the camera
lens are not properly aligned. If the camera lens and the image plane are not parallel, the
distortions will look as shown in the graphic below.
73
Figure 23: Tangential Distortion
To address distortion in digital image processing, mathematical models like the Brown-
Conrady model (Brown, 1966) can be used to describe and correct the effects of the distor-
tion. To be able to apply those models, it is important that the extrinsic and intrinsic
parameters of the camera are known. These parameters can be determined by calibration.
Calibration
Camera calibration estimates the extrinsic and intrinsic parameters of a camera (Szeliski,
2022). The calibration makes it possible to extract distortion from the images.
Extrinsic characteristics of a camera are, for instance, the orientation in real world coordi-
nates and the position of the camera. The intrinsic characteristics include parameters
such as the optical center, the focal length, and the lens distortion parameters.
If the camera is calibrated properly, images can reliably be recovered from distortion
which allows us, for instance, to measure distances and sizes on those images in units as
meters and, therefore, reconstruct a 3D model of the underlying scenario from the real
world.
Techniques
74
Figure 24: Principle of the Pinhole Camera
1. Transform the coordinates from the 3D world to the 3D camera coordinates. For this
step, extrinsic parameters, such as rotation and translation of the information are
used.
2. Transform the 3D camera coordinates to the 2D image coordinates. In this step, intrin-
sic parameters, such as focal length, distortion parameters, and optical center are
applied.
To map the 3D coordinates from the real world to a two-dimensional image, a 3x4 projec-
tion matrix (often referred to as a camera matrix) is used. When we multiply the 3D coordi-
nates with this matrix, we will receive the 2D coordinates of the projected point on the
image pane.
The figure below illustrates the steps of the projection process when 3D real world coordi-
nates are transformed to the 2D image coordinates.
75
To apply the projection steps illustrated above, we need to know the intrinsic and extrinsic
camera parameters. These can be estimated using camera calibration. To understand the
practical implementation of the calibration process we will look at flexible techniques for
camera calibration (Zhang, 2000).
This technique uses two or more images as an input as well as the size of the object. A
good object for camera calibration is, for instance, a checkerboard. After the calibration
process, we will receive the extrinsic parameters rotation and translation and the intrinsic
camera parameters optical center, focal length, and distortion.
1. Select at least two sample images, which should be well-structured patterns, such as
a checkerboard pattern.
2. Identify distinctive points in each image. If we use a checkerboard pattern, this can,
for instance, be the corners of the individual squares. Because of the clear structure of
the checkerboard pattern with the black and white squares, the corners are easy to
detect. They have a high gradient at the corners in both directions.
3. Localization of the corners of the squares. For the checkerboard pattern this can be
done in a very robust manner. To be able to identify the 3D coordinates of the corners
in the 3D real world, we need to know the size of the checkerboard and need two or
more sample images. Moreover, we know the 2D coordinates of the corners in the
image from the picture that was taken by the camera. Using this information, we can
calculate the camera matrix and the distortion coefficients. The distortion coefficients
can be used by applying the Brown-Conrady model (Brown, 1966).
76
5.3 Feature Detection
In the context of computer vision, features can be defined as points of interest of an
image, which contain the required information to solve a respective problem (Hassaballah
et al., 2016). To find those features in a picture, there exists a large variety of feature detec-
tion algorithms. Once the features are detected, the semantic information about them can
be extracted. The coordinates of a feature, i.e., on which position it is located in an image,
is the feature keypoint. The semantic information extracted about a feature is stored in a
vector, which is also called a feature descriptor or feature vector. The detection and extrac-
tion of features is often an important part of the preprocessing in machine learning appli-
cations. The extracted feature vectors can subsequently be used as an input for image
classification. In motion tracking or recognition of individuals or similar objects in multi-
ple images, feature matching can be used.
The most common types of features are blobs, edges, and corners. Blobs are formed by a
group of pixels that have some properties in common. Regions that differ in properties
belong to different blobs. This can, for instance, be different color or brightness compared
to the areas surrounding a region. Edges are indicated by a significant change of the
brightness of pixels. They can be identified by a discontinuity of the image intensity, i.e., a
sudden change in the brightness of an image (Jain et al., 1995). Corners are the connec-
tion between two edges. The image below illustrates the difference between blobs (blue),
edges (red), and corners (yellow).
If we want to detect all tomatoes in the picture, we can use an algorithm to detect all
blobs. However, there will still be the challenge of distinguishing tomatoes from other
round objects, like olives or cucumbers. This challenge can be tackled if we use a feature
77
description algorithm to extract the information that is characteristic of a tomato and con-
struct a feature descriptor from this information. The feature descriptor could, for
instance, include information about the surrounding n pixel values or the color of the pix-
els.
Once we have the feature descriptor for our cucumber candidate, it is possible to compare
it with other feature descriptors from cucumber images using a feature matching algo-
rithm. This feature matching algorithm allows us to detect all the cucumber slices in the
image. As we have seen in our example, feature engineering is usually performed in three
steps:
1. Feature detection
2. Feature description/extraction
3. Feature matching
Feature detection
To detect features such as edges or corners, there exist several methods. To detect edges
in images, 2D convolution can be used. Edges are characterized by a significant difference
of the pixel values to the surrounding pixels. If we look at an edge, there will be a clear
difference in brightness and/or color compared to the surrounding pixels.
The figure above shows an example of edge detection. The edge between the road and the
surrounding grass is clearly visibly in this example. On the upper left part of the zoomed in
image we can see some variations of dark green colors, the lower right part is filled with
variations of light gray. The edge separates both parts of the image.
Two techniques that are commonly used for edge detection are the Canny edge detector
and the Sobel filter. The Canny edge detection (Canny, 1986) analyzes the change between
pixel values. For this purpose, it uses the derivatives of the x and y coordinates. The algo-
rithm works with two–dimensional values, i.e., it works only on single color images such
as gray scaled images. The figure below shows the result of the Canny edge detection in
our example picture.
78
Figure 29: Example for Canny Edge Detection
When using Sobel filters for edge detection, two special kernel matrices are used, one for
each of the axes. These Sobel operators use convolution to transfer the original image into
a gradient image. High frequencies in the gradient image indicate areas with the highest
changes in pixel intensity which are likely to be edges. Therefore, in a second step, the
algorithm is often combined with a threshold function to detect the edges. The figure
below shows the Sobel edge detection for the x and y direction.
For corner detection in images, one of the most prominent algorithms is the Harris corner
detection (Harris & Stephens, 1988). This algorithm analyzes the change of the pixel values
in a sliding window that is moved in different directions. The sliding window can be as
small as, for instance, 7x7 pixels. The figure illustrates how flat areas, edges, and corners
can be detected using the sliding window technique.
79
Figure 31: Harris Corner Detection
The left image shows the window in a flat area with no edges or corners. In the underlying
window, there is no significant change in the values of the pixels if the window is moved
into any direction. In the middle image, the window is moved on an edge but does not
touch the other edge. This means we only have a change in a pixel value when we move
the image in the horizontal direction. If we move the image in a vertical direction, there
will be no changes in the pixel value. In the image on the right, the sliding window is
moved over a corner. In this image, we will have a significant change in the pixel value no
matter in which direction we move the image.
Therefore, if we want to detect corners, we have to find the window where the change of
the underlying pixels is maximized in all directions. To formalize this idea mathematically,
Harris corner detection uses the Sobel operators which were explained previously.
Feature description
For further processing of the features detected in the feature detection step, it is important
to be able to describe those features in a way that a computer can use them and distin-
guish one from another. For this purpose, we use feature vectors/feature descriptors,
which contain semantic information about the features. One possibility to describe fea-
tures is the Binary Robust Independent Elementary Features (BRIEF) algorithm (Calonder
et al., 2010). To describe a feature, a binary vector is used.
The vector is constructed using an image patch, i.e., a square with a set pixel width and
height, which is constructed by comparing the intensity of a pair of pixels. In a first step, a
patch p at position x is first smoothed. Afterwards the pixel intensity p x is computed. In
a test τ the result of the comparison is coded into a binary value according to the following
equation:
1ifp x < p y
τ p; x, y : =
0otherwise
80
The major advantage of the BRIEF algorithm is that it is fast to compute and easy to imple-
ment. However, feature extraction for features that are rotated more than 35 degrees is no
longer accurate (Hassaballah et al., 2016). Algorithms like Oriented FAST and Rotated
BRIEF (ORB) try to overcome this limitation (Rublee et al., 2011).
Another algorithm for feature description is the SIFT algorithm (Scale-Invariant Feature
Transform) (Lowe, 1999). The SIFT algorithm has been enhanced by the SURF algorithm
(Speeded-Up Robust Features) (Bay et al., 2008), which provides a performance improved
variation of the SIFT algorithm. However, as both algorithms have been patented, they
cannot be used as freely as for instance ORB. Additionally compared to ORB their accuracy
is lower and the computational cost higher (Rublee et al., 2011).
Feature matching
The goal of feature matching is to identify similar features in different images. This could,
for instance, be when detecting the same person in different scenarios. Feature matching
is an important component in tasks like camera calibration, motion tracking, object recog-
nition, and tracking.
One very simple technique for feature matching is brute force matching, which compares
the feature descriptors of source and target image and computes the distance between
those images (Jakubovic & Velagic, 2018). For numeric values of the feature vectors, we
can use the Euclidean distance (Wang et al., 2005). For binary vectors, because they are
generated when using the BRIEF algorithm, the Hamming distance is an appropriate
approach to calculate the distance (Torralba et al., 2008).
Especially when dealing with large datasets and high dimensional feature vectors, Fast
Library for Approximate Nearest Neighbors (FLANN) provides a more sophisticated
method for feature matching. It contains a set of algorithms using a nearest neighbors
search and has lower computational costs than brute force matching. The most appropri-
ate algorithm is automatically selected depending on the dataset. However, it is less accu-
rate than brute force matching (Muja & Lowe, 2009).
According to Hassaballah et al. (2016), there are several characteristics a good algorithm
for feature detection and extraction from images should have: robustness, repeatability,
accuracy, generality, efficiency and quantity. The characteristics are explained in the table
below.
81
Accuracy Accurate localization of a feature in an image based
on its pixel position
When performing feature detection and extraction on an image, there are several chal-
lenges. While humans can easily identify objects no matter how they are located or lit,
those differences can pose a great challenge for a computer. Therefore, there is still much
ongoing research to develop algorithms that are less prone to factors, such as noise, vary-
ing lighting conditions, changes of camera perspectives, rotation or translation of objects,
and changes of scale.
To perform the semantic segmentation, the algorithm receives an image with one or more
objects as an input, and outputs an image where each pixel is labeled according to its cat-
egory. The figure below illustrates how semantic segmentation can be applied to an
image. In the image, every pixel is either categorized as background, chair, or coffee table.
82
Figure 32: Example for Semantic Segmentation
The convolutional part of the network is used for feature extraction. It transforms the
image from the input into a multidimensional representation of its features. The deconvo-
lution network uses the features that have been extracted from the convolution network
83
to generate the shapes of the object segmentation. Its unpooling and deconvolution lay-
ers are used to identify class labels based on the pixels and predict the segmentation
masks. It generates a probability map as an output, which has the same size as the input
image. For each pixel this probability map indicates the probability of it belonging to one
of the given classes (Noh et al., 2015). Additionally, to refine the label map, it is possible to
Conditional random apply fully connected conditional random fields to the output of the network (Krähen-
fields bühl & Koltun, 2012).
An undirected probabilis-
tic model that also con-
siders neighboring sam- Use Cases
ples for classification is
known as a conditional
random field. Semantic image segmentation can be helpful many use cases:
SUMMARY
Computer vision is an interdisciplinary field that combines methods
from computer science, engineering, and artificial intelligence. It dates
back to the 1960s when researchers first tried to mimic the visual system
of humans. Typical tasks in computer vision deal with topics such as rec-
ognition tasks, image restoration, motion analysis, and geometry recon-
struction.
In computer vision, images are represented using pixels. Models like the
Brown-Conrady model can be used to address the distortion of digital
images. Besides that, it is also important to know the calibration param-
eters of a camera to address radial and tangential distortion.
84
BACKMATTER
LIST OF REFERENCES
Aqel, M. O. A., Marhaban, M. H., Saripan, M. I., & Ismail, N. B. (2016). Review of visual odom-
etry: Types, approaches, challenges, and applications. SpringerPlus, 5(1), 1897. https:/
/doi.org/10.1186/s40064-016-3573-7
Barik, D., & Mondal, M. (2010). Object identification for computer vision using image seg-
mentation. In V. Mahadevan & G. S. Tomar (Eds.), ICETC 2010.2010 2nd international
conference on education technology and computer. IEEE. https://doi.org/10.1109/ICET
C.2010.5529412
Bay, H., Ess, A., Tuytelaars, T., & van Gool, L. (2008). Speeded-up robust features (SURF).
Computer Vision and Image Understanding, 110(3), 346–359.
Beel, J., Gipp, B., Langer, S., & Breitinger, C. (2016). Research-paper recommender systems:
A literature survey. International Journal on Digital Libraries, 17(4), 305–338. https://do
i.org/10.1007/s00799-015-0156-0
Buchanan, B. G. (2005). A (very) brief history of artificial intelligence. AI Mag, 26, 53–60.
Calonder, M., Lepetit, V., Strecha, C., & Fua, P. (2010). BRIEF: Binary Robust Independent
Elementary Features. In K. Daniilidis, P. Maragos, & N. Paragios (Eds.), Lecture Notes in
Computer Science. Computer Vision – ECCV 2010 (Vol. 6314, pp. 778–792). Springer. htt
ps://doi.org/10.1007/978-3-642-15561-1_56
Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N. L. U., John, R. S., Constant, N., Guajardo-
Céspedes, M., Yuan, S., Tar, C., Sung, Y., Strope, B., & Kurzweil, R. (2018). Universal Sen-
tence Encoder. EMNLP Demonstration. https://arxiv.org/abs/1803.11175
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2016, June 7). Semantic
Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. https://d
oi.org/10.48550/arXiv.1412.7062
86
Chen, Y., Tian, Y., & He, M. (2020). Monocular human pose estimation: A survey of deep
learning-based methods. Computer Vision and Image Understanding, 192, 102897.
Chidambaram, M., Yang, Y., Cer, D., Yuan, S., Sung, Y.–H., Strope, B., & Kurzweil, R. (2019).
Learning cross-lingual sentence representations via a multi-task dual-encoder model. ht
tp://arxiv.org/pdf/1810.12836v4
Crevier, D. (1993). Ai - The tumultuous history of the search for artificial intelligence. Basic
Books, Inc.
D’Acunto, F., Prabhala, N., & Rossi, A. G. (2019). The promises and pitfalls of robo-advising.
The Review of Financial Studies, 32(5), 1983–2020. https://doi.org/10.1093/rfs/hhz014
Devlin, J., Chang, M.–W., Lee, K., & Toutanova, K. (2018, October 11). BERT: Pre-training of
Deep Bidirectional Transformers for Language Understanding. http://arxiv.org/pdf/181
0.04805v2
Dhruv, B., Mittal, N., & Modi, M. (2017). Analysis of different filters for noise reduction in
images. [RN1][SC2]2017 Recent Developments in Control, Automation & Power Engi-
neering (RDCAPE) (pp. 410–415). IEEE. https://doi.org/10.1109/RDCAPE.2017.8358306
Blosch, M., & Fenn, J. (2018, August 20). Understanding Gartner’s hype cycles. Gartner. http
s://www.gartner.com/en/documents/388776
Gartner. (2021, September 7). Gartner identifies four trends driving near-term artificial intel-
ligence innovation [Press release]. https://www.gartner.com/en/newsroom/press-rele
ases/2021-09-07-gartner-identifies-four-trends-driving-near-term-artificial-intelligenc
e-innovation
Ghosh, A., & Veale, D. T. (2016). Fracking sarcasm using neural network. In A. Balahur, E.
van der Goot, P. Vossen, & A. Montoyo (Eds.), Proceedings of the 7th workshop on com-
putational approaches to subjectivity, sentiment and social media snalysis (pp. 161–
169). Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-042
5
Giles, M. (2018, December 19). The man turning China into a quantum superpower. MIT
Technology Review. https://www.technologyreview.com/2018/12/19/1571/the-man-t
urning-china-into-a-quantum-superpower/
87
Giles, T. D. (2016). Aristotle writing science. An application of his theory. Journal of Techni-
cal Writing and Communication, 46(1), 83–104. https://doi.org/10.1177/0047281615600
633
Grace, K., Salvatier, J., Dafoe, A., Zhang, B., & Evans, O. (2017, May 24). When will AI exceed
human performance? Evidence from AI Experts. https://arxiv.org/abs/1705.08807?xtor=
AL-32280680#:~:text=Researchers%20believe%20there%20is%20a,much%20sooner%
20than%20North%20Americans.
Han, X.–F., Laga, H., & Bennamoun, M. (2021). Image-based 3D object reconstruction:
State-of-the-art and trends in the deep learning era. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 43(5), 1578–1604. https://doi.org/10.1109/TPAMI.2019.2
954885
Harris, C., & Stephens, M. (1988, September). A combined corner and edge detector. In C. J.
Taylor (Ed.), Procedings of the Alvey Vision Conference 1988 (23.1-23.6). Alvey Vision
Club. https://doi.org/10.5244/C.2.23
Hassaballah, M., Abdelmgeid, A. A., & Alshazly, H. A. (2016). Image Features Detection,
Description and Matching. In A. I. Awad & M. Hassaballah (Eds.), Image Feature Detec-
tors and Descriptors: Foundations and Applications (Vol. 630, pp. 11–45). Springer
International Publishing. https://doi.org/10.1007/978-3-319-28854-3_2
Holler, M. J. (2012). Von Neumann, Morgenstern, and the creation of game theory: From
Chess to Social Science, 1900–1960 [Review of the book Von Neumann, Morgenstern,
and the creation of game theory: From Chess to Social Science, 1900–1960, by R. Leo-
nard]. The European Journal of the History of Economic Thought, 19(1), 131--135.
Horgan, J. (1993). The mastermind of artificial intelligence. Scientific American, 269(5), 35–
38. https://doi.org/10.1038/scientificamerican1193-35
Hutchins, J. (1995). "The wisky was invisible", or persistent myths of MT. MT News Interna-
tional, 11, 17–18. https://aclanthology.org/www.mt-archive.info/90/MTNI-1995-Hutchi
ns.pdf
Hutchins, J. (1997). From first conception to first demonstration: The nascent years of
machine translation, 1947–1954. A chronology. Machine Translation, 12(3), 195–252. ht
tps://doi.org/10.1023/A:1007969630568
Işın, A., Direkoğlu, C., & Şah, M. (2016). Review of MRI-based brain tumor image segmenta-
tion using deep learning methods. Procedia Computer Science, 102, 317–324. https://d
oi.org/10.1016/j.procs.2016.09.407
Islam, N., Islam, Z., & Noor, N. (2017, October 3). A survey on optical character recognition
system. Arxiv. http://arxiv.org/pdf/1710.05703v1
88
Iyyer, M., Manjunatha, V., Boyd-Graber, J., & Daumé III, H. (2015). Deep unordered compo-
sition rivals syntactic methods for text classification. In C. Zong & M. Strube (Eds.), Pro-
ceedings of the 53rd annual meeting of the association for computational linguistics and
the 7th international joint conference on natural language processing: Vol. 1. Long
papers (pp. 1681–1691). Association for Computational Linguistics. https://doi.org/10.
3115/v1/P15-1162
Jain, R., Kasturi, R., & Schunck, B. G. (1995). Machine vision. McGraw-Hill Professional.
Jakubovic, A., & Velagic, J. (2018). Image feature matching and object detection using
brute-force matchers. In M. Muštra, M. Grgić, B. Zovko-Cihlar & D. Vitas (Eds.), Proceed-
ings of ELMAR-2018. 60th international symposium ELMAR-2018 (pp. 83–86). IEEE. https:/
/doi.org/10.23919/ELMAR.2018.8534641
Kaddari, Z., Mellah, Y., Berrich, J., Belkasmi, M. G., & Bouchentouf, T. (2021). Natural lan-
guage processing: Challenges and future directions. In T. Masrour, I. El Hassani & A.
Cherrafi (Eds.), Lecture Notes in Networks and Systems. Artificial intelligence and
industrial applications (Vol. 144, pp. 236–246). Springer International Publishing. https
://doi.org/10.1007/978-3-030-53970-2_22
Kaymak, Ç., & Uçar, A. (2019). A brief survey and an application of semantic image seg-
mentation for autonomous driving. In V. E. Balas, S. S. Roy, D. Sharma, & P. Samui
(Eds.), Handbook of Deep Learning Applications (Vol. 136, pp. 161–200). Springer Inter-
national Publishing. https://doi.org/10.1007/978-3-030-11479-4_9
Kim, Y., Petrov, P., Petrushkov, P., Khadivi, S., & Ney, H. (2019, September 20). Pivot-based
transfer learning for neural machine translation between non-English languages. http:/
/arxiv.org/pdf/1909.09524v1
Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, A., Urtasun, R., & Fidler, S. (2015,
June 22). Skip-thought vectors. http://arxiv.org/pdf/1506.06726v1
Koehn, P., & Knowles, R. (2017, June 13). Six challenges for neural machine translation. http
://arxiv.org/pdf/1706.03872v1
Krähenbühl, P., & Koltun, V. (2012, 20 October). Efficient inference in fully connected CRFs
with Gaussian edge potentials. Advances in Neural Information Processing Systems, 24,
109—117. https://doi.org/10.48550/arXiv.1210.5644
Kuipers, M., & Prasad, R. (2021). Journey of Artificial Intelligence. Wireless Personal Com-
munications, 123, 3275—3290. https://doi.org/10.1007/s11277-021-09288-0
Kurzweil, R. (2014). The singularity is near. In R. L. Sandler (Ed.), Ethics and emerging tech-
nologies (pp. 393–406). Palgrave Macmillan. https://doi.org/10.1057/9781137349088_2
6
89
Laguarta, J., Hueto, F., & Subirana, B. (2020). COVID-19 artificial intelligence diagnosis
using only cough recordings. IEEE Open Journal of Engineering in Medicine and Biology,
1, 275–281. https://doi.org/10.1109/OJEMB.2020.3026928
Leonard, R. (2010). Von Neumann, Morgenstern, and the creation of game theory: From
chess to social science, 1900--1960. Historical perspectives on modern economics.
Cambridge University Press. https://search.ebscohost.com/login.aspx?direct=true&sc
ope=site&db=nlebk&db=nlabk&AN=783042
Li, B., Shi, Y., Qi, Z., & Chen, Z. (2018). A survey on semantic segmentation. In H. Tong, Z. Li,
F. Zhu & J. Yu (Eds.), 18th IEEE international conference on data mining workshops.
ICDMW 2018 (pp. 1233–1240). IEEE. https://doi.org/10.1109/ICDMW.2018.00176
Liu, Y., Gall, J., Stoll, C., Dai, Q., Seidel, H.-P., & Theobalt, C. (2013). Markerless motion cap-
ture of multiple characters using multiview image segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 35(11), 2720–2735. https://doi.org/10.1109/
TPAMI.2013.47
Long, J., Shelhamer, E., & Darrell, T. (2015, March 8). Fully convolutional networks for
semantic segmentation. https://doi.org/10.48550/arXiv.1411.4038
Lyra, M., Ploussi, A., & Georgantzoglou, A. (2011). MATLAB as a tool in nuclear medicine
image processing. In C. Ionescu (Ed.), MATLAB - A ubiquitous tool for the practical engi-
neer. IntechOpen. https://doi.org/10.5772/19999
Masnick, M. (2014, June 9). No, a 'Supercomputer' did not pass the Turing test for the first
time and everyone should know better. Techdirt. https://www.techdirt.com/articles/20
140609/07284327524/no-computer-did-not-pass-turing-test-first-time-everyone-shoul
d-know-better.shtml
May, C., Ferraro, F., McCree, A., Wintrode, J., Garcia-Romero, D., & van Durme, B. (2015).
Topic identification and discovery on text and speech. In L. Màrquez, C. Callison-
Burch, & J. Su (Eds.), Proceedings of the 2015 conference on empirical methods in natu-
ral language processing (pp. 2377–2387). Association for Computational Linguistics. ht
tps://doi.org/10.18653/v1/D15-1285
90
McCarthy, J., Minsky, M. L., Rochester, N., & Shannon, C. E. (1955). A proposal for the Dart-
mouth Summer Research Project on Artificial Intelligence. AI Magazine, 27(4). https://d
oi.org/10.1609/aimag.v27i4.1904
McKinsey & Company (2021). Global survey: The state of AI in 2021. https://www.mckinsey.c
om/~/media/McKinsey/Business%20Functions/McKinsey%20Analytics/Our%20Insigh
ts/Global%20survey%20The%20state%20of%20AI%20in%202021/Global-survey-The-
state-of-AI-in-2021.pdf
Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing Order into Text. Proceedings of the
2004 Conference on empirical methods in natural language processing (pp. 404–411).
https://aclanthology.org/W04-3252
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013, September 7). Efficient estimation of
word representations in vector space. http://arxiv.org/pdf/1301.3781v3
Muja, M., & Lowe, D. G. (2009, February 5–8). Fast approximate nearest neighbors with
automatic algorithm configuration. In A. K. Ranchordas & H. Araújo (Eds.), Proceedings
of the fourth international conference on computer vision theory and applications (pp.
331–340). SciTePress. https://doi.org/10.5220/0001787803310340
Nasukawa, T., & Yi, J. (2003). Sentiment analysis: Capturing favorability using natural lan-
guage processing. In J. Gennari, B. Porter, & Y. Gil (Eds.), Proceedings of the 2nd Inter-
national Conference on Knowledge Capture, 70–77. https://doi.org/10.1145/945645.94
5658
Negnevitsky, M. (2011). Artificial Intelligence: A Guide to Intelligent Systems (3rd ed.). Addi-
son Wesley.
Newquist, H. P. (1994). The brain makers: The history of artificial intelligence – Genius, ego,
And greed in the quest for machines that think. Sams Publishing.
Nilsson, N. J. (2009). The quest for artificial intelligence. Cambridge University Press. https:
//doi.org/10.1017/CBO9780511819346
Noh, H., Hong, S., & Han, B. (2015, May 17). Learning deconvolution network for semantic
segmentation. http://arxiv.org/pdf/1505.04366v1
O’Mahony, N., Campbell, S., Carvalho, A., Harapanahalli, S., Hernandez, G. V., Krpalkova,
L., Riordan, D., & Walsh, J. (2020). Deep learning vs. traditional computer vision. In K.
Arai & S. Kapoor (Eds.), Advances in intelligent systems and computing. Advances in
computer vision (Vol. 943, pp. 128–144). Springer International Publishing. https://doi.
org/10.1007/978-3-030-17795-9_10
91
Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representa-
tion. In Q. C. R. I. Alessandro Moschitti, G. Bo Pang, & U. o. A. Walter Daelemans (Eds.),
Proceedings of the 2014 conference on empirical methods in natural language process-
ing (EMNLP) (pp. 1532–1543). Association for Computational Linguistics. https://doi.or
g/10.3115/v1/D14-1162
Pollatos, V., Kouvaras, L., & Charou, E. (2020, October 13). Land cover semantic segmenta-
tion using ResUNet. http://arxiv.org/pdf/2010.06285v1
PricewaterhouseCoopers. (2018). Sizing the prize. What’s the real value of AI for your busi-
ness and how can you capitalize? https://www.pwc.com/gx/en/issues/analytics/assets/
pwc-ai-analysis-sizing-the-prize-report.pdf
Rublee, E., Rabaud, V., Konolige, K., & Bradski, G. (2011, November 6–13). ORB: An efficient
alternative to SIFT or SURF. 011 International Conference on Computer Vision (ICCv
2011) (pp. 2564–2571). IEEE. https://doi.org/10.1109/ICCV.2011.6126544
Russell, S. J., & Norvig, P. (2022). Artificial intelligence: A modern approach (4th ed.). Pear-
son.
Schwartz, O. (2019). In the 17th century, Leibniz dreamed of a machine that could calculate
ideas. The machine would use an “alphabet of human thoughts” and rules to combine
them. IEEE Spectrum. https://spectrum.ieee.org/in-the-17th-century-leibniz-dreamed
-of-a-machine-that-could-calculate-ideas
Searle, J. R. (1980). Minds, brains, and programs. Behavioral and Brain Sciences, 3(3), 417–
424. https://doi.org/10.1017/S0140525X00005756
Sharif, W., Samsudin, N. A., Deris, M. M., & Naseem, R. (2016, August 24–26). Effect of nega-
tion in sentiment analysis. In E. Ariwa (Ed.), 2016 sixth international conference on
innovative computing technology (INTECH) (pp. 718–723). IEEE. https://doi.org/10.1109
/INTECH.2016.7845119
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schritt-
wieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D.,
Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Grae-
pel, T., & Hassabis, D. (2016). Mastering the game of Go with deep neural networks and
tree search. Nature, 529(7587), 484–489. https://doi.org/10.1038/nature16961
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L.,
Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., & Hassabis, D. (2018). A general
reinforcement learning algorithm that masters chess, shogi, and Go through self-play.
Science, 362(6419), 1140–1144. https://doi.org/10.1126/science.aar6404
Smith, S. W. (1997). The scientist and engineer's guide to digital signal processing. Califor-
nia Technical Publ.
92
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine
Learning, 3(1), 9–44. https://doi.org/10.1023/A:1022633531479
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.).
Adaptive computation and machine learning. MIT Press.
Szeliski, R. (2022). Computer vision: Algorithms and applications (2nd ed.). Springer Inter-
national Publishing. https://doi.org/10.1007/978-3-030-34372-9
Torralba, A., Fergus, R., & Weiss, Y. (2008, June 23–28). Small codes and large image data-
bases for recognition. In 2008 IEEE conference on computer vision and pattern recogni-
tion (pp. 1–8). IEEE. https://doi.org/10.1109/CVPR.2008.4587633
Turing, A. M. (1950). Computing machinery and intelligence. Mind, LIX(236), 433–460. https
://doi.org/10.1093/mind/LIX.236.433
van Otterlo, M., & Wiering, M. (2012). Reinforcement Learning and Markov Decision Proc-
esses. In M. Wiering & M. van Otterlo (Eds.), Reinforcement Learning: State-of-the-Art
(Vol. 12, pp. 3–42). Springer. https://doi.org/10.1007/978-3-642-27645-3_1
Wang, A., Qiu, T., & Shao, L. (2009). A simple method of radial distortion correction with
centre of distortion estimation. Journal of Mathematical Imaging and Vision, 35(3),
165–172. https://doi.org/10.1007/s10851-009-0162-1
Wang, L., Zhang, Y., & Feng, J. (2005). On the Euclidean distance of images. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 27(8), 1334–1339. https://doi.org/10
.1109/TPAMI.2005.165
Weizenbaum, J. (1966). ELIZA—A computer program for the study of natural language
communication between man and machine. Communications of the ACM, 9(1), 36–45.
https://doi.org/10.1145/365153.365168
Wiley, V., & Lucas, T. (2018). Computer Vision and Image Processing: A paper review. Inter-
national Journal of Artificial Intelligence Research, 2(1), 29–36. https://doi.org/10.2909
9/ijair.v2i1.42
93
Yilmaz, A., Javed, O., & Shah, M. (2006). Object tracking: A survey. ACM Computing Surveys,
38(4), 13. https://doi.org/10.1145/1177352.1177355
Zhang, D., Mishra, S., Brynjolfsson, E., Etchemendy, J., Ganguli, D., Grosz, B., Lyons, T.,
Manyika, J., Niebles, J. C., Sellitto, M., Shoham, Y., Clark, J., & Perrault, R. (2021, March
9). The AI index 2021 annual report. https://aiindex.stanford.edu/wp-content/uploads/
2021/11/2021-AI-Index-Report_Master.pdf
Zhang, Z. (2000). A flexible new technique for camera calibration. IEEE transactions on pat-
tern analysis and machine intelligence, 22(11), 1330–1334. https://doi.org/10.1109/34.8
88718
Zimmermann, T., Kotschenreuther, L., & Schmidt, K. (2016, June 21). Data-driven HR -
Résumé analysis based on natural language processing and machine learning. https://
doi.org/10.48550/arXiv.1606.05611
94
LIST OF TABLES AND
FIGURES
Figure 1: Historical Development of Al . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
95