0% found this document useful (0 votes)
118 views28 pages

Chapter-1 Deep Learning in NLP

This chapter introduces natural language processing and its historical development in three waves: rationalism, empiricism, and deep learning. Deep learning revolutionized NLP by enabling distributed representations of language via embedding, semantic generalization from embeddings, modeling long sequences, hierarchical networks, and end-to-end learning for NLP tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views28 pages

Chapter-1 Deep Learning in NLP

This chapter introduces natural language processing and its historical development in three waves: rationalism, empiricism, and deep learning. Deep learning revolutionized NLP by enabling distributed representations of language via embedding, semantic generalization from embeddings, modeling long sequences, hierarchical networks, and end-to-end learning for NLP tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Preface

Natural language processing (NLP), which aims to enable computers to process


human languages intelligently, is an important inter-discipline field crossing artifi-
cial intelligence, computing science, cognitive science, information processing, and
linguistics. Concerned with interactions between computers and human languages,
NLP applications such as speech recognition, dialog systems, information retrieval,
question answering, and machine translation have started to reshape the way people
identify, obtain, and make use of information.
The development of NLP can be described in terms of three major waves: ra-
tionalism, empiricism, and deep learning. In the first wave, rationalist approaches
advocated the design of hand-crafted rules to incorporate knowledge into NLP sys-
tems based on the assumption that knowledge of language in the human mind is
fixed in advance by generic inheritance. In the second wave, empirical approaches
assume that rich sensory input and the observable language data in surface form are
required and sufficient to enable the mind to learn the detailed structure of natural
language. As a result, probabilistic models were developed to discover the regu-
larities of languages from large corpora. In the third wave, deep learning exploits
hierarchical models of non-linear processing, inspired by biological neural systems
to learn intrinsic representations from language data, in ways that aim to simulate
human cognitive abilities.
The intersection of deep learning and natural language processing has resulted in
striking successes in practical tasks. Speech recognition is the first industrial NLP
application that deep learning has strongly impacted. With the availability of large-
scale training data, deep neural networks achieved dramatically lower recognition
errors than the traditional empirical approaches. Another prominent successful ap-
plication of deep learning in NLP is machine translation. End-to-end neural ma-
chine translation that models the mapping between human languages using neural
networks has proven to improve translation quality substantially. Therefore, neural
machine translation has quickly become the new de facto technology in major com-
mercial online translation services: Google, Microsoft, Facebook, Baidu, and more.
Many other areas of NLP, including language understanding and dialogue, lexical
analysis and parsing, knowledge graph, information retrieval, question answering
viii Preface

from text, social computing, language generation, and text sentiment analysis, have
also seen much significant progress using deep learning, riding on the third wave
of NLP. Nowadays, deep learning is a dominating method applied to practically all
NLP tasks.
The main goal of this book is to provide a comprehensive survey on the re-
cent advances in deep learning applied to NLP. The book presents state-of-the-art
of NLP-centric deep learning research, and focuses on the role of deep learning
played in major NLP applications including spoken language understanding, di-
alogue systems, lexical analysis, parsing, knowledge graph, machine translation,
question answering, sentiment analysis, social computing, and natural language gen-
eration (from images). This book is suitable for readers with a technical background
in computation, including graduate students, post-doctoral researchers, educators,
and industrial researchers and anyone interested in getting up to speed with the lat-
est techniques of deep learning associated with NLP.
The book is organized into eleven chapters as follows:
• Chapter 1: A Joint Introduction to Natural Language Processing and to Deep
Learning (Li Deng and Yang Liu)
• Chapter 2: Deep Learning in Conversational Language Understanding (Gokhan
Tur, Asli Celikyilmaz, Xiaodong He, Dilek Hakkani-Tür, and Li Deng)
• Chapter 3: Deep Learning in Spoken and Text-based Dialogue Systems (Asli
Celikyilmaz, Li Deng, and Dilek Hakkani-Tür)
• Chapter 4: Deep Learning in Lexical Analysis and Parsing (Wanxiang Che and
Yue Zhang)
• Chapter 5: Deep Learning in Knowledge Graph (Zhiyuan Liu and Xianpei Han)
• Chapter 6: Deep Learning in Machine Translation (Yang Liu and Jiajun Zhang)
• Chapter 7: Deep Learning in Question Answering (Kang Liu and Yansong Feng)
• Chapter 8: Deep Learning in Sentiment Analysis (Duyu Tang and Meishan
Zhang)
• Chapter 9: Deep Learning in Social Computing (Xin Zhao and Chenliang Li)
• Chapter 10: Deep Learning in Natural Language Generation from Images (Xi-
aodong He and Li Deng)
• Chapter 11: Epilogue (Li Deng and Yang Liu)
Chapter 1 first reviews the basics of NLP as well as the main scope of NLP cov-
ered in the following chapters of the book, and then goes in some depth into the
historical development of NLP summarized as three waves and future directions.
Then, an in-depth survey on the recent advances in deep learning applied to NLP is
organized into nine separate chapters, each covering a largely independent applica-
tion area of NLP. The main body of each chapter is written by leading researchers
and experts actively working in the respective field.
The origin of this book was the set of comprehensive tutorials given at the 15th
China National Conference on Computational Linguistics (CCL 2016) held in Oc-
tober 2016 in Yantai, Shandong, China, where both of us, editors of this book, were
active participants and were taking leading roles. We thank our Springer’s senior
editor, Dr. Celine Lanlan Chang, who kindly invited us to create this book and who
Preface ix

has been providing much of timely assistance needed to complete this book. We
are grateful also to Springer’s Assistant Editor, Jane Li, for offering invaluable help
through various stages of manuscript preparation.
We thank all authors of Chapters 2-10 who devoted their valuable time care-
fully preparing the content of their chapters: Gokhan Tur, Asli Celikyilmaz, Dilek
Hakkani-Tur, Wanxiang Che, Yue Zhang, Xianpei Han, Zhiyuan Liu, Jiajun Zhang,
Kang Liu, Yansong Feng, Duyu Tang, Meishan Zhang, Xin Zhao, Chenliang Li,
and Xiaodong He. The authors of Chapters 4-9 are CCL 2016 tutorial speakers.
They spent a considerable amount of time in updating their tutorial material with
the latest advances in the field since October 2016.
Further, we thank numerous reviewers and readers, Sadaoki Furui, Andrew Ng,
Fred Juang, Ken Church, Haifeng Wang, Hongjiang Zhang, who not only gave us
much needed encouragements but also offered many constructive comments which
substantially improved earlier drafts of the book.
Finally, we give our appreciations to our organizations, Microsoft Research and
Citadel (for Li Deng) and Tsinghua University (for Yang Liu), who provided excel-
lent environments, supports, and encouragements that have been instrumental for us
to complete this book.

Li Deng, Seattle, USA


Yang Liu, Beijing, China
October, 2017
Chapter 1
A Joint Introduction to Natural Language
Processing and to Deep Learning
Li Deng1 and Yang Liu2
1 Citadel, Chicago and Seattle, USA
2 Tsinghua University, Beijing, China

l.deng@ieee.org, liuyang2011@tsinghua.edu.cn

Abstract In this chapter, we set up the fundamental framework for the book. We
first provide an introduction to the basics of natural language processing (NLP) as an
integral part of artificial intelligence. We then survey the historical development of
NLP, spanning over five decades, in terms of three waves. The first two waves arose
as rationalism and empiricism, paving ways to the current deep learning wave. The
key pillars underlying the deep learning revolution for NLP consist of 1) distributed
representations of linguistic entities via embedding, 2) semantic generalization due
to the embedding, 3) long-span deep sequence modeling of natural language, 4)
hierarchical networks effective for representing linguistic levels from low to high,
and 5) end-to-end deep learning methods to jointly solve many NLP tasks. After
the survey, several key limitations of current deep learning technology for NLP are
analyzed. This analysis leads to five research directions for future advances in NLP.

1.1 Natural Language Processing: The Basics

Natural language processing (NLP) investigates the use of computers to process or


to understand human (i.e. natural) languages for the purpose of performing use-
ful tasks. NLP is an interdisciplinary field that combines computational linguistics,
computing science, cognitive science, and artificial intelligence. From a scientific
perspective, NLP aims to model the cognitive mechanisms underlying the under-
standing and production of human languages. From an engineering perspective,
NLP is concerned with how to develop novel practical applications to facilitate the
interactions between computers and human languages. Typical applications in NLP
include speech recognition, spoken language understanding, dialogue systems, lex-
ical analysis, parsing, machine translation, knowledge graph, information retrieval,
question answering, sentiment analysis, and social computing. These NLP applica-
tion areas form the core content of this book.
Natural language is a system constructed specifically to convey meaning or se-
mantics, and is by its fundamental nature a symbolic or discrete system. The surface

1
2 1 A Joint Introduction to Natural Language Processing and to Deep Learning

or observable “physical” signal of natural language is called text, always in a sym-


bolic form. The text “signal” has its counterpart — the speech signal; the latter
can be regarded as the continuous correspondence of symbolic text, both entail-
ing the same latent linguistic hierarchy of natural language. From NLP and signal
processing perspectives, speech can be treated as “noisy” versions of text, impos-
ing additional difficulties in its need of “de-noising” when performing the task of
understanding the common underlying semantics. Chapters 2 and 3 as well as cur-
rent Chapter 1 of this book cover the speech aspect of NLP in some detail, while
the remaining chapters start directly from text in discussing a wide variety of text-
oriented tasks that exemplify the pervasive NLP applications enabled by machine
learning techniques, notably deep learning.
The symbolic nature of natural language is in stark contrast to the continuous
nature of language’s neural substrate in the human brain. We will defer this discus-
sion to Section 6 of this chapter when discussing future challenges of deep learning
in NLP. A related contrast is how the symbols of natural language are encoded in
several continuous-valued modalities, such as gesture (as in sign language), hand-
writing (as an image), and, of course, speech. On the one hand, the word as a symbol
is used as a “signifier” to refer to a concept or a thing in real world as a “signified”
object, necessarily a categorical entity. On the other hand, the continuous modalities
that encode symbols of words constitute the external signals sensed by the human
perceptual system and transmitted to the brain, which in turn operates in a continu-
ous fashion. While of great theoretical interest, the subject of contrasting the sym-
bolic nature of language versus its continuous rendering and encoding goes beyond
the scope of this book.
In the next few sections, we outline and discuss, from a historical perspective, the
development of general methodology used to study NLP as a rich interdisciplinary
field. Much like several closely related sub- and super-fields such as conversational
systems, speech recognition, artificial intelligence, the development of NLP can be
described in terms of three major waves (Deng, 2017; Pereira, 2017), each of which
is elaborated in a separate section next.

1.2 The First Wave: Rationalism

NLP research in its first wave lasted for a long time, dating back to 1950s. In 1950,
Alan Turing proposed the Turing test to evaluate a computer’s ability to exhibit in-
telligent behavior indistinguishable from that of a human (Turing, 1950). This test
is based on natural language conversations between a human and a computer de-
signed to generate human-like responses. In 1954, the Georgetown-IBM experiment
demonstrated the first machine translation system capable of translating more than
sixty Russian sentences into English.
The approaches, based on the belief that knowledge of language in the human
mind is fixed in advance by generic inheritance, dominated most of NLP research
between about 1960 and late 1980s. These approaches have been called rationalist
1.2 The First Wave: Rationalism 3

ones (Church, 2007). The dominance of rationalist approaches in NLP was mainly
due to the widespread acceptance of arguments of Noam Chomsky for an innate
language structure and his criticism of N-grams (Chomsky, 1957). Postulating that
key parts of language are hardwired in the brain at birth as a part of the human
genetic inheritance, rationalist approaches endeavored to design hand-crafted rules
to incorporate knowledge and reasoning mechanisms into intelligent NLP systems.
Up until 1980s, most notably successful NLP systems, such as ELIZA for simulating
a Rogerian psychotherapist and MARGIE for structuring real-world information
into concept ontologies, were based on complex sets of hand-written rules.
This period coincided approximately with the early development of artificial in-
telligence, characterized by expert knowledge engineering, where domain experts
devised computer programs according to the knowledge about the (very narrow) ap-
plication domains they have (Nilsson, 1982; Winston, 1993). The experts designed
these programs using symbolic logical rules based on careful representations and en-
gineering of such knowledge. These knowledge-based artificial intelligence systems
tend to be effective in solving narrow-domain problems by examining the “head” or
most important parameters and reaching a solution about the appropriate action to
take in each specific situation. These “head” parameters are identified in advance
by human experts, leaving the “tail” parameters and cases untouched. Since they
lack learning capability, they have difficulty in generalizing the solutions to new
situations and domains. The typical approach during this period is exemplified by
the expert system, a computer system that emulates the decision-making ability of a
human expert. Such systems are designed to solve complex problems by reasoning
about knowledge (Nilsson, 1982). The first expert system was created in 1970s and
then proliferated in 1980s. The main “algorithm” used was the inference rules in the
form of “if-then-else” (Jackson, 1998). The main strength of these first-generation
artificial intelligence systems is its transparency and interpretability in their (lim-
ited) capability in performing logical reasoning. Like NLP systems such as ELIZA
and MARGIE, the general expert systems in the early days used handcrafted expert
knowledge which was often effective in narrowly-defined problems, although the
reasoning could not handle uncertainty that is ubiquitous in practical applications.
In specific NLP application areas of dialogue systems and spoken language un-
derstanding, to be described in more detail in Chapters 2 and 3 of this book, such
rationalistic approaches were represented by the pervasive use of symbolic rules and
templates (Seneff et al., 1991). The designs were centered on grammatical and on-
tological constructs, which, while interpretable and easy to debug and update, had
experienced severe difficulties in practical deployment. When such systems worked,
they often worked beautifully; but unfortunately this happened just not very often
and the domains were necessarily limited.
Likewise, speech recognition research and system design, another long-standing
NLP and artificial intelligence challenge, during this rationalist era were based
heavily on the paradigm of expert knowledge engineering, as elegantly analyzed
in (Church and Mercer, 1993). During 1970s and early 1980s, the expert-system
approach to speech recognition was quite popular (Reddy, 1976; Zue, 1985). How-
ever, the lack of abilities to learn from data and to handle uncertainty in reasoning
4 1 A Joint Introduction to Natural Language Processing and to Deep Learning

was acutely recognized by researchers, leading to the second wave of speech recog-
nition, NLP, and artificial intelligence described next.

1.3 The Second Wave: Empiricism

The second wave of NLP was characterized by the exploitation of data corpora and
of (shallow) machine learning, statistical or otherwise, to make use of such data
(Manning and Schtze, 1999). As much of the structure of and theory about nat-
ural language were discounted or discarded in favor of data-driven methods, the
main approaches developed during this era have been called empirical or prag-
matic ones (Church and Mercer, 1993; Church, 2014). With the increasing avail-
ability of machine-readable data and steady increase of computational power, em-
pirical approaches have dominated NLP since around 1990. One of the major NLP
conferences was even named “Empirical Methods in Natural Language Process-
ing (EMNLP)” to reflect most directly the strongly positive sentiment of NLP re-
searchers during that era towards empirical approaches.
In contrast to rationalist approaches, empirical approaches assume that the hu-
man mind only begins with general operations for association, pattern recognition,
and generalization. Rich sensory input is required to enable the mind to learn the
detailed structure of natural language. Prevalent in linguistics between 1920 and
1960, empiricism has been undergoing a resurgence since 1990. Early empirical
approaches to NLP focused on developing generative models such as the hidden
Markov model (Baum and Petrie, 1966), the IBM translation models (Brown et al.,
1993), and the head-driven parsing models (Collins, 1997) to discover the regulari-
ties of languages from large corpora. Since late 1990s, discriminative models have
become the de facto approach in a variety of NLP tasks. Representative discrimina-
tive models and methods in NLP include the maximum entropy model (Ratnaparkhi,
1997), supporting vector machines (Vapnik, 1998), conditional random fields (Laf-
ferty et al., 2001), maximum mutual information and minimum classification error
(He et al., 2008), and perceptron (Collins, 2002).
Again, this era of empiricism in NLP was paralleled with corresponding ap-
proaches in artificial intelligence as well as in speech recognition and computer vi-
sion. It came about after clear evidence that learning and perception capabilities are
crucial for complex artificial intelligence systems but missing in the expert systems
popular in the previous wave. For example, when DARPA opened its first Grand
Challenge for autonomous driving, most vehicles then relied on the knowledge-
based artificial intelligence paradigm. Much like speech recognition and NLP, the
autonomous driving and vision researchers immediately realized the limitation of
the knowledge-based paradigm due to the necessity for machine learning with un-
certainty handling and generalization capabilities.
The empiricism in NLP and speech recognition in this second wave was based
on data-intensive machine learning, which we now call “shallow” due to the general
lack of abstractions constructed by many-layer or “deep” representations of data
1.3 The Second Wave: Empiricism 5

which would come in the third wave to be described in the next section. In ma-
chine learning, researchers do not need to concern with constructing precise and
exact rules as required for the knowledge-based NLP and speech systems during the
first wave. Rather, they focus on statistical models (Bishop, 2006; Murphy, 2012)
or simple neural networks (Bishop, 1995) as an underlying engine. They then au-
tomatically learn or “tune” the parameters of the engine using ample training data
to make them handle uncertainty, and to attempt to generalize from one condition
to another and from one domain to another. The key algorithms and methods for
machine learning include EM (expectation-maximization), Bayesian networks, sup-
port vector machines, decision trees, and, for neural networks, backpropagation al-
gorithm.
Generally speaking, the machine learning based NLP, speech, and other artificial
intelligence systems perform much better than the earlier, knowledge-based counter-
parts. Successful examples include almost all artificial intelligence tasks in machine
perception — speech recognition (Jelinek, 1998), face recognition (Viola and Jones,
2004), visual object recognition (Fei-Fei and Perona, 2005), handwriting recogni-
tion (Plamondon and Srihari, 2000), and machine translation (Och, 2003).
More specifically, in a core NLP application area of machine translation, as to
be described in detail in Chapter 6 of this book as well as in (Church and Mercer,
1993), the field has switched rather abruptly around 1990 from rationalistic meth-
ods outlined in Section 2 to empirical, largely statistical methods. The availability of
sentence-level alignments in the bilingual training data made it possible to acquire
surface-level translation knowledge not by rules but from data directly, at the ex-
pense of discarding or discounting structured information in natural languages. The
most representative work during this wave is that empowered by various versions of
IBM translation models (Brown et al., 1993). Subsequent developments during this
empiricist era of machine translation further significantly improved the quality of
translation systems (Och and Ney, 2002; Och, 2003; He and Deng, 2003; Chiang,
2007; He and Deng, 2012), but not at the level of massive deployment in real world
(which would come after the next, deep learning wave).
In the dialogue and spoken language understanding areas of NLP, this empiri-
cist era was also marked prominently by data-driven machine learning approaches.
These approaches were well suited to meet the requirement for quantitative evalu-
ation and concrete deliverables. They focused on broader but shallow, surface-level
coverage of text and domains instead of detailed analyses of highly restricted text
and domains. The training data were used not to design rules for language under-
standing and response action from the dialogue systems but to learn parameters of
(shallow) statistical or neural models automatically from data. Such learning helped
reduce the cost of hand-crafted complex dialogue manager’s design, and helped im-
prove robustness against speech recognition errors in the overall spoken language
understanding and dialogue systems; for a review, see (He and Deng, 2013). More
specifically, for the dialogue policy component of dialogue systems, powerful rein-
forcement learning based on Markov decision processes had been introduced during
this era; for a review, see (Young et al., 2013). And for spoken language understand-
ing, the dominant methods moved from rule- or template-based ones during the first
6 1 A Joint Introduction to Natural Language Processing and to Deep Learning

wave to generative models like hidden Markov models (Wang et al., 2011) to dis-
criminative models like conditional random fields (Tur and Deng, 2011).
Similarly, in speech recognition, over close to 30 years from early 1980s to
around 2010, the field was dominated by the (shallow) machine learning paradigm
using a statistical generative model called the Hidden Markov Model (HMM) in-
tegrated with Gaussian mixture models, along with various versions of its general-
ization (Baker et al., 2009a,b; Deng and O’Shaughnessy, 2003; Rabiner and Juang,
1993). Among many versions of the generalized HMMs were statistical and neural-
net based hidden dynamic models (Deng, 1998; Bridle et al., 1998; Deng and Yu,
2007). The former adopted EM and switching extended Kalman filter algorithms
for learning model parameters (Ma and Deng, 2004; Lee et al., 2004), and the latter
used back-propagation (Picone et al., 1999). Both of them made extensive use of
multiple latent layers of representations for the generative process of speech wave-
forms following the long-standing framework of analysis-by-synthesis in human
speech perception. More significantly, inverting this “deep” generative process to
its counterpart of an end-to-end discriminative process gave rise to the first indus-
trial success of deep learning (Deng et al., 2010, 2013; Hinton et al., 2012), which
formed a driving force of the third wave of speech recognition and NLP that will be
elaborated next.

1.4 The Third Wave: Deep Learning

While the NLP systems, including speech recognition, language understanding, and
machine translation, developed during the second wave performed a lot better and
with higher robustness than those during the first wave, they were far from human-
level performance and left much to desire. With a few exceptions, the (shallow)
machine learning models for NLP often did not have the capacity sufficiently large
to absorb the large amounts of training data. Further, the learning algorithms, meth-
ods, and infrastructures were not powerful enough. All this changed several years
ago, giving rise to the third wave of NLP, propelled by the new paradigm of deep-
structured machine learning or Deep Learning (Bengio, 2009; Deng and Yu, 2014;
LeCun et al., 2015; Goodfellow et al., 2016).
In traditional machine learning, features are designed by humans and feature
engineering is a bottleneck, requiring significant human expertise. Concurrently,
the associated shallow models lack the representation power and hence the ability
to form levels of decomposable abstractions that would automatically disentangle
complex factors in shaping the observed language data. Deep learning breaks away
the above difficulties by the use of deep, layered model structure, often in the form of
neural networks, and the associated end-to-end learning algorithms. The advances in
deep learning are one major driving force behind the current NLP and more general
artificial intelligence inflection point and are responsible for the resurgence of neural
networks with a wide range of practical, including business, applications (Parloff,
2016).
1.4 The Third Wave: Deep Learning 7

More specifically, despite the success of (shallow) discriminative models in a


number of important NLP tasks developed during the second wave, they suffered
from the difficulty of covering all regularities in languages by designing features
manually with domain expertise. Besides the incompleteness problem, such shal-
low models also face the sparsity problem as features usually only occur once on
the training data, especially for highly sparse high-order features. Therefore, feature
design has become one of the major obstacles in statistical NLP before deep learning
comes to rescue. Deep learning brings hope for addressing the human feature en-
gineering problem, with a view called “NLP from scratch” (Collobert et al., 2011),
which was in early days of deep learning considered highly unconventional. Such
deep learning approaches exploit the powerful neural networks that contain multiple
hidden layers to solve general machine learning tasks dispensing with feature engi-
neering. Unlike shallow neural networks and related machine learning models, deep
neural networks are capable of learning representations from data using a cascade
of multiple layers of non-linear processing units for feature extraction. As higher
level features are derived from lower level features, these levels form a hierarchy of
concepts.
Deep learning originated from artificial neural networks, which can be viewed
as cascading models of cell types inspired by biological neural systems. With the
advent of back-propagation algorithm (Rumelhart et al., 1986), training deep neu-
ral networks from scratch attracted intensive attention in 1990s. In these early days,
without large amounts of training data and without proper design and learning meth-
ods, during neural network training the learning signals vanish exponentially with
the number of layers (or more rigorously the depth of credit assignment) when prop-
agated from layer to layer, making it difficult to tune connection weights of deep
neural networks, especially the recurrent versions. Hinton et al. (2006) initially over-
came this problem by using unsupervised pre-training to first learn generally useful
feature detectors. Then, the network is further trained by supervised learning to clas-
sify labeled data. As a result, it is possible to learn the distribution of a high-level
representation using low-level representations. This seminal work marks the revival
of neural networks. A variety of network architectures have since been proposed
and developed, including deep belief networks (Hinton et al., 2006), stacked auto-
encoders (Vincent et al., 2010), deep Boltzmann machines (Hinton and Salakhutdi-
nov, 2012), deep convolutional neural works (Krizhevsky et al., 2012), deep stack-
ing networks (Deng et al., 2012), and deep Q-networks (Mnih et al., 2015). Capable
of discovering intricate structures in high-dimensional data, deep learning has since
2010 been successfully applied to real-world tasks in artificial intelligence including
notably speech recognition (Yu et al., 2010; Hinton et al., 2012), image classification
(Krizhevsky et al., 2012; He et al., 2016), and NLP (all chapters in this book). De-
tailed analyses and reviews of deep learning have been provided in a set of tutorial
survey articles (Deng, 2014; LeCun et al., 2015; Juang, 2016).
As speech recognition is one of core tasks in NLP, we briefly discuss it here
due to its importance as the first industrial NLP application in real world impacted
strongly by deep learning. Industrial applications of deep learning to large-scale
speech recognition started to take off around 2010. The endeavor was initiated with
8 1 A Joint Introduction to Natural Language Processing and to Deep Learning

a collaboration between academia and industry, with the original work presented
at the 2009 NIPS Workshop on Deep Learning for Speech Recognition and Re-
lated Applications. The workshop was motivated by the limitations of deep gener-
ative models of speech, and the possibility that the big-compute, big-data era war-
rants a serious exploration of deep neural networks. It was believed then that pre-
training DNNs using generative models of deep belief nets based on the contrastive-
divergence learning algorithm would overcome the main difficulties of neural nets
encountered in the 1990s (Dahl et al., 2011; Mohamed et al., 2009). However, early
into this research at Microsoft, it was discovered that without contrastive-divergence
pre-training, but with the use of large amounts of training data together with the
deep neural networks designed with corresponding large, context-dependent output
layers and with careful engineering, dramatically lower recognition errors could be
obtained than then-state-of-the-art (shallow) machine learning systems (Yu et al.,
2010, 2011; Dahl et al., 2012). This finding was quickly verified by several other
major speech recognition research groups in North America (Hinton et al., 2012;
Deng et al., 2013) and subsequently overseas. Further, the nature of recognition
errors produced by the two types of systems was found to be characteristically dif-
ferent, offering technical insights into how to integrate deep learning into the exist-
ing highly efficient, run-time speech decoding system deployed by major players in
speech recognition industry (Yu and Deng, 2015; Abdel-Hamid et al., 2014; Xiong
et al., 2016; Saon et al., 2017). Nowadays, back-propagation algorithm applied to
deep neural nets of various forms is uniformly used in all current state-of-the-art
speech recognition systems (Yu and Deng, 2015; Amodei et al., 2016; Saon et al.,
2017), and all major commercial speech recognition systems — Microsoft Cortana,
Xbox, Skype Translator, Amazon Alexa, Google Assistant, Apple Siri, Baidu and
iFlyTek voice search, and more – are all based on deep learning methods.
The striking success of speech recognition in 2010-2011 heralded the arrival of
the third wave of NLP and artificial intelligence. Quickly following the success of
deep learning in speech recognition, computer vision (Krizhevsky et al., 2012) and
machine translation (Bahdanau et al., 2015) — were taken over by the similar deep
learning paradigm. In particular, while the powerful technique of neural embedding
of words was developed in as early as 2011 (Bengio et al., 2001), it is not until more
than ten year later it was shown to be practically useful at a large and practically
useful scale (Mikolov et al., 2013) due to the availability of big data and faster com-
putation. In addition, a large number of other real-world NLP applications, such
as image captioning (Karpathy and Fei-Fei, 2015; Fang et al., 2015; Gan et al.,
2017), visual question answering (Fei-Fei and Perona, 2016), speech understanding
(Mesnil et al., 2013), web search (Huang et al., 2013b), and recommendation sys-
tems, have been made successful due to deep learning, in addition to many non-NLP
tasks including drug discovery and toxicology, customer relationship management,
recommendation systems, gesture recognition, medical informatics, advertisement,
medical image analysis, robotics, self-driving vehicles, board and eSports games
(e.g. Atari, Go, Poker, and the latest, DOTA2), and so on. For more details, see
htt ps : //en.wikipedia.org/wiki/deepl earning.
1.4 The Third Wave: Deep Learning 9

In more specific text-based NLP application areas, machine translation is per-


haps impacted the most by deep learning. Advancing from the shallow statistical
machine translation developed during the second wave of NLP, the current best
machine translation systems in real world applications are based on deep neural
networks. For example, Google announced the first stage of its move to neural ma-
chine translation in September 2016 and Microsoft made a similar announcement
two months later. Facebook has been working on the conversion to neural machine
translation for about a year, and by August 2017 it is at full deployment. Details of
the deep learning techniques in these state-of-the-art large-scale machine translation
systems will be reviewed in Chapter 6.
In the area of spoken language understanding and dialogue systems, deep learn-
ing is also making a huge impact. The current popular techniques maintain and
expand the statistical methods developed during second-wave era in several ways.
Like the empirical, (shallow) machine learning methods, deep learning is also based
on data-intensive methods to reduce the cost of hand-crafted complex understand-
ing and dialogue management, to be robust against speech recognition errors under
noise environments and against language understanding errors, and to exploit the
power of Markov decision processes and reinforcement learning for designing di-
alogue policy; e.g. (Gasic et al., 2017; Dhingra et al., 2017). Compared with the
earlier methods, deep neural net models and representations are much more power-
ful and they make end-to-end learning possible. However, deep learning has not yet
solved the problems of interpretability and domain scalability associated with ear-
lier empirical techniques. Details of the deep learning techniques popular for current
spoken language understanding and dialogue systems as well as their challenges will
be reviewed in Chapters 2 and 3.
Two important recent technological breakthroughs brought about in applying
deep learning to NLP problems are sequence-to-sequence learning (Sutskevar et al.,
2014) and attention modeling (Bahdanau et al., 2015). The sequence-to-sequence
learning introduces a powerful idea of using recurrent nets to carry out both encod-
ing and decoding in an end-to-end manner. While attention modeling was initially
developed to overcome the difficulty of encoding a long sequence, subsequent de-
velopments significantly extended its power to provide highly flexible alignment of
two arbitrary sequences that can be learned together with neural net parameters. The
key concepts of sequence-to-sequence learning and of attention mechanism boosted
the performance of neural machine translation based on distributed word embedding
over the best system based on statistical learning and local representations of words
and phrases. Soon after this success, these concepts have also been applied success-
fully to a number of other NLP-related tasks such as image captioning (Karpathy
and Fei-Fei, 2015; Devlin et al., 2015), speech recognition (Chorowski et al., 2015),
meta learning for program execution, one-shot learning, syntactic parsing, lip read-
ing, text understanding, summarization, and question answering and more.
Setting aside their huge empirical successes, models of neural-network based
deep learning are often simpler and easier to design than the traditional machine
learning models developed in the earlier wave In many applications, deep learning
is performed simultaneously for all parts of the model, from feature extraction all
10 1 A Joint Introduction to Natural Language Processing and to Deep Learning

the way to prediction, in an end-to-end manner. Another factor contributing to the


simplicity of neural network models is that the same model building blocks (i.e. the
different types of layers) are generally used in many different applications. Using
the same building blocks for a large variety of tasks makes the adaptation of models
used for one task or data to another task or data relatively easy. In addition, software
toolkits have been developed to allow faster and more efficient implementation of
these models. For these reasons, deep neural networks are nowadays a prominent
method of choice for a large variety of machine learning and artificial intelligence
tasks over large datasets including, prominently, NLP tasks.
Although deep learning has proven effective in reshaping the processing of
speech, images and videos in a revolutionary way, the effectiveness is less clear-cut
in intersecting deep learning with text-based NLP despite its empirical successes in
a number of practical NLP tasks. In speech, image and video processing, deep learn-
ing effectively addresses the semantic gap problem by learning high-level concepts
from raw perceptual data in a direct manner. However, in NLP, stronger theories
and structured models on morphology, syntax, and semantics have been advanced
to distill the underlying mechanisms of understanding and generation of natural lan-
guages, which have not been as easily compatible with neural networks. Compared
with speech, image, and video signals, it seems less straightforward to see that the
neural representations learned from textual data can provide equally direct insights
onto natural language. Therefore, applying neural networks, especially those having
sophisticated hierarchical architectures, to NLP has received increasing attention
and has become the most active area in both NLP and deep learning communi-
ties with highly visible progresses made in recent years (Deng, 2016; Manning and
Socher, 2017). Surveying the advances and analyzing the future directions in deep
learning for NLP form the main motivation for us to write this chapter and to create
this book, with the desire for the NLP researchers to accelerate the research further
in the current fast pace of the progress.

1.5 Transitions from Now to the Future

Before analyzing the future dictions of NLP with more advanced deep learning, here
we first summarize the significance of the transition from the past waves of NLP to
the present one. We then discuss some clear limitations and challenges of the present
deep learning technology for NLP, to pave a way to examining further development
that would overcome these limitations for the next wave of innovations.

1.5.1 From Empiricism to Deep Learning: A Revolution

On the surface, the deep learning rising wave discussed in Section 4 in this chap-
ter appears to be a simple push of the second, empiricist wave of NLP (Section 3)
1.5 Transitions from Now to the Future 11

into an extreme end with bigger data, larger models, and greater computing power.
After all, the fundamental approaches developed during both waves are data-driven,
are based on machine learning and computation, and have dispensed with human-
centric “rationalistic” rules that are often brittle and costly to acquire in practical
NLP applications. However, if we analyze these approaches holistically and at a
deeper level, we can identify aspects of conceptual revolution moving from empiri-
cist machine learning to deep learning, and can subsequently analyze the future di-
rections of the field (Section 6). This revolution, in our opinion, is no less significant
than the revolution from the earlier rationalist wave to empiricist one as analyzed at
the beginning and end of the latter (Church and Mercer, 1993; Charniak, 2011).
Empiricist machine learning and linguistic data analysis during the second NLP
wave started in early 1990s by crypto-analysts and computer scientists working on
natural language sources that are highly limited in vocabulary and application do-
mains. As we discussed in Section 3, surface-level text observations, i.e. words and
their sequences, are counted using discrete probabilistic models without relying on
deep structure in natural language. The basic representations were “one-hot” or lo-
calist, where no semantic similarity between words was exploited. With restrictions
in domains and associated text content, such structure-free representations and em-
pirical models are often sufficient to cover much of what needs to be covered. That
is, the shallow, count-based statistical models can naturally do well in limited and
specific NLP tasks. But when the domain and content restrictions are lifted for more
realistic NLP applications in real world, count-based models would necessarily be-
come ineffective, no manner how many tricks of smoothing have been invented in
an attempt to mitigate the problem of combinatorial counting sparseness. This is
where deep learning for NLP truly shines — distributed representations of words
via embedding, semantic generalization due to the embedding, longer-span deep
sequence modeling, and end-to-end learning methods have all contributed to beat-
ing empiricist, count-based methods in a wide range of NLP tasks as discussed in
Section 4.

1.5.2 Limitations of Current Deep Learning Technology

Despite the spectacular successes of deep learning in NLP tasks, most notably in
speech recognition/understanding, language modeling, and in machine translation,
there remain huge challenges. The current deep learning methods based on neu-
ral networks as a black box generally lack interpretability, even further away from
explainability, in contrast to the “rationalist” paradigm established during the first
NLP wave where the rules devised by experts were naturally explainable. In prac-
tice, however, it is highly desirable to explain the predictions from a seemingly
“black-box” model, not only for improving the model but for providing the users of
the prediction system with interpretations of the suggested actions to take (Koh and
Liang, 2017).
12 1 A Joint Introduction to Natural Language Processing and to Deep Learning

In a number of applications, deep learning methods have proved to give recog-


nition accuracy close to or exceeding humans, but they require considerably more
training data, power consumption, and computing resources than humans. Also, the
accuracy results are statistically impressive but often unreliable on the individual ba-
sis. Further, most of the current deep learning models have no reasoning and explain-
ing capabilities, making them vulnerable to disastrous failures or attacks without the
ability to foresee and thus to prevent them. Moreover, the current NLP models have
not taken into account the need for developing and executing goals and plans for
decision making via ultimate NLP systems. A more specific limitation of current
NLP methods based on deep learning is their poor abilities for understanding and
reasoning inter-sentential relationships, although huge progresses have been made
for inter-words and phrases within sentences.
As discussed earlier, the success of deep learning in NLP has largely come from
a simple strategy thus far — given an NLP task, apply standard sequence models
based on (bi-directional) LSTMs, add attention mechanisms if information required
in the task needs to flow from another source, and then train the full models in an
end-to-end manner. However, while sequence modeling is naturally appropriate for
speech, human understanding of natural language (in text form) requires more com-
plex structure than sequence. That is, current sequence-based deep learning systems
for NLP can be further advanced by exploiting modularity, structured memories, and
recursive, tree-like representations for sentences and larger text (Manning, 2016).
To overcome the challenges outlined above and to achieve the ultimate success
of NLP as a core artificial intelligence field, both fundamental and applied research
are needed. The next new wave of NLP and artificial intelligence will not come until
researchers create new paradigmatic, algorithmic, and computation (including hard-
ware) breakthroughs. Here we outline several high-level directions toward potential
breakthroughs.

1.6 Future Directions of NLP

1.6.1 Neural-Symbolic Integration

A potential breakthrough is in developing advanced deep learning models and meth-


ods that are more effective than current methods in building, accessing, and exploit-
ing memories and knowledge, including, in particular, common-sense knowledge.
It is not clear how to best integrate the current deep learning methods, centered on
distributed representations (of everything), with explicit, easily interpretable, and
localist-represented knowledge about natural language and the world and with re-
lated reasoning mechanisms.
One path to this goal is to seamlessly combine neural networks and symbolic lan-
guage systems. These NLP and artificial intelligence systems will aim to discover
by themselves the underlying causes or logical rules that shape their prediction and
1.6 Future Directions of NLP 13

decision-making processes interpretable to human users in symbolic natural lan-


guage forms. Recent, very preliminary work in this direction made use of an in-
tegrated neural-symbolic representation called tensor-product neural memory cells,
capable of decoding back to symbolic forms. This structured neural representation
is provably lossless in the coded information after extensive learning within the
neural-tensor domain (Palangi et al., 2017; Smolensky et al., 2016; Lee et al., 2016).
Extensions of such tensor-product representations, when applied to NLP tasks such
as machine reading and question answering, are aimed to learn to process and under-
stand massive natural language documents. After learning, the systems will be able
not only to answer questions sensibly but also to truly understand what it reads to the
extent that it can convey such understanding to human users in providing clues as to
what steps have been taken to reach the answer. These steps may be in the form of
logical reasoning expressed in natural language which is thus naturally understood
by the human users of this type of machine reading and comprehension systems. In
our view, natural language understanding is not just to accurately predict an answer
from a question with relevant passages or data graphs as its contextual knowledge
in a supervised way after seeing many examples of matched questions-passages-
answers. Rather, the desired NLP system equipped with real understanding should
resemble human cognitive capabilities. As an example of such capabilities (Nguyen
et al., 2017) — after an understanding system is trained well, say, in a question
answering task (using supervised learning or otherwise), it should master all essen-
tial aspects of the observed text material provided to solve the question answering
tasks. What such mastering entails is that the learned system can subsequently per-
form well on other NLP tasks, e.g. translation, summarization, and recommendation,
etc., without seeing additional paired data such as raw text data with its summary,
or parallel English and Chinese texts, etc.
One way to examine the nature of such powerful neural-symbolic systems is
to regard them as ones incorporating the strength of the “rationalist” approaches
marked by expert reasoning and structure richness popular during the first wave of
NLP discussed in Section 2. Interestingly, prior to the rising of deep learning (third)
wave of NLP, Church (2007) argued that the pendulum from rationalist to empiri-
cist approaches has swung too far at almost the peak of the second NLP wave, and
predicted that the new rationalist wave would arrive. However, rather than swinging
back to a renewed rationalist era of NLP, deep learning era arrived in full force in
just a short period from the time of writing by Church (2007). Instead of adding the
rationalist flavor, deep learning has been pushing empiricism of NLP to its pinnacle
with big data and big compute, and with conceptually revolutionary ways of repre-
senting a sweeping range of linguistic entities by massive parallelism and distribut-
edness, thus drastically enhancing the generalization capability of new-generation
NLP models. Only after the sweeping successes of current deep learning methods
for NLP (Section 4) and subsequent analyses of a series of their limitations, do re-
searchers look into the next wave of NLP — not swinging back to rationalism while
abandoning empiricism but developing more advanced deep learning paradigms that
would organically integrate the missing essence of rationalism into the structured
neural methods that are aimed to approach human cognitive functions for language.
14 1 A Joint Introduction to Natural Language Processing and to Deep Learning

1.6.2 Structure, Memory, and Knowledge

As discussed earlier in this chapter as well as in the current NLP literature (Man-
ning and Socher, 2017), NLP researchers at present still have very primitive deep
learning methods for exploiting structure and for building and accessing memories
or knowledge. While LSTM (with attention) has been pervasively applied to NLP
tasks to beat many NLP benchmarks, LSTM is far from a good memory model
for human cognition. In particular, LSTM lacks adequate structure for simulating
episodic memory, one key component of human cognitive ability is to retrieve and
re-experience aspects of a past novel event or thought. This ability gives rise to
one-shot learning skills and can be crucial in reading comprehension of natural lan-
guage text or speech understanding, as well as reasoning over events described by
natural language. Many recent studies have been devoted to better memory mod-
eling, including external memory architectures with supervised learning (Vinyals
et al., 2016; Kaiser et al., 2017), augmented memory architectures with reinforce-
ment learning (Graves et al., 2016; Oh et al., 2016). However, they have not shown
general effectiveness, but have suffered a number of limitations including notably
scalability (arising from the use of attention which has to access every stored el-
ement in the memory). Much work remains in the direction of better modeling of
memory and exploitation of knowledge for text understanding and reasoning.

1.6.3 Unsupervised and Generative Deep Learning

Another potential breakthrough in deep learning for NLP is in new algorithms for
unsupervised deep learning, which makes use of ideally no direct teaching signals
paired to inputs (token by token) to guide the learning. Word embedding discussed
in Section 4 can be viewed as a weak form of unsupervised learning, making use
of adjacent words as “cost-free” surrogate teaching signals, but for real-world NLP
prediction tasks, such as translation, understanding, summarization, etc., such em-
bedding obtained in an “unsupervised manner” has to be fed into another super-
vised architecture with require costly teaching signals. In truly unsupervised learn-
ing which requires no expensive teaching signals, new types objective functions and
new optimization algorithms are needed; e.g. the objective function for unsuper-
vised learning should not require explicit target label data aligned with the input
data as in cross entropy that is most popular for supervised learning. Development
of unsupervised deep learning algorithms has been significantly behind that of su-
pervised and reinforcement deep learning where back-propagation and Q-learning
algorithms have been reasonably mature.
The most recent preliminary development in unsupervised learning takes the ap-
proach of exploiting sequential output structure and advanced optimization methods
to alleviate the need for using labels in training prediction systems (Russell and Ste-
fano, 2017; Liu et al., 2017). Future advances in unsupervised learning are promis-
ing by exploiting new sources of learning signals including the structure of input
1.6 Future Directions of NLP 15

data and the mapping relationships from input to output and vice versa. Exploiting
the relationship from output to input is closely connected to building conditional
generative models. To this end, the recent popular topic in deep learning — genera-
tive adversarial networks (Goodfellow et al., 2017) — is a highly promising direc-
tion where the long-standing concept of analysis-by-synthesis in pattern recognition
and machine learning is likely to return to spotlight in the near future in solving NLP
tasks in new ways.
Generative adversarial networks have been formulated as neural nets, with dense
connectivity among nodes and with no probabilistic setting. On the other hand,
probabilistic and Bayesian reasoning, which often takes computational advantage
of sparse connections among “nodes” as random variables, has been one of the
principal theoretical pillars to machine learning and has been responsible for many
NLP methods developed during the empiricist wave of NLP discussed in Section 3.
What is the right interface between deep learning and probabilistic modeling? Can
probabilistic thinking help understand deep learning techniques better and motivate
new deep learning methods for NLP tasks? How about the other way around? These
issues are widely open for future research.

1.6.4 Multi-modal and Multi-task Deep Learning

Multi-modal and multi-task deep learning are related learning paradigms, both con-
cerning the exploitation of latent representations in the deep networks pooled from
different modalities (e.g. audio, speech, video, images, text, source codes, etc.) or
from multiple cross-domain tasks (e.g. point and structured prediction, ranking, rec-
ommendation, time-series forecasting, clustering, etc.). Before the deep learning
wave, multi-modal and multi-task learning had been very difficult to be made effec-
tive, due to the lack of intermediate representations that share across modalities or
tasks. See a most striking example of this contrast for multi-task learning — multi-
lingual speech recognition during the empiricist wave (Lin et al., 2008) and during
the deep learning wave (Huang et al., 2013a).
Multi-modal information can be exploited as low-cost supervision. For instance,
standard speech recognition, image recognition, and text classification methods
make use of supervision labels within each of the speech, image, and text modali-
ties separately. This, however, is far from how children learn to recognize speech,
image, and to classify text. For example, children often get the distant “supervision”
signal for speech sounds by an adult pointing to an image scene, text, or handwrit-
ing that is associated with the speech sounds. Similarly, for children learning image
categories, they may exploit speech sounds or text as supervision signals. This type
of learning that occur in children can motivate a learning scheme that leverages
multi-modal data to improve engineering systems for multi-modal deep learning. A
similarity measure need to be defined in the same semantic space, which speech,
image, and text are all mapped into, via deep neural networks that may be trained
16 1 A Joint Introduction to Natural Language Processing and to Deep Learning

using maximum mutual information across different modalities. The huge potential
of this scheme has not been explored and found in the NLP literature.
Similar to multi-modal deep learning, multi-task deep learning can also benefit
from leveraging multiple latent levels of representations across tasks or domains.
The recent work on joint many-task learning solves a range of NLP tasks — from
morphological to syntactic and to semantic levels, within one single, big deep neu-
ral network model (Hashimoto et al., 2017). The model predicts different levels of
linguistic outputs at successively deep layers, accomplishing standard NLP tasks of
tagging, chunking, syntactic parsing, as well as predictions of semantic relatedness
and entailment. The strong results obtained using this single, end-to-end learned
model point to the direction to solve more challenging NLP tasks in real world as
well as tasks beyond NLP.

1.6.5 Meta-Learning

A further future direction for fruitful NLP and artificial intelligence research is the
paradigm of learning-to-learn or meta-learning. The goal of meta-learning is to learn
how to learn new tasks faster by reusing previous experience, instead of treating
each new task in isolation and learning to solve each of them from scratch. That
is, with the success of meta-learning, we can train a model on a variety of learning
tasks, such that it can solve new learning tasks using only a small number of train-
ing samples. In our NLP context, successful meta-learning will enable the design
of intelligent NLP systems that improve or automatically discover new learning al-
gorithms (e.g. sophisticated optimization algorithms for unsupervised learning), for
solving NLP tasks using small amounts of training data.
The study of meta-learning, as a subfield of machine learning, started over three
decades ago (Schmidhuber, 1987; Hochreiter et al., 2001), but it was not until recent
years when deep learning methods reasonably matured that stronger evidence of the
potentially huge impact of meta-learning has become apparent. Initial progresses of
meta-learning can be seen in various techniques successfully applied to deep learn-
ing, including hyper-parameter optimization (Maclaurin et al., 2015), neural net-
work architecture optimization (Wichrowska et al., 2017), and fast reinforcement
learning (Finn et al., 2017). The ultimate success of meta-learning in real world
will allow the development of algorithms for solving most NLP and computer sci-
ence problems to be reformulated as a deep learning problem and to be solved by
a uniform infrastructure designed for deep learning today. Meta-learning is a pow-
erful emerging artificial intelligence and deep learning paradigm, which is a fertile
research area expected to impact real-world NLP applications.
1.7 Summary 17

1.7 Summary

In this chapter, to set up the fundamental framework for the book, we first provided
an introduction to the basics of natural language processing (NLP), which is more
application-oriented than computational linguistics, both belonging a field of artifi-
cial intelligence and computer science. We survey the historical development of the
NLP field, spanning over several decades, in terms of three waves of NLP — start-
ing from rationalism and empiricism, to the current deep learning wave. The goal of
the survey is to distill insights from the historical developments that serve to guide
the future directions.
The conclusion from our three-wave analysis is that the current deep learning
technology for NLP is a conceptual and paradigmatic revolution from the NLP tech-
nologies developed from the previous two waves. The key pillars underlying this
revolution consist of distributed representations of linguistic entities (sub-words,
words, phrases, sentences, paragraphs, documents, etc.) via embedding, semantic
generalization due to the embedding, long-span deep sequence modeling of lan-
guage, hierarchical networks effective for representing linguistic levels from low to
high, and end-to-end deep learning methods to jointly solve many NLP tasks. None
of these were possible before the deep learning wave, not only because of lack of big
data and powerful computation in the previous waves, but, equally importantly, due
to missing the right framework until the deep learning paradigm emerged in recent
years.
After we surveyed the prominent successes of select NLP application areas at-
tributed to deep learning (with a much more comprehensive coverage of the NLP
successful areas in the remaining chapters of this book), we pointed out and an-
alyzed several key limitations of current deep learning technology in general, as
well as those for NLP more specifically. This investigation led us to five research
directions for future advances in NLP — frameworks for neural-symbolic integra-
tion, exploration of better memory models and better use of knowledge, as well
as better deep learning paradigms including unsupervised and generative learning,
multi-modal and multi-task learning, and meta learning.
In conclusion, deep learning has ushered in a world that gives our NLP field a
much brighter future than any time in the past. Deep learning not only provides a
powerful modeling framework for representing human cognitive abilities of natural
language in computer systems, but, as importantly, it has already been creating su-
perior practical results in a number of key application areas of NLP. In the remaining
chapters of this book, detailed descriptions of NLP techniques developed using the
deep learning framework will be provided, and where possible, benchmark results
will be presented contrasting deep learning with more traditional techniques devel-
oped before the deep learning tidal wave hit the NLP shore just a few years ago. We
hope this comprehensive set of material will serve as a mark along the way where
NLP researchers are developing better and more advanced deep learning methods
to overcome some or all the current limitations discussed in this chapter, possibly
inspired by the research directions we analyzed here as well.
18 1 A Joint Introduction to Natural Language Processing and to Deep Learning

References

Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., and Yu, D. (2014).
Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio,
Speech and Language Processing, 22.
Amodei, D., Ng, A., et al. (2016). Deep speech 2: End-to-end speech recognition in
English and Mandarin. In Proceedings of ICML.
Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly
learning to align and translate. In Proceedings of ICLR.
Baker, J. et al. (2009a). Research developments and directions in speech recognition
and understanding. IEEE Signal Processing Magazine, 26(4).
Baker, J. et al. (2009b). Updated mins report on speech recognition and understand-
ing. IEEE Signal Processing Magazine, 26(4).
Baum, L. and Petrie, T. (1966). Statistical inference for probabilistic functions of
finite state markov chains. The Annals of Mathematical Statistics.
Bengio, Y. (2009). Learning Deep Architectures for AI. NOW Publishers.
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2001). A neural probabilistic
language model. Proceedings of NIPS.
Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford University
Press.
Bishop, C. (2006). Pattern Recognition and Machine Learning. Springer.
Bridle, J. et al. (1998). An investigation of segmental hidden dynamic models of
speech coarticulation for automatic speech recognition. Final Report for 1998
Workshop on Language Engineering, Johns Hopkins University CLSP.
Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., and Mercer, R. L. (1993). The
mathematics of statistical machine translation: Parameter estimation. Computa-
tional Linguistics.
Charniak, E. (2011). The brain as a statistical inference engine — and you can too.
Computational Linguistics.
Chiang, D. (2007). Hierarchical phrase-based translation. Computaitional Linguis-
tics.
Chomsky, N. (1957). Syntactic Structures. Mouton.
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015).
Attention-based models for speech recognition. Proceedings of NIPS.
Church, K. (2007). A pendulum swung too far. Linguistic Issues in Language
Technology, 2(4).
Church, K. (2014). The case for empiricism (with and without statistics). In Proc.
Frame Semantics in NLP.
Church, K. and Mercer, R. (1993). Introduction to the special issue on computa-
tional linguistics using large corpora. Computational Linguistics, 9(1).
Collins, M. (1997). Head-Driven Statistical Models for Natural Language Parsing.
PhD thesis, University of Pennsylvania.
Collins, M. (2002). Discriminative training methods for hidden markov models:
Theory and experiments with perceptron algorithms. In Proceedings of EMNLP.
References 19

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, and Kuksa, P.
(2011). Natural language processing (almost) from scratch. J. Machine Learning
Reserach.
Dahl, G., Yu, D., and Deng, L. (2011). Large-vocabulry continuous speech recog-
nition with context-dependent DBN-HMMs. In Proceedings of ICASSP.
Dahl, G., Yu, D., Deng, L., and Acero, A. (2012). Context-dependent pre-trained
deep neural networks for large-vocabulary speech recognition. IEEE Trans. Au-
dio, Speech, and Language Processing, 20.
Deng, L. (1998). A dynamic, feature-based approach to the interface between
phonology and phonetics for speech modeling and recognition. Speech Com-
munication, 24(4).
Deng, L. (2014). A tutorial survey of architectures, algorithms, and applications for
deep learning. APSIPA Transactions on Signal and Information Processing, 3.
Deng, L. (2016). Deep learning: from speech recognition to language and multi-
modal processing. APSIPA Transactions on Signal and Information Processing,
5.
Deng, L. (2017). Paradigms and algorithms of artificial intelligence — the historical
path and future outlook. In IEEE Signal Processing Magazine.
Deng, L., Hinton, G., and Kingsbury, B. (2013). New types of deep neural network
learning for speech recognition and related applications: An overview. Proceed-
ings of ICASSP.
Deng, L. and O’Shaughnessy, D. (2003). SPEECH PROCESSING A Dynamic and
Optimization-Oriented Approach. Marcel Dekker.
Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. (2010).
Binary coding of speech spectrograms using a deep autoencoder. Proceedings of
Interspeech.
Deng, L. and Yu, D. (2007). Use of differential cepstra as acoustic features in hidden
trajectory modeling for phonetic recognition. Proceedings of ICASSP.
Deng, L. and Yu, D. (2014). Deep Learning: Methods and Applications. NOW
Publishers.
Deng, L., Yu, D., and Platt, J. (2012). Scalable stacking and learning for building
deep architectures. In Proceedings of ICASSP.
Devlin, J. et al. (2015). Language models for image captioning: The quirks and
what works. Proceedings of CVPR.
Dhingra, B., Li, L., Li, X., Gao, J., Chen, Y., Ahmed, F., and Deng, L. (2017).
Towards end-to-end reinforcement learning of dialogue agents for information
access. Proceedings of ACL.
Fang, H. et al. (2015). From captions to visual concepts and back. Proceedings of
CVPR.
Fei-Fei, L. and Perona, P. (2005). A bayesian hierarchical model for learning natural
scene categories. Proceedings of CVPR.
Fei-Fei, L. and Perona, P. (2016). Stacked attention networks for image question
answering. Proceedings of CVPR.
Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic meta-learning for fast
adaptation of deep networks. In Proceedings of ICML.
20 1 A Joint Introduction to Natural Language Processing and to Deep Learning

Gan, Z. et al. (2017). Semantic compositional networks for visual captioning. Pro-
ceedings of CVPR.
Gasic, M., Mrki, N., Rojas-Barahona, L., Su, P., Ultes, S., Vandyke, D., Wen, T., and
Young, S. (2017). Dialogue manager domain adaptation using gaussian process
reinforcement learning. Computer Speech and Language, 45.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
Goodfellow, I. et al. (2017). Generative adversarial networks. In Proceedings of
NIPS.
Graves, A. et al. (2016). Hybrid computing using a neural network with dynamic
external memory. Nature.
Hashimoto, K., Xiong, C., Tsuruoka, Y., and Socher, R. (2017). Investigation of
recurrent-neural-network architectures and learning methods for spoken language
understanding. In Proceedings of EMNLP.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image
recognition. Proc. CVPR.
He, X. and Deng, L. (2003). Maximum expected BLEU training of phrase and
lexicon translation models. Proceedings of ACL.
He, X. and Deng, L. (2012). Maximum expected BLEU training of phrase and
lexicon translation models. Proceedings of ACL.
He, X. and Deng, L. (2013). Speech-centric information processing: An
optimization-oriented approach. Proceedings of the IEEE, 101.
He, X., Deng, L., and Chou, W. (2008). Discriminative learning in sequential pattern
recognition. 25(5).
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.-r., Jaitly, N., Senior, A., Van-
houcke, V., Nguyen, P., Kingsbury, B., and Sainath, T. (2012). Deep neural net-
works for acoustic modeling in speech recognition. IEEE Signal Processing Mag-
azine.
Hinton, G., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep
belief nets. Neural Computation.
Hinton, G. and Salakhutdinov, R. (2012). A better way to pre-train deep boltzmann
machines. In Proceedings of NIPS.
Hochreiter, S. et al. (2001). Learning to learn using gradient descent. In Proceedings
of International Conf. Artificial Neural Networks.
Huang, J.-T., Li, J., Yu, D., Deng, L., and Gong, Y. (2013a). Cross-lingual knowl-
edge transfer using multilingual deep neural networks with shared hidden layers.
In Proceedings of ICASSP.
Huang, P. et al. (2013b). Learning deep structured semantic models for web search
using clickthrough data. Proceedings of CIKM.
Jackson, P. (1998). Introduction to Expert Systems. Addison-Wesley.
Jelinek, F. (1998). Statistical Models for Speech Recognition. MIT Press.
Juang, F. (2016). Deep neural networks a developmental perspective. APSIPA
Transactions on Signal and Information Processing, 5.
Kaiser, L., Nachum, O., Roy, A., and Bengio, S. (2017). Learning to remember rare
events. In Proceedings of ICLR.
References 21

Karpathy, A. and Fei-Fei, L. (2015). Deep visual-semantic alignments for generat-


ing image descriptions. Proceedings of CVPR.
Koh, P. and Liang, P. (2017). Understanding black-box predictions via influence
functions. In Proceedings of ICML.
Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). Imagenet classification with
deep convolutional neural networks. In Proceedings of NIPS.
Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random fields: Prob-
abilistic models for segmenting and labeling sequence data. In Proceedings of
ICML.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521.
Lee, L., Attias, H., Deng, L., and Fieguth, P. (2004). A multimodal variational
approach to learning and inference in switching state space models. Proceedings
of ICASSP.
Lee, M. et al. (2016). Reasoning in vector space: An exploratory study of question
answering. Proceedings of ICLR.
Lin, H., Deng, L., Droppo, J., Yu, D., and Acero, A. (2008). Learning methods in
multilingual speech recognition. In NIPS Workshop.
Liu, Y., Chen, J., and Deng, L. (2017). An unsupervised learning method exploiting
sequential output statistics. In arXiv:1702.07817.
Ma, J. and Deng, L. (2004). Target-directed mixture dynamic models for sponta-
neous speech recognition. IEEE Trans. Speech and Audio Processing, 12(4).
Maclaurin, D., Duvenaud, D., and Adams, R. (2015). Gradient-based hyperparam-
eter optimization through reversible learning. In Proceedings of ICML.
Manning, C. (2016). Computational linguistics and deep learning. In Computational
Linguistics.
Manning, C. and Schtze, H. (1999). Foundations of statistical natural language
processing. MIT Press.
Manning, C. and Socher, R. (2017). Lectures 17 and 18: Issues and Possible Ar-
chitectures for NLP; Tackling the Limits of Deep Learning for NLP. CS224N
Course: NLP with Deep Learning.
Mesnil, G., He, X., Deng, L., and Bengio, Y. (2013). Investigation of recurrent-
neural-network architectures and learning methods for spoken language under-
standing. In Proceedings of Interspeech.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). Distributed
representations of words and phrases and their compositionality. Proceedings of
NIPS.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G.,
Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie,
C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and
Hassabis, D. (2015). Human-level control through deep reinforcement learning.
Nature.
Mohamed, A., Dahl, G., and Hinton, G. (2009). Acoustic modeling using deep
belief networks. In NIPS Workshop on Speech Recognition.
Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.
22 1 A Joint Introduction to Natural Language Processing and to Deep Learning

Nguyen, T. et al. (2017). Ms marco: A human generated machine reading compre-


hension dataset. arXiv:1611,09268.
Nilsson, N. (1982). Principles of Artificial Intelligence. Springer.
Och, F. (2003). Maximum error rate training in statistical machine translation. Pro-
ceedings of ACL.
Och, F. and Ney, H. (2002). Discriminative training and maximum entropy models
for statistical machine translation. Proceedings of ACL.
Oh, J., Chockalingam, V., Singh, S., and Lee, H. (2016). Control of memory, active
perception, and action in minecraft. In Proceedings of ICML.
Palangi, H., Smolensky, P., He, X., and Deng, L. (2017). Deep learn-
ing of grammatically-interpretable representations through question-answering.
arXiv:1705.08432.
Parloff, R. (2016). Why deep learning is suddenly changing your life. In Fortune
Magazine.
Pereira, F. (2017). A (computational) linguistic farce in three acts. In
http://www.earningmyturns.org.
Picone, J. et al. (1999). Initial evaluation of hidden dynamic models on conversa-
tional speech. Proceedings of ICASSP.
Plamondon, R. and Srihari, S. (2000). Online and off-line handwriting recognition:
A comprehensive survey. IEEE Trans. Pattern Analysis and Machine Intelligence.
Rabiner, L. and Juang, B.-H. (1993). Fundamentals of Speech Recognition.
Prentice-Hall.
Ratnaparkhi, A. (1997). A simple introduction to maximum entropy models for
natural language processing. Technical report, University of Pennsylvania.
Reddy, R. (1976). Speech recognition by machine: A review. Proceedings of the
IEEE, 64(4).
Rumelhart, D., Hinton, G., and Williams, R. (1986). Learning representations by
back-propagating errors. Nature.
Russell, S. and Stefano, E. (2017). Label-free supervision of neural networks with
physics and domain knowledge. In Proceedings of AAAI.
Saon, G. et al. (2017). English conversational telephone speech recognition by hu-
mans and machines. Proceedings of ICASSP.
Schmidhuber, J. (1987). Evolutionary principles in self-referential learning.
Diploma Thesis, Institute of Informatik, Tech. Univ. Munich.
Seneff, S. et al. (1991). Development and preliminary evaluation of the mit atis
system. Proceedings of HLT.
Smolensky, P. et al. (2016). Reasoning with tensor product representations.
arXiv:1601,02745.
Sutskevar, I., Vinyals, O., and Le, Q. (2014). Sequence to sequence learning with
neural networks. Proceedings of NIPS.
Tur, G. and Deng, L. (2011). Intent Determination and Spoken Utterance Classi-
fication; Chapter 4 in book: Spoken Language Understanding. John Wiley and
Sons.
Turing, A. (1950). Computing Machinery and Intelligence. Mind.
Vapnik, V. (1998). Statistical Learning Theory. John Wiley and Sons.
References 23

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010).
Stacked denoising autoencoders: Learning useful representations in a deep net-
work with a local denoising criterion. The Journal of Machine Learning Re-
search.
Vinyals, O. et al. (2016). Matching networks for one shot learning. In Proceedings
of NIPS.
Viola, P. and Jones, M. (2004). Robust real-time face detection. International Jour-
nal of Computer Vision, 57.
Wang, Y.-Y., Deng, L., and Acero, A. (2011). Semantic Frame Based Spoken
Language Understanding; Chapter 3 in book: Spoken Language Understanding.
John Wiley and Sons.
Wichrowska, O. et al. (2017). Learned optimizers that scale and generalize. In
Proceedings of ICML.
Winston, P. (1993). Artificial Intelligence. Addison-Wesley.
Xiong, W. et al. (2016). Achieving human parity in conversational speech recogni-
tion. Proceedings of Interspeech.
Young, S., Gasic, M., Thomson, B., and Williams, J. (2013). Pomdp-based statistical
spoken dialogue systems: a review. Proceedings of the IEEE.
Yu, D. and Deng, L. (2015). Automatic Speech Recognition: A Deep Learning Ap-
proach. Springer.
Yu, D., Deng, L., and Dahl, G. (2010). Roles of pre-training and fine-tuning in
context-dependent dbn-hmms for real-world speech recognition. NIPS Workshop.
Yu, D., Deng, L., Seide, F., and Li, G. (2011). Discriminative pre-training of deep
nerual networks. In U.S. Patent No. 9,235,799, granted in 2016, filed in 2011.
Zue, V. (1985). The use of speech knowledge in automatic speech recognition.
volume 73.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy