Chapter-1 Deep Learning in NLP
Chapter-1 Deep Learning in NLP
from text, social computing, language generation, and text sentiment analysis, have
also seen much significant progress using deep learning, riding on the third wave
of NLP. Nowadays, deep learning is a dominating method applied to practically all
NLP tasks.
The main goal of this book is to provide a comprehensive survey on the re-
cent advances in deep learning applied to NLP. The book presents state-of-the-art
of NLP-centric deep learning research, and focuses on the role of deep learning
played in major NLP applications including spoken language understanding, di-
alogue systems, lexical analysis, parsing, knowledge graph, machine translation,
question answering, sentiment analysis, social computing, and natural language gen-
eration (from images). This book is suitable for readers with a technical background
in computation, including graduate students, post-doctoral researchers, educators,
and industrial researchers and anyone interested in getting up to speed with the lat-
est techniques of deep learning associated with NLP.
The book is organized into eleven chapters as follows:
• Chapter 1: A Joint Introduction to Natural Language Processing and to Deep
Learning (Li Deng and Yang Liu)
• Chapter 2: Deep Learning in Conversational Language Understanding (Gokhan
Tur, Asli Celikyilmaz, Xiaodong He, Dilek Hakkani-Tür, and Li Deng)
• Chapter 3: Deep Learning in Spoken and Text-based Dialogue Systems (Asli
Celikyilmaz, Li Deng, and Dilek Hakkani-Tür)
• Chapter 4: Deep Learning in Lexical Analysis and Parsing (Wanxiang Che and
Yue Zhang)
• Chapter 5: Deep Learning in Knowledge Graph (Zhiyuan Liu and Xianpei Han)
• Chapter 6: Deep Learning in Machine Translation (Yang Liu and Jiajun Zhang)
• Chapter 7: Deep Learning in Question Answering (Kang Liu and Yansong Feng)
• Chapter 8: Deep Learning in Sentiment Analysis (Duyu Tang and Meishan
Zhang)
• Chapter 9: Deep Learning in Social Computing (Xin Zhao and Chenliang Li)
• Chapter 10: Deep Learning in Natural Language Generation from Images (Xi-
aodong He and Li Deng)
• Chapter 11: Epilogue (Li Deng and Yang Liu)
Chapter 1 first reviews the basics of NLP as well as the main scope of NLP cov-
ered in the following chapters of the book, and then goes in some depth into the
historical development of NLP summarized as three waves and future directions.
Then, an in-depth survey on the recent advances in deep learning applied to NLP is
organized into nine separate chapters, each covering a largely independent applica-
tion area of NLP. The main body of each chapter is written by leading researchers
and experts actively working in the respective field.
The origin of this book was the set of comprehensive tutorials given at the 15th
China National Conference on Computational Linguistics (CCL 2016) held in Oc-
tober 2016 in Yantai, Shandong, China, where both of us, editors of this book, were
active participants and were taking leading roles. We thank our Springer’s senior
editor, Dr. Celine Lanlan Chang, who kindly invited us to create this book and who
Preface ix
has been providing much of timely assistance needed to complete this book. We
are grateful also to Springer’s Assistant Editor, Jane Li, for offering invaluable help
through various stages of manuscript preparation.
We thank all authors of Chapters 2-10 who devoted their valuable time care-
fully preparing the content of their chapters: Gokhan Tur, Asli Celikyilmaz, Dilek
Hakkani-Tur, Wanxiang Che, Yue Zhang, Xianpei Han, Zhiyuan Liu, Jiajun Zhang,
Kang Liu, Yansong Feng, Duyu Tang, Meishan Zhang, Xin Zhao, Chenliang Li,
and Xiaodong He. The authors of Chapters 4-9 are CCL 2016 tutorial speakers.
They spent a considerable amount of time in updating their tutorial material with
the latest advances in the field since October 2016.
Further, we thank numerous reviewers and readers, Sadaoki Furui, Andrew Ng,
Fred Juang, Ken Church, Haifeng Wang, Hongjiang Zhang, who not only gave us
much needed encouragements but also offered many constructive comments which
substantially improved earlier drafts of the book.
Finally, we give our appreciations to our organizations, Microsoft Research and
Citadel (for Li Deng) and Tsinghua University (for Yang Liu), who provided excel-
lent environments, supports, and encouragements that have been instrumental for us
to complete this book.
l.deng@ieee.org, liuyang2011@tsinghua.edu.cn
Abstract In this chapter, we set up the fundamental framework for the book. We
first provide an introduction to the basics of natural language processing (NLP) as an
integral part of artificial intelligence. We then survey the historical development of
NLP, spanning over five decades, in terms of three waves. The first two waves arose
as rationalism and empiricism, paving ways to the current deep learning wave. The
key pillars underlying the deep learning revolution for NLP consist of 1) distributed
representations of linguistic entities via embedding, 2) semantic generalization due
to the embedding, 3) long-span deep sequence modeling of natural language, 4)
hierarchical networks effective for representing linguistic levels from low to high,
and 5) end-to-end deep learning methods to jointly solve many NLP tasks. After
the survey, several key limitations of current deep learning technology for NLP are
analyzed. This analysis leads to five research directions for future advances in NLP.
1
2 1 A Joint Introduction to Natural Language Processing and to Deep Learning
NLP research in its first wave lasted for a long time, dating back to 1950s. In 1950,
Alan Turing proposed the Turing test to evaluate a computer’s ability to exhibit in-
telligent behavior indistinguishable from that of a human (Turing, 1950). This test
is based on natural language conversations between a human and a computer de-
signed to generate human-like responses. In 1954, the Georgetown-IBM experiment
demonstrated the first machine translation system capable of translating more than
sixty Russian sentences into English.
The approaches, based on the belief that knowledge of language in the human
mind is fixed in advance by generic inheritance, dominated most of NLP research
between about 1960 and late 1980s. These approaches have been called rationalist
1.2 The First Wave: Rationalism 3
ones (Church, 2007). The dominance of rationalist approaches in NLP was mainly
due to the widespread acceptance of arguments of Noam Chomsky for an innate
language structure and his criticism of N-grams (Chomsky, 1957). Postulating that
key parts of language are hardwired in the brain at birth as a part of the human
genetic inheritance, rationalist approaches endeavored to design hand-crafted rules
to incorporate knowledge and reasoning mechanisms into intelligent NLP systems.
Up until 1980s, most notably successful NLP systems, such as ELIZA for simulating
a Rogerian psychotherapist and MARGIE for structuring real-world information
into concept ontologies, were based on complex sets of hand-written rules.
This period coincided approximately with the early development of artificial in-
telligence, characterized by expert knowledge engineering, where domain experts
devised computer programs according to the knowledge about the (very narrow) ap-
plication domains they have (Nilsson, 1982; Winston, 1993). The experts designed
these programs using symbolic logical rules based on careful representations and en-
gineering of such knowledge. These knowledge-based artificial intelligence systems
tend to be effective in solving narrow-domain problems by examining the “head” or
most important parameters and reaching a solution about the appropriate action to
take in each specific situation. These “head” parameters are identified in advance
by human experts, leaving the “tail” parameters and cases untouched. Since they
lack learning capability, they have difficulty in generalizing the solutions to new
situations and domains. The typical approach during this period is exemplified by
the expert system, a computer system that emulates the decision-making ability of a
human expert. Such systems are designed to solve complex problems by reasoning
about knowledge (Nilsson, 1982). The first expert system was created in 1970s and
then proliferated in 1980s. The main “algorithm” used was the inference rules in the
form of “if-then-else” (Jackson, 1998). The main strength of these first-generation
artificial intelligence systems is its transparency and interpretability in their (lim-
ited) capability in performing logical reasoning. Like NLP systems such as ELIZA
and MARGIE, the general expert systems in the early days used handcrafted expert
knowledge which was often effective in narrowly-defined problems, although the
reasoning could not handle uncertainty that is ubiquitous in practical applications.
In specific NLP application areas of dialogue systems and spoken language un-
derstanding, to be described in more detail in Chapters 2 and 3 of this book, such
rationalistic approaches were represented by the pervasive use of symbolic rules and
templates (Seneff et al., 1991). The designs were centered on grammatical and on-
tological constructs, which, while interpretable and easy to debug and update, had
experienced severe difficulties in practical deployment. When such systems worked,
they often worked beautifully; but unfortunately this happened just not very often
and the domains were necessarily limited.
Likewise, speech recognition research and system design, another long-standing
NLP and artificial intelligence challenge, during this rationalist era were based
heavily on the paradigm of expert knowledge engineering, as elegantly analyzed
in (Church and Mercer, 1993). During 1970s and early 1980s, the expert-system
approach to speech recognition was quite popular (Reddy, 1976; Zue, 1985). How-
ever, the lack of abilities to learn from data and to handle uncertainty in reasoning
4 1 A Joint Introduction to Natural Language Processing and to Deep Learning
was acutely recognized by researchers, leading to the second wave of speech recog-
nition, NLP, and artificial intelligence described next.
The second wave of NLP was characterized by the exploitation of data corpora and
of (shallow) machine learning, statistical or otherwise, to make use of such data
(Manning and Schtze, 1999). As much of the structure of and theory about nat-
ural language were discounted or discarded in favor of data-driven methods, the
main approaches developed during this era have been called empirical or prag-
matic ones (Church and Mercer, 1993; Church, 2014). With the increasing avail-
ability of machine-readable data and steady increase of computational power, em-
pirical approaches have dominated NLP since around 1990. One of the major NLP
conferences was even named “Empirical Methods in Natural Language Process-
ing (EMNLP)” to reflect most directly the strongly positive sentiment of NLP re-
searchers during that era towards empirical approaches.
In contrast to rationalist approaches, empirical approaches assume that the hu-
man mind only begins with general operations for association, pattern recognition,
and generalization. Rich sensory input is required to enable the mind to learn the
detailed structure of natural language. Prevalent in linguistics between 1920 and
1960, empiricism has been undergoing a resurgence since 1990. Early empirical
approaches to NLP focused on developing generative models such as the hidden
Markov model (Baum and Petrie, 1966), the IBM translation models (Brown et al.,
1993), and the head-driven parsing models (Collins, 1997) to discover the regulari-
ties of languages from large corpora. Since late 1990s, discriminative models have
become the de facto approach in a variety of NLP tasks. Representative discrimina-
tive models and methods in NLP include the maximum entropy model (Ratnaparkhi,
1997), supporting vector machines (Vapnik, 1998), conditional random fields (Laf-
ferty et al., 2001), maximum mutual information and minimum classification error
(He et al., 2008), and perceptron (Collins, 2002).
Again, this era of empiricism in NLP was paralleled with corresponding ap-
proaches in artificial intelligence as well as in speech recognition and computer vi-
sion. It came about after clear evidence that learning and perception capabilities are
crucial for complex artificial intelligence systems but missing in the expert systems
popular in the previous wave. For example, when DARPA opened its first Grand
Challenge for autonomous driving, most vehicles then relied on the knowledge-
based artificial intelligence paradigm. Much like speech recognition and NLP, the
autonomous driving and vision researchers immediately realized the limitation of
the knowledge-based paradigm due to the necessity for machine learning with un-
certainty handling and generalization capabilities.
The empiricism in NLP and speech recognition in this second wave was based
on data-intensive machine learning, which we now call “shallow” due to the general
lack of abstractions constructed by many-layer or “deep” representations of data
1.3 The Second Wave: Empiricism 5
which would come in the third wave to be described in the next section. In ma-
chine learning, researchers do not need to concern with constructing precise and
exact rules as required for the knowledge-based NLP and speech systems during the
first wave. Rather, they focus on statistical models (Bishop, 2006; Murphy, 2012)
or simple neural networks (Bishop, 1995) as an underlying engine. They then au-
tomatically learn or “tune” the parameters of the engine using ample training data
to make them handle uncertainty, and to attempt to generalize from one condition
to another and from one domain to another. The key algorithms and methods for
machine learning include EM (expectation-maximization), Bayesian networks, sup-
port vector machines, decision trees, and, for neural networks, backpropagation al-
gorithm.
Generally speaking, the machine learning based NLP, speech, and other artificial
intelligence systems perform much better than the earlier, knowledge-based counter-
parts. Successful examples include almost all artificial intelligence tasks in machine
perception — speech recognition (Jelinek, 1998), face recognition (Viola and Jones,
2004), visual object recognition (Fei-Fei and Perona, 2005), handwriting recogni-
tion (Plamondon and Srihari, 2000), and machine translation (Och, 2003).
More specifically, in a core NLP application area of machine translation, as to
be described in detail in Chapter 6 of this book as well as in (Church and Mercer,
1993), the field has switched rather abruptly around 1990 from rationalistic meth-
ods outlined in Section 2 to empirical, largely statistical methods. The availability of
sentence-level alignments in the bilingual training data made it possible to acquire
surface-level translation knowledge not by rules but from data directly, at the ex-
pense of discarding or discounting structured information in natural languages. The
most representative work during this wave is that empowered by various versions of
IBM translation models (Brown et al., 1993). Subsequent developments during this
empiricist era of machine translation further significantly improved the quality of
translation systems (Och and Ney, 2002; Och, 2003; He and Deng, 2003; Chiang,
2007; He and Deng, 2012), but not at the level of massive deployment in real world
(which would come after the next, deep learning wave).
In the dialogue and spoken language understanding areas of NLP, this empiri-
cist era was also marked prominently by data-driven machine learning approaches.
These approaches were well suited to meet the requirement for quantitative evalu-
ation and concrete deliverables. They focused on broader but shallow, surface-level
coverage of text and domains instead of detailed analyses of highly restricted text
and domains. The training data were used not to design rules for language under-
standing and response action from the dialogue systems but to learn parameters of
(shallow) statistical or neural models automatically from data. Such learning helped
reduce the cost of hand-crafted complex dialogue manager’s design, and helped im-
prove robustness against speech recognition errors in the overall spoken language
understanding and dialogue systems; for a review, see (He and Deng, 2013). More
specifically, for the dialogue policy component of dialogue systems, powerful rein-
forcement learning based on Markov decision processes had been introduced during
this era; for a review, see (Young et al., 2013). And for spoken language understand-
ing, the dominant methods moved from rule- or template-based ones during the first
6 1 A Joint Introduction to Natural Language Processing and to Deep Learning
wave to generative models like hidden Markov models (Wang et al., 2011) to dis-
criminative models like conditional random fields (Tur and Deng, 2011).
Similarly, in speech recognition, over close to 30 years from early 1980s to
around 2010, the field was dominated by the (shallow) machine learning paradigm
using a statistical generative model called the Hidden Markov Model (HMM) in-
tegrated with Gaussian mixture models, along with various versions of its general-
ization (Baker et al., 2009a,b; Deng and O’Shaughnessy, 2003; Rabiner and Juang,
1993). Among many versions of the generalized HMMs were statistical and neural-
net based hidden dynamic models (Deng, 1998; Bridle et al., 1998; Deng and Yu,
2007). The former adopted EM and switching extended Kalman filter algorithms
for learning model parameters (Ma and Deng, 2004; Lee et al., 2004), and the latter
used back-propagation (Picone et al., 1999). Both of them made extensive use of
multiple latent layers of representations for the generative process of speech wave-
forms following the long-standing framework of analysis-by-synthesis in human
speech perception. More significantly, inverting this “deep” generative process to
its counterpart of an end-to-end discriminative process gave rise to the first indus-
trial success of deep learning (Deng et al., 2010, 2013; Hinton et al., 2012), which
formed a driving force of the third wave of speech recognition and NLP that will be
elaborated next.
While the NLP systems, including speech recognition, language understanding, and
machine translation, developed during the second wave performed a lot better and
with higher robustness than those during the first wave, they were far from human-
level performance and left much to desire. With a few exceptions, the (shallow)
machine learning models for NLP often did not have the capacity sufficiently large
to absorb the large amounts of training data. Further, the learning algorithms, meth-
ods, and infrastructures were not powerful enough. All this changed several years
ago, giving rise to the third wave of NLP, propelled by the new paradigm of deep-
structured machine learning or Deep Learning (Bengio, 2009; Deng and Yu, 2014;
LeCun et al., 2015; Goodfellow et al., 2016).
In traditional machine learning, features are designed by humans and feature
engineering is a bottleneck, requiring significant human expertise. Concurrently,
the associated shallow models lack the representation power and hence the ability
to form levels of decomposable abstractions that would automatically disentangle
complex factors in shaping the observed language data. Deep learning breaks away
the above difficulties by the use of deep, layered model structure, often in the form of
neural networks, and the associated end-to-end learning algorithms. The advances in
deep learning are one major driving force behind the current NLP and more general
artificial intelligence inflection point and are responsible for the resurgence of neural
networks with a wide range of practical, including business, applications (Parloff,
2016).
1.4 The Third Wave: Deep Learning 7
a collaboration between academia and industry, with the original work presented
at the 2009 NIPS Workshop on Deep Learning for Speech Recognition and Re-
lated Applications. The workshop was motivated by the limitations of deep gener-
ative models of speech, and the possibility that the big-compute, big-data era war-
rants a serious exploration of deep neural networks. It was believed then that pre-
training DNNs using generative models of deep belief nets based on the contrastive-
divergence learning algorithm would overcome the main difficulties of neural nets
encountered in the 1990s (Dahl et al., 2011; Mohamed et al., 2009). However, early
into this research at Microsoft, it was discovered that without contrastive-divergence
pre-training, but with the use of large amounts of training data together with the
deep neural networks designed with corresponding large, context-dependent output
layers and with careful engineering, dramatically lower recognition errors could be
obtained than then-state-of-the-art (shallow) machine learning systems (Yu et al.,
2010, 2011; Dahl et al., 2012). This finding was quickly verified by several other
major speech recognition research groups in North America (Hinton et al., 2012;
Deng et al., 2013) and subsequently overseas. Further, the nature of recognition
errors produced by the two types of systems was found to be characteristically dif-
ferent, offering technical insights into how to integrate deep learning into the exist-
ing highly efficient, run-time speech decoding system deployed by major players in
speech recognition industry (Yu and Deng, 2015; Abdel-Hamid et al., 2014; Xiong
et al., 2016; Saon et al., 2017). Nowadays, back-propagation algorithm applied to
deep neural nets of various forms is uniformly used in all current state-of-the-art
speech recognition systems (Yu and Deng, 2015; Amodei et al., 2016; Saon et al.,
2017), and all major commercial speech recognition systems — Microsoft Cortana,
Xbox, Skype Translator, Amazon Alexa, Google Assistant, Apple Siri, Baidu and
iFlyTek voice search, and more – are all based on deep learning methods.
The striking success of speech recognition in 2010-2011 heralded the arrival of
the third wave of NLP and artificial intelligence. Quickly following the success of
deep learning in speech recognition, computer vision (Krizhevsky et al., 2012) and
machine translation (Bahdanau et al., 2015) — were taken over by the similar deep
learning paradigm. In particular, while the powerful technique of neural embedding
of words was developed in as early as 2011 (Bengio et al., 2001), it is not until more
than ten year later it was shown to be practically useful at a large and practically
useful scale (Mikolov et al., 2013) due to the availability of big data and faster com-
putation. In addition, a large number of other real-world NLP applications, such
as image captioning (Karpathy and Fei-Fei, 2015; Fang et al., 2015; Gan et al.,
2017), visual question answering (Fei-Fei and Perona, 2016), speech understanding
(Mesnil et al., 2013), web search (Huang et al., 2013b), and recommendation sys-
tems, have been made successful due to deep learning, in addition to many non-NLP
tasks including drug discovery and toxicology, customer relationship management,
recommendation systems, gesture recognition, medical informatics, advertisement,
medical image analysis, robotics, self-driving vehicles, board and eSports games
(e.g. Atari, Go, Poker, and the latest, DOTA2), and so on. For more details, see
htt ps : //en.wikipedia.org/wiki/deepl earning.
1.4 The Third Wave: Deep Learning 9
Before analyzing the future dictions of NLP with more advanced deep learning, here
we first summarize the significance of the transition from the past waves of NLP to
the present one. We then discuss some clear limitations and challenges of the present
deep learning technology for NLP, to pave a way to examining further development
that would overcome these limitations for the next wave of innovations.
On the surface, the deep learning rising wave discussed in Section 4 in this chap-
ter appears to be a simple push of the second, empiricist wave of NLP (Section 3)
1.5 Transitions from Now to the Future 11
into an extreme end with bigger data, larger models, and greater computing power.
After all, the fundamental approaches developed during both waves are data-driven,
are based on machine learning and computation, and have dispensed with human-
centric “rationalistic” rules that are often brittle and costly to acquire in practical
NLP applications. However, if we analyze these approaches holistically and at a
deeper level, we can identify aspects of conceptual revolution moving from empiri-
cist machine learning to deep learning, and can subsequently analyze the future di-
rections of the field (Section 6). This revolution, in our opinion, is no less significant
than the revolution from the earlier rationalist wave to empiricist one as analyzed at
the beginning and end of the latter (Church and Mercer, 1993; Charniak, 2011).
Empiricist machine learning and linguistic data analysis during the second NLP
wave started in early 1990s by crypto-analysts and computer scientists working on
natural language sources that are highly limited in vocabulary and application do-
mains. As we discussed in Section 3, surface-level text observations, i.e. words and
their sequences, are counted using discrete probabilistic models without relying on
deep structure in natural language. The basic representations were “one-hot” or lo-
calist, where no semantic similarity between words was exploited. With restrictions
in domains and associated text content, such structure-free representations and em-
pirical models are often sufficient to cover much of what needs to be covered. That
is, the shallow, count-based statistical models can naturally do well in limited and
specific NLP tasks. But when the domain and content restrictions are lifted for more
realistic NLP applications in real world, count-based models would necessarily be-
come ineffective, no manner how many tricks of smoothing have been invented in
an attempt to mitigate the problem of combinatorial counting sparseness. This is
where deep learning for NLP truly shines — distributed representations of words
via embedding, semantic generalization due to the embedding, longer-span deep
sequence modeling, and end-to-end learning methods have all contributed to beat-
ing empiricist, count-based methods in a wide range of NLP tasks as discussed in
Section 4.
Despite the spectacular successes of deep learning in NLP tasks, most notably in
speech recognition/understanding, language modeling, and in machine translation,
there remain huge challenges. The current deep learning methods based on neu-
ral networks as a black box generally lack interpretability, even further away from
explainability, in contrast to the “rationalist” paradigm established during the first
NLP wave where the rules devised by experts were naturally explainable. In prac-
tice, however, it is highly desirable to explain the predictions from a seemingly
“black-box” model, not only for improving the model but for providing the users of
the prediction system with interpretations of the suggested actions to take (Koh and
Liang, 2017).
12 1 A Joint Introduction to Natural Language Processing and to Deep Learning
As discussed earlier in this chapter as well as in the current NLP literature (Man-
ning and Socher, 2017), NLP researchers at present still have very primitive deep
learning methods for exploiting structure and for building and accessing memories
or knowledge. While LSTM (with attention) has been pervasively applied to NLP
tasks to beat many NLP benchmarks, LSTM is far from a good memory model
for human cognition. In particular, LSTM lacks adequate structure for simulating
episodic memory, one key component of human cognitive ability is to retrieve and
re-experience aspects of a past novel event or thought. This ability gives rise to
one-shot learning skills and can be crucial in reading comprehension of natural lan-
guage text or speech understanding, as well as reasoning over events described by
natural language. Many recent studies have been devoted to better memory mod-
eling, including external memory architectures with supervised learning (Vinyals
et al., 2016; Kaiser et al., 2017), augmented memory architectures with reinforce-
ment learning (Graves et al., 2016; Oh et al., 2016). However, they have not shown
general effectiveness, but have suffered a number of limitations including notably
scalability (arising from the use of attention which has to access every stored el-
ement in the memory). Much work remains in the direction of better modeling of
memory and exploitation of knowledge for text understanding and reasoning.
Another potential breakthrough in deep learning for NLP is in new algorithms for
unsupervised deep learning, which makes use of ideally no direct teaching signals
paired to inputs (token by token) to guide the learning. Word embedding discussed
in Section 4 can be viewed as a weak form of unsupervised learning, making use
of adjacent words as “cost-free” surrogate teaching signals, but for real-world NLP
prediction tasks, such as translation, understanding, summarization, etc., such em-
bedding obtained in an “unsupervised manner” has to be fed into another super-
vised architecture with require costly teaching signals. In truly unsupervised learn-
ing which requires no expensive teaching signals, new types objective functions and
new optimization algorithms are needed; e.g. the objective function for unsuper-
vised learning should not require explicit target label data aligned with the input
data as in cross entropy that is most popular for supervised learning. Development
of unsupervised deep learning algorithms has been significantly behind that of su-
pervised and reinforcement deep learning where back-propagation and Q-learning
algorithms have been reasonably mature.
The most recent preliminary development in unsupervised learning takes the ap-
proach of exploiting sequential output structure and advanced optimization methods
to alleviate the need for using labels in training prediction systems (Russell and Ste-
fano, 2017; Liu et al., 2017). Future advances in unsupervised learning are promis-
ing by exploiting new sources of learning signals including the structure of input
1.6 Future Directions of NLP 15
data and the mapping relationships from input to output and vice versa. Exploiting
the relationship from output to input is closely connected to building conditional
generative models. To this end, the recent popular topic in deep learning — genera-
tive adversarial networks (Goodfellow et al., 2017) — is a highly promising direc-
tion where the long-standing concept of analysis-by-synthesis in pattern recognition
and machine learning is likely to return to spotlight in the near future in solving NLP
tasks in new ways.
Generative adversarial networks have been formulated as neural nets, with dense
connectivity among nodes and with no probabilistic setting. On the other hand,
probabilistic and Bayesian reasoning, which often takes computational advantage
of sparse connections among “nodes” as random variables, has been one of the
principal theoretical pillars to machine learning and has been responsible for many
NLP methods developed during the empiricist wave of NLP discussed in Section 3.
What is the right interface between deep learning and probabilistic modeling? Can
probabilistic thinking help understand deep learning techniques better and motivate
new deep learning methods for NLP tasks? How about the other way around? These
issues are widely open for future research.
Multi-modal and multi-task deep learning are related learning paradigms, both con-
cerning the exploitation of latent representations in the deep networks pooled from
different modalities (e.g. audio, speech, video, images, text, source codes, etc.) or
from multiple cross-domain tasks (e.g. point and structured prediction, ranking, rec-
ommendation, time-series forecasting, clustering, etc.). Before the deep learning
wave, multi-modal and multi-task learning had been very difficult to be made effec-
tive, due to the lack of intermediate representations that share across modalities or
tasks. See a most striking example of this contrast for multi-task learning — multi-
lingual speech recognition during the empiricist wave (Lin et al., 2008) and during
the deep learning wave (Huang et al., 2013a).
Multi-modal information can be exploited as low-cost supervision. For instance,
standard speech recognition, image recognition, and text classification methods
make use of supervision labels within each of the speech, image, and text modali-
ties separately. This, however, is far from how children learn to recognize speech,
image, and to classify text. For example, children often get the distant “supervision”
signal for speech sounds by an adult pointing to an image scene, text, or handwrit-
ing that is associated with the speech sounds. Similarly, for children learning image
categories, they may exploit speech sounds or text as supervision signals. This type
of learning that occur in children can motivate a learning scheme that leverages
multi-modal data to improve engineering systems for multi-modal deep learning. A
similarity measure need to be defined in the same semantic space, which speech,
image, and text are all mapped into, via deep neural networks that may be trained
16 1 A Joint Introduction to Natural Language Processing and to Deep Learning
using maximum mutual information across different modalities. The huge potential
of this scheme has not been explored and found in the NLP literature.
Similar to multi-modal deep learning, multi-task deep learning can also benefit
from leveraging multiple latent levels of representations across tasks or domains.
The recent work on joint many-task learning solves a range of NLP tasks — from
morphological to syntactic and to semantic levels, within one single, big deep neu-
ral network model (Hashimoto et al., 2017). The model predicts different levels of
linguistic outputs at successively deep layers, accomplishing standard NLP tasks of
tagging, chunking, syntactic parsing, as well as predictions of semantic relatedness
and entailment. The strong results obtained using this single, end-to-end learned
model point to the direction to solve more challenging NLP tasks in real world as
well as tasks beyond NLP.
1.6.5 Meta-Learning
A further future direction for fruitful NLP and artificial intelligence research is the
paradigm of learning-to-learn or meta-learning. The goal of meta-learning is to learn
how to learn new tasks faster by reusing previous experience, instead of treating
each new task in isolation and learning to solve each of them from scratch. That
is, with the success of meta-learning, we can train a model on a variety of learning
tasks, such that it can solve new learning tasks using only a small number of train-
ing samples. In our NLP context, successful meta-learning will enable the design
of intelligent NLP systems that improve or automatically discover new learning al-
gorithms (e.g. sophisticated optimization algorithms for unsupervised learning), for
solving NLP tasks using small amounts of training data.
The study of meta-learning, as a subfield of machine learning, started over three
decades ago (Schmidhuber, 1987; Hochreiter et al., 2001), but it was not until recent
years when deep learning methods reasonably matured that stronger evidence of the
potentially huge impact of meta-learning has become apparent. Initial progresses of
meta-learning can be seen in various techniques successfully applied to deep learn-
ing, including hyper-parameter optimization (Maclaurin et al., 2015), neural net-
work architecture optimization (Wichrowska et al., 2017), and fast reinforcement
learning (Finn et al., 2017). The ultimate success of meta-learning in real world
will allow the development of algorithms for solving most NLP and computer sci-
ence problems to be reformulated as a deep learning problem and to be solved by
a uniform infrastructure designed for deep learning today. Meta-learning is a pow-
erful emerging artificial intelligence and deep learning paradigm, which is a fertile
research area expected to impact real-world NLP applications.
1.7 Summary 17
1.7 Summary
In this chapter, to set up the fundamental framework for the book, we first provided
an introduction to the basics of natural language processing (NLP), which is more
application-oriented than computational linguistics, both belonging a field of artifi-
cial intelligence and computer science. We survey the historical development of the
NLP field, spanning over several decades, in terms of three waves of NLP — start-
ing from rationalism and empiricism, to the current deep learning wave. The goal of
the survey is to distill insights from the historical developments that serve to guide
the future directions.
The conclusion from our three-wave analysis is that the current deep learning
technology for NLP is a conceptual and paradigmatic revolution from the NLP tech-
nologies developed from the previous two waves. The key pillars underlying this
revolution consist of distributed representations of linguistic entities (sub-words,
words, phrases, sentences, paragraphs, documents, etc.) via embedding, semantic
generalization due to the embedding, long-span deep sequence modeling of lan-
guage, hierarchical networks effective for representing linguistic levels from low to
high, and end-to-end deep learning methods to jointly solve many NLP tasks. None
of these were possible before the deep learning wave, not only because of lack of big
data and powerful computation in the previous waves, but, equally importantly, due
to missing the right framework until the deep learning paradigm emerged in recent
years.
After we surveyed the prominent successes of select NLP application areas at-
tributed to deep learning (with a much more comprehensive coverage of the NLP
successful areas in the remaining chapters of this book), we pointed out and an-
alyzed several key limitations of current deep learning technology in general, as
well as those for NLP more specifically. This investigation led us to five research
directions for future advances in NLP — frameworks for neural-symbolic integra-
tion, exploration of better memory models and better use of knowledge, as well
as better deep learning paradigms including unsupervised and generative learning,
multi-modal and multi-task learning, and meta learning.
In conclusion, deep learning has ushered in a world that gives our NLP field a
much brighter future than any time in the past. Deep learning not only provides a
powerful modeling framework for representing human cognitive abilities of natural
language in computer systems, but, as importantly, it has already been creating su-
perior practical results in a number of key application areas of NLP. In the remaining
chapters of this book, detailed descriptions of NLP techniques developed using the
deep learning framework will be provided, and where possible, benchmark results
will be presented contrasting deep learning with more traditional techniques devel-
oped before the deep learning tidal wave hit the NLP shore just a few years ago. We
hope this comprehensive set of material will serve as a mark along the way where
NLP researchers are developing better and more advanced deep learning methods
to overcome some or all the current limitations discussed in this chapter, possibly
inspired by the research directions we analyzed here as well.
18 1 A Joint Introduction to Natural Language Processing and to Deep Learning
References
Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., and Yu, D. (2014).
Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio,
Speech and Language Processing, 22.
Amodei, D., Ng, A., et al. (2016). Deep speech 2: End-to-end speech recognition in
English and Mandarin. In Proceedings of ICML.
Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly
learning to align and translate. In Proceedings of ICLR.
Baker, J. et al. (2009a). Research developments and directions in speech recognition
and understanding. IEEE Signal Processing Magazine, 26(4).
Baker, J. et al. (2009b). Updated mins report on speech recognition and understand-
ing. IEEE Signal Processing Magazine, 26(4).
Baum, L. and Petrie, T. (1966). Statistical inference for probabilistic functions of
finite state markov chains. The Annals of Mathematical Statistics.
Bengio, Y. (2009). Learning Deep Architectures for AI. NOW Publishers.
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2001). A neural probabilistic
language model. Proceedings of NIPS.
Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford University
Press.
Bishop, C. (2006). Pattern Recognition and Machine Learning. Springer.
Bridle, J. et al. (1998). An investigation of segmental hidden dynamic models of
speech coarticulation for automatic speech recognition. Final Report for 1998
Workshop on Language Engineering, Johns Hopkins University CLSP.
Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., and Mercer, R. L. (1993). The
mathematics of statistical machine translation: Parameter estimation. Computa-
tional Linguistics.
Charniak, E. (2011). The brain as a statistical inference engine — and you can too.
Computational Linguistics.
Chiang, D. (2007). Hierarchical phrase-based translation. Computaitional Linguis-
tics.
Chomsky, N. (1957). Syntactic Structures. Mouton.
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015).
Attention-based models for speech recognition. Proceedings of NIPS.
Church, K. (2007). A pendulum swung too far. Linguistic Issues in Language
Technology, 2(4).
Church, K. (2014). The case for empiricism (with and without statistics). In Proc.
Frame Semantics in NLP.
Church, K. and Mercer, R. (1993). Introduction to the special issue on computa-
tional linguistics using large corpora. Computational Linguistics, 9(1).
Collins, M. (1997). Head-Driven Statistical Models for Natural Language Parsing.
PhD thesis, University of Pennsylvania.
Collins, M. (2002). Discriminative training methods for hidden markov models:
Theory and experiments with perceptron algorithms. In Proceedings of EMNLP.
References 19
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, and Kuksa, P.
(2011). Natural language processing (almost) from scratch. J. Machine Learning
Reserach.
Dahl, G., Yu, D., and Deng, L. (2011). Large-vocabulry continuous speech recog-
nition with context-dependent DBN-HMMs. In Proceedings of ICASSP.
Dahl, G., Yu, D., Deng, L., and Acero, A. (2012). Context-dependent pre-trained
deep neural networks for large-vocabulary speech recognition. IEEE Trans. Au-
dio, Speech, and Language Processing, 20.
Deng, L. (1998). A dynamic, feature-based approach to the interface between
phonology and phonetics for speech modeling and recognition. Speech Com-
munication, 24(4).
Deng, L. (2014). A tutorial survey of architectures, algorithms, and applications for
deep learning. APSIPA Transactions on Signal and Information Processing, 3.
Deng, L. (2016). Deep learning: from speech recognition to language and multi-
modal processing. APSIPA Transactions on Signal and Information Processing,
5.
Deng, L. (2017). Paradigms and algorithms of artificial intelligence — the historical
path and future outlook. In IEEE Signal Processing Magazine.
Deng, L., Hinton, G., and Kingsbury, B. (2013). New types of deep neural network
learning for speech recognition and related applications: An overview. Proceed-
ings of ICASSP.
Deng, L. and O’Shaughnessy, D. (2003). SPEECH PROCESSING A Dynamic and
Optimization-Oriented Approach. Marcel Dekker.
Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. (2010).
Binary coding of speech spectrograms using a deep autoencoder. Proceedings of
Interspeech.
Deng, L. and Yu, D. (2007). Use of differential cepstra as acoustic features in hidden
trajectory modeling for phonetic recognition. Proceedings of ICASSP.
Deng, L. and Yu, D. (2014). Deep Learning: Methods and Applications. NOW
Publishers.
Deng, L., Yu, D., and Platt, J. (2012). Scalable stacking and learning for building
deep architectures. In Proceedings of ICASSP.
Devlin, J. et al. (2015). Language models for image captioning: The quirks and
what works. Proceedings of CVPR.
Dhingra, B., Li, L., Li, X., Gao, J., Chen, Y., Ahmed, F., and Deng, L. (2017).
Towards end-to-end reinforcement learning of dialogue agents for information
access. Proceedings of ACL.
Fang, H. et al. (2015). From captions to visual concepts and back. Proceedings of
CVPR.
Fei-Fei, L. and Perona, P. (2005). A bayesian hierarchical model for learning natural
scene categories. Proceedings of CVPR.
Fei-Fei, L. and Perona, P. (2016). Stacked attention networks for image question
answering. Proceedings of CVPR.
Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic meta-learning for fast
adaptation of deep networks. In Proceedings of ICML.
20 1 A Joint Introduction to Natural Language Processing and to Deep Learning
Gan, Z. et al. (2017). Semantic compositional networks for visual captioning. Pro-
ceedings of CVPR.
Gasic, M., Mrki, N., Rojas-Barahona, L., Su, P., Ultes, S., Vandyke, D., Wen, T., and
Young, S. (2017). Dialogue manager domain adaptation using gaussian process
reinforcement learning. Computer Speech and Language, 45.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
Goodfellow, I. et al. (2017). Generative adversarial networks. In Proceedings of
NIPS.
Graves, A. et al. (2016). Hybrid computing using a neural network with dynamic
external memory. Nature.
Hashimoto, K., Xiong, C., Tsuruoka, Y., and Socher, R. (2017). Investigation of
recurrent-neural-network architectures and learning methods for spoken language
understanding. In Proceedings of EMNLP.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image
recognition. Proc. CVPR.
He, X. and Deng, L. (2003). Maximum expected BLEU training of phrase and
lexicon translation models. Proceedings of ACL.
He, X. and Deng, L. (2012). Maximum expected BLEU training of phrase and
lexicon translation models. Proceedings of ACL.
He, X. and Deng, L. (2013). Speech-centric information processing: An
optimization-oriented approach. Proceedings of the IEEE, 101.
He, X., Deng, L., and Chou, W. (2008). Discriminative learning in sequential pattern
recognition. 25(5).
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.-r., Jaitly, N., Senior, A., Van-
houcke, V., Nguyen, P., Kingsbury, B., and Sainath, T. (2012). Deep neural net-
works for acoustic modeling in speech recognition. IEEE Signal Processing Mag-
azine.
Hinton, G., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep
belief nets. Neural Computation.
Hinton, G. and Salakhutdinov, R. (2012). A better way to pre-train deep boltzmann
machines. In Proceedings of NIPS.
Hochreiter, S. et al. (2001). Learning to learn using gradient descent. In Proceedings
of International Conf. Artificial Neural Networks.
Huang, J.-T., Li, J., Yu, D., Deng, L., and Gong, Y. (2013a). Cross-lingual knowl-
edge transfer using multilingual deep neural networks with shared hidden layers.
In Proceedings of ICASSP.
Huang, P. et al. (2013b). Learning deep structured semantic models for web search
using clickthrough data. Proceedings of CIKM.
Jackson, P. (1998). Introduction to Expert Systems. Addison-Wesley.
Jelinek, F. (1998). Statistical Models for Speech Recognition. MIT Press.
Juang, F. (2016). Deep neural networks a developmental perspective. APSIPA
Transactions on Signal and Information Processing, 5.
Kaiser, L., Nachum, O., Roy, A., and Bengio, S. (2017). Learning to remember rare
events. In Proceedings of ICLR.
References 21
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010).
Stacked denoising autoencoders: Learning useful representations in a deep net-
work with a local denoising criterion. The Journal of Machine Learning Re-
search.
Vinyals, O. et al. (2016). Matching networks for one shot learning. In Proceedings
of NIPS.
Viola, P. and Jones, M. (2004). Robust real-time face detection. International Jour-
nal of Computer Vision, 57.
Wang, Y.-Y., Deng, L., and Acero, A. (2011). Semantic Frame Based Spoken
Language Understanding; Chapter 3 in book: Spoken Language Understanding.
John Wiley and Sons.
Wichrowska, O. et al. (2017). Learned optimizers that scale and generalize. In
Proceedings of ICML.
Winston, P. (1993). Artificial Intelligence. Addison-Wesley.
Xiong, W. et al. (2016). Achieving human parity in conversational speech recogni-
tion. Proceedings of Interspeech.
Young, S., Gasic, M., Thomson, B., and Williams, J. (2013). Pomdp-based statistical
spoken dialogue systems: a review. Proceedings of the IEEE.
Yu, D. and Deng, L. (2015). Automatic Speech Recognition: A Deep Learning Ap-
proach. Springer.
Yu, D., Deng, L., and Dahl, G. (2010). Roles of pre-training and fine-tuning in
context-dependent dbn-hmms for real-world speech recognition. NIPS Workshop.
Yu, D., Deng, L., Seide, F., and Li, G. (2011). Discriminative pre-training of deep
nerual networks. In U.S. Patent No. 9,235,799, granted in 2016, filed in 2011.
Zue, V. (1985). The use of speech knowledge in automatic speech recognition.
volume 73.