0% found this document useful (0 votes)
3 views

language models

This document provides an overview of language models, defining them as probability distributions over words that predict the next word in a sequence. It discusses the evolution of language models from probabilistic methods to neural network-based models, highlighting advancements such as RNNs and transformers. The future of language models is positioned as a step towards creating human-like intelligence systems capable of performing a wide range of natural language processing tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

language models

This document provides an overview of language models, defining them as probability distributions over words that predict the next word in a sequence. It discusses the evolution of language models from probabilistic methods to neural network-based models, highlighting advancements such as RNNs and transformers. The future of language models is positioned as a step towards creating human-like intelligence systems capable of performing a wide range of natural language processing tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Skip to main content

Built In National
Open Search
FOR EMPLOYERS
JOINLOG IN
 JOBS
 COMPANIES
 ARTICLES
 SALARIES
 LEARN
 REMOTE
DATA SCIENCE
EXPERT CONTRIBUTORS
MACHINE LEARNING

A Beginner’s Guide to
Language Models
A language model is a probability distribution over words or word
sequences. Learn more about different types of language models and
what they can do.

Written byMór Kapronczay


Published on Dec. 13, 2022
Image: Shutterstock / Built In

Extracting information from textual data has changed dramatically


over the past decade. As the term natural language processing has
overtaken text mining as the name of the field, the methodology
has changed tremendously, too. One of the main drivers of this
change was the emergence of language models as a basis for many
applications aiming to distill valuable insights from raw text.

LANGUAGE MODEL DEFINITION


A language model uses machine learning to conduct a probability
distribution over words used to predict the most likely next word in
a sentence based on the previous entry. Language models learn
from text and can be used for producing original text, predicting
the next word in a text, speech recognition, optical character
recognition and handwriting recognition.

In learning about natural language processing, I’ve been fascinated


by the evolution of language models over the past years. You may
have heard about GPT-3 and the potential threats it poses, but how
did we get this far? How can a machine produce an article that
mimics a journalist?

What Is a Language Model?


A language model is a probability distribution over words or word
sequences. In practice, it gives the probability of a certain word
sequence being “valid.” Validity in this context does not refer to
grammatical validity. Instead, it means that it resembles how
people write, which is what the language model learns. This is an
important point. There’s no magic to a language model like
other machine learning models, particularly deep neural networks,
it’s just a tool to incorporate abundant information in a concise
manner that’s reusable in an out-of-sample context.

MORE ON DATA SCIENCE: Basic Probability Theory and Statistics Terms


to Know

What Can a Language Model Do?


The abstract understanding of natural language, which is necessary
to infer word probabilities from context, can be used for a number
of tasks. Lemmatization or stemming aims to reduce a word to its
most basic form, thereby dramatically decreasing the number of
tokens. These algorithms work better if the part-of-speech role of
the word is known. A verb’s postfixes can be different from a noun’s
postfixes, hence the rationale for part-of-speech tagging (or POS-
tagging), a common task for a language model.

With a good language model, we can perform extractive or


abstractive summarization of texts. If we have models for different
languages, a machine translation system can be built easily. Less
straightforward use-cases include answering questions (with or
without context, see the example at the end of the article).
Language models can also be used for speech
recognition, OCR, handwriting recognition and more. There’s a
whole spectrum of opportunities.

Types of Language Models


There are two types of language models:

1. Probabilistic methods.

2. Neural network-based modern language models

It’s important to note the difference between them.


PROBABILISTIC LANGUAGE MODEL

A simple probabilistic language model is constructed by calculating


n-gram probabilities. An n-gram is an n word sequence, n being an
integer greater than zero. An n-gram’s probability is the conditional
probability that the n-gram’s last word follows a particular n-1
gram (leaving out the last word). It’s the proportion of occurrences
of the last word following the n-1 gram leaving the last word out.
This concept is a Markov assumption. Given the n-1 gram (the
present), the n-gram probabilities (future) does not depend on the
n-2, n-3, etc grams (past).

There are evident drawbacks of this approach. Most importantly,


only the preceding n words affect the probability distribution of the
next word. Complicated texts have deep context that may have
decisive influence on the choice of the next word. Thus, what the
next word is might not be evident from the previous n-words, not
even if n is 20 or 50. A term has influence on a previous word
choice: the word United is much more probable if it is followed by
States of America. Let’s call this the context problem.

On top of that, it’s evident that this approach scales poorly. As size
increases (n), the number of possible permutations skyrocket, even
though most of the permutations never occur in the text. And all
the occuring probabilities (or all n-gram counts) have to be
calculated and stored. In addition, non-occurring n-grams create a
sparsity problem, as in, the granularity of the probability
distribution can be quite low. Word probabilities have few different
values, therefore most of the words have the same probability.

NEURAL NETWORK-BASED LANGUAGE


MODELS

Neural network based language models ease the sparsity problem


by the way they encode inputs. Word embedding layers create an
arbitrary sized vector of each word that incorporates semantic
relationships as well. These continuous vectors create the much
needed granularity in the probability distribution of the next word.
Moreover, the language model is a function, as all neural networks
are with lots of matrix computations, so it’s not necessary to store
all n-gram counts to produce the probability distribution of the next
word.

A tutorial on the basics of language models. | Video: Victor


Lavrenko

Evolution of Language Models


Even though neural networks solve the sparsity problem, the
context problem remains. First, language models were developed to
solve the context problem more and more efficiently — bringing
more and more context words to influence the probability
distribution. Secondly, the goal was to create an architecture that
gives the model the ability to learn which context words are more
important than others.

The first model, which I outlined previously, is a dense (or hidden)


layer and an output layer stacked on top of a continuous bag-of-
words (CBOW) Word2Vec model. A CBOW Word2Vec model is
trained to guess the word from context. A Skip-Gram Word2Vec
model does the opposite, guessing context from the word. In
practice, a CBOW Word2Vec model requires a lot of examples of
the following structure to train it: the inputs are n words before
and/or after the word, which is the output. We can see that the
context problem is still intact.

RECURRENT NEURAL NETWORKS (RNN)

Recurrent neural networks (RNNs) are an improvement regarding


this matter. Since RNNs can be either a long short-term memory
(LSTM) or a gated recurrent unit (GRU) cell based network, they
take all previous words into account when choosing the next word.
AllenNLP’s ELMo takes this notion a step further, utilizing a
bidirectional LSTM, which takes into account the context before
and after the word counts.

TRANSFORMERS
The main drawback of RNN-based architectures stems from their
sequential nature. As a consequence, training times soar for long
sequences because there is no possibility for parallelization.
The solution for this problem is the transformer architecture.

The GPT models from OpenAI and Google’s BERT utilize the
transformer architecture, as well. These models also employ a
mechanism called “Attention,” by which the model can learn which
inputs deserve more attention than others in certain cases.

In terms of model architecture, the main quantum leaps were firstly


RNNs, specifically, LSTM and GRU, solving the sparsity problem
and reducing the disk space language models use, and
subsequently, the transformer architecture, making parallelization
possible and creating attention mechanisms. But architecture is not
the only aspect a language model can excel in.

Compared to the GPT-1 architecture, GPT-3 has virtually nothing


novel. But it’s huge. It has 175 billion parameters, and it was
trained on the largest corpus a model has ever been trained on in
common crawl. This is partly possible because of the semi-
supervised training strategy of a language model. A text can be
used as a training example with some words omitted. The
incredible power of GPT-3 comes from the fact that it has read
more or less all text that has appeared on the internet over the past
years, and it has the capability to reflect most of the complexity
natural language contains.
TRAINED FOR MULTIPLE PURPOSES

Finally, I’d like to review the T5 model from Google. Previously,


language models were used for standard NLP tasks, like part-of-
speech (POS) tagging or machine translation with slight
modifications. With a little retraining, BERT can be a POS-tagger
because of its abstract ability to understand the underlying
structure of natural language.

With T5, there is no need for any modifications for NLP tasks. If it
gets a text with some <M> tokens in it, it knows that those tokens
are gaps to fill with the appropriate words. It can also answer
questions. If it receives some context after the questions, it
searches the context for the answer. Otherwise, it answers from its
own knowledge. Fun fact: It beat its own creators in a trivia quiz.

MORE ON LANGUAGE MODELS: NLP for Beginners: A Complete Guide

Future of Language Models


Personally, I think this is the field that we are closest to creating an
AI. There’s a lot of buzz around AI, and many simple decision
systems and almost any neural network are called AI, but this is
mainly marketing. By definition, artificial intelligence involves
human-like intelligence capabilities performed by a machine. While
transfer learning shines in the field of computer vision, and the
notion of transfer learning is essential for an AI system, the very
fact that the same model can do a wide range of NLP tasks and can
infer what to do from the input is itself spectacular. It brings us one
step closer to actually creating human-like intelligence systems.

Subscribe to Built In to get tech articles + jobs in your


inbox.
Your Expertise
Email Address
SUBSCRIBE

RECENT DATA SCIENCE ARTICLES


What Is Process Mining?


What Is Pattern Recognition?

What Is Object-Relational Mapping (ORM)?
Data Science
Expert Contributors
Machine Learning

Expert Contributors
Built In’s expert contributor network publishes thoughtful, solutions-oriented stories
written by innovative tech professionals. It is the tech industry’s definitive destination
for sharing compelling, first-person accounts of problem-solving on the road to
innovation.

LEARN MORE

Great Companies Need Great People. That's Where We Come In.


RECRUIT WITH US

Built In is the online community for startu Collection

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy