0% found this document useful (0 votes)
14 views11 pages

INTELLIPAAT - 2024 - 01 - 20 - Tansformers Cont. and Autoencoders

The document discusses different types of tokenization used in natural language processing including word, character, and subword tokenization. It explains how tokenization breaks down text into smaller units to make it easier for algorithms to analyze language and perform tasks like machine translation, sentiment analysis, and chatbots.

Uploaded by

ScarfaceXXX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views11 pages

INTELLIPAAT - 2024 - 01 - 20 - Tansformers Cont. and Autoencoders

The document discusses different types of tokenization used in natural language processing including word, character, and subword tokenization. It explains how tokenization breaks down text into smaller units to make it easier for algorithms to analyze language and perform tasks like machine translation, sentiment analysis, and chatbots.

Uploaded by

ScarfaceXXX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

In Generative question and answering the answer is rephrased.

In the world of Natural Language Processing (NLP), tokens are the building blocks! They're essentially smaller units that we break
down text into, making it easier for computers to understand and analyze. Think of it like chopping up a giant pizza into slices – it's
much easier to manage and consume that way.

There are different ways to slice that pizza (text), depending on the task at hand:

 Word Tokenization: This is the most common type, where we simply split the text into individual words. Like "The quick
brown fox jumps over the lazy dog" becomes ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"].
 Character Tokenization: Here, we go even smaller, breaking down the text into individual characters. So, our example
sentence would become ["T", "h", "e", " ", "q", "u", "i", "c", "k", ...].
 Subword Tokenization: This takes a middle ground, splitting text into smaller meaningful units than words but larger than
characters. This is particularly useful for languages with complex morphology (like German) or dealing with rare words.
Now, why is tokenization so important? Well, computers don't naturally understand language the way humans do. By breaking
down text into smaller, more manageable units, we make it easier for NLP algorithms to identify patterns, analyze word
relationships, and perform various tasks like:

 Machine translation: Translating languages involves understanding the meaning of one sentence and expressing it in
another. Tokenization helps break down the meaning into smaller pieces that can be translated more accurately.
 Sentiment analysis: Identifying the emotional tone of a text requires understanding the relationships between
words. Tokenization allows us to analyze these relationships and determine if a sentence is positive, negative, or neutral.
 Chatbots and virtual assistants: These systems need to understand what users are saying in order to respond
appropriately. Tokenization helps break down user queries into meaningful units that the chatbot can process and respond
to.

So, the next time you interact with an NLP system, remember the tiny tokens behind the scenes, diligently working to help
computers understand and respond to human language!

BERT, which stands for Bidirectional Encoder Representations from Transformers, is based on Transformers, a deep learning
model in which every output element is connected to every input element, and the weightings between them are dynamically
calculated based upon their connection.

It took place in Qatar From 18 November to 20 December 2022, after the country was awarded hosting rights in 2010.
[SEP] – it is a tag

When you hit the encode:

input_ids = tokenizer.encode(question, text)

It is encoding them into the tokens. But behind the scenes the tokens are all going to be embeddings.

Semantic search:

1:05:45 –
We can perform a similarity exercise such as cosine similarity between the questions and sentences.

The vector db support similarity search.

Algorithms that support search such as Locality Sensitive Hashing, Approximate nearest neighbors (ANN)
In generative question answering the final output is sent to a decoder.
Extractive question answering versus generative question answering.
1:49:47

The encoder compresses the input and the decoder is trained to recreate the original input.
Even if there is a minor amount of loss it should be able to construct the image well enough.

We then calculate the error and from the error backpropagate.

The whole thing put together becomes an autoencoder.

PCA does linear combination of inputs while here it is a nonlinear combination of inputs.

(x_train, _), (x_test, _) = mnist.load_data()

We don’t need the target variable in the above as the data set itself is the target variable.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy