0% found this document useful (0 votes)
150 views199 pages

Intro To AI Models and Rag v.0.1

Intro-to-AI-models

Uploaded by

Mina Saneed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
150 views199 pages

Intro To AI Models and Rag v.0.1

Intro-to-AI-models

Uploaded by

Mina Saneed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 199

LLM Models and RAG

Hands-on guide

Mohamed El-Zahaby
V 0.1
April 2024
This guide is primarily for technical teams engaged in developing a basic
conversational AI with RAG solutions. It offers a basic introduction to the
technical aspects.

This guide helps anyone with basic technical background to get involved in the
AI domain.

This guide combines between the theoretical, basic knowledge and code
implementation.

It's important to note that most of the content is compiled from various
online resources, reflecting the extensive effort in
curating and organizing this information from numerous sources..
Contents
INTRODUCTION ........................................................................................................................................8
What is Conversational AI? ..........................................................................................................................9
The Technology Behind Conversational AI .................................................................................................9
1. Speech-to-text: ..........................................................................................................................................9
2. Language processing: ...............................................................................................................................9
3. Text-To-Speech (TTS): ..........................................................................................................................10
4.Context and Multi-turn conversations: ....................................................................................................10
5. Dialogue policy: .....................................................................................................................................10
LLM Basics.................................................................................................................................................12
What is a large language model (LLM)? ....................................................................................................13
How do LLMs work? .................................................................................................................................13
Machine learning and deep learning ...........................................................................................................13
Neural networks..........................................................................................................................................14
Transformer models....................................................................................................................................14
What are the Relations and Differences between LLMs and Transformers? .............................................15
Transformers...............................................................................................................................................15
LLM (Large Language Model)...................................................................................................................15
Relation and Differences between LLMs and Transformers ......................................................................16
What are Pipelines in Transformers?..........................................................................................................17
What are Hugging Face Transformers? ......................................................................................................18
Hugging Face provides: ..............................................................................................................................18
Chains .........................................................................................................................................................19
What are chains?.........................................................................................................................................20
Some reasons you may want to use chains: ................................................................................................20
Foundational chain types in LangChain .....................................................................................................20
LLMChain ..................................................................................................................................................21
Creating an LLMChain...............................................................................................................................22
Sequential Chains .......................................................................................................................................26
SimpleSequentialChain ..............................................................................................................................26
SequentialChain ..........................................................................................................................................28
Transformation ...........................................................................................................................................30
Prompt Engineering.....................................................................................................................................36
What is Prompt Engineering? .....................................................................................................................37
Prompt ........................................................................................................................................................37
Types of Prompts ........................................................................................................................................38
Instruction Prompting .................................................................................................................................38
Role Prompting ...........................................................................................................................................39
“Standard” Prompting.................................................................................................................................41
Chain of Thought (CoT) Prompting ...........................................................................................................41
Recommendations and Tips for Prompt Engineering with OpenAI API ...................................................43
Embeddings.................................................................................................................................................48
A problem with semantic search.................................................................................................................49
What are embeddings?................................................................................................................................50
What is a vector in machine learning?........................................................................................................51
How do embeddings work? ........................................................................................................................53
How are embeddings used in large language models (LLMs)?..................................................................54
Vector Stores ...............................................................................................................................................55
What Are Vector Databases? ......................................................................................................................56
The Benefits of Using Open Source Vector Databases ..............................................................................56
Open Source Vector Databases Comparison: Chroma Vs. Milvus Vs. Weaviate ......................................57
1. Chroma ...................................................................................................................................................57
2. Milvus .....................................................................................................................................................58
3. Weaviate .................................................................................................................................................58
4.Faiss: ........................................................................................................................................................59
Chunking.....................................................................................................................................................63
Document Splitting .....................................................................................................................................64
Chunking Methods .....................................................................................................................................65
Character Splitting ......................................................................................................................................66
Recursive Character Text Splitting.............................................................................................................71
Split by Tokens ...........................................................................................................................................75
Tiktoken Tokenizer.....................................................................................................................................75
Hugging Face Tokenizer ............................................................................................................................75
Other Tokenizer ..........................................................................................................................................76
Things to Keep in Mind ..............................................................................................................................76
Quantization ................................................................................................................................................77
What is Quantization? ................................................................................................................................78
How does quantization work? ....................................................................................................................78
Hugging Face and Bitsandbytes Uses.........................................................................................................78
Loading a Model in 4-bit Quantization ......................................................................................................79
Loading a Model in 8-bit Quantization ......................................................................................................80
Changing the Compute Data Type .............................................................................................................80
Using NF4 Data Type .................................................................................................................................81
Nested Quantization for Memory Efficiency .............................................................................................81
Loading a Quantized Model from the Hub .................................................................................................81
Exploring Advanced techniques and configuration ....................................................................................82
Fine-Tuning a Model Loaded in 8-bit ........................................................................................................82
Temperature ................................................................................................................................................83
Top P and Temperature ..............................................................................................................................84
Temperature ................................................................................................................................................84
Top p ...........................................................................................................................................................84
Token length ...............................................................................................................................................85
Max tokens .................................................................................................................................................85
Stop tokens .................................................................................................................................................86
Langchain Memory .....................................................................................................................................87
What is Conversational memory?...............................................................................................................88
ConversationChain .....................................................................................................................................89
Forms of Conversational Memory ..............................................................................................................91
ConversationBufferMemory .......................................................................................................................91
ConversationSummaryMemory..................................................................................................................96
ConversationBufferWindowMemory .......................................................................................................103
ConversationSummaryBufferMemory .....................................................................................................108
Other Memory Types................................................................................................................................110
Agents & Tools .........................................................................................................................................111
Tools .........................................................................................................................................................112
Agents .......................................................................................................................................................112
Chains .......................................................................................................................................................113
Memory ....................................................................................................................................................114
Callback Handlers.....................................................................................................................................116
Walkthrough — Project Utilizing Langchain ...........................................................................................116
RAG ..........................................................................................................................................................122
The Curse Of The LLMs ..........................................................................................................................123
The Challenge ...........................................................................................................................................123
What is RAG?...........................................................................................................................................123
How does RAG help? ...............................................................................................................................125
𝗡𝗡𝗡𝗡𝗡𝗡 𝗥𝗥𝗥𝗥𝗥𝗥 𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁 :- ........................................................................................................................126
groq ...........................................................................................................................................................128
What is groq? ............................................................................................................................................129
What is LPU?............................................................................................................................................129
How Groq's LPU Works ...........................................................................................................................129
How LPU is different from GPU ..............................................................................................................131
Groq Tools ................................................................................................................................................133
Groq and RAG Architecture Example ......................................................................................................137
What is LlamaParse ? ...............................................................................................................................139
Use Case – 1..............................................................................................................................................157
Conversational AI chatbot ........................................................................................................................158
implementation-1-A4000..........................................................................................................................158
implementation-2-A100............................................................................................................................175
implementation-3-groq .............................................................................................................................175
implementation-4-llama3-A4000 .............................................................................................................189
Use Case – 2..............................................................................................................................................192
Action integration with chatbot (google calendar booking) .....................................................................193
Source Code ..............................................................................................................................................198
INTRODUCTION
What is Conversational AI?
Conversational AI means using technology with artificial intelligence to make machines
talk to people. Basically, it figures out what someone says or writes and responds
naturally to keep the conversation going. Thanks to recent improvements, machines can
now have smart and natural conversations with humans.

The Technology Behind Conversational AI


Conversational AI relies on various components to function, spanning from speech
recognition to intent detection and concluding with a spoken or written response. The
following components constitute the core of the conversational AI technology stack:

1. Speech-to-text:
- This technology converts spoken words into text transcriptions.

2. Language processing:
2.1 Natural Language Understanding (NLU):

- NLU is the process by which technology comprehends natural human language.

- Especially crucial in voice interactions where speakers may not use specific keywords or
share longer stories.

2.2. Intent:

- Intents within conversational AI determine the actions triggered based on conversational


inputs.

2.3. Intent detection:

- This process involves the bot correctly identifying the intent behind an utterance.

- More challenging in voice compared to text due to the tendency for longer stories in
speech.
2.4. Value extraction:

- AI agents extract relevant information from customer queries and store them against
corresponding 'slots.'

- Vital for handling multiple values in a single speech, ensuring natural conversations.

3. Text-To-Speech (TTS):
- This technology converts written text into spoken utterances.

- Off-the-shelf solutions may sound robotic, but voice actors can be used for natural
responses.

4.Context and Multi-turn conversations:


- Conversational bots need to maintain context across multiple turns for natural-feeling
conversations.

- Particularly important in voice interactions where chat history isn't displayed to the
customer.

- Each back-and-forth interaction in a conversation is a 'turn.'

- Multi-turn conversations involve more than one interaction, contributing to a


comprehensive dialogue.

5. Dialogue policy:
- Dialogue policy guides the flow of a conversation, allowing the bot to intelligently
navigate a transaction.

- A robust dialogue policy accommodates interruptions, such as clarifying questions and


enhancing the user experience.

ref:

https://spotintelligence.com/2024/01/30/conversational-ai-explained-top-9-tools-how-to-
guide-including-gpt/

https://i0.wp.com/spotintelligence.com/wp-content/uploads/2024/01/key-components-of-
conversational-ai-1024x576.webp?resize=1024,576&ssl=1

https://spotintelligence.com/2024/01/30/conversational-ai-explained-top-9-tools-how-to-
guide-including-gpt/
LLM Basics
What is a large language model (LLM)?
A large language model (LLM) is a type of [artificial intelligence
(AI)](https://www.cloudflare.com/learning/ai/what-is-artificial-intelligence/ ) program
that can recognize and generate text, among other tasks. LLMs are trained on [huge sets
of data](https://www.cloudflare.com/learning/ai/big-data/ ) — hence the name "large."
LLMs are built on [machine learning](https://www.cloudflare.com/learning/ai/what-is-
machine-learning/ ): specifically, a type of [neural
network](https://www.cloudflare.com/learning/ai/what-is-neural-network/ ) called a
transformer model.

In simpler terms, an LLM is a computer program that has been fed enough examples to be
able to recognize and interpret human language or other types of complex data. Many
LLMs are trained on data that has been gathered from the Internet — thousands or
millions of gigabytes' worth of text. But the quality of the samples impacts how well
LLMs will learn natural language, so an LLM's programmers may use a more curated data
set.

LLMs use a type of machine learning called [deep


learning](https://www.cloudflare.com/learning/ai/what-is-deep-learning/ ) in order to
understand how characters, words, and sentences function together. Deep learning
involves the probabilistic analysis of unstructured data, which eventually enables the deep
learning model to recognize distinctions between pieces of content without human
intervention.

LLMs are then further trained via tuning: they are fine-tuned or prompt-tuned to the
particular task that the programmer wants them to do, such as interpreting questions and
generating responses, or translating text from one language to another.

How do LLMs work?


Machine learning and deep learning

At a basic level, LLMs are built on machine learning. Machine learning is a subset of AI,
and it refers to the practice of feeding a program large amounts of data in order to train the
program how to identify features of that data without human intervention.

LLMs use a type of machine learning called deep learning. Deep learning models can
essentially train themselves to recognize distinctions without human intervention,
although some human fine-tuning is typically necessary.
Deep learning uses probability in order to "learn." For instance, in the sentence "The quick
brown fox jumped over the lazy dog," the letters "e" and "o" are the most common,
appearing four times each. From this, a deep learning model could conclude (correctly)
that these characters are among the most likely to appear in English-language text.

Realistically, a deep learning model cannot actually conclude anything from a single
sentence. But after analyzing trillions of sentences, it could learn enough to predict how to
logically finish an incomplete sentence, or even generate its own sentences.

Neural networks
In order to enable this type of deep learning, LLMs are built on neural networks. Just as
the human brain is constructed of neurons that connect and send signals to each other, an
artificial neural network (typically shortened to "neural network") is constructed of
network nodes that connect with each other. They are composed of several "layers”: an
input layer, an output layer, and one or more layers in between. The layers only pass
information to each other if their own outputs cross a certain threshold.

Transformer models
The specific kind of neural networks used for LLMs are called transformer models.
Transformer models are able to learn context — especially important for human language,
which is highly context-dependent. Transformer models use a mathematical technique
called self-attention to detect subtle ways that elements in a sequence relate to each other.
This makes them better at understanding context than other types of machine learning. It
enables them to understand, for instance, how the end of a sentence connects to the
beginning, and how the sentences in a paragraph relate to each other.

This enables LLMs to interpret human language, even when that language is vague or
poorly defined, arranged in combinations they have not encountered before, or
contextualized in new ways. On some level they "understand" semantics in that they can
associate words and concepts by their meaning, having seen them grouped together in that
way millions or billions of times.

ref: https://www.cloudflare.com/learning/ai/what-is-large-language-model/
What are the Relations and Differences
between LLMs and Transformers?

Transformers
Has gained a lot of popularity in the field of natural language processing (NLP). These are
good at understanding the relationships between words in a sentence or sequence of text.
Unlike traditional models like RNNs, Transformers don't rely on sequential processing,
allowing them to do computation in parallel and process sentences more efficiently.
Overall, these are powerful models good at understanding relationships between words
and have modernized NLP area.

Imagine a sentence: "The cat sat on the mat." A transformer breaks down this sentence
into smaller units called "tokens" (e.g., "The," "cat," "sat," "on," "the," "mat," and
punctuation marks). Each token is represented as a vector, capturing its meaning and
context. The transformer then learns to analyse the relationships between these tokens to
understand the sentence's overall meaning.

Example models,

- BERT (Bidirectional Encoder Representations from Transformers)

- GPT (Generative Pre-trained Transformer)

- T5 (Text-to-Text Transfer Transformer)

- DialoGPT

LLM (Large Language Model)


Is a specific type of transformer that has been trained on vast amounts of text data. It has
learned to predict the next word in a sentence given the context of the previous words.
This ability allows LLMs to generate contextually correct text.

For instance, if you provide the prompt "Once upon a time in a land far" an LLM can
generate the next words as "away." The LLM bases its predictions on the patterns and
context it has learned during training on massive amounts of text. This makes LLMs
useful for various applications, such as auto-completion, translation, summarization, and
even creative writing.

- GPT3.5 Turbo & GPT-4 by OpenAI

- BLOOM by BigScience

- LaMDA by Google

- MT-NLG by Nvidia/Microsoft2

- LLaMA by Meta AI2

Relation and Differences between LLMs and


Transformers
Transformers and LLMs (Large Language Models) are related concepts, as LLMs are a
specific type of model that is built upon the transformer architecture. While transformers,
in general, can be used for various tasks beyond language modeling, LLMs are
specifically trained in generating text and understanding natural language(There can be
exceptions as this field is quickly evolving, and pace of research and funding is
unprecedented).

The main differences between transformers and LLMs lie in their specific purposes and
training objectives. Transformers are a broader class of models that can be applied to
various tasks, including language translation, speech recognition, and image captioning,
while LLMs are focused on language modeling and text generation(there are some
exceptions). Transformers serve as the underlying architecture that enables LLMs to
understand and generate text by capturing contextual relationships and long-range
dependencies. Transformers are more general-purpose models, whereas LLMs are
specifically trained and optimized for language modeling and generation tasks.

Transformer models can also be divided into three categories: encoders, decoders, and
encoder-decoder architectures. This categorization is based on the different roles these
components play in the model's overall function. Encoders aim to understand the input
sequence. They focus on processing the input and capturing its meaning and
context. Decoders, on the other hand, generate output based on the information learned
by the encoder. They take the encoded representations and produce the desired output
sequence. Encoder-decoder models combine both encoder and decoder components.
They are used in tasks where the input and output sequences have different lengths or
meanings. The encoder understands the input sequence, and the decoder generates the
corresponding output sequence.
ref: https://www.linkedin.com/pulse/transformers-llms-next-frontier-ai-vijay-chaudhary/

What are Pipelines in Transformers?


- They provide an easy-to-use API through pipeline() method for performing inference
over a variety of tasks.

- They are used to encapsulate the overall process of every Natural Language Processing
task, such as text cleaning, tokenization, embedding, etc.

The pipeline() method has the following structure:

from transformers import pipeline


# To use a default model & tokenizer for a given task(e.g. question-answering)
pipeline("task-name")
# To use an existing model
pipeline("task-name", model="model_name")
# To use a custom model/tokenizer
pipeline('task-name', model='model name',tokenizer='tokenizer_name')
)

>This code snippet is using the transformers library to create a pipeline for natural
language processing tasks such as question-answering.

- The first line imports the pipeline function from the transformers library.

- The next three lines show how to use the pipeline function for different scenarios.

- The first scenario uses a default model and tokenizer for a given task, which is specified
in the placeholder "task-name".

- The second scenario uses an existing model, which is specified in the placeholder
"model_name", for the same task as in the first scenario.

- The third scenario uses a custom model and tokenizer, which are specified in the
placeholders "model name" and "tokenizer_name", respectively, for the same task as in
the first two scenarios.

- Overall, the pipeline function allows for easy implementation of natural language
processing tasks with various models and tokenizers.

ref: https://www.datacamp.com/tutorial/an-introduction-to-using-transformers-and-
hugging-face

What are Hugging Face Transformers?


[Hugging Face Transformers](https://huggingface.co/docs/transformers/index ) is an
open-source framework for deep learning created by Hugging Face. It provides APIs and
tools to download state-of-the-art pre-trained models and further tune them to maximize
performance. These models support common tasks in different modalities, such as natural
language processing, computer vision, audio, and multi-modal applications.

For many applications, such as sentiment analysis and text summarization, pre-trained
models work well without any additional model training.

Hugging Face Transformers pipelines encode best practices and have default models
selected for different tasks, making it easy to get started. Pipelines make it easy to use
GPUs when available and allow batching of items sent to the GPU for better throughput
performance.

Hugging Face provides:


- A [model hub](https://huggingface.co/models ) containing many pre-trained models.

- The [� � Transformers library](https://huggingface.co/docs/transformers/index


) that supports the download and use of these models for NLP applications and fine-
tuning. It is common to need both a tokenizer and a model for natural language processing
tasks.

� Transformers
- [�
pipelines](https://huggingface.co/docs/transformers/v4.26.1/en/pipeline_tutorial ) that
have a simple interface for most natural language processing tasks.

ref: https://docs.databricks.com/en/machine-learning/train-model/huggingface/index.html
Chains
What are chains?
A chain is an end-to-end wrapper around multiple individual components executed in a
defined order.

Chains are one of the core concepts of LangChain. Chains allow you to go beyond just a
single API call to a language model and instead chain together multiple calls in a logical
sequence.

They allow you to combine multiple components to create a coherent application.

Some reasons you may want to use chains:


- To break down a complex task into smaller steps that can be handled sequentially by
different models or utilities. This allows you to leverage the different strengths of different
systems.

- To add state and memory between calls. The output of one call can be fed as input to
the next call to provide context and state.

- To add additional processing, filtering or validation logic between calls.

- For easier debugging and instrumentation of a sequence of calls.

Foundational chain types in LangChain


The `LLMChain`, `RouterChain`, `SimpleSequentialChain`, and `TransformChain` are
considered the core foundational building blocks that many other more complex chains
build on top of. They provide basic patterns like chaining LLMs, conditional logic,
sequential workflows, and data transformations.

• `LLMChain`: Chains together multiple calls to language models. Useful for breaking
down complex prompts.

• `RouterChain`: Allows conditionally routing between different chains based on logic.


Enables branching logic.

• `SimpleSequentialChain`: Chains together multiple chains in sequence. Useful for linear


workflows.
• `TransformChain`: Applies a data transformation between chains. Helpful for data
munging and preprocessing.

Other key chain types like `Agents` and `RetrievalChain` build on top of these
foundations to enable more advanced use cases like goal-oriented conversations and
knowledge-grounded generation.

However the foundational four provide the basic patterns for chain construction in
LangChain.

LLMChain

The most commonly used type of chain is an LLMChain.

The LLMChain consists of a PromptTemplate, a language model, and an optional output


parser. For example, you can create a chain that takes user input, formats it with a
PromptTemplate, and then passes the formatted response to an LLM. You can build more
complex chains by combining multiple chains, or by combining chains with other
components.

The main differences between using an LLMChain versus directly passing a prompt to an
LLM are:

- LLMChain allows chaining multiple prompts together, while directly passing a prompt
only allows one. With LLMChain, you can break down a complex prompt into multiple
more straightforward prompts and chain them together.

- LLMChain maintains state and memory between prompts. The output of one prompt
can be fed as input to the following prompt to provide context. Directly passing prompts
lack this memory.

- LLMChain makes adding preprocessing logic, validation, and instrumentation between


prompts easier. This helps with debugging and quality control.

- LLMChain provides some convenience methods like `apply` and `generate` that make
it easy to run the chain over multiple inputs.
Creating an LLMChain
To create an LLMChain, you need to specify:

- The language model to use

- The prompt template

Code Example:

from langchain import PromptTemplate, OpenAI, LLMChain


# the language model
llm = OpenAI(temperature=0)
# the prompt template
prompt_template = "Act like a comedian and write a super funny two-sentence short
story about {thing}?"
llm_chain = LLMChain(
llm=llm,
prompt=PromptTemplate.from_template(prompt_template)
)
llm_chain("A toddler hiding his dad's laptop")

{'thing': "A toddler hiding his dad's laptop",


'text': '\n\nThe toddler thought he was being sneaky, but little did he know his
dad was watching the whole time from the other room, laughing.'}

Use `apply` when you have a list of inputs and want to get the LLM to generate text for
each one, it will run the LLMChain for every input dictionary in the list and return a list of
outputs.

input_list = [
{"thing": "a Punjabi rapper who eats too many samosas"},
{"thing": "a blind eye doctor"},
{"thing": "a data scientist who can't do math"}
]
llm_chain.apply(input_list)

[{'text': "\n\nThe Punjabi rapper was so famous that he was known as the 'Samosa
King', but his fame was short-lived when he ate so many samosas that he had to be
hospitalized for a stomachache!"},
{'text': "\n\nA blind eye doctor was so successful that he was able to cure his own
vision - but he still couldn't find his glasses."},
{'text': '\n\nA data scientist was so bad at math that he had to hire a calculator
to do his calculations for him. Unfortunately, the calculator was even worse at math
than he was!'}]

`generate` is similar to apply, except it returns an `LLMResult` instead of a string. Use


this when you want the entire `LLMResult` object returned, not just the generated text.
This gives you access to metadata like the number of tokens used.
llm_chain.generate(input_list)

LLMResult(generations=
[[Generation(text="\n\nThe Punjabi rapper was so famous that he was known as the
'Samosa King',
but his fame was short-lived when he ate so many samosas that he had to be
hospitalized for a stomachache!",
generation_info={'finish_reason': 'stop', 'logprobs': None})],

[Generation(text="\n\nA blind eye doctor was so successful that he was able to cure
his own vision - but he still couldn't find his glasses.",
generation_info={'finish_reason': 'stop', 'logprobs': None})],

[Generation(text='\n\nA data scientist was so bad at math that he had to hire a


calculator to do his calculations for him. Unfortunately, the calculator was even
worse at math than he was!', generation_info={'finish_reason': 'stop', 'logprobs':
None})]],

llm_output={'token_usage': {'prompt_tokens': 75, 'total_tokens': 187,


'completion_tokens': 112}, 'model_name': 'text-davinci-003'},
run=[RunInfo(run_id=UUID('b638d2c6-77d9-4346-8494-866892e36bc5')),
RunInfo(run_id=UUID('427f9e51-4848-49d3-83c1-e96131f2b34f')),
RunInfo(run_id=UUID('4201eea9-1616-42e7-8cb2-a5b26128decd'))])

Use `predict` when you want to pass inputs as keyword arguments instead of a dictionary.
This can be convenient if you don’t want to construct an input dictionary.

llm_chain.predict(thing="colorful socks")

The socks were so colorful that when the washing machine finished its cycle, the socks
had formed a rainbow in the laundry basket!

Use `LLMChain.run` when you want to pass the input as a dictionary and get the raw
text output from the LLM.

`LLMChain.run` is convenient when your LLMChain has a single input key and a single
output key.

llm_chain.run("the red hot chili peppers")

['1. Wear a Hawaiian shirt\n2. Sing along to the wrong lyrics\n3. Bring a beach ball
to the concert\n4. Try to start a mosh pit\n5. Bring a kazoo and try to join in on
the music']
Parsing output

To parse the output, you simply pass an output parser directly to `LLMChain`.

from langchain.output_parsers import CommaSeparatedListOutputParser


llm = OpenAI(temperature=0)
# the prompt template

prompt_template = "Act like a Captain Obvious and list 5 funny things to not do at
{place}?"

output_parser=CommaSeparatedListOutputParser()
llm_chain = LLMChain(
llm=llm,
prompt=PromptTemplate.from_template(prompt_template),
output_parser= output_parser
)

llm_chain.predict(place='Disneyland')

['1. Wear a costume of a Disney villain.\n2. Bring your own food and drinks into the
park.\n3. Try to ride the roller coasters without a ticket.\n4. Try to sneak into
the VIP area.\n5. Try to take a selfie with a Disney character without asking
permission.']

Router Chains

Router chains allow routing inputs to different destination chains based on the input text.
This allows the building of chatbots and assistants that can handle diverse requests.

- Router chains examine the input text and route it to the appropriate destination chain

- Destination chains handle the actual execution based on the input

- Router chains are powerful for building multi-purpose chatbots/assistants

The following example will show routing chains used in a `MultiPromptChain` to create a
question-answering chain that selects the prompt which is most relevant for a given
question and then answers the question using that prompt.
from langchain.chains.router import MultiPromptChain
from langchain.llms import OpenAI
from langchain.chains import ConversationChain
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate

physics_template = """You are a very smart physics professor. \


You are great at answering questions about physics in a concise and easy to
understand manner. \
When you don't know the answer to a question you admit that you don't know.

Here is a question:
{input}"""

math_template = """You are a very good mathematician. You are great at answering
math questions. \
You are so good because you are able to break down hard problems into their
component parts, \
answer the component parts, and then put them together to answer the broader
question.

Here is a question:
{input}"""

prompt_infos = [
{
"name": "physics",
"description": "Good for answering questions about physics",
"prompt_template": physics_template,
},
{
"name": "math",
"description": "Good for answering math questions",
"prompt_template": math_template,
},
]

destination_chains = {}

for p_info in prompt_infos:


name = p_info["name"]
prompt_template = p_info["prompt_template"]
prompt = PromptTemplate(template=prompt_template, input_variables=["input"])
chain = LLMChain(llm=llm, prompt=prompt)
destination_chains[name] = chain

default_chain = ConversationChain(llm=llm, output_key="text")

default_chain.run("What is math?")

Math is the study of numbers, shapes, and patterns. It is used to solve problems and
understand the world around us. It is a fundamental part of our lives and is used in many
different fields, from engineering to finance.
Sequential Chains
Sometimes, you might want to make a series of calls to a language model, take the output
from one call and use it as the input to another. Sequential chains allow you to connect
multiple chains and compose them into pipelines executing a specific scenario.

There are two types of sequential chains:

1) `SimpleSequentialChain`: The simplest form of sequential chains, where each step has
a singular input/output, and the output of one step is the input to the next.

2) `SequentialChain`: A more general form of sequential chains allows multiple


inputs/outputs.

SimpleSequentialChain
The simplest form of a sequential chain is where each step has a single input and output.

The output of one step is passed as input to the next step in the chain. You would use
`SimpleSequentialChain` it when you have a linear pipeline where each step has a single
input and output. `SimpleSequentialChain` implicitly passes the output of one step as
input to the next.

This is great for composing a precise sequence of LLMChains where each builds directly
on the previous output.

### When to use:

- You have a clear pipeline of steps, each with a single input and output

- Each step builds directly off the previous step’s output

- Useful for simple linear pipelines with one input and output per step.

- Create each step as an `LLMChain`.

- Pass list of `LLMChains` to `SimpleSequentialChain`.

- Call `run()` passing the initial input.


### How to use:

1) Define each step as an `LLMChain` with a single input and output

2) Create a `SimpleSequentialChain` passing a list of the LLMChain steps

3) Call `run()` on the SimpleSequentialChain with the initial input

from langchain.llms import OpenAI


from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

# This is an LLMChain to write a rap.


llm = OpenAI(temperature=.7)

template = """

You are a Punjabi Jatt rapper, like AP Dhillon or Sidhu Moosewala.

Given a topic, it is your job to spit bars on of pure heat.

Topic: {topic}
"""
prompt_template = PromptTemplate(input_variables=["topic"], template=template)

rap_chain = LLMChain(llm=llm, prompt=prompt_template)

# This is an LLMChain to write a diss track

llm = OpenAI(temperature=.7)

template = """

You are an extremely competitive Punjabi Rapper.

Given the rap from another rapper, it's your job to write a diss track which
tears apart the rap and shames the original rapper.

Rap:
{rap}
"""

prompt_template = PromptTemplate(input_variables=["rap"], template=template)

diss_chain = LLMChain(llm=llm, prompt=prompt_template)

# This is the overall chain where we run these two chains in sequence.
from langchain.chains import SimpleSequentialChain

overall_chain = SimpleSequentialChain(chains=[rap_chain, diss_chain], verbose=True)

review = overall_chain.run("Drinking Crown Royal and mobbin in my red Challenger")


SequentialChain
A more general form of sequential chain allows multiple inputs and outputs per step.

You would use `SequentialChain` when you have a more complex pipeline where steps
might have multiple inputs and outputs.

`SequentialChain` allows you to explicitly specify all the input and output variables at
each step and map outputs from one step to inputs of the next. This provides more
flexibility when steps might have multiple dependencies or produce multiple results to
pass along.

### When to use:

- You have a sequence of steps but with more complex input/output requirements

- You need to track multiple variables across steps in the chain

### How to use

- Define each step as an LLMChain, specifying multiple input/output variables

- Create a SequentialChain specifying all input/output variables

- Map outputs from one step to inputs of the next

- Call run() passing a dict of all input variables

- The key difference is `SimpleSequentialChain` handles implicit variable passing


whereas SequentialChain allows explicit variable specification and mapping.

### When you would use SequentialChain vs SimpleSequentialChain

Use `SimpleSequentialChain` for linear sequences with a single input/output. Use


`SequentialChain` for more complex sequences with multiple inputs/outputs.

### The key difference

`SimpleSequentialChain` is for linear pipelines with a single input/output per step.


Implicitly passes variables.

`SequentialChain` handles more complex pipelines with multiple inputs/outputs per step.
Allows explicitly mapping variables.
This uses a standard ChatOpenAI model and prompt template. You chain them together
with the `|` operator and then call it with `chain.invoke`. We can also get async, batch, and
streaming support out of the box.

llm = OpenAI(temperature=.7)

template = """

You are a Punjabi Jatt rapper, like AP Dhillon or Sidhu Moosewala.

Given two topics, it is your job to create a rhyme of two verses and one chorus
for each topic.

Topic: {topic1} and {topic2}

Rap:

"""

prompt_template = PromptTemplate(input_variables=["topic1", "topic2"],


template=template)

rap_chain = LLMChain(llm=llm, prompt=prompt_template, output_key="rap")

template = """

You are a rap critic from the Rolling Stone magazine and Metacritic.

Given a, it is your job to write a review for that rap.

Your review style should be scathing, critical, and no holds barred.

Rap:

{rap}

Review from the Rolling Stone magazine and Metacritic critic of the above rap:

"""

prompt_template = PromptTemplate(input_variables=["rap"], template=template)

review_chain = LLMChain(llm=llm, prompt=prompt_template, output_key="review")

# This is the overall chain where we run these two chains in sequence.
from langchain.chains import SequentialChain

overall_chain = SequentialChain(
chains=[rap_chain, review_chain],
input_variables=["topic1", "topic2"],
# Here we return multiple variables
output_variables=["rap", "review"],
verbose=True)

overall_chain({"topic1":"Tractors and sugar canes", "topic2": "Dasuya, Punjab"})

```
> Entering new SequentialChain chain...

> Finished chain.

{'topic1': 'Tractors and sugar canes',

'topic2': 'Dasuya, Punjab',

'rap': "Verse 1\nI come from a place with lots of fame\nDasuya, Punjab, where the
tractors reign\nI'm a Jatt rapper with a game to play\nSo I'm gonna take it up and make it
my way\n\nChorus\nTractors and sugar canes, that's what I'm talking about\nTractors and
sugar canes, it's all about\nDasuya, Punjab, a place so grand\nTractors and sugar canes,
that's our jam\n\nVerse 2\nFrom Punjab's beauty I derive my pride\nMy heart belongs to
the place, where the sugar canes reside\nWhere the soil is my home, I'm never
apart\nFrom the tractors and sugar canes of Dasuya, Punjab\n\nChorus\nTractors and
sugar canes, that's what I'm talking about\nTractors and sugar canes, it's all
about\nDasuya, Punjab, a place so grand\nTractors and sugar canes, that's our jam",

'review': "\nThis rap artist hails from the small town of Dasuya, Punjab, and takes pride in
his hometown's culture and agricultural way of life. While the lyrical content of this rap is
filled with references to tractors and sugar canes, unfortunately the artist's delivery falls
flat and fails to capture the unique essence of his home. The basic rhyme scheme,
repetitive chorus, and lack of originality make this a forgettable track. The artist's
enthusiasm for his hometown is admirable, but unfortunately it is not enough to make this
rap stand out from the crowd."}

Transformation
Transformation Chains allows you to define custom data transformation logic as a step in
your LangChain pipeline. This is useful when you must preprocess or transform data
before passing it to the next step.
from langchain.chains import TransformChain, LLMChain, SimpleSequentialChain
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate

!wget https://www.gutenberg.org/files/2680/2680-0.txt

with open("/content/2680-0.txt") as f:
meditations = f.read()

def transform_func(inputs: dict) -> dict:


"""
Extracts specific sections from a given text based on newline separators.

The function assumes the input text is divided into sections or paragraphs
separated
by one newline characters (`\n`). It extracts the sections from index 922 to 950
(inclusive) and returns them in a dictionary.

Parameters:
- inputs (dict): A dictionary containing the key "text" with the input text as
its value.

Returns:
- dict: A dictionary containing the key "output_text" with the extracted
sections as its value.
"""
text = inputs["text"]
shortened_text = "\n".join(text.split("\n")[921:950])
return {"output_text": shortened_text}

transform_chain = TransformChain(
input_variables=["text"], output_variables=["output_text"],
transform=transform_func, verbose=True
)

transform_chain.run(meditations)

II. Let it be thy earnest and incessant care as a Roman and a man to

perform whatsoever it is that thou art about, with true and unfeigned

gravity, natural affection, freedom and justice: and as for all other

cares, and imaginations, how thou mayest ease thy mind of them. Which

thou shalt do; if thou shalt go about every action as thy last action,

free from all vanity, all passionate and wilful aberration from reason,

and from all hypocrisy, and self-love, and dislike of those things,

which by the fates or appointment of God have happened unto thee. Thou

seest that those things, which for a man to hold on in a prosperous

course, and to live a divine life, are requisite and necessary, are not

many, for the gods will require no more of any man, that shall but keep

and observe these things.


III. Do, soul, do; abuse and contemn thyself; yet a while and the time

for thee to respect thyself, will be at an end. Every man's happiness

depends from himself, but behold thy life is almost at an end, whiles

affording thyself no respect, thou dost make thy happiness to consist in

the souls, and conceits of other men.

IV. Why should any of these things that happen externally, so much

distract thee? Give thyself leisure to learn some good thing, and cease

roving and wandering to and fro. Thou must also take heed of another

kind of wandering, for they are idle in their actions, who toil and

labour in this life, and have no certain scope to which to direct all

their motions, and desires. V. For not observing the state of another

man's soul, scarce was ever any man known to be unhappy. Tell whosoever

they be that intend not, and guide not by reason and discretion the

motions of their own souls, they must of necessity be unhappy.

template = """

Rephrase this text:

{output_text}

In the style of a 90s gangster rapper speaking to his homies.

Rephrased:"""

prompt = PromptTemplate(input_variables=["output_text"], template=template)

llm_chain = LLMChain(llm=OpenAI(), prompt=prompt)

sequential_chain = SimpleSequentialChain(chains=[transform_chain, llm_chain],


verbose=True)

sequential_chain.run(meditations)
> Entering new SimpleSequentialChain chain...

> Entering new TransformChain chain...

> Finished chain.

II. Let it be thy earnest and incessant care as a Roman and a man to

perform whatsoever it is that thou art about, with true and unfeigned

gravity, natural affection, freedom and justice: and as for all other

cares, and imaginations, how thou mayest ease thy mind of them. Which

thou shalt do; if thou shalt go about every action as thy last action,

free from all vanity, all passionate and wilful aberration from reason,

and from all hypocrisy, and self-love, and dislike of those things,

which by the fates or appointment of God have happened unto thee. Thou

seest that those things, which for a man to hold on in a prosperous

course, and to live a divine life, are requisite and necessary, are not

many, for the gods will require no more of any man, that shall but keep

and observe these things.

III. Do, soul, do; abuse and contemn thyself; yet a while and the time

for thee to respect thyself, will be at an end. Every man's happiness


depends from himself, but behold thy life is almost at an end, whiles

affording thyself no respect, thou dost make thy happiness to consist in

the souls, and conceits of other men.

IV. Why should any of these things that happen externally, so much

distract thee? Give thyself leisure to learn some good thing, and cease

roving and wandering to and fro. Thou must also take heed of another

kind of wandering, for they are idle in their actions, who toil and

labour in this life, and have no certain scope to which to direct all

their motions, and desires. V. For not observing the state of another

man's soul, scarce was ever any man known to be unhappy. Tell whosoever

they be that intend not, and guide not by reason and discretion the

motions of their own souls, they must of necessity be unhappy.

Yo, listen up my homies, it's time to get serious. We gotta take care of our business and
act with true gravity, natural affection, freedom, and justice. So forget all those other cares
and worries, and just do every action like it's your last, stayin' away from vanity and all
that phony stuff. We don't need much for true happiness. All the gods ask is that we keep
it real and show some respect for ourselves. Don't let nothin' from the outside distract you.
Take time to learn something good and make sure you got a goal to get to. Don't worry
'bout anybody else, 'cause if you don't look after your own soul, you gonna end up real
unhappy.

> Finished chain.

Yo, listen up my homies, it's time to get serious. We gotta take care of our business and
act with true gravity, natural affection, freedom, and justice. So forget all those other cares
and worries, and just do every action like it's your last, stayin' away from vanity and all
that phony stuff. We don't need much for true happiness. All the gods ask is that we keep
it real and show some respect for ourselves. Don't let nothin' from the outside distract you.
Take time to learn something good and make sure you got a goal to get to. Don't worry
'bout anybody else, 'cause if you don't look after your own soul, you gonna end up real
unhappy.

ref: https://www.comet.com/site/blog/chaining-the-future-an-in-depth-dive-into-
langchain/
Prompt Engineering
What is Prompt Engineering?
The ability to provide a good starting point for the model and guide it to produce the right
output plays a key role for applications that can integrate into daily work and make life
easier. The output produced by language models varies significantly with the prompt
served.

“Prompt Engineering” is the practice of guiding the language model with a clear,
detailed, well-defined, and optimized prompt in order to achieve a desired output.

There are two basic elements of a prompt. The language model needs a user-supplied
instruction to generate a response. In other words, when a user provides an instruction, the
language model produces a response.

Prompt

ref: Prompt Engineering [Guide](https://github.com/dair-ai/Prompt-Engineering-


Guide/blob/main/lecture/Prompt-Engineering-Lecture-Elvis.pdf )

- Instructions: This is the section where the task description is expressed. The task to be
done must be clearly stated.

- Context: A task can be understood differently depending on its context. For this
reason, providing the command without its context can cause the language model to
output something other than what is expected.

- Input data: Indicates which and what kind of data the command will be executed on.
Presenting it clearly to the language model in a structured format increases the quality of
the response.
- Output indicator: This is an indicator of the expected output. Here, what the expected
output is can be defined structurally, so that output in a certain format can be produced.

Types of Prompts
It is a well-known fact that the better the prompt, the better the output! So, what kinds
of prompts are there? Let’s try to understand the different types of prompt! Before you
know it, you’ll be a prompt engineer yourself!

Many advanced prompting techniques have been designed to improve performance on


complex tasks, but first let’s get acquainted with simpler prompt types, starting with the
most basic.

Instruction Prompting
Simple instructions provide some guidance for producing useful outputs. For example, an
instruction can express a clear and simple mathematical operation such as “Adding the
numbers 1 to 99.”

Or, you could try your hand at a slightly more complicated command. For example,
maybe you want to analyze customer reviews for a restaurant separately according to
taste, location, service, speed and price. You can easily do this with the command below:
Role Prompting
Another approach is to assign a role to the artificial intelligence entity before the
instructions. This technique generates somewhat more successful, or at least specific,
outputs.

Now, let’s observe the difference when first assigning a role within the prompt. Let’s
imagine a user who needs help to relieve tooth sensitivity to cold foods.

First, we try a simple command: “I need help addressing my sensitivity to cold foods.”
Now, let’s ask for advice again but this time we’ll assign the artificial intelligence a
dentist role.
You can clearly see a difference in both the tone and content of the response, given the
role assignment.

“Standard” Prompting
Prompts are considered “standard” when they consist of only one question. For example,
‘Ankara is the capital of which country?’ would qualify as a standard prompt.

Few shot standard prompts

Few shot standard prompts can be thought of as standard prompts in which a few samples
are presented first. This approach is beneficial in that it facilitates learning in context. It is
an approach that allows us to provide examples in the prompts to guide model
performance and improvement.

Chain of Thought (CoT) Prompting


Chain of Thought prompting is a way of simulating the reasoning process while answering
a question, similar to the way the human mind might think it. If this reasoning process is
explained with examples, the AI can generally achieve more accurate results.
Comparison of models on the GSM8K benchmark

Now let’s try to see the difference through an example.

Source: [Chain of Thought Prompting Elicits Reasoning in Large Language


Models(2022)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-
via.html)

Above, an example of how the language model should think step-by-step is first presented
to demonstrate how the AI should “think” through the problem or interpret it.

“Zero Shot Chain of Thought (Zero-Shot CoT)”

“Zero Shot Chain of Thought (Zero-Shot CoT)” slightly differentiates from this
approach to prompt engineering. This time, it is seen that his reasoning ability can be
increased again by adding a directive command like “Let’s think step by step” without
presenting an example to the language model.
Source: [Zero Shot Chain of Thought](https://github.com/dair-ai/Prompt-Engineering-
Guide/blob/main/lecture/Prompt-Engineering-Lecture-Elvis.pdf )

In the experiments, it is seen that the “Zero Shot Chain of Thought” approach alone is
not as effective as the Chain of Thought Prompting approach. On the other hand, it is of
great importance what the redirect command is, and at this point, it has been observed that
the “Let’s think step by step” command produces more successful results than many other
commands.

Recommendations and Tips for Prompt


Engineering with OpenAI API
Let’s try to summarize some of OpenAI’s suggested tips and usage recommendations on
how to give clear and effective instructions to GPT-3 and Codex when prompt
engineering.

Use latest models for the best results:

If you are going to use it to generate text, the most current model is “text-davinci-003”
and to generate code, it is “code-davinci-002” (November, 2022). You can [check
here](https://platform.openai.com/docs/models/gpt-3) to follow the current models and for
more detailed information about the models.
Instructions must be at the beginning of the prompt, and the instruction
and content must be separated by separators such as ### or “ “” :
First of all, we must clearly state the instructions to the language model, and then use
various separators to define the instruction and its content. Thus, it is presented to the
language model in a more understandable way.

Source: [Best practices for OpenAI](https://help.openai.com/en/articles/6654000-best-


practices-for-prompt-engineering-with-openai-api)

Give instructions that are specific, descriptive and as detailed as


possible:

By typing clear commands on topics such as context, text length, format, style, you can
get better outputs. For example, instead of an open-ended command like `Write a poem
about OpenAI.` , you could write a more detailed command like `Write a short inspiring
poem about OpenAI, focusing on the recent DALL-E product launch (DALL-E is a text to
image ML model) in the style of a famous poet`

Provide the output format expressed with examples:

If you have a preferred output format in mind, we recommend providing a format


example, as shown below:

Less effective :

Extract the entities mentioned in the text below.

Extract the following 4 entity types: company names, people names, specific topics and
themes.
Text: text

Better :

Extract the important entities mentioned in the text below.

First extract all company names, then extract all people names, then extract specific topics
which fit the content and finally extract general overarching themes

Desired format:

Company names: comma_separated_list_of_company_names

People names: -||-

Specific topics: -||-

General themes: -||-

Text: text

Try zero-shot first, then continue with few-shot examples and fine-tune
if you still don’t get the output you want:

You can try zero-shot prompt engineering for your command without providing any
examples to the language model. If you don’t get as successful output as you want, you
can try few-shot methods by guiding the model with a few examples. If you still don’t
produce as good an output as you intended, you can try fine-tuning.

There are examples of both zero-shot and few-shot prompts in the previous sections. You
can check out [this best practices](https://docs.google.com/document/d/1h-
GTjNDDKPKU_Rsd0t1lXCAnHltaXTAzQ8K2HRhQf9U/edit ) for fine-tune.

Avoid imprecise explanations:

When presenting a command to the language model, use clear and understandable
language. Avoid unnecessary clarifications and details.
Less effective :

The description for this product should be fairly short, a few sentences only,

and not too much more.

Better :

Use a 3 to 5 sentence paragraph to describe this product.

Tell what to do rather than what not to do:

Avoiding negative sentences and emphasizing intent will lead to better results.

Less effective :

The following is a conversation between an Agent and a Customer. DO NOT ASK


USERNAME OR PASSWORD. DO NOT REPEAT.

Customer: I can't log in to my account.

Agent:

Better :

The following is a conversation between an Agent and a Customer. The agent will attempt
to diagnose the problem and suggest a solution, whilst refraining from asking any
questions related to PII. Instead of asking for PII, such as username or password, refer the
user to the help article www.samplewebsite.com/help/faq

Customer: I can’t log in to my account.

Agent:

Code Generation Specific — Use “leading words” to nudge the model toward
a particular pattern:

It may be necessary to provide some hints to guide the language model when asking it to
generate a piece of code. For example, a starting point can be provided, such as “import”
that he needs to start writing code in Python, or “SELECT” when he needs to write an
SQL query.
ref: https://www.comet.com/site/blog/prompt-engineering/
Embeddings
A problem with semantic search

The basic design of a semantic search system, as pitched by most vector search vendors,
has two _easy_ (this is irony) steps:

1. Compute embeddings for your documents and queries. Somewhere. Somehow. Figure
it out by yourself.

2. Upload them to a vector search engine and enjoy a better semantic search.

A good embedding model is essential for semantic search. Image by author.

Your semantic search is as good as your embedding model, but choosing the model is
often considered out of scope for most early adopters. So everyone just takes a [sentence-
transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-
MiniLM-L6-v2 ) and hopes for the best.

But this approach has more open questions than answers:

- Is there a difference between embedding models? Are paid models from OpenAI and
Cohere better?

- How do they handle multiple languages? Is there a benefit in large 1B+ models?

- Dense retrieval using embeddings is one of many semantic search methods. Is it better
than new-age sparse approaches
like [SPLADEv2](https://arxiv.org/abs/2109.10086) and [ELSER](https://www.elastic.c
o/guide/en/machine-learning/8.8/ml-nlp-elser.html )
What are embeddings?
Embeddings are representations of values or objects like text, images, and audio that are
designed to be consumed by [machine
learning](https://www.cloudflare.com/learning/ai/what-is-machine-learning/ ) models and
semantic search algorithms. They translate objects like these into a mathematical form
according to the factors or traits each one may or may not have, and the categories they
belong to.

Essentially, embeddings enable machine learning models to find similar objects. Given a
photo or a document, a machine learning model that uses embeddings could find a similar
photo or document. Since embeddings make it possible for computers to understand the
relationships between words and other objects, they are foundational for [artificial
intelligence (AI)](https://www.cloudflare.com/learning/ai/what-is-artificial-intelligence/ ).

For example, the documents in the upper right of this two-dimensional space may be
relevant to each other:
Technically, embeddings are _vectors_ created by machine learning models for the
purpose of capturing meaningful data about each object.

What is a vector in machine learning?


In mathematics, a vector is an array of numbers that define a point in a dimensional space.
In more practical terms, a vector is a list of numbers — like 1989, 22, 9, 180. Each
number indicates where the object is along a specified dimension.

In machine learning, the use of vectors makes it possible to search for similar objects. A
vector-searching algorithm simply has to find two vectors that are close together in
a [vector database](https://www.cloudflare.com/learning/ai/what-is-vector-database/ ).

To understand this better, think about latitude and longitude. These two dimensions —
north-south and east-west, respectively — can indicate the location of any place on Earth.
The city of Vancouver, British Columbia, Canada can be represented as the latitude and
longitude coordinates 49°15'40"N, 123°06'50"W. This list of two values is a simple
vector.
Now, imagine trying to find a city that is very near Vancouver. A person would just look
at a map, while a machine learning model could instead look at the latitude and longitude
(or vector) and find a place with a similar latitude and longitude. The city of Burnaby is at
49°16'N, 122°58'W — very close to 49°15'40"N, 123°06'50"W. Therefore, the model can
conclude, correctly, that Burnaby is located near Vancouver.

Adding more dimensions to vectors

Now, imagine trying to find a city that is not only close to Vancouver, but of similar size.
To this model of locations, let us add a third "dimension" to latitude and longitude:
population size. Population can be added to each city's vector, and population size can be
treated like a Z-axis, with latitude and longitude as the Y- and X-axes.

The vector for Vancouver is now 49°15'40"N, 123°06'50"W, 662,248*. With this third
dimension added, Burnaby is no longer particularly close to Vancouver, as its population
is only 249,125*. The model might instead find the city of Seattle, Washington, US,
which has a vector of 47°36'35"N 122°19'59"W, 749,256.

_*As of 2021.

As of 2022._

This is a fairly simple example of how vectors and similarity search work. But to be of
use, machine learning models may want to generate more than three dimensions, resulting
in much more complex vectors.

Even more multi-dimensional vectors

For instance, how can a model tell which TV shows are similar to each other, and
therefore likely to be watched by the same people? There are any number of factors to
take into account: episode length, number of episodes, genre classification, number of
viewers in common, actors in each show, year each show debuted, and so on. All of these
can be "dimensions," and each show represented as a point along each of these
dimensions.

Multi-dimensional vectors can help us determine if the sitcom _Seinfeld_ is similar to


the horror show _Wednesday_. _Seinfeld_ debuted in 1989, _Wednesday_ in 2022.
The two shows have different episode lengths, with _Seinfeld_ at 22-24 minutes
and _Wednesday_ at 46-57 minutes — and so on. By looking at their vectors, we can see
that these shows likely occupy very different points in a dimensional representation of TV
shows.

TV show, Genre, Year debuted, Episode length, Seasons (through 2023), Episodes
(through 2023)
We can express these as vectors, just as we did with latitude and longitude, but with more
values:

_Seinfeld_ vector: [Sitcom], 1989, 22-24, 9, 180

_Wednesday_ vector: [Horror], 2022, 46-57, 1, 8

A machine learning model might identify the sitcom _Cheers_ as being much more
similar to _Seinfeld_. It is of the same genre, debuted in 1982, features an episode length
of 21-25 minutes, has 11 seasons, and has 275 episodes.

_Seinfeld_ vector: [Sitcom], 1989, 22-24, 9, 180

_Cheers_ vector: [Sitcom], 1982, 21-25, 11, 275

In our examples above, a city was a point along the two dimensions of latitude and
longitude; we then added a third dimension of population. We also analyzed the location
of these TV shows along five dimensions.

Instead of two, three, or five dimensions, a TV show within a machine learning model is a
point along perhaps a hundred or a thousand dimensions — however many the model
wants to include.

How do embeddings work?


Embedding is the process of creating vectors using [deep
learning](https://www.cloudflare.com/learning/ai/what-is-deep-learning/ ). An
"embedding" is the output of this process — in other words, the vector that is created by a
deep learning model for the purpose of similarity searches by that model.

Embeddings that are close to each other — just as Seattle and Vancouver have latitude
and longitude values close to each other and comparable populations — can be considered
similar. Using embeddings, an algorithm can suggest a relevant TV show, find similar
locations, or identify which words are likely to be used together or similar to each other,
as in language models.
How are embeddings used in large language
models (LLMs)?
For LLMs, embedding is taken a step further. The context of every word becomes an
embedding, in addition to the word itself. The meanings of entire sentences, paragraphs,
and articles can be searched and analyzed. Although this takes quite a bit of computational
power, the context for queries can be stored as embeddings, saving time and compute
power for future queries.

ref: https://www.cloudflare.com/learning/ai/what-are-embeddings/

https://medium.com/the-ai-forum/rag-on-complex-pdf-using-llamaparse-langchain-and-
groq-5b132bd1f9f3
Vector Stores
What Are Vector Databases?

In its most simplistic definition, a vector database stores information as vectors (vector
embeddings), which are a numerical version of a data object.

As such, vector embeddings are a powerful method of indexing and searching across very
large and unstructured or semi-
unstructured [datasets](https://www.kdnuggets.com/datasets/index.html ). These datasets
can consist of text, images, or sensor data and a vector database orders this information
into a manageable format.

Vector databases work using high-dimensional vectors which can contain hundreds of
different dimensions, each linked to a specific property of a data object. Thus creating an
unrivaled level of complexity.

Not to be confused with a vector index or a vector search library, a vector database is a
complete management solution to store and filter metadata in a way that is:

- Is completely scalable

- Can be easily backed up

- Enables dynamic data changes

- Provides a high level of security

The Benefits of Using Open Source Vector


Databases
Open source vector databases provide numerous benefits over licensed alternatives, such
as:

- They are a flexible solution that can be easily modified to suit specific needs, unlike
licensed options which are typically designed for a particular project.

- Open source vector databases are supported by a large community of


developers who are ready to assist with any issues or provide advice on how projects
could be improved.

- An open-source solution is budget-friendly with no licensing fees, subscription fees,


or any unexpected costs during the project.
- Due to the transparent nature of open-source vector databases, developers can work
more effectively, understanding every component and how the database was built.

- Open source products are constantly being improved and evolving with changes in
technology as they are backed by active communities.

Open Source Vector Databases Comparison:


Chroma Vs. Milvus Vs. Weaviate
Now that we have an understanding of what a vector database is and the benefits of an
open-source solution, let’s consider some of the most popular options on the market. We
will focus on the strengths, features, and uses of Chroma, Milvus, and Weaviate, before
moving on to a direct head-to-head comparison to determine the best option for your
needs.

1. Chroma
- Focus: ChromaDB is specifically designed for managing and searching large-scale
color data, particularly in the context of computer vision and image processing. It is
optimized for working with color histograms and other color-based representations.

- Features:

_Color-specific indexing:_ ChromaDB provides indexing methods tailored for color


data, allowing for efficient storage and retrieval of color information.

_Querying by color similarity:_ It’s designed to quickly find similar colors based on
certain criteria, which is useful in applications like image retrieval or analysis.

Use Cases: ChromaDB is commonly used in applications where color plays a crucial role,
such as image and video processing, where similarity searches based on color are
essential.

one of Chroma’s key strengths is its support for audio data, making it a top choice for
audio-based search engines, music recommendation applications, and other sound-based
projects.
2. Milvus
Milvus has gained a strong reputation in the world of ML and [data
science](https://www.kdnuggets.com/tag/data-science ), boasting impressive capabilities
in terms of vector indexing and querying. Utilizing powerful algorithms, Milvus offers
lightning-fast processing and data retrieval speeds [and GPU
support](https://milvus.io/blog/unveiling-milvus-2-3-milestone-release-offering-support-
for-gpu-arm64-cdc-and-other-features.md), even when working with very large datasets.
Milvus can also be integrated with other popular frameworks such as PyTorch and
TensorFlow, allowing it to be added to existing ML workflows.

Use Cases

Milvus is renowned for its capabilities in similarity search and analytics, with extensive
support for multiple programming languages. This flexibility means developers aren't
limited to backend operations and can even perform tasks typically reserved for server-
side languages on the front end. For example, you could [generate PDFs with
JavaScript](http://apryse.com/blog/javascript/how-to-generate-pdfs-with-javascript) while
leveraging real-time data from Milvus. This opens up new avenues for application
development, especially for educational content and apps focusing on accessibility.

This open-source vector database can be used across a wide range of industries and in a
large number of applications. Another prominent example involves eCommerce, where
Milvus can power accurate recommendation systems to suggest products based on a
customer’s preferences and buying habits.

It’s also suitable for image/ video analysis projects, assisting with image similarity
searches, object recognition, and content-based image retrieval. Another key use case
is [natural language processing](https://www.kdnuggets.com/tag/natural-language-
processing ) (NLP), providing document clustering and semantic search capabilities, as
well as providing the backbone to question and answer systems.

3. Weaviate
The third open source vector database in our honest comparison is Weaviate, which is
available in [both a self-hosted and fully-managed
solution](https://weaviate.io/blog/weaviate-1-21-release). Countless businesses are using
Weaviate to handle and manage large datasets due to its excellent level of performance, its
simplicity, and its highly scalable nature.

Capable of managing a range of data types, Weaviate is very flexible and can store both
vectors and data objects which makes it ideal for applications that need a range of search
techniques (E.G. vector searches and keyword searches).
Use Cases

In terms of its use, Weaviate is perfect for projects like Data classification in enterprise
resource planning software or applications that involve:

- Similarity searches

- Semantic searches

- Image searches

- eCommerce product searches

- Recommendation engines

- Cybersecurity threat analysis and detection

- Anomaly detection

- Automated data harmonization

Now we have a brief understanding of what each vector database can offer, let’s consider
the finer details that set each open source solution apart in our handy comparison table.

4.Faiss:
- Focus: Faiss (Facebook AI Similarity Search) is a more general-purpose library
designed for similarity search in large-scale vector databases. It is not limited to any
specific type of data and can be applied to a wide range of applications.

- Features:

_Versatility:_ Faiss supports various indexing methods and similarity metrics,


making it flexible for different types of vector data.

_Efficiency:_ It is highly optimized for speed and memory usage, making it suitable
for handling large datasets efficiently.

_Integration with deep learning frameworks:_ Faiss is often used in conjunction


with deep learning models to perform similarity searches on learned embeddings.
Use Cases:

Faiss is widely used in applications where similarity search is critical, such as


recommendation systems, natural language processing, and image retrieval. Its versatility
makes it suitable for handling different types of vector data.
Chroma Milvus Weaviate
Open Source Status Yes - Apache-2.0 license Yes - Apache-2.0 license Yes - BSD-3-Clause license
Publication Date Feb-23 Oct-19 Jan-21

Suitable for a wide range of


Suitable for a wide range of Suitable for a wide range of
applications, with support for a
applications, with support for applications, with support for
plethora of data types and
multiple data types and formats. multiple data types and formats.
Use Cases formats. Perfect for eCommerce
Specializes in Audio-based Ideal for Data classification in
recommendation systems,
search projects and image/video enterprise resource planning
natural language processing, and
retrieval. software.
image/video-based analysis

Offers a GraphQL-based API,


Uses both in-memory and
providing flexibility and efficiency
Impressive ease of use. persistent storage to provide
when interacting with the
Development, testing, and high-speed query and insert
knowledge graph. Supports real-
production environments all use performance. Provides automatic
time data updates, to ensure the
Key Features the same API on a Jupyter data partitioning, load balancing,
knowledge graph remains up-to-
Notebook. Powerful search, filter, and fault tolerance for large-scale
date with the latest changes. Its
and density estimation vector data handling. Supports a
schema inference feature
functionality. variety of vector similarity search
automates the process of
algorithms.
defining data structures.

Supported
Python or JavaScript Python, Java, C++, and Go Python, Javascript, and Go
ProgrammingLanguages

Dedicated forum and active


Active community on GitHub,
Strong community with a Slack, Twitter, and LinkedIn
Community and Industry Slack, Reddit, and Twitter. Over
Discord channel available to communities. Plus regular
Recognition 1000 enterprise users. Extensive
answer live queries. Podcasts and newsletters.
documentation.
Extensive documentation.
GitHub Stars 9k 23.5k 7.8k
In summary, the choice between ChromaDB and Faiss depends on the nature of your data
and the specific requirements of your application. If your primary concern is efficient color-
based similarity search, ChromaDB might be more suitable. If you need a general-purpose
library for similarity search on large-scale vector data, Faiss is a versatile and powerful
option.

ref: https://medium.com/@sujathamudadla1213/chromadb-vsfaiss-65cdae3012ab

https://www.kdnuggets.com/an-honest-comparison-of-open-source-vector-databases
Chunking
Document Splitting
Once the data is loaded, the next step in the indexing pipeline is splitting the documents
into manageable chunks. The question arises around the need of this step. Why is splitting
of documents necessary? There are two reasons for that:

- Ease of Search

Large chunks of data are harder to search over. Splitting data into smaller chunks
therefore helps in better indexation.

- Context Window Size

LLMs allow only a finite number of tokens in prompts and completions. The context
therefore cannot be larger than what the context window permits.

## Chunking Strategies

While splitting documents into chunks might sound a simple concept, there are certain
best practices that researchers have discovered. There are a few considerations that may
influence the overall chunking strategy.

- Nature of Content

Consider whether you are working with lengthy documents, such as articles or books, or
shorter content like tweets or instant messages. The chosen model for your goal and,
consequently, the appropriate chunking strategy depend on your response.

- Embedding Model being Used

We will discuss embeddings in detail in the next section but the choice of embedding
model also dictates the chunking strategy. Some models perform better with chunks of
specific length

- Expected Length and Complexity of User Queries

Determine whether the content will be short and specific or long and complex. This factor
will influence the approach to chunking the content, ensuring a closer correlation between
the embedded query and the embedded chunks

- Application Specific Requirements

The application use case, such as semantic search, question answering, summarization, or
other purposes will also determine how text should be chunked. If the results need to be
input into another language model with a token limit, it is crucial to factor this into your
decision-making process.

Chunking Methods
Depending on the aforementioned considerations, a number of `text splitters` are
available. At a broad level, text splitters operate in the following manner:

- Divide the text into compact, `semantically meaningful units`, often sentences.

- Merge these smaller units into larger chunks until a specific size is achieved, measured
by a `length function`.

- Upon reaching the predetermined size, treat that chunk as an independent segment of
text. Thereafter, start creating a new text chunk with `some degree of overlap` to maintain
contextual continuity between chunks.

Two areas to focus on, therefore are:

- How the text is split?

- How the chunk size is measured?

Levels Of Text Splitting

- [Character Splitting](https://github.com/FullStackRetrieval-
com/RetrievalTutorials/blob/8a30b5710b3dd99ef2239fb60c7b54bc38d3613d/tutorials/Le
velsOfTextSplitting/#CharacterSplitting ) - Simple static character chunks of data

- [Recursive Character Text Splitting](https://github.com/FullStackRetrieval-


com/RetrievalTutorials/blob/8a30b5710b3dd99ef2239fb60c7b54bc38d3613d/tutorials/Le
velsOfTextSplitting/#RecursiveCharacterSplitting ) - Recursive chunking based on a list
of separators

- [Document Specific Splitting](https://github.com/FullStackRetrieval-


com/RetrievalTutorials/blob/8a30b5710b3dd99ef2239fb60c7b54bc38d3613d/tutorials/Le
velsOfTextSplitting/#DocumentSpecific ) - Various chunking methods for different
document types (PDF, Python, Markdown)

- [Semantic Splitting](https://github.com/FullStackRetrieval-
com/RetrievalTutorials/blob/8a30b5710b3dd99ef2239fb60c7b54bc38d3613d/tutorials/Le
velsOfTextSplitting/#SemanticChunking ) - Embedding walk based chunking

- [Agentic Splitting](https://github.com/FullStackRetrieval-
com/RetrievalTutorials/blob/8a30b5710b3dd99ef2239fb60c7b54bc38d3613d/tutorials/Le
velsOfTextSplitting/#AgenticChunking ) - Experimental method of splitting text with an
agent-like system. Good for if you believe that token cost will trend to $0.00

- [Alternative Representation Chunking + Indexing]


(https://github.com/FullStackRetrieval-
com/RetrievalTutorials/blob/8a30b5710b3dd99ef2239fb60c7b54bc38d3613d/tutorials/Le
velsOfTextSplitting/#BonusLevel ) - Derivative representations of your raw text that will
aid in retrieval and indexing

A very common approach is where we `pre-determine` the size of the text chunks.

Additionally, we can specify the `overlap between chunks` (Remember, overlap is


preferred to maintain contextual continuity between chunks).

This approach is simple and cheap and is, therefore, widely used. Let’s look at

some examples:

Split by Character

In this approach, the text is split based on a character and the chunk size is measured by
the number of characters.

Example text : alice_in_wonderland.txt (the book in .txt format) using LangChain’s


`CharacterTextSplitter`

Character Splitting
Character splitting is the most basic form of splitting up your text. It is the process of
simply dividing your text into N-character sized chunks regardless of their content or
form.

This method isn't recommended for any applications - but it's a great starting point for us
to understand the basics.

- Pros: Easy & Simple

- Cons: Very rigid and doesn't take into account the structure of your text

Concepts to know:

- Chunk Size - The number of characters you would like in your chunks. 50, 100,
100,000, etc.
- Chunk Overlap - The amount you would like your sequential chunks to overlap. This
is to try to avoid cutting a single piece of context into multiple pieces. This will create
duplicate data across chunks.

First let's get some sample text

In [1]:

text = "This is the text I would like to chunk up. It is the example text for this
exercise"

Then let's split this text manually

In [2]:

# Create a list that will hold your chunks


chunks = []

chunk_size = 35 # Characters

# Run through the a range with the length of your text and iterate every chunk_size
you want
for i in range(0, len(text), chunk_size):
chunk = text[i:i + chunk_size]
chunks.append(chunk)
chunks

Out[2]:

['This is the text I would like to ch',

'unk up. It is the example text for ',

'this exercise']

Congratulations! You just split your first text. We have long way to go but you're already
making progress. Feel like a language model practitioner yet?

When working with text in the language model world, we don't deal with raw strings. It is
more common to work with documents. Documents are objects that hold the text you're
concerned with, but also additional metadata which makes filtering and manipulation
easier later.

We could convert our list of strings into documents, but I'd rather start from scratch and
create the docs.
Let's load up LangChains `CharacterSplitter` to do this for us

In [3]:

from langchain.text_splitter import CharacterTextSplitter

Then let's load up this text splitter. I need to specify `chunk overlap` and `separator` or
else we'll get funk results. We'll get into those next

In [4]:

text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=0,


separator='', strip_whitespace=False)

Then we can actually split our text via `create_documents`.


Note: `create_documents` expects a list of texts, so if you just have a string (like we do)
you'll need to wrap it in `[]`

In [5]:
text_splitter.create_documents([text])

Out[5]:

[Document(page_content='This is the text I would like to ch'),

Document(page_content='unk up. It is the example text for '),

Document(page_content='this exercise')]

Notice how this time we have the same chunks, but they are in documents. These will play
nicely with the rest of the LangChain world. Also notice how the trailing whitespace on
the end of the 2nd chunk is missing. This is because LangChain removes it, see [this
line](https://github.com/langchain-
ai/langchain/blob/f36ef0739dbb548cabdb4453e6819fc3d826414f/libs/langchain/langchai
n/text_splitter.py#L167) for where they do it. You can avoid this
with `strip_whitespace=False`

Chunk Overlap & Separators

Chunk overlap will blend together our chunks so that the tail of Chunk #1 will be the
same thing and the head of Chunk #2 and so on and so forth.
This time I'll load up my overlap with a value of 4, this means 4 characters of overlap

In [6]:

```
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=4,
separator='')

```

In [7]:

```
text_splitter.create_documents([text])

```

Out[7]:

```

[Document(page_content='This is the text I would like to ch'),

Document(page_content='o chunk up. It is the example text'),

Document(page_content='ext for this exercise')]

```

Notice how we have the same chunks, but now there is overlap between 1 & 2 and 2 & 3.
The 'o ch' on the tail of Chunk #1 matches the 'o ch' of the head of Chunk #2.

Check [ChunkViz.com](https://github.com/FullStackRetrieval-
com/RetrievalTutorials/blob/8a30b5710b3dd99ef2239fb60c7b54bc38d3613d/tutorials/Le
velsOfTextSplitting/www.chunkviz.com ) to help show it. Here's what the same text
looks like.
Check out how we have three colors, with two overlaping sections.

Separators are character(s) sequences you would like to split on. Say you wanted to
chunk your data at `ch`, you can specify it.

In [8]:

```
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=0,
separator='ch')

```

In [9]:

```
text_splitter.create_documents([text])

```

Out[9]:

```

[Document(page_content='This is the text I would like to'),

Document(page_content='unk up. It is the example text for this exercise')]


```

Recursive Character Text Splitting


Let's jump a level of complexity.

The problem with Level #1 is that we don't take into account the structure of our
document at all. We simply split by a fix number of characters.

The Recursive Character Text Splitter helps with this. With it, we'll specify a series of
separatators which will be used to split our docs.

You can see the default separators for LangChain [here](https://github.com/langchain-


ai/langchain/blob/9ef2feb6747f5a69d186bd623b569ad722829a5e/libs/langchain/langchai
n/text_splitter.py#L842 ). Let's take a look at them one by one.

- "\n\n" - Double new line, or most commonly paragraph breaks

- "\n" - New lines

- " " - Spaces

- "" - Characters

I'm not sure why a period (".") isn't included on the list, perhaps it is not universal
enough? If you know, let me know.

This is the swiss army knife of splitters and my first choice when mocking up a quick
application. If you don't know which splitter to start with, this is a good first bet.

Let's try it out

In [16]:

```
from langchain.text_splitter import RecursiveCharacterTextSplitter

```

Then let's load up a larger piece of text


In [17]:
``````
text = """
One of the most important things I didn't understand about the world when I was a
child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I
heard a thousand times, "what you put in." They meant well, but this is rarely true.
If your product is only half as good as your competitor's, you don't get half as
many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business.
Some think this is a flaw of capitalism, and that if we changed the rules it would
stop being true. But superlinear returns for performance are a feature of the world,
not an artifact of rules we've invented. We see the same pattern in fame, power,
military victories, knowledge, and even benefit to humanity. In all of these, the
rich get richer. [1]

```

Now let's make our text splitter

In [18]:

```
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 65, chunk_overlap=0)

```

In [19]:

```
text_splitter.create_documents([text])

```

Out[19]:

```

[Document(page_content="One of the most important things I didn't understand about


the"),

Document(page_content='world when I was a child is the degree to which the returns


for'),

Document(page_content='performance are superlinear.'),

Document(page_content='Teachers and coaches implicitly told us the returns were


linear.'),

Document(page_content='"You get out," I heard a thousand times, "what you put in."
They'),

Document(page_content='meant well, but this is rarely true. If your product is only'),

Document(page_content="half as good as your competitor's, you don't get half as many"),

Document(page_content='customers. You get no customers, and you go out of


business.'),

Document(page_content="It's obviously true that the returns for performance are"),

Document(page_content='superlinear in business. Some think this is a flaw of'),

Document(page_content='capitalism, and that if we changed the rules it would stop


being'),

Document(page_content='true. But superlinear returns for performance are a feature of'),

Document(page_content="the world, not an artifact of rules we've invented. We see the"),

Document(page_content='same pattern in fame, power, military victories, knowledge,


and'),

Document(page_content='even benefit to humanity. In all of these, the rich get richer.'),

Document(page_content='[1]')]

```

Notice how now there are more chunks that end with a period ".". This is because those
likely are the end of a paragraph and the splitter first looks for double new lines
(paragraph break).

Once paragraphs are split, then it looks at the chunk size, if a chunk is too big, then it'll
split by the next separator. If the chunk is still too big, then it'll move onto the next one
and so forth.
For text of this size, let's split on something bigger.

In [20]:

```
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 450, chunk_overlap=0)
text_splitter.create_documents([text])

```

Out[20]:

```

[Document(page_content="One of the most important things I didn't understand about the


world when I was a child is the degree to which the returns for performance are
superlinear."),

Document(page_content='Teachers and coaches implicitly told us the returns were linear.


"You get out," I heard a thousand times, "what you put in." They meant well, but this is
rarely true. If your product is only half as good as your competitor\'s, you don\'t get half as
many customers. You get no customers, and you go out of business.'),

Document(page_content="It's obviously true that the returns for performance are


superlinear in business. Some think this is a flaw of capitalism, and that if we changed the
rules it would stop being true. But superlinear returns for performance are a feature of the
world, not an artifact of rules we've invented. We see the same pattern in fame, power,
military victories, knowledge, and even benefit to humanity. In all of these, the rich get
richer. [1]")]

```

For this text, 450 splits the paragraphs perfectly. You can even switch the chunk size to
469 and get the same splits. This is because this splitter builds in a bit of cushion and
wiggle room to allow your chunks to 'snap' to the nearest separator.

Let's view this visually


Split by Tokens
For those well versed with Large Language Models, tokens is not a new concept.

All LLMs have a token limit in their respective context windows which we cannot exceed.
It is therefore a good idea to count the tokens while creating chunks. All LLMs also have
their tokenizers.

Tiktoken Tokenizer
Tiktoken tokenizer has been created by OpenAI for their family of models. Using this
strategy, the split still happens based on the character. However, the length of the chunk is
determined by the number of tokens.

example: LangChain’s `TokenTextSplitter`

Tokenizers are helpful in creating chunks that sit well in the context window of an
LLM

Hugging Face Tokenizer


Hugging Face has become the go-to platform for anyone building apps using LLMs or
even other models. All models available via Hugging Face are also accompanied by their
tokenizers.

example: `GPT2TokenizerFast`

https://huggingface.co/docs/transformers/tokenizer_summary

Other Tokenizer
Other libraries like Spacy, NLTK and SentenceTransformers also provide splitters.

Things to Keep in Mind


- Ensure data quality by preprocessing it before determining the optimal chunk size.
Examples include removing HTML tags or eliminating specific elements that contribute
noise, particularly when data is sourced from the web.

- Consider factors such as content nature (e.g., short messages or lengthy documents),
embedding model characteristics, and capabilities like token limits in choosing chunk
sizes. Aim for a balance between preserving context and maintaining accuracy.

- Test different chunk sizes. Create embeddings for the chosen chunk sizes and store them
in your index or indices. Run a series of queries to evaluate quality and compare the
performance of different chunk sizes.

ref: https://github.com/FullStackRetrieval-
com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Spl
itting.ipynb

https://www.linkedin.com/in/abhinav-kimothi/
Quantization
What is Quantization?

Quantization is a compression technique that involes mapping high precision values to a


lower precision one. For an LLM, that means modifying the precision of their weights and
activations making it less memory intensive. This surely does have impact on the
capabilites of the model including the accuracy. It is often a trade-off based on the use
case to go with model that is quantized. It is found that in some cases its possible to
achieve comparable results with significantly lower precision. Quantization improves
performance by reducing memory bandwidth requirement and increasing cache utilization.

Instead of using high-precision data types, such as 32-bit floating-point numbers,


quantization represents values using lower-precision data types, such as 8-bit integers.
This process significantly reduces memory usage and can speed up model execution while
maintaining acceptable accuracy.

With an LLM model, quantization process at different precision levels enables a model to
be run on wider range of devices.

How does quantization work?


LLMs are generally trained with full(float32) or half precision(float16 floating point
numbers. One float16 has 16 bits which is 2 bytes. So it requires two gigabytes for one
billion parameter model trained on FP16.

The process of quantization thus works on finding a way to represent the range (which is
[min, max] for the datatype) of FP32 weight values to a lower precision values like FP16
or even INT4 (Integer 4 bit) datatypes. The typical case is one from FP32 to INT8.

The overall impact on the quality of LLM depends on the technique used.

Hugging Face and Bitsandbytes Uses


Hugging Face’s Transformers library is a go-to choice for working with pre-trained
language models. To make the process of model quantization more accessible, Hugging
Face has seamlessly integrated with the Bitsandbytes library. This integration simplifies
the quantization process and empowers users to achieve efficient models with just a few
lines of code.
Install latest accelerate from source:

pip install git+https://github.com/huggingface/accelerate.git

Install latest transformers from source and bitsandbytes:

pip install git+https://github.com/huggingface/transformers.git

pip install bitsandbytes

Hugging Face and Bitsandbytes Integration Uses

Loading a Model in 4-bit Quantization


One of the key features of this integration is the ability to load models in 4-bit
quantization. This can be done by setting the `load_in_4bit=True` argument when calling
the `.from_pretrained` method. By doing so, you can reduce memory usage by
approximately fourfold.
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "bigscience/bloom-1b7"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto",
load_in_4bit=True)
Loading a Model in 8-bit Quantization
For further memory optimization, you can load a model in 8-bit quantization. This can be
achieved by using the `load_in_8bit=True` argument when calling `.from_pretrained`.
This reduces the memory footprint by approximately half.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "bigscience/bloom-1b7"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto",
load_in_8bit=True)

You can even check the memory footprint of your model using
the `get_memory_footprint` method:

print(model.get_memory_footprint())

Other Use cases:

The Hugging Face and Bitsandbytes integration goes beyond basic quantization
techniques. Here are some use cases you can explore:

Changing the Compute Data Type


You can modify the data type used during computation by setting
the `bnb_4bit_compute_dtype` to a different value, such as `torch.bfloat16`. This can
result in speed improvements in specific scenarios. Here's an example:

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16)
Using NF4 Data Type
The NF4 data type is designed for weights initialized using a normal distribution. You can
use it by specifying `bnb_4bit_quant_type="nf4"`:

from transformers import BitsAndBytesConfig

nf4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")

model_nf4 = AutoModelForCausalLM.from_pretrained(model_id,
quantization_config=nf4_config)

Nested Quantization for Memory Efficiency


The integration also recommends using the nested quantization technique for even greater
memory efficiency without sacrificing performance. This technique has proven beneficial,
especially when fine-tuning large models:

from transformers import BitsAndBytesConfig

double_quant_config = BitsAndBytesConfig(load_in_4bit=True,
bnb_4bit_use_double_quant=True)

model_double_quant = AutoModelForCausalLM.from_pretrained(model_id,
quantization_config=double_quant_config)

Loading a Quantized Model from the Hub


A quantized model can be loaded with ease using the `from_pretrained` method. Make
sure the saved weights are quantized by checking the `quantization_config` attribute in
the model configuration:

model = AutoModelForCausalLM.from_pretrained("model_name", device_map="auto")

In this case, you don’t need to specify the `load_in_8bit=True` argument, but you must
have both Bitsandbytes and Accelerate library installed.
Exploring Advanced techniques and
configuration

There are additional techniques and configurations to consider:

Offloading Between CPU and GPU

One advanced use case involves loading a model and distributing weights between the
CPU and GPU. This can be achieved by
setting `llm_int8_enable_fp32_cpu_offload=True`. This feature is beneficial for users
who need to fit large models and distribute them between the GPU and CPU.

Adjusting Outlier Threshold

Experiment with the `llm_int8_threshold` argument to change the threshold for outliers.
This parameter impacts inference speed and can be fine-tuned to suit your specific use
case.

Skipping the Conversion of Some Modules

In certain situations, you may want to skip the conversion of specific modules to 8-bit.
You can do this using the `llm_int8_skip_modules` argument.

Fine-Tuning a Model Loaded in 8-bit


With the support of adapters in the Hugging Face ecosystem, can fine-tune models loaded
in 8-bit quantization, enabling the fine-tuning of large models with ease.

ref: https://medium.com/@rakeshrajpurohit/model-quantization-with-hugging-face-
transformers-and-bitsandbytes-integration-b4c9983e8996

https://medium.com/@techresearchspace/what-is-quantization-in-llm-
01ba61968a51#:~:text=Quantization%20is%20a%20compression%20technique,the%20m
odel%20including%20the%20accuracy.
Temperature
Top P and Temperature
Large Language Models(LLMs) are essential tools in natural language processing (NLP)
and have been used in a variety of applications, such as text completion, translation, and
question answering.

The output of large language models can be affected by various hyperparameters


including temperature, top p, token length, max tokens and stop tokens.

Temperature
Temperature is a hyperparameter that controls the randomness of language model output.

A high temperature produces more unpredictable and creative results, while a low
temperature produces more deterministic and conservative output. In other words, a higher
temperature setting causes the model to be more “confident” in its output. A lower
temperature setting yields more conservative and predictable output.

For example, if you adjust the temperature to 0.5, the model will generate text that is
more predictable and less creative than if you set the temperature to 1.0.

temperature: Controls the randomness of responses. A lower temperature leads to more


predictable outputs, while a higher temperature results in more varied and sometimes
more creative outputs

Top p
Top p: also known as nucleus sampling, is another hyperparameter that controls the
randomness of language model output.

It sets a threshold probability and selects the top tokens whose cumulative probability
exceeds the threshold. The model then randomly samples from this set of tokens to
generate output. This method can produce more diverse and interesting output than
traditional methods that randomly sample the entire vocabulary.

For example, if you set top p to 0.9, the model will only consider the most likely words
that make up 90% of the probability mass.
top_p: can be considere as a method of text generation that selects the next token from the
probability distribution of the top p most likely tokens. This balances exploration and
exploitation during generation

Token length
This is the number of words or characters in a sequence or text that is fed to the LLM.

It varies depending on the language and the tokenization method used for the particular
LLM.

The length of the input text affects the output of the LLM.

A very short input may not have enough context to generate a meaningful completion.

Conversely, a rather long input may make the model inefficiently process or it may cause
the model to generate an irrelevant output.

Max tokens
This is the maximum number of tokens that the LLM generates.

Within this, is the token limit; the maximum number of tokens that can be used in the
prompt and the completion of the model. Determined by the architecture of the model
LLM, it refers to the maximum tokens that can be processed at once.

The computational cost and the memory requirements are directly proportional to the max
tokens. Set a longer max token, and you will have greater context and coherent output
text. Set a shorter max token, and you will use less memory and have a faster response but
your output is prone to errors and inconsistencies.

During the training and fine-tuning of the LLM, the max token is set.

Contrary to fine-tuning token length during the generation of output, the coherence and
length of the output is carefully set at inception, based on the specific task &
requirements, without affecting other parameters that will likely need adjusting.

max_tokens: The maximum number of tokens that the model can process in a single
response. This limit ensures computational efficiency and resource management
Stop tokens
In simple terms, it is the length of the output or response of an LLM.

So it signifies the end of a sequence in terms of either a paragraph or a sentence.

Similar to max tokens, the inference budget is reduced when the stop tokens are set low.

For example, when the stop tokens are set at 2, the generated text or output will be
limited to a paragraph. If the stop tokens is set at 1, the generated text will be limited to a
sentence.

ref: https://medium.com/@dixnjakindah/top-p-temperature-and-other-parameters-
1a53d2f8d7d7
Langchain Memory
What is Conversational memory?
Conversational memory is how a chatbot can respond to multiple queries in a chat-like
manner. It enables a coherent conversation, and without it, every query would be treated
as an entirely independent input without considering past interactions.

[The LLM with and without conversational memory. The blue boxes are user prompts and
in grey are the LLMs responses. Without conversational memory (right), the LLM cannot
respond using knowledge of previous interactions.]

The LLM with and without conversational memory. The blue boxes are user prompts and
in grey are the LLMs responses. Without conversational memory (right), the LLM cannot
respond using knowledge of previous interactions.

The memory allows LLM to remember previous interactions with the user. By default,
LLMs are _stateless_ — meaning each incoming query is processed independently of
other interactions. The only thing that exists for a stateless agent is the current input,
nothing else.

There are many applications where remembering previous interactions is very important,
such as chatbots. Conversational memory allows us to do that.

There are several ways that we can implement conversational memory. In the context of
[LangChain](/learn/langchain-intro/, they are all built on top of the `ConversationChain`.
ConversationChain
We can start by initializing the ConversationChain. We will use OpenAI’s text-davinci-
003 as the LLM, but other models like gpt-3.5-turbo can be used.

```
from langchain import OpenAI
from langchain.chains import ConversationChain

# first initialize the large language model


llm = OpenAI(
temperature=0,
openai_api_key="OPENAI_API_KEY",
model_name="text-davinci-003"
)

# now initialize the conversation chain


conversation = ConversationChain(llm=llm)

```

We can see the prompt template used by the ConversationChain like so:

In[8]:

```
print(conversation.prompt.template)

```

Out[8]:

```

The following is a friendly conversation between a human and an AI. The AI is talkative
and provides lots of specific details from its context. If the AI does not know the answer
to a question, it truthfully says it does not know.

Current conversation:

history

Human: input

AI:
```

Here, the prompt primes the model by telling it that the following is a conversation
between a human (us) and an AI (text-davinci-003). The prompt attempts to
reduce _hallucinations_ (where a model makes things up) by stating:

"If the AI does not know the answer to a question, it truthfully says it does not know."

This can help but does not solve the problem of hallucinations — but we will save this for
the topic of a future chapter.

Following the initial prompt, we see two parameters; history and input. The input is
where we’d place the latest human query; it is the input entered into a chatbot text box:

The history is where conversational memory is used. Here, we feed in information about
the conversation history between the human and AI.

These two parameters — history and input — are passed to the LLM within the prompt
template we just saw, and the output that we (hopefully) return is simply the predicted
continuation of the conversation.
Forms of Conversational Memory
We can use several types of conversational memory with the ConversationChain. They
modify the text passed to the history parameter.

ConversationBufferMemory
_(Follow along with our_ _[Jupyter notebooks](https://github.com/pinecone-
io/examples/blob/master/learn/generation/langchain/handbook/03-langchain-
conversational-memory.ipynb ))_

The ConversationBufferMemory is the most straightforward conversational memory in


LangChain. As we described above, the raw input of the past conversation between the
human and AI is passed — in its raw form — to the history parameter.

In[11]:

```
from langchain.chains.conversation.memory import ConversationBufferMemory

conversation_buf = ConversationChain(
llm=llm,
memory=ConversationBufferMemory()
)

```

In[32]:

```
conversation_buf("Good morning AI!")

```

Out[32]:

```

'input': 'Good morning AI!',

'history': '',

'response': " Good morning! It's a beautiful day today, isn't it? How can I help you?"
```

We return the first response from the conversational agent. Let’s continue the
conversation, writing prompts that the LLM can only answer _if_ it considers the
conversation history. We also add a count_tokens function so we can see how many
tokens are being used by each interaction.

In[6]:

```
from langchain.callbacks import get_openai_callback

def count_tokens(chain, query):


with get_openai_callback() as cb:
result = chain.run(query)
print(f'Spent a total of cb.total_tokens tokens')

return result

```

In[33]:

```
count_tokens(
conversation_buf,
"My interest here is to explore the potential of integrating Large Language
Models with external knowledge"
)

```

Out[33]:

```

Spent a total of 179 tokens

```

Out[33]:

```

' Interesting! Large Language Models are a type of artificial intelligence that can process
natural language and generate text. They can be used to generate text from a given
context, or to answer questions about a given context. Integrating them with external
knowledge can help them to better understand the context and generate more accurate
results. Is there anything else I can help you with?'

```

In[34]:

```
count_tokens(
conversation_buf,
"I just want to analyze the different possibilities. What can you think of?"
)

```

Out[34]:

```

Spent a total of 268 tokens

```

Out[34]:

```

' Well, integrating Large Language Models with external knowledge can open up a lot of
possibilities. For example, you could use them to generate more accurate and detailed
summaries of text, or to answer questions about a given context more accurately. You
could also use them to generate more accurate translations, or to generate more accurate
predictions about future events.'

```

In[35]:

```
count_tokens(
conversation_buf,
"Which data source types could be used to give context to the model?"
)

```
Out[35]:

```

Spent a total of 360 tokens

```

Out[35]:

```

' There are a variety of data sources that could be used to give context to a Large
Language Model. These include structured data sources such as databases, unstructured
data sources such as text documents, and even audio and video data sources. Additionally,
you could use external knowledge sources such as Wikipedia or other online
encyclopedias to provide additional context.'

```

In[36]:

```
count_tokens(
conversation_buf,
"What is my aim again?"
)

```

Out[36]:

```

Spent a total of 388 tokens

```

Out[36]:

```

' Your aim is to explore the potential of integrating Large Language Models with external
knowledge.'

```
The LLM can clearly remember the history of the conversation. Let’s take a look
at _how_ this conversation history is stored by the ConversationBufferMemory:

In[37]:

```
print(conversation_buf.memory.buffer)

```

Out[37]:

```

Human: Good morning AI!

AI: Good morning! It's a beautiful day today, isn't it? How can I help you?

Human: My interest here is to explore the potential of integrating Large Language Models
with external knowledge

AI: Interesting! Large Language Models are a type of artificial intelligence that can
process natural language and generate text. They can be used to generate text from a given
context, or to answer questions about a given context. Integrating them with external
knowledge can help them to better understand the context and generate more accurate
results. Is there anything else I can help you with?

Human: I just want to analyze the different possibilities. What can you think of?

AI: Well, integrating Large Language Models with external knowledge can open up a lot
of possibilities. For example, you could use them to generate more accurate and detailed
summaries of text, or to answer questions about a given context more accurately. You
could also use them to generate more accurate translations, or to generate more accurate
predictions about future events.

Human: Which data source types could be used to give context to the model?

AI: There are a variety of data sources that could be used to give context to a Large
Language Model. These include structured data sources such as databases, unstructured
data sources such as text documents, and even audio and video data sources. Additionally,
you could use external knowledge sources such as Wikipedia or other online
encyclopedias to provide additional context.

Human: What is my aim again?


AI: Your aim is to explore the potential of integrating Large Language Models with
external knowledge.

```

We can see that the buffer saves every interaction in the chat history directly. There are a
few pros and cons to this approach. In short, they are:

Pros Cons
Storing everything gives the
More tokens mean slowing
LLM the maximum amount of
response times and higher costs
information
Long conversations cannot be
Storing everything is simple and remembered as we hit the LLM
intuitive token limit (4096 tokens for text-
davinci-003 and gpt-3.5-turbo)

The `ConversationBufferMemory` is an excellent option to get started with but is limited


by the storage of every interaction. Let’s take a look at other options that help remedy this.

ConversationSummaryMemory
Using `ConversationBufferMemory`, we very quickly use _a lot_ of tokens and even
exceed the context window limit of even the most advanced LLMs available today.

To avoid excessive token usage, we can use `ConversationSummaryMemory`. As the


name would suggest, this form of memory _summarizes_ the conversation history before
it is passed to the history parameter.

We initialize the `ConversationChain` with the summary memory like so:

```
from langchain.chains.conversation.memory import ConversationSummaryMemory

conversation = ConversationChain(
llm=llm,
memory=ConversationSummaryMemory(llm=llm)
)

```
When using ConversationSummaryMemory, we need to pass an LLM to the object
because the summarization is powered by an LLM. We can see the prompt used to do this
here:

In[19]:

```
print(conversation_sum.memory.prompt.template)

```

Out[19]:

```

Progressively summarize the lines of conversation provided, adding onto the previous
summary returning a new summary.

EXAMPLE

Current summary:

The human asks what the AI thinks of artificial intelligence. The AI thinks artificial
intelligence is a force for good.

New lines of conversation:

Human: Why do you think artificial intelligence is a force for good?

AI: Because artificial intelligence will help humans reach their full potential.

New summary:

The human asks what the AI thinks of artificial intelligence. The AI thinks artificial
intelligence is a force for good because it will help humans reach their full potential.

END OF EXAMPLE
Current summary:

summary

New lines of conversation:

new_lines

New summary:

```

Using this, we can summarize every new interaction and append it to a “running
summary” of all past interactions. Let’s have another conversation utilizing this approach.

In[40]:

```
# without count_tokens we'd call `conversation_sum("Good morning AI!")`
# but let's keep track of our tokens:
count_tokens(
conversation_sum,
"Good morning AI!"
)

```

Out[40]:

```

Spent a total of 290 tokens

```

Out[40]:
```

" Good morning! It's a beautiful day today, isn't it? How can I help you?"

```

In[41]:

```
count_tokens(
conversation_sum,
"My interest here is to explore the potential of integrating Large Language
Models with external knowledge"
)

```

Out[41]:

```

Spent a total of 440 tokens

```

Out[41]:

```

" That sounds like an interesting project! I'm familiar with Large Language Models, but
I'm not sure how they could be integrated with external knowledge. Could you tell me
more about what you have in mind?"

```

In[42]:

```
count_tokens(
conversation_sum,
"I just want to analyze the different possibilities. What can you think of?"
)

```
Out[42]:

```

Spent a total of 664 tokens

```

Out[42]:

```

' I can think of a few possibilities. One option is to use a large language model to generate
a set of candidate answers to a given query, and then use external knowledge to filter out
the most relevant answers. Another option is to use the large language model to generate a
set of candidate answers, and then use external knowledge to score and rank the answers.
Finally, you could use the large language model to generate a set of candidate answers,
and then use external knowledge to refine the answers.'

```

In[43]:

```
count_tokens(
conversation_sum,
"Which data source types could be used to give context to the model?"
)

```

Out[43]:

```

Spent a total of 799 tokens

```

Out[43]:

```

' There are many different types of data sources that could be used to give context to the
model. These could include structured data sources such as databases, unstructured data
sources such as text documents, or even external APIs that provide access to external
knowledge. Additionally, the model could be trained on a combination of these data
sources to provide a more comprehensive understanding of the context.'

```

In[44]:

```
count_tokens(
conversation_sum,
"What is my aim again?"
)

```

Out[44]:

```

Spent a total of 853 tokens

```

Out[44]:

```

' Your aim is to explore the potential of integrating Large Language Models with external
knowledge.'

```

In this case the summary contains enough information for the LLM to “remember” our
original aim. We can see this summary in it’s raw form like so:

In[45]:

```
print(conversation_sum.memory.buffer)

```

Out[45]:

```
The human greeted the AI with a good morning, to which the AI responded with a good
morning and asked how it could help. The human expressed interest in exploring the
potential of integrating Large Language Models with external knowledge, to which the AI
responded positively and asked for more information. The human asked the AI to think of
different possibilities, and the AI suggested three options: using the large language model
to generate a set of candidate answers and then using external knowledge to filter out the
most relevant answers, score and rank the answers, or refine the answers. The human then
asked which data source types could be used to give context to the model, to which the AI
responded that there are many different types of data sources that could be used, such as
structured data sources, unstructured data sources, or external APIs. Additionally, the
model could be trained on a combination of these data sources to provide a more
comprehensive understanding of the context. The human then asked what their aim was
again, to which the AI responded that their aim was to explore the potential of integrating
Large Language Models with external knowledge.

```

The number of tokens being used for this conversation is greater than when using
the ConversationBufferMemory, so is there any advantage to
using ConversationSummaryMemory over the buffer memory?

Token count (y-axis) for the buffer memory vs. summary memory as the number of
interactions (x-axis) increases.
For longer conversations, yes. [Here](https://github.com/pinecone-
io/examples/blob/master/learn/generation/langchain/handbook/03a-token-counter.ipynb ),
we have a longer conversation. As shown above, the summary memory initially uses far
more tokens. However, as the conversation progresses, the summarization approach grows
more slowly. In contrast, the buffer memory continues to grow linearly with the number
of tokens in the chat.

We can summarize the pros and cons of `ConversationSummaryMemory` as follows:

Pros Cons
Shortens the number of tokens Can result in higher token usage
for long conversations. for smaller conversations

Memorization of the conversation


Enables much longer history is wholly reliant on the
conversations summarization ability of the
intermediate summarization LLM

Also requires token usage for the


Relatively straightforward
summarization LLM; this
implementation, intuitively simple
increases costs (but does not limit
to understand
conversation length)

Conversation summarization is a good approach for cases where long conversations are
expected. Yet, it is still fundamentally limited by token limits. After a certain amount of
time, we still exceed context window limits.

ConversationBufferWindowMemory
The `ConversationBufferWindowMemory ` acts in the same way as our earlier _“buffer
memory”_ but adds a _window_ to the memory. Meaning that we only keep a given
number of past interactions before _“forgetting”_ them. We use it like so:

```
from langchain.chains.conversation.memory import ConversationBufferWindowMemory

conversation = ConversationChain(
llm=llm,
memory=ConversationBufferWindowMemory(k=1)
)

```
In this instance, we set k=1 — this means the window will remember the single latest
interaction between the human and AI. That is the latest human response and the latest AI
response. We can see the effect of this below:

In[61]:

```
count_tokens(
conversation_bufw,
"Good morning AI!"
)

```

Out[61]:

```

Spent a total of 85 tokens

```

Out[61]:

```

" Good morning! It's a beautiful day today, isn't it? How can I help you?"

```

In[62]:

```
count_tokens(
conversation_bufw,
"My interest here is to explore the potential of integrating Large Language
Models with external knowledge"
)

```

Out[62]:

```
Spent a total of 178 tokens

```

Out[62]:

```

' Interesting! Large Language Models are a type of artificial intelligence that can process
natural language and generate text. They can be used to generate text from a given
context, or to answer questions about a given context. Integrating them with external
knowledge can help them to better understand the context and generate more accurate
results. Do you have any specific questions about this integration?'

```

In[63]:

```
count_tokens(
conversation_bufw,
"I just want to analyze the different possibilities. What can you think of?"
)

```

Out[63]:

```

Spent a total of 233 tokens

```

Out[63]:

```

' There are many possibilities for integrating Large Language Models with external
knowledge. For example, you could use external knowledge to provide additional context
to the model, or to provide additional training data. You could also use external
knowledge to help the model better understand the context of a given text, or to help it
generate more accurate results.'

```
In[64]:

```
count_tokens(
conversation_bufw,
"Which data source types could be used to give context to the model?"
)

```

Out[64]:

```

Spent a total of 245 tokens

```

Out[64]:

```

' Data sources that could be used to give context to the model include text corpora,
structured databases, and ontologies. Text corpora provide a large amount of text data that
can be used to train the model and provide additional context. Structured databases
provide structured data that can be used to provide additional context to the model.
Ontologies provide a structured representation of knowledge that can be used to provide
additional context to the model.'

```

In[65]:

```
count_tokens(
conversation_bufw,
"What is my aim again?"
)

```
Out[65]:

```

Spent a total of 186 tokens

```

Out[65]:

```

' Your aim is to use data sources to give context to the model.'

```

By the end of the conversation, when we ask "What is my aim again?", the answer to this
was contained in the human response _three_ interactions ago. As we only kept the most
recent interaction (k=1), the model had forgotten and could not give the correct answer.

We can see the effective “memory” of the model like so:

In[66]:

```
bufw_history = conversation_bufw.memory.load_memory_variables(
inputs=[]
)['history']

```

In[67]:

```
print(bufw_history)

```

Out[67]:

```

Human: What is my aim again?

AI: Your aim is to use data sources to give context to the model.
```

Although this method isn’t suitable for remembering distant interactions, it is good at
limiting the number of tokens being used — a number that we can increase/decrease
depending on our needs. For the [longer conversation](https://github.com/pinecone-
io/examples/blob/master/learn/generation/langchain/handbook/03a-token-counter.ipynb
) used in our earlier comparison, we can set k=6 and reach ~1.5K tokens per interaction
after 27 total interactions:

Token count including the ConversationBufferWindowMemory at k=6 and k=12.

If we only need memory of recent interactions, this is a great option. However, for a mix
of both distant and recent interactions, there are other options.

ConversationSummaryBufferMemory
The ConversationSummaryBufferMemory is a mix of
the ConversationSummaryMemory and the ConversationBufferWindowMemory. It
summarizes the earliest interactions in a conversation while maintaining
the max_token_limit most recent tokens in their conversation. It is initialized like so:

```
conversation_sum_bufw = ConversationChain(
llm=llm, memory=ConversationSummaryBufferMemory(
llm=llm,
max_token_limit=650
)

```

When applying this to our earlier conversation, we can set max_token_limit to a small
number and yet the LLM can remember our earlier “aim”.

This is because that information is captured by the “summarization” component of the


memory, despite being missed by the “buffer window” component.

Naturally, the pros and cons of this component are a mix of the earlier components on
which this is based.

Pros Cons
Summarizer means we can Summarizer increases token
remember distant interactions count for shorter conversations
Storing the raw interactions —
Buffer prevents us from missing
even if just the most recent
information from the most recent
interactions — increases token
interactions
count

Although requiring more tweaking on what to summarize and what to maintain within the
buffer window, the ConversationSummaryBufferMemory does give us plenty of
flexibility and is the only one of our memory types (so far) that allows us to remember
distant interactions _and_ store the most recent interactions in their raw — and most
information-rich — form.
Token count comparisons including the ConversationSummaryBufferMemory type with
max_token_limit values of 650 and 1300.

We can also see that despite including a summary of past interactions _and_ the raw
form of recent interactions — the increase in token count
of ConversationSummaryBufferMemory is competitive with other methods.

Other Memory Types


The memory types we have covered here are great for getting started and give a good
balance between remembering as much as possible and minimizing tokens.

However, we have other options — particularly


the `ConversationKnowledgeGraphMemory` and `ConversationEntityMemory`.

That’s it for this introduction to conversational memory for LLMs using LangChain. As
we’ve seen, there are plenty of options for helping _stateless_ LLMs interact as if they
were in a _stateful_ environment — able to consider and refer back to past interactions.

ref: https://www.pinecone.io/learn/series/langchain/langchain-conversational-memory/
Agents & Tools
Agents are like characters or personas with specific capabilities. They use chains and tools
to perform their functions.

Chains are sequences of processing steps for prompts. They are used within agents to
define how the agent processes information.

Tools are specialized functionalities that can be used by agents or within chains for
specific tasks.

Tools
Tools are functions that agents can use to interact with the world. They are functions that
are supposed to perform specific duties. These tools can be generic utilities (e.g. google
search, database lookups, mathematical opeartions etc.), other chains, or even other
agents.

Tools allow for the LLM to interact with the outside world and since they are
customizable they can pretty much coded to do anything you like and not just some
limited pre-defined operations.

Agents
Some applications will require not just a predetermined chain of calls to LLMs/other
tools, but potentially an unknown chain that depends on the user’s input. In these types of
chains, there is a “agent” which has access to a suite of tools. Depending on the user input,
the agent can then decide which, if any, of these tools to call.

The core idea of agents is to use an LLM to choose a sequence of actions to take. In
chains, a sequence of actions is hardcoded (in code). In agents, a language model is used
as a reasoning engine to determine which actions to take and in which order.

Simply put, Agent = Tools + Memory


Looking at the diagram, when receiving a request, Agents make use of a LLM to decide
on which Action to take.

After an Action is completed, the Agent enters the Observation step.

From Observation step Agent shares a Thought; if a final answer is not reached, the Agent
cycles back to another Action in order to move closer to a Final Answer.

There is a whole array of Action options available to the LangChain Agent.

Actions are taken by the agent via various tools. The more tools are available to an Agent,
the more actions can be taken by the Agent.

There are many types of agents such as — Conversations, ReAct etc. Custom agents can
be made as well.

Chains
Using an LLM in isolation is fine for simple applications, but more complex applications
require chaining LLMs — either with each other or with other components.

LangChain provides the Chain interface for such “chained” applications. We define a
Chain very generically as a sequence of calls to components, which can include other
chains.

In the sample project explained in this article, the Sequential Chain is used which will
give very clear insight into how these chains work.

Langchain has 4 types of foundational chains -

1. **LLM** — A simple chain with a prompt template that can process multiple inputs.

2. **Router** — A gateway that uses the large language model (LLM) to select the
most suitable processing chain.

3. **Sequential** — A family of chains which processes input in a sequential manner.


This means that the output of the first node in the chain, becomes the input of the second
node and the output of the second, the input of the third and so on.

4. **Transformation** — A type of chain that allows Python function calls for


customizable text manipulation.

Memory
You can provide attach memory to your so that it remembers the context of the
conversation and responds accordingly.

1. **Buffer Memory:** The Buffer memory in Langchain is a simple memory buffer


that stores the history of the conversation. It has a buffer property that returns the list of
messages in the chat memory. The load_memory_variables function returns the history
buffer. This type of memory is useful for storing and retrieving the immediate history of a
conversation.

2. **Buffer Window Memory:** Buffer Window Memory is a variant of Buffer


Memory. It also stores the conversation history but with a twist. It has a property k which
determines the number of previous interactions to be stored. The buffer property returns
the last k*2 messages from the chat memory. This type of memory is useful when you
want to limit the history to a certain number of previous interactions.

3. **Entity Memory:** The Entity Memory in Langchain is a more complex type of


memory. It not only stores the conversation history but also extracts and summarizes
entities from the conversation. It uses the Langchain Language Model (LLM) to predict
and extract entities from the conversation. The extracted entities are then stored in an
entity store which can be either in-memory or Redis-backed. This type of memory is
useful when you want to extract and store specific information from the conversation.
Each of these memory types has its own use cases and trade-offs. Buffer Memory and
Buffer Window Memory are simpler and faster but they only store the conversation
history. Entity Memory, on the other hand, is more complex and slower but it provides
more functionality by extracting and summarizing entities from the conversation.

As for the data structures and algorithms used, it seems that Langchain primarily uses lists
and dictionaries to store the memory. The algorithms are mostly related to text processing
and entity extraction, which involve the use of the Langchain Language Model.

1. **Conversation Knowledge Graph Memory:** The Conversation Knowledge Graph


Memory is a sophisticated memory type that integrates with an external knowledge graph
to store and retrieve information about knowledge triples in the conversation. It uses the
Langchain Language Model (LLM) to predict and extract entities and knowledge triples
from the conversation. The extracted entities and knowledge triples are then stored in a
NetworkxEntityGraph, which is a type of graph data structure provided by the NetworkX
library. This memory type is useful when you want to extract, store, and retrieve
structured information from the conversation in the form of a knowledge graph.

2. **ConversationSummaryMemory:** The ConversationSummaryMemory is a type


of memory that summarizes the conversation history. It uses the LangChain Language
Model (LLM) to generate a summary of the conversation. The summary is stored in a
buffer and is updated every time a new message is added to the conversation. This
memory type is useful when you want to maintain a concise summary of the conversation
that can be used for reference or to provide context for future interactions.

3. **ConversationSummaryBufferMemory:** ConversastionSummaryBufferMemory
is similar to the ConversationSummaryMemory but with an added feature of pruning. If
the conversation becomes too long (exceeds a specified token limit), the memory prunes
the conversation by summarizing the pruned part and adding it to a moving summary
buffer. This ensures that the memory does not exceed its capacity while still retaining the
essential information from the conversation.

4. **ConversationTokenBufferMemory:** ConversationTokenBufferMemory is a
type of memory that stores the conversation history in a buffer. It also has a pruning
feature similar to the ConversationSummaryBufferMemory. If the conversation exceeds a
specified token limit, the memory prunes the earliest messages until it is within the limit.
This memory type is useful when you want to maintain a fixed-size memory of the most
recent conversation history.

5. **VectorStore-Backed Memory:** The VectorStore-Backed Memory is a memory


type that is backed by a VectorStoreRetriever. The VectorStoreRetriever is used to
retrieve relevant documents based on a query. The retrieved documents are then stored in
the memory. This memory type is useful when you want to store and retrieve information
in the form of vectors, which is particularly useful for tasks such as semantic search or
similarity computation.
Callback Handlers
LangChain provides a callbacks system that allows you to hook into the various stages of
your LLM application. This is useful for logging, monitoring, streaming, and other tasks.
The BaseCallbackHandler class is used to define the actions to be performed inside the
hook functions.

Some available hooks are — on_llm_start, on_agent_end, on_chain_start. The names of


these hooks are self explanatory. Code can be written inside these functions which has to
be performed when those functions are called.

The object of the BaseCallbackHandler class can provided to the appropriate agent, chain,
tool etc.

Walkthrough — Project Utilizing Langchain


The following image displays the architecture I’ve used in a project that helps in
answering questions on data available in a large SQL database, by creating SQL queries to
fetch relevant data, then analyzing the fetched data and then returning a response in the
form of answer.
In the image above it can be seen that the agent has two chains available to it as tools
which are -

1. Analysis Chain (For doing analysis on data in memory)

2. Sequential Chain (For writing SQL queries)

**NOTE: While there are predefined and configured agents, tools and chains are
available, custom versions of all of these can be made.**

**NOTE: Chains can be provided as tools to the agent. Similarly, Tools can be made
available as a chain segment in chains as well. The user has a lot of freedom to customize
these agents, tools, chains and can plug, sequence them according to their needs.**

tools=[
Tool.from_function(func=sequentialchain._run,
name="tool1",
description="Useful when user wants information about revenue,
margin, employee and projects. Input is a descriptive plain text formed using user
question and chat history and output is the result."
),

Tool.from_function(func=analysis._run,
name="tool2",
description="Useful when you want to do some calculations and
statistical analysis using the memory. Input is a list of numbers with description
of what is to be done to it or a mathematical equation of number and output is
result."
)
]

The code snippet above shows a tools array in which two chains, namely — sequentialchain and
analysis chain are provided as tools.

```
memory =
ConversationBufferWindowMemory(memory_key="chat_history",return_messages=True,k=7)
llm = AzureChatOpenAI(
temperature=0,
deployment_name="********************",
model_name="gpt-35-turbo-16k",
openai_api_base="***************************",
openai_api_version="2023-07-01-preview",
openai_api_key="**************",
openai_api_type="azure"
)
agent_chain=initialize_agent(
tools,
llm,
agent=AgentType.OPENAI_FUNCTIONS,
verbose=True,
agent_kwargs=agent_kwargs,
memory=memory,
callbacks=[MyCustomHandler()]
)

```

The initialize_agent function creates an agent object with the specifications you have
entered in the function as arguments.

This agent is what manages the whole interaction with the LLM. The agent is run like this
→ answer=agent_chain.run(“the query put in by the user”)

The tools and memory are provided to the agent. I have used the
ConversationBufferWindowMemory() which allows me to specify the value k as 7. This
means that the last 7 conversations (input and output) are available to the LLM when you
ask a new question.
class sequentialchain(BaseTool):
def _run( self, run_manager: Optional[CallbackManagerForToolRun] = None ) ->
str:
tables = similarity_search(self)
print(tables)
sql_chain = SQLAgent(tables)
querycheckchain=querycheckfunc(tables)
executorchainobj=QueryExecutorChain(user_query=self)
overall_chain = SimpleSequentialChain(chains=[sql_chain, querycheckchain,
executorchainobj], verbose=True)
review = overall_chain.run(self)
return review

The similarity_search() function gets the appropriate table descriptions from the vector db
and provides it as input variables for the chains so they can write proper SQL queries.

The SimpleSequentialChain() has 3 chains passed to it — sql_chain, querycheckchain,


executorchainobj which are run in succession. The output of the first chain is passed to the
second chain as an input variable and the output of the third chain is passed to the third
chain as an input variable.

The **sql_chain** — based on a prompt on how to create SQL queries and table
descriptions makes SQL queries.

The **querycheckchain** — Receives the SQL query from sql_chain, then corrects all
the errors, syntax, adds missing elements if any and makes it compliant to the standards
described in prompt.

The **executorchainobj** — This chain segment is actually a tool passed as a chain. It


Receives the SQL query that is ready to be run on the database.

The output or fetched data after running the SQL query is then received by the agent
which had called the sequentialchain. The agent would interpret the fetched in accordance
to the user’s input question, format it and provide the final answer/response to the user. If
the agent wants to do some analysis on the fetched data it can then send this data to the
analysis chain the output of which can then be formatted into a final answer/response.

If the question asked by a user is a follow up question, the agent can look at the memory
and if it can find the necessary data in it, then it can formulate the answer based on the
memory alone as well, or if it thinks some analysis is to be done then it can also directly
send that data to the analysis chain as well.

Agent decides when to use the memory, which tool to use or if to use any tool at all.

**NOTE: I have used a custom chain (analysis chain) provided as a tool to the agent.
There are predefined tools for all sorts of purposes like math, SQL connections, google
drive connections, AWS Lambda connections etc.**

The analysis chain is a normal LLM call chain and has prompt instructions to do various
types of statistical analysis (mean, median, standard deviation, variance etc.), calculate
growth, percentages and other mathematical operations.
Callback Handlers can also be added to perform various tasks at certain defined stages of
the application run cycle.

```
class MyCustomHandler(BaseCallbackHandler):
def on_llm_new_token(self, token: str, **kwargs) -> None:
print(f"My custom handler, token: {token}")
for key, value in kwargs.items():
print("%s == %s" % (key, value))

def on_llm_end( self,


outputs,
*,
run_id,
parent_run_id,
**kwargs, ):
"""Run when llm call ends running."""
print(run_id)

def on_chain_end( self,


outputs,
*,
run_id,
parent_run_id,
**kwargs, ):
"""Run when chain ends running."""
print(run_id)

```

The CallbackHandler — MyCustomHandler() has been configured with certains set of


code that would run on — on_chain_end, on_llm_end. The names of these hooks are self
explanatory. When the object of this class is provided to the appropriate agent, tool, chain
etc., these code inside these hooks would run as their names suggest.

All sorts of hooks such as on_chain_start, on_chain_end, on_tool_start, on_tool_end are


available which can be specified to do certain tasks under the BaseCallbackHandler Class.

```

prompt_template = PromptTemplate(input_variables=["query"], template=template)

query_check_chain = LLMChain(llm=llm, prompt=prompt_template,


output_key="review", callbacks=[MyCustomHandler()])

```
The hook in this case — MyCustomHandler(), can be provided to the appropriate agent,
tool or chain in the callbacks argument.

When all of this is set up when the agent is run — (agent_chain,run(“user’s input
question”)), the application can self write the sql queries, run them to fetch data from the
database, analyse the data, and give proper information as output to the user. The user
never has to even open the database, write sql queries, fetch the data, dig through it for
analysis etc. Everything happens automatically from start to finish.

ref: https://medium.com/@saumitra1joshi/langchain-agents-tools-chains-memory-for-
utilizing-the-full-potential-of-llms-211e5dfee3fa

https://community.deeplearning.ai/t/agents-vs-chains-vs-tools/516148/2
RAG
The Curse Of The LLMs
As usage exploded, so did the expectations. Many users started using ChatGPT as a
source of information, like an alternative to Google. As a result, they also started
encountering prominent weaknesses of the system. Concerns around copyright, privacy,
security, ability to do mathematical calculations etc. aside, people realised that there are
two major limitations of Large Language Models.

Curse of the LLMs

> _Users look at LLMs for knowledge and wisdom, yet LLMs are sophisticated predictors
of what word comes next._

The Challenge
- Make LLMs respond with up-to-date information

- Make LLMs not respond with factually inaccurate information

- Make LLMs aware of proprietary information

What is RAG?
In 2023, RAG has become one of the most used technique in the domain of Large
Language Models.
Retrieval Augmented Generation

_User writes a prompt or a query that is passed to an orchestrator_

_Orchestrator sends a search query to the retriever_


_Retriever fetches the relevant information from the knowledge sources and sends back_

_Orchestrator augments the prompt with the context and sends to the LLM_

_LLM responds with the generated text which is displayed to the user via the
orchestrator_

How does RAG help?


Unlimited Knowledge

The Retriever of an RAG system can have access to external sources of information.
Therefore, the LLM is not limited to its internal knowledge. The external sources can be
proprietary documents and data or even the internet.

Expanding LLM Memory with RAG


Confidence in Responses

With the context (extra information that is retrieved) made available to the LLM, the
confidence in LLM responses is increased.

Increasing Confidence in LLM Responses

As RAG technique evolves and becomes accessible with frameworks


like [LangChain](https://www.linkedin.com/company/langchain/
) and [LlamaIndex](https://www.linkedin.com/company/llamaindex/) , it is finding more
and more application in LLM powered applications like QnA with documents,
conversational agents, recommendation systems and for content generation.

ref: https://www.linkedin.com/pulse/context-key-significance-rag-language-models-
abhinav-kimothi-nebnc/

𝗡𝗡𝗡𝗡𝗡𝗡 𝗥𝗥𝗥𝗥𝗥𝗥 𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁𝘁 :-


1. 𝗖𝗖𝗖𝗖𝗖𝗖𝗖𝗖𝗖𝗖 𝗼𝗼𝗼𝗼 𝗡𝗡𝗡𝗡𝗡𝗡𝗡𝗡 - Steps in CoN involve Generating notes for documents that have been
retrieved, which result in a more factually correct answer and also because Notes are
generated at steps that have been used to break the problem in the final step
trustworthiness of the answer also increases. https://cobusgreyling.medium.com/chain-
of-note-con-retrieval-for-llms-763ead1ae5c5

2. 𝗖𝗖𝗖𝗖𝗖𝗖𝗖𝗖𝗖𝗖𝗖𝗖𝗖𝗖𝗖𝗖𝗖𝗖𝗖𝗖 𝗥𝗥𝗥𝗥𝗥𝗥 = This RAG technique breaks the problem into a binary step if the
retrieved answer is Ambiguous --> Then the query is passed to Search and then search
results are taken and finally LLM is triggered again to look at the query keeping in mind
both RAG document and Search results. https://medium.com/the-ai-
forum/implementing-a-flavor-of-corrective-rag-using-langchain-chromadb-zephyr-7b-
beta-and-openai-30d63e222563
3. 𝗥𝗥𝗥𝗥𝗥𝗥 𝗙𝗙𝗙𝗙𝗙𝗙𝗙𝗙𝗙𝗙𝗙𝗙 - A Query is broken into small sub-queries in this approach. Then these
queries are given to a vector DB to retrieve the most relevant documents for each query.
Finally, using the Reciprocal rank fusion algorithm, the most relevant information is
prioritized. ( In [LlamaIndex](https://www.linkedin.com/company/llamaindex/ ) When I
used the combination of Recursive Retrieval and Semantic Chunking
+ [Pinecone](https://www.linkedin.com/company/pinecone-io/ ) as VectorDB results
came out best for our RAG application)

- RAG-Fusion improves traditional search systems by overcoming their limitations


through a multi-query approach. It expands user queries into multiple diverse perspectives
using a Language Model (LLM). This strategy goes beyond capturing explicit information
and delves into uncovering deeper, transformative knowledge. The fusion process
involves conducting parallel vector searches for both the original and expanded queries,
intelligently re-ranking to optimize results, and pairing the best outcomes with new
queries.

https://medium.com/@kbdhunga/advanced-rag-rag-fusion-using-langchain-772733da00b7

4. 𝗦𝗦𝗦𝗦𝗦𝗦𝗦𝗦-𝗥𝗥𝗥𝗥𝗥𝗥 - A self-rag technique where LLMs can do self-reflection for dynamic


retrieval, critique, and generation. https://github.com/run-llama/llama-
hub/blob/main/llama_hub/llama_packs/self_rag/self_rag.ipynb

https://cobusgreyling.medium.com/self-reflective-retrieval-augmented-generation-self-
rag-f5cbad4412d5

ref:
https://www.linkedin.com/feed/update/urn:li:activity:7185147270554681344/?commentU
rn=urn%3Ali%3Acomment%3A(ugcPost%3A7183502198595629058%2C718616583672
8979456)&dashCommentUrn=urn%3Ali%3Afsd_comment%3A(7186165836728979456
%2Curn%3Ali%3AugcPost%3A7183502198595629058)&dashReplyUrn=urn%3Ali%3A
fsd_comment%3A(7186182955751354368%2Curn%3Ali%3AugcPost%3A71835021985
95629058)&replyUrn=urn%3Ali%3Acomment%3A(ugcPost%3A7183502198595629058
%2C7186182955751354368)

https://www.linkedin.com/feed/update/urn:li:activity:7180436217006600194/
groq
What is groq?
Groq founded in 2016 by Jonathan Ross, Groq has built chips specifically designed for
inference, that is, running generative AI models. It says its chips, dubbed "language
processing units" (LPUs), are not only quicker but also one-tenth the cost of conventional
AI hardware.

What is LPU?
**Groq's Language Processing Unit (LPU)** represents a paradigm shift in processor
architecture, designed to revolutionize high-performance computing (HPC) and artificial
intelligence (AI) workloads. This article will delve into the components, architecture, and
workings of the LPU, highlighting its potential to transform the landscape of HPC and AI.

How Groq's LPU Works


The LPU's unique architecture enables it to outperform traditional CPUs and GPUs in
HPC and AI workloads. Here's a step-by-step breakdown of how the LPU works:

**1. Data Input:** Data is fed into the LPU, triggering the Centralized Control Unit to
issue instructions to the Processing Elements (PEs).

**2. Massively Parallel Processing:** The PEs, organized in SIMD arrays, execute the
same instruction on different data points concurrently, resulting in massively parallel
processing.

**3. High-Bandwidth Memory Hierarchy:** The LPU's memory hierarchy, including


on-chip SRAM and off-chip memory, ensures high-bandwidth, low-latency data access.

**4. Centralized Control Unit:** The Centralized Control Unit manages the flow of
data and instructions, coordinating the execution of thousands of operations in a single
clock cycle.
**5. Network-on-Chip (NoC):** A high-bandwidth Network-on-Chip (NoC)
interconnects the PEs, the CU, and the memory hierarchy, enabling fast, efficient
communication between different components of the LPU.

**6. Processing Elements:** The Processing Elements consist of Arithmetic Logic


Units, Vector Units, and Scalar Units, executing operations on large data sets
simultaneously.

**7. Data Output:** The LPU outputs data based on the computations performed by the
Processing Elements.
How LPU is different from GPU
**1. Architecture:**

**- LPU:** An LPU is designed specifically for natural language processing tasks, with
a multi-stage pipeline that includes tokenization, parsing, semantic analysis, feature
extraction, machine learning models, and inference/prediction.

**- GPU:** A GPU has a more complex architecture, consisting of multiple streaming
multiprocessors (SMs) or compute units, each containing multiple CUDA cores or stream
processors.
**2. Instruction Set:**

**- LPU:** The LPU's instruction set is optimized for natural language processing tasks,
with support for tokenization, parsing, semantic analysis, and feature extraction.

**- GPU:** A GPU has a more general-purpose instruction set, designed for high-
throughput, high-bandwidth data processing.

**3. Memory Hierarchy:**

**- LPU:** The LPU's memory hierarchy is optimized for natural language processing
tasks, with a focus on efficient data access and processing.

**- GPU:** A GPU has a more complex memory hierarchy, including registers, shared
memory, L1/L2 caches, and off-chip memory. The memory hierarchy in GPUs is designed
for high-throughput, high-bandwidth data access, but may have higher latency compared
to the LPU for specific NLP tasks.

**4. Power Efficiency and Performance:**

**- LPU:** The LPU is designed for high power efficiency and performance, with a
focus on natural language processing tasks. It can deliver superior performance per watt
compared to GPUs for specific NLP workloads.

**- GPU:** GPUs are designed for high throughput and performance, particularly for
graphics rendering and parallel computations. However, they may consume more power
than an LPU for the same NLP workload due to their more complex architecture and
larger number of processing units.

**5. Applications:**

**- LPU:** The LPU is well-suited for natural language processing tasks, such as
tokenization, parsing, semantic analysis, feature extraction, and machine learning model
inference.

**- GPU:** GPUs are widely used in applications such as gaming, computer-aided
design (CAD), scientific simulations, and machine learning. However, they are not
optimized for natural language processing tasks, and an LPU would generally provide
better performance and power efficiency for such tasks.

In summary, the LPU and GPU have different architectural designs and use cases. The
LPU is designed specifically for natural language processing tasks, while GPUs are
designed for high-throughput, high-bandwidth data processing, particularly for graphics
rendering and parallel computations. The LPU offers a more streamlined, power-efficient
architecture for natural language processing tasks, while GPUs provide a more complex,
feature-rich architecture for a broader range of applications.

ref: https://www.linkedin.com/pulse/groqs-lpu-revolutionary-leap-processing-computing-
ai-abhijit-singh-y0rdc/

Groq Tools
Groq API endpoints support tool use for programmatic execution of specified operations
through requests with explicitly defined operations. With tool use, Groq API model
endpoints deliver structured JSON output that can be used to directly invoke functions
from desired codebases.

[Models](https://console.groq.com/docs/tool-use#models )

These following models powered by Groq all support tool use:

- **llama3-70b**

- **llama3-8b**

- **llama2-70b**

- **mixtral-8x7b**

- **gemma-7b-it**

Parallel tool calling is enabled for both Llama3 models.

[Use Cases](https://console.groq.com/docs/tool-use#use-cases )

- **Convert natural language into API calls:** Interpreting user queries in natural
language, such as “What’s the weather in Palo Alto today?”, and translating them into
specific API requests to fetch the requested information.

- **Call external API:** Automating the process of periodically gathering stock prices
by calling an API, comparing these prices with predefined thresholds and automatically
sending alerts when these thresholds are met.
- **Resume parsing for recruitment:** Analyzing resumes in natural language to
extract structured data such as candidate name, skillsets, work history, and education, that
can be used to populate a database of candidates matching certain criteria.

[Example](https://console.groq.com/docs/tool-use#example )

```
from groq import Groq
import os
import json

client = Groq(api_key = os.getenv('GROQ_API_KEY'))


MODEL = 'mixtral-8x7b-32768'

# Example dummy function hard coded to return the score of an NBA game
def get_game_score(team_name):
"""Get the current score for a given NBA game"""
if "warriors" in team_name.lower():
return json.dumps({"game_id": "401585601", "status": 'Final', "home_team":
"Los Angeles Lakers", "home_team_score": 121, "away_team": "Golden State Warriors",
"away_team_score": 128})
elif "lakers" in team_name.lower():
return json.dumps({"game_id": "401585601", "status": 'Final', "home_team":
"Los Angeles Lakers", "home_team_score": 121, "away_team": "Golden State Warriors",
"away_team_score": 128})
elif "nuggets" in team_name.lower():
return json.dumps({"game_id": "401585577", "status": 'Final', "home_team":
"Miami Heat", "home_team_score": 88, "away_team": "Denver Nuggets",
"away_team_score": 100})
elif "heat" in team_name.lower():
return json.dumps({"game_id": "401585577", "status": 'Final', "home_team":
"Miami Heat", "home_team_score": 88, "away_team": "Denver Nuggets",
"away_team_score": 100})
else:
return json.dumps({"team_name": team_name, "score": "unknown"})

def run_conversation(user_prompt):

# Step 1: send the conversation and available functions to the model


messages=[
{
"role": "system",
"content": "You are a function calling LLM that uses the data extracted
from the get_game_score function to answer questions around NBA game scores. Include
the team and their opponent in your response."
},
{
"role": "user",
"content": user_prompt,
}
]
tools = [
{
"type": "function",
"function": {
"name": "get_game_score",
"description": "Get the score for a given NBA game",
"parameters": {
"type": "object",
"properties": {
"team_name": {
"type": "string",
"description": "The name of the NBA team (e.g. 'Golden
State Warriors')",
}
},
"required": ["team_name"],
},
},
}
]
response = client.chat.completions.create(
model=MODEL,
messages=messages,
tools=tools,
tool_choice="auto",
max_tokens=4096
)

response_message = response.choices[0].message
tool_calls = response_message.tool_calls

# Step 2: check if the model wanted to call a function

if tool_calls:

# Step 3: call the function

# Note: the JSON response may not always be valid; be sure to handle errors

available_functions = {
"get_game_score": get_game_score,
} # only one function in this example, but you can have multiple
messages.append(response_message) # extend conversation with assistant's
reply

# Step 4: send the info for each function call and function response to the model
for tool_call in tool_calls:
function_name = tool_call.function.name
function_to_call = available_functions[function_name]
function_args = json.loads(tool_call.function.arguments)
function_response = function_to_call(
team_name=function_args.get("team_name")
)
messages.append(
{
"tool_call_id": tool_call.id,
"role": "tool",
"name": function_name,
"content": function_response,
}
) # extend conversation with function response
second_response = client.chat.completions.create(
model=MODEL,
messages=messages
) # get a new response from the model where it can see the function
response
return second_response.choices[0].message.content

user_prompt = "What was the score of the Warriors game?"


print(run_conversation(user_prompt))

```

[Sequence of Steps](https://console.groq.com/docs/tool-use#sequence-of-steps )

- **Initialize the API client**: Set up the Groq Python client with your API key and
specify the model to be used for generating [conversational
responses](https://console.groq.com/docs/text-chat#streaming-a-chat-completion).

- **Define the function and conversation parameters**: Create a user query and
define a function (`get_current_score`) that can be called by the model, detailing its
purpose, input parameters, and expected output format.

- **Process the model’s request**: Submit the initial conversation to the model, and if
the model requests to call the defined function, extract the necessary parameters from the
model’s request and execute the function to get the response.

- **Incorporate function response into conversation**: Append the function’s output


to the conversation and a structured message and resubmit to the model, allowing it to
generate a response that includes or reacts to the information provided by the function
call.

[Tools Specifications](https://console.groq.com/docs/tool-use#tools-specifications )

- `tools`: an array with each element representing a tool

- `type`: a string indicating the category of the tool

- `function`: an object that includes:

- `description` - a string that describes the function’s purpose, guiding the model on
when and how to use it

- `name`: a string serving as the function’s identifier

- `parameters`: an object that defines the parameters the function accepts


[Tool Choice](https://console.groq.com/docs/tool-use#tool-choice )

- `tool_choice`: A parameter that dictates if the model can invoke functions.

- `auto`: The default setting where the model decides between sending a text response
or calling a function

- `none`: Equivalent to not providing any tool specification; the model won't call any
functions

- Specifying a Function:

- To mandate a specific function call, use `{"type": "function", "function":


{"name":"get_financial_data"}}`

- The model is constrained to utilize the function named

[Known limitations](https://console.groq.com/docs/tool-use#known-limitations )

- Parallel tool use is disabled because of limitations of the Mixtral model. The endpoint
will always return at most a single `tool_call` at a time.

ref: https://console.groq.com/docs/tool-use

Groq and RAG Architecture Example


**Retrieval-Augmented Generation (RAG)** is a new approach that leverages Large
Language Models (LLMs) to automate knowledge search, synthesis, extraction, and
planning from unstructured data sources. This method has gained prominence over the
past year due to its ability to enhance LLM applications with contextual information. The
RAG data stack consists of several key components:

- **Loading Data**: Initially, data is ingested from various sources, such as text
documents, websites, or databases. This data can be in a raw or preprocessed format.

- **Processing Data**: The data undergoes preprocessing steps to clean and structure it
for further analysis. This may include tasks like tokenization, stemming, and removing
stop words.

- **Embedding Data**: Each piece of data is converted into a numerical representation


called an embedding. This embedding captures semantic information about the data,
making it easier for the LLM to understand and process.

- **Vector Database:** The embeddings are stored in a vector database, which allows
for efficient retrieval based on similarity metrics. This database enables quick access to
relevant data points during the generation process.

- **Retrieval and Prompting:** During the generation process, the LLM can retrieve
relevant data points from the vector database based on the context of the current input.
This retrieval mechanism helps the LLM provide more accurate and contextually relevant
outputs.

Overall, the RAG approach enhances the capabilities of LLMs by enabling them to
leverage external knowledge sources in a systematic and efficient manner. This can lead to
more powerful and contextually aware applications in various domains, such as natural
language understanding, information retrieval, and decision-making.

Building a Production grade RAG remains a complex and subtle problem. Some of the
challenges associated is as follows:-

- **Results aren’t accurate enough:** The application was not able to produce
satisfactory results for a long-tail of input tasks/queries.

- **The number of parameters to tune is overwhelming:** It’s not clear which


parameters across the data parsing, ingestion, retrieval.
- **PDFs are specifically a problem:** I have complex docs with lots of messy
formatting. How do I represent this in the right way so the LLM can understand it?

- **Data syncing is a challenge:** Production data often updates regularly, and


continuously syncing new data brings a new set of challenges.

With a sole intent to solve the above problems in February 20, 2024 LlamaIndex came up
with LalmaCloud and LlamaParse a new generation of managed parsing, ingestion, and
retrieval services, designed to bring **production-grade** **context-augmentation** to
our LLM and RAG applications.

The main intuition behind the inception of LalmaCloud was to focus on focus on writing
the business logic and not on data wrangling. Process large volumes of production data,
immediately leading to better response quality. It has following two components:

1. **LlamaParse:** Proprietary parsing for complex documents with embedded objects


such as tables and figures. LlamaParse directly integrates with LlamaIndex ingestion and
retrieval to let you build retrieval over complex, semi-structured documents. It is promised
to be able to answer complex questions that simply weren’t possible previously.

2. **Managed Ingestion and Retrieval API:** An API which allows you to easily
load, process, and store data for your RAG app and consume it in any language. Backed
by data sources in [LlamaHub](https://llamahub.ai/), including LlamaParse, and data
storage integrations.

What is LlamaParse ?
LlamaParse is a proprietary parsing service that is incredibly good at parsing PDFs with
complex tables into a well-structured markdown format.
This service is available in a **public preview mode:** available to everyone, but with
a usage limit (1k pages per day) with 7,000 free pages per week. Then $0.003 per page ($3
per 1,000 pages).It operates as a standalone service that can also be plugged into the
managed ingestion and retrieval API

```
from llama_parse import LlamaParse

parser = LlamaParse(
api_key="llx-...", # can also be set in your env as LLAMA_CLOUD_API_KEY
result_type="markdown", # "markdown" and "text" are available
verbose=True
)

```

Currently LlamaParse primarily support PDFs with tables, but they are also building out
better support for figures, and and an expanded set of the most popular document types:
.docx, .pptx, .html as a part of next enhancements.

rich table support

Since we first released LlamaParse it has featured [industry-leading table


extraction](https://github.com/run-
llama/llama_parse/blob/main/examples/demo_advanced.ipynb ) capabilities. Under the
hood, this has been using LLM intelligence since the start. It seamlessly integrates with
the advanced indexing/retrieval capabilities that the open-source framework offers,
enabling users to build state-of-the-art document RAG. Now with JSON mode (see below)
and parsing instructions, you can take this even further.
Example 2: parsing comic books

Parsing translated manga presents a particular challenge for a parser since a regular parser
interprets the panels as cells in a table, and the reading order is right-to-left even though
the book is in English, as shown in this extract from "The manga guide to calculus", by
Hiroyuki Kojima:

Using LlamaParse, you can give the parser plain, English-language instructions on what to
do:

```
The provided document is a manga comic book.
Most pages do NOT have title. It does not contain tables.
Try to reconstruct the dialogue happening in a cohesive way.

```

(You can see the full code in our [demonstration


notebook](https://colab.research.google.com/drive/1dO2cwDCXjj9pS9yQDZ2vjg-
0b5sRXQYo ), including what it looks like to parse this without the instructions)

The result is a perfect parse!

```
# The Asagake Times

Sanda-Cho Distributor

A newspaper distributor?

Do I have the wrong map?

```

Example 3: mathematical equations

Another challenging format for parsing is complex mathematical equations (by


coincidence, the manga we picked as an example is all about how to do mathematics):
To parse this, we take the same instructions as before and add one sentence: `Output any
math equation in LATEX markdown (between $$)` . The result of parsing is clear LaTeX
instructions, which render the equations perfectly:

for local deployment using docker: https://blog.gopenai.com/running-pdf-parsers-in-


docker-containers-5e7a7ed829c8

Code Implementation

The code is implemented in Google Colab (cpu)

Install required dependencies


```
%%writefile requirements.txt
langchain
langchain-community
llama-parse
fastembed
chromadb
python-dotenv
langchain-groq
chainlit
fastembed
unstructured[md]

!pip install -r requirements.txt

```

Set up the environment variables

```
from google.colab import userdata

llamaparse_api_key = userdata.get('LLAMA_CLOUD_API_KEY')
groq_api_key = userdata.get("GROQ_API_KEY")

Import required dependencies

##### LLAMAPARSE #####


from llama_parse import LlamaParse

from langchain.text_splitter import RecursiveCharacterTextSplitter


from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
#
from groq import Groq
from langchain_groq import ChatGroq
#
import joblib
import os
import nest_asyncio # noqa: E402
nest_asyncio.apply()

```

LlamaParse Parameters

```
* api_key: str = Field(
default="",

description="The API key for the LlamaParse API.",


)

* base_url: str = Field(


default=DEFAULT_BASE_URL,
description="The base URL of the Llama Parsing API.",
)
* result_type: ResultType = Field(
default=ResultType.TXT, description="The result type for the parser."
)
* num_workers: int = Field(
default=4,
gt=0,
lt=10,
description="The number of workers to use sending API requests for
parsing."
)
* check_interval: int = Field(
default=1,
description="The interval in seconds to check if the parsing is done.",
)
* max_timeout: int = Field(
default=2000,
description="The maximum timeout in seconds to wait for the parsing to
finish.",
)
* verbose: bool = Field(
default=True, description="Whether to print the progress of the parsing."
)
* language: Language = Field(
default=Language.ENGLISH, description="The language of the text to
parse."
)
* parsing_instruction: Optional[str] = Field(
default="",
description="The parsing instruction for the parser."
)

```

Helper function to load and parse the input data

```
!mkdir data
#
def load_or_parse_data():
data_file = "./data/parsed_data.pkl"

if os.path.exists(data_file):
# Load the parsed data from the file
parsed_data = joblib.load(data_file)
else:
# Perform the parsing step and store the result in llama_parse_documents
parsingInstructionUber10k = """The provided document is a quarterly report
filed by Uber Technologies,
Inc. with the Securities and Exchange Commission (SEC).
This form provides detailed financial information about the company's
performance for a specific quarter.
It includes unaudited financial statements, management discussion and
analysis, and other relevant disclosures required by the SEC.
It contains many tables.
Try to be precise while answering the questions"""
parser = LlamaParse(api_key=llamaparse_api_key,
result_type="markdown",
parsing_instruction=parsingInstructionUber10k,
max_timeout=5000,)
llama_parse_documents = parser.load_data("./data/uber_10q_march_2022
(1).pdf")

# Save the parsed data to a file


print("Saving the parse results in .pkl format ..........")
joblib.dump(llama_parse_documents, data_file)

# Set the parsed data to the variable


parsed_data = llama_parse_documents

return parsed_data

```

Helper function to load chunks into vectorstore.

```

# Create vector database


def create_vector_database():
"""
Creates a vector database using document loaders and embeddings.

This function loads urls,


splits the loaded documents into chunks, transforms them into embeddings using
OllamaEmbeddings,
and finally persists the embeddings into a Chroma vector database.

"""
# Call the function to either load or parse the data
llama_parse_documents = load_or_parse_data()
print(llama_parse_documents[0].text[:300])

with open('data/output.md', 'a') as f: # Open the file in append mode ('a')


for doc in llama_parse_documents:
f.write(doc.text + '\n')

markdown_path = "/content/data/output.md"
loader = UnstructuredMarkdownLoader(markdown_path)

#loader = DirectoryLoader('data/', glob="**/*.md", show_progress=True)


documents = loader.load()
# Split loaded documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000,
chunk_overlap=100)
docs = text_splitter.split_documents(documents)

#len(docs)
print(f"length of documents loaded: {len(documents)}")
print(f"total number of document chunks generated :{len(docs)}")
#docs[0]

# Initialize Embeddings
embed_model = FastEmbedEmbeddings(model_name="BAAI/bge-base-en-v1.5")

# Create and persist a Chroma vector database from the chunked documents
vs = Chroma.from_documents(
documents=docs,
embedding=embed_model,
persist_directory="chroma_db_llamaparse1", # Local mode with in-memory
storage only
collection_name="rag"
)

#query it
#query = "what is the agend of Financial Statements for 2022 ?"
#found_doc = qdrant.similarity_search(query, k=3)
#print(found_doc[0][:100])
#print(qdrant.get())

print('Vector DB created successfully !')


return vs,embed_model

```

Process the data and create Vector Store

```
vs,embed_model = create_vector_database()

```

Instantiate LLM

```
chat_model = ChatGroq(temperature=0,
model_name="mixtral-8x7b-32768",
api_key=userdata.get("GROQ_API_KEY"),)

```

The above code does the following:

- Creates a new ChatGroq object named chat_model

- Sets the temperature parameter to 0, indicating that the responses should be more
predictable

- Sets the model_name parameter to “mixtral-8x7b-32768“, specifying the language


model to use
Instantiate Vectorstore

```
vectorstore = Chroma(embedding_function=embed_model,
persist_directory="chroma_db_llamaparse1",
collection_name="rag")
#
retriever=vectorstore.as_retriever(search_kwargs={'k': 3})

```

Create a Custom Prompt Template

```
custom_prompt_template = """Use the following pieces of information to answer the
user's question.
If you don't know the answer, just say that you don't know, don't try to make up an
answer.

Context: {context}
Question: {question}

Only return the helpful answer below and nothing else.


Helpful answer:
"""

```

Helper Function to format the prompt

```
def set_custom_prompt():
"""
Prompt template for QA retrieval for each vectorstore
"""
prompt = PromptTemplate(template=custom_prompt_template,
input_variables=['context', 'question'])
return prompt
#
prompt = set_custom_prompt()
prompt

########################### RESPONSE ###########################

PromptTemplate(input_variables=['context', 'question'], template="Use the following


pieces of information to answer the user's question.\nIf you don't know the answer, just
say that you don't know, don't try to make up an answer.\n\nContext:
{context}\nQuestion: {question}\n\nOnly return the helpful answer below and nothing
else.\nHelpful answer:\n")
```

Instantiate the Retrieval Question Answering Chain

```
qa = RetrievalQA.from_chain_type(llm=chat_model,
chain_type="stuff",
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": prompt})

```

Invoke the Retriveal QA Chain

```
response = qa.invoke({"query": "what is the Balance of UBER TECHNOLOGIES, INC.as of
December 31, 2021?"})

```

Response Synthesized

```
response['result']

########################### RESPONSE ###########################

Based on the provided balance sheet of Uber Technologies, Inc. as of December 31, 2021,
the total assets are $38,774 million, total liabilities are $23,425 million, and total equity is
$9,613 million.

```
Question 2

```
response = qa.invoke({"query": "What is the Cash flows from operating activities
associated with bad expense specified in the document ?"})
response['result']

######################## RESPONSE ###############################

The Cash flows from operating activities associated with bad debt expense is 23 for the
year 2021 and 18 for the year 2022.

```

Question 3

```
response = qa.invoke({"query": "what is Loss (income) from equity method
investments, net ?"})
response["result"]

############################### RESPONSE #############################

The loss from equity method investments, net, is calculated as the sum of the impairment
of equity method investment and the revaluation of MLU B.V. call option, which
amounted to $182 million and $181 million, respectively. This results in a total loss from
equity method investments, net, of $363 million. This loss is included in the net loss
attributable to Uber Technologies, Inc. of $5.9 billion.
```

Question 4

```
response = qa.invoke({"query": "What is the Total cash and cash equivalents, and
restricted cash and cash equivalents for reconciliation ?"})
response['result']

######################## RESPONSE ####################################

The total cash and cash equivalents, and restricted cash and cash equivalents for
reconciliation is $6,607 million. This amount is obtained by adding the cash and cash
equivalents of $4,836 million and the restricted cash and cash equivalents - current of
$247 million, and the restricted cash and cash equivalents - non-current of $1,524 million.

```

Question 5
```
response = qa.invoke({"query":"Based on the CONDENSED CONSOLIDATED STATEMENTS OF
REDEEMABLE NON-CONTROLLING INTERESTS AND EQUITY what is the Balance as of March 31,
2021?"})
print(response['result'])

############# RESPONSE ##################

The balance as of March 31, 2021 was $473 for Redeemable Non-Controlling Interests,
1,867,369 shares for Common Stock, $— for Additional Paid-In Capital, $36,182 for
Other Comprehensive Income (Loss), $654 for Non-Controlling Interests, and $654 for
Total Equity.

```

Question 6

```
response = qa.invoke({"query":"Based on the condensed consolidated statements of
comprehensive Income(loss) what is the Comprehensive income (loss) attributable to
Uber Technologies, Inc.for the three months ended March 31, 2022"})
response['result']

######################### RESPONSE####################################

The Comprehensive income (loss) attributable to Uber Technologies, Inc. for the three
months ended March 31, 2022 was $(5,911) million. This information can be found on the
Uber Technologies, Inc. - Condensed Consolidated Statements of Comprehensive Income
(Loss) provided in the quarterly report.

```

Question 7
```
response = qa.invoke({"query":"Based on the condensed consolidated statements of
comprehensive Income(loss) what is the Comprehensive income (loss) attributable to
Uber Technologies?"})
response['result']

##################### RESPONSE #################################

The Comprehensive income (loss) attributable to Uber Technologies, Inc. for the three
months ended March 31, 2021 is $1,081 million, and for the three months ended March
31, 2022 is -$5,911 million.

```

Question 8

```
response = qa.invoke({"query":"Based on the condensed consolidated statements of
comprehensive Income(loss) what is the Net loss including non-controlling
interests"})
response['result']

################ RESPONSE #######################################

The Net loss including non-controlling interests is $(122) million for the three months
ended March 31, 2021 and $(5,918) million for the three months ended March 31, 2022.

```

Question 9

```
response = qa.invoke({"query":"what is the Net cash used in operating activities for
Mrach 31,2021? "})
response['result']

############## RESPONSE ###############################

Net cash used in operating activities for March 31, 2021 was $611 million.

```

Question 10

```
query = "Based on the CONDENSED CONSOLIDATED STATEMENTS OF CASH FLOWS What is the
value of Purchases of property and equipment ?"
response = qa.invoke({"query":query})
response['result']

####################### RESPONSE #####################################


The value of purchases of property and equipment for the three months ended March 31,
2021 and 2022 can be found in the 'Cash flows from investing activities' section of the
condensed consolidated statements of cash flows.

For the three months ended March 31, 2021: $71 million

For the three months ended March 31, 2022: $62 million

```

Question 11

```
query = "Based on the CONDENSED CONSOLIDATED STATEMENTS OF CASH FLOWS what is the
Purchases of property and equipment for the year 2022?"
response = qa.invoke({"query":query})
response['result']

########### RESPONSE #####################################

The purchases of property and equipment for the year 2022 based on the CONDENSED
CONSOLIDATED STATEMENTS OF CASH FLOWS is -62.

```

From the above implementation we can assume that LLamaParse is comparatively good at
parsing complex pdf documents. Although we have to experiment more with different
tabular structures. Here ia comparison of LlamaParse with PyPDF

ref:

https://www.llamaindex.ai/blog/launching-the-first-genai-native-document-parsing-
platform

https://medium.com/the-ai-forum/rag-on-complex-pdf-using-llamaparse-langchain-and-
groq-5b132bd1f9f3

https://wow.groq.com/retrieval-augmented-generation-with-groq-api/

https://wow.groq.com/retrieval-augmented-generation-with-groq-api/
Use Case – 1
Conversational AI chatbot

implementation-1-A4000

- We use 2XA4000 GPUs with low memory and the Mistral 7B model in this experiment.

- The first big challenge is to utilize the model to run with low memory. For this we had to
use quantization.

- We use 2XA4000 GPUs with low memory and the Mistral 7B model in this experiment.

- The first big challenge is to utilize the model to run with low memory. For this we had to
use quantization.

# Code Implementation

**Install required dependencies**

```
# import dependencies
import pysqlite3
import sys
sys.modules["sqlite3"] = sys.modules.pop("pysqlite3")

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,
pipeline, AutoConfig, TextStreamer, TextIteratorStreamer

import os
import gradio as gr

from langchain.llms import HuggingFacePipeline


from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter,
CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain import HuggingFacePipeline
from langchain.document_loaders import PyPDFDirectoryLoader #for pdf
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory, ConversationSummaryMemory
from langchain.document_loaders import UnstructuredURLLoader #for html
from langchain_community.vectorstores import FAISS
from IPython.display import Audio, display
from gtts import gTTS
from io import BytesIO
import base64
import time
from langchain_community.document_loaders import DirectoryLoader
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

```

**define template:**

```
template = """You are an assistant for question-answering tasks. Use the following
pieces of retrieved context to answer the question. If you don't know the answer,
just say that you don't know. Use two sentences maximum and keep the answer concise.
Question: question
Context: context
Answer:"""

prompt = PromptTemplate.from_template(template)

```

This code is defining a template for a question-answering system using


the `langchain/prompts` library. The template is a string that contains placeholders for a
question and context. The system will use this context to answer the given question.

Here's a breakdown of the template string:

1. `You are an assistant for question-answering tasks.` - This line introduces the purpose
of the system.

2. `Use the following pieces of retrieved context to answer the question.` - This line
instructs the system to use the provided context to generate an answer.

3. `If you don't know the answer, just say that you don't know.` - This line sets an
expectation for the system to respond honestly when it can't answer a question.

4. `Use two sentences maximum and keep the answer concise.` - This line encourages the
system to provide brief and to-the-point answers.

5. `Question: question` - This placeholder will be replaced with the actual question at
runtime.

6. `Context: context` - This placeholder will be replaced with the actual context at
runtime.

7. `Answer:` - This line separates the context from the system-generated answer.
The `prompt` variable is created using the `PromptTemplate.from_template()` function,
which converts the template string into a `PromptTemplate` object. This object can then
be used to generate prompts for the question-answering system.

**ASR:**

```
import whisper
model_whisper = whisper.load_model("base")

def transcribe(audio):

start_time = time.time()

language = 'en'

# load audio and pad/trim it to fit 30 seconds


audio = whisper.load_audio(audio)
audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model_whisper.device)

# detect the spoken language


_, probs = model_whisper.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

# decode the audio


options = whisper.DecodingOptions()
result = whisper.decode(model_whisper, mel, options)

print("---ASR: %s seconds ---" % (time.time() - start_time))

return result.text
#################################################

```

This code is using the `whisper` library for automatic speech recognition (ASR) to
transcribe audio files into text.

1. `import whisper` - Import the `whisper` library, which is a Python library for speech
recognition.

2. `model_whisper = whisper.load_model("base")` - Load the pre-trained base model


from the `whisper` library.

3. `def transcribe(audio):` - Define a function called `transcribe` that takes an audio file
as input.

4. `start_time = time.time()` - Record the start time for measuring the transcription time.
5. `language = 'en'` - Set the language for the transcription to English.

6. `audio = whisper.load_audio(audio)` - Load the audio file using


the `whisper.load_audio()` function.

7. `audio = whisper.pad_or_trim(audio)` - Pad or trim the audio to fit a length of 30


seconds.

8. `mel = whisper.log_mel_spectrogram(audio).to(model_whisper.device)` - Convert the


audio into a log-Mel spectrogram and move it to the same device as the model.

9. `_, probs = model_whisper.detect_language(mel)` - Detect the language spoken in the


audio using the `detect_language()` function.

10. `options = whisper.DecodingOptions()` - Create an instance of


the `DecodingOptions` class for decoding the audio.

11. `result = whisper.decode(model_whisper, mel, options)` - Decode the audio using


the `decode()` function, which returns a `DecodingResult` object.

12. `print("---ASR: %s seconds ---" % (time.time() - start_time))` - Calculate and print


the time taken for the transcription.

13. `return result.text` - Return the transcribed text from the `DecodingResult` object.

The `transcribe()` function can be called by passing an audio file path as an argument to
transcribe the audio into text.

**utilize the two GPUs**

```
device_ids = [0, 1] # Modify this list according to your GPU configuration
primary_device = f'cuda:{device_ids[1]}' # Primary device
torch.cuda.set_device(primary_device)

```

This code sets the primary GPU device for PyTorch to use for computations.

1. `device_ids = [0, 1]` - Define a list of GPU device IDs available for use. In this case,
both devices with IDs 0 and 1 are included. Modify this list according to your GPU
configuration.

2. `primary_device = f'cuda:{device_ids[1]}'` - Set the primary device to the second


GPU in the list (index 1). In this case, it is set to 'cuda:1' assuming that the GPU at index 1
is available.

3. `torch.cuda.set_device(primary_device)` - Set the primary device for PyTorch to use


for computations using the `torch.cuda.set_device()` function.

After running this code, PyTorch will use the specified GPU as the primary device for
computations. If you have multiple GPUs and want to utilize them for parallel processing,
you can modify the `device_ids` list and the `primary_device` assignment accordingly.

**initialize tokenizer:**

```
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

```

This code initializes a tokenizer for a pre-trained model and sets padding configurations.

1. `tokenizer = AutoTokenizer.from_pretrained(model_name,
trust_remote_code=True)` - Initialize a tokenizer for a pre-trained model using
the `AutoTokenizer.from_pretrained()` function.
The `trust_remote_code=True` argument allows the function to download and execute
pre-trained model scripts from a remote location if necessary.

2. `tokenizer.pad_token = tokenizer.eos_token` - Set the padding token for the tokenizer


to be the end-of-sentence token. This ensures that sequences are padded with the
appropriate token during tokenization.

3. `tokenizer.padding_side = "right"` - Set the padding side to the right. This means that
when sequences are padded, the padding tokens will be added to the right side of the
sequence.

After running this code, you will have a tokenizer object configured for a pre-trained
model with padding settings applied. This tokenizer can be used for tokenizing input
sequences and preparing them for input into a pre-trained model.

**quantization:**

```
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16)

```

This code initializes a `BitsAndBytesConfig` object for configuring quantization settings


for a model.

1. `BitsAndBytesConfig(...)` - Create a `BitsAndBytesConfig` object to configure


quantization settings for a model.

2. `load_in_4bit=True` - Enable loading the model in 4-bit precision.

3. `bnb_4bit_use_double_quant=True` - Enable double quantization for 4-bit models.

4. `bnb_4bit_quant_type="nf4"` - Set the quantization type to "nf4", which stands for


"neural-fused 4-bit".

5. `bnb_4bit_compute_dtype=torch.bfloat16` - Set the compute dtype


to `torch.bfloat16` for 4-bit models.

After running this code, you will have a `BitsAndBytesConfig` object with the specified
quantization settings. This object can be used for configuring a model to use 4-bit
quantization during training or inference.

**initialize LLM:*

```
model_name='mistralai/Mistral-7B-Instruct-v0.2'
model_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
model =
AutoModelForCausalLM.from_pretrained(model_name,quantization_config=config,config=mo
del_config,device_map='auto')

```

This code initializes a pre-trained language model with quantization settings.

1. `model_name='mistralai/Mistral-7B-Instruct-v0.2'` - Specify the name of the pre-


trained model.

2. `model_config = AutoConfig.from_pretrained(model_name,
trust_remote_code=True)` - Initialize a model configuration for the pre-trained model
using the `AutoConfig.from_pretrained()` function.

3. `model =
AutoModelForCausalLM.from_pretrained(model_name,quantization_config=config,confi
g=model_config,device_map='auto')` - Initialize the pre-trained model using
the `AutoModelForCausalLM.from_pretrained()` function.
The `quantization_config` argument is set to the `config` object created earlier, which
enables quantization for the model. The `config` argument is set to
the `model_config` object, which specifies the model's configuration.
The `device_map` argument is set to 'auto', which automatically maps the model's layers
to the available GPUs.

After running this code, you will have a pre-trained language model initialized with the
specified quantization settings. This model can be used for natural language processing
tasks such as text generation, question answering, and more.

**Use both GPUs together:**

```
# Move model to GPUs
model = torch.nn.DataParallel(model, device_ids=device_ids)

```

This code moves the model to the specified GPUs using


PyTorch's `DataParallel` module.

1. `model = torch.nn.DataParallel(model, device_ids=device_ids)` - Wrap the model with


PyTorch's `DataParallel` module. This module replicates the model across the specified
GPUs and handles data distribution and synchronization during training.

- `model` - The model to be parallelized.

- `device_ids` - A list of GPU IDs to use for parallel processing.

After running this code, the model will be parallelized across the specified GPUs,
allowing for efficient data distribution and computation during training.

**initialize pipeline:**

```
#streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True,
skip_special_tokens=True)

pipeline = pipeline(task='text-generation',
model=model.module,
tokenizer=tokenizer,
# temperature=0.1,
repetition_penalty=1.1,
return_full_text=True,
max_new_tokens=1500,
do_sample=False,
pad_token_id = tokenizer.eos_token_id,
eos_token_id = tokenizer.eos_token_id,
streamer = streamer
)

llm = HuggingFacePipeline(pipeline=pipeline)

```

This code creates a Hugging Face pipeline for text generation using a pre-trained language
model.

1. `streamer = TextIteratorStreamer(tokenizer, skip_prompt=True,


skip_special_tokens=True)` - Initialize a `TextIteratorStreamer` object for streaming
tokenized input and output.

- `tokenizer` - The tokenizer associated with the pre-trained model.

- `skip_prompt=True` - Skip the prompt token during streaming.

- `skip_special_tokens=True` - Skip special tokens during streaming.

2. `pipeline = pipeline(task='text-generation', ...)` - Initialize a Hugging Face pipeline for


text generation.

- `task='text-generation'` - Specify the task for the pipeline.

- `model=model.module` - The pre-trained model to use for text generation.

- `tokenizer=tokenizer` - The tokenizer associated with the pre-trained model.

- `repetition_penalty=1.1` - Apply a repetition penalty to discourage repeating the


same phrases.

- `return_full_text=True` - Return the full text instead of individual tokens.

- `max_new_tokens=1500` - Set the maximum number of new tokens to generate.

- `do_sample=False` - Disable sampling and use greedy decoding.

- `pad_token_id = tokenizer.eos_token_id` - Set the padding token ID to the end-of-


sentence token ID.

- `eos_token_id = tokenizer.eos_token_id` - Set the end-of-sentence token ID.

- `streamer = streamer` - Set the `TextIteratorStreamer` object for streaming


tokenized input and output.

3. `llm = HuggingFacePipeline(pipeline=pipeline)` - Wrap the pipeline in


a `HuggingFacePipeline` object for easier use.

After running this code, you will have a Hugging Face pipeline for text generation using a
pre-trained language model. You can use the `llm` object to generate text based on input
prompts.

**loading RAG data:**

```
text_loader_kwargs={'autodetect_encoding': True}
loader_txt = DirectoryLoader("txt/", glob="./*.txt", loader_cls=TextLoader,
loader_kwargs=text_loader_kwargs)
documents_txt = loader_txt.load()

if 1==0:
# load pdfs
loader_pdfs = PyPDFDirectoryLoader('pdfs/')
documents_pdfs = loader_pdfs.load()
#print(documents)

if 1==0:
urls = [
"https://url1/",
"https://url2/",
"https://url3/"
]

loader_urls = UnstructuredURLLoader(urls=urls)
documents_htmls = loader_urls.load()

```

This code loads text documents from different sources, such as text files, PDFs, and web
pages.

1. `text_loader_kwargs={'autodetect_encoding': True}` - Set


the `autodetect_encoding` option to `True` for the text loader.

2. `loader_txt = DirectoryLoader("txt/", glob="./*.txt", loader_cls=TextLoader,


loader_kwargs=text_loader_kwargs)` - Initialize a `DirectoryLoader` object for loading
text files from the "txt" directory.

- `DirectoryLoader` - A loader for loading documents from a directory.

- `"txt/"` - The directory path for the text files.


- `glob="./*.txt"` - The glob pattern for matching text files.

- `loader_cls=TextLoader` - The loader class for loading text files.

- `loader_kwargs=text_loader_kwargs` - The loader arguments for the text loader.

3. `documents_txt = loader_txt.load()` - Load the text documents from the specified


directory.

4. The commented-out code block `if 1==0:` loads PDFs from the "pdfs" directory
using `PyPDFDirectoryLoader`.

5. The second commented-out code block `if 1==0:` loads web pages from a list of
URLs using `UnstructuredURLLoader`.

After running this code, you will have the text documents loaded into memory as a list
of `Document` objects. You can then use these documents for further processing, such as
text classification, information extraction, or other natural language processing tasks.

**initialize embeddings:**

```
#################################################
##### Embeddings Model setup
##### Vectorization

text_splitter = CharacterTextSplitter(chunk_size=50, chunk_overlap=5)

all_splits = text_splitter.split_documents(documents_txt)

# specify embedding model (using huggingface sentence transformer)


embedding_model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}
embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name,
model_kwargs=model_kwargs)

```

This code sets up the embedding model for vectorization of text documents.

1. `text_splitter = CharacterTextSplitter(chunk_size=50, chunk_overlap=5)` - Initialize


a `CharacterTextSplitter` object for splitting documents into smaller chunks.

2. `all_splits = text_splitter.split_documents(documents_txt)` - Split


the `documents_txt` list of `Document` objects into smaller chunks.

3. `embedding_model_name = "sentence-transformers/all-mpnet-base-v2"` - Specify the


embedding model name using Hugging Face Sentence Transformer.
4. `model_kwargs = {"device": "cuda"}` - Set up the model arguments for the embedding
model.

5. `embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name,
model_kwargs=model_kwargs)` - Initialize the `HuggingFaceEmbeddings` object for
generating embeddings using the specified embedding model.

After running this code, you will have a `HuggingFaceEmbeddings` object that can be
used for generating embeddings for the text chunks. These embeddings can then be used
for various natural language processing tasks such as clustering, classification, or
similarity search.

**initialize vectorstore:**

```
#document chunks and embiddings
vectordb = FAISS.from_documents(all_splits, embeddings)

retriever = vectordb.as_retriever()

```

This code creates a vector database using the FAISS library and a retriever for the
document chunks and their corresponding embeddings.

1. `vectordb = FAISS.from_documents(all_splits, embeddings)` - Create a vector


database using the FAISS library with the `all_splits` list of text chunks and their
corresponding embeddings generated by the `embeddings` object.

2. `retriever = vectordb.as_retriever()` - Create a retriever object from the vector


database for efficient similarity search.

After running this code, you will have a vector database and a retriever object that can be
used for efficient similarity search and retrieval of document chunks based on their
embeddings.

**initialize chain**

```
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever,
chain_type_kwargs={"prompt": prompt})

```

This code creates a RetrievalQA chain using the specified language model (`llm`),
retriever (`retriever`), and a prompt for the RetrievalQA chain.
1. `RetrievalQA.from_chain_type(llm=llm, retriever=retriever,
chain_type_kwargs={"prompt": prompt})` - Create a RetrievalQA chain using the
specified language model (`llm`), retriever (`retriever`), and a prompt for the RetrievalQA
chain.

- `llm` - The language model to be used for generating answers.

- `retriever` - The retriever object for efficient similarity search and retrieval of
document chunks based on their embeddings.

- `chain_type_kwargs` - A dictionary of keyword arguments for the RetrievalQA


chain.

- `"prompt"` - A prompt for the RetrievalQA chain.

After running this code, you will have a RetrievalQA chain that can be used for question
answering tasks by combining the language model's ability to generate answers and the
retriever's ability to efficiently search and retrieve relevant document chunks based on
their embeddings.

## RetrievalQA Chain

We will first see how to do question answering after multiple relevant splits have been
retrieved from the vector store. We may also need to compress the relevant splits to fit
into the LLM context. Finally, we send these splits along with a system prompt and
human question to the language model to get the answer.
Retrieval QA Chain

By default, we pass all the chunks into the same context window, into the same call of the
language model. But, we can also use other methods in case the number of documents is
high and if we can't pass them all in the same context window. MapReduce, Refine, and
MapRerank are three methods that can be used if the number of documents is high. Now,
we will look into these methods in detail.

**handle conversation**

```
# create conversation using rag in memory
def create_conversation(query: str, chat_history: list) -> tuple:
try:
start_time = time.time()
result = qa_chain(query)
chat_history.append((query, result["result"]))

# return '', chat_history, text_to_speech(result['answer'])


return '',chat_history, text_to_speech(result['result'])

except Exception as e:
chat_history.append((query, e))
return '', chat_history, ''
```

This code defines a function `create_conversation` that takes a user query and a chat
history as input and returns a tuple containing an empty string, the updated chat history,
and a text-to-speech converted response.

1. `def create_conversation(query: str, chat_history: list) -> tuple:` - Define a


function `create_conversation` that takes a user query (`query`) and a chat history
(`chat_history`) as input and returns a tuple.

2. `try:` - Begin a try block for error handling.

3. `start_time = time.time()` - Record the start time for calculating the response time.

4. `chat_history.append((query, result["result"]))` - Append the user query and the


corresponding response to the chat history.

5. `return '',chat_history, text_to_speech(result['result'])` - Return an empty string, the


updated chat history, and a text-to-speech converted response.

6. `except Exception as e:` - Catch any exceptions that occur during the execution of the
function.

7. `chat_history.append((query, e))` - Append the user query and the corresponding error
message to the chat history.

8. `return '', chat_history, ''` - Return an empty string, the updated chat history, and an
empty string.

The function `create_conversation` is designed to handle user queries and update the chat
history with the corresponding responses or error messages. The text-to-speech converted
response is also returned along with the chat history.

## RetrievalQA chain with Prompt

Let’s try to understand a little bit better what’s going on underneath the hood. First, we
define the prompt template. The prompt template has instructions about how to use the
context. It also has a placeholder for a context variable. We will use prompts to get
answers to a question. Here, the prompt takes in the documents and the question and
passes it to a language model.
**using gradio to speed up the chat bot building**

```
def bot(history):
print("Question: ", history[-1][0])
llm_chain.run(question=history[-1][0])
history[-1][1] = ""
for character in llm.streamer:
print(character)
history[-1][1] += character
yield history

# build gradio ui
with gr.Blocks() as bot_interface:

with gr.Row():
chatbot = gr.Chatbot()
with gr.Row():
with gr.Column():
html = gr.HTML()
with gr.Row():
with gr.Column():
msg = gr.Textbox()
with gr.Row():
with gr.Column():
audio_input=gr.Audio(type="filepath")
user_input = gr.Textbox()
gr.Interface(
fn=transcribe,
inputs=[
audio_input
],
outputs=[
user_input
],
live=True)

#TEXT INPUT
msg.submit(create_conversation, [msg, chatbot], [msg, chatbot, html])

```

This code defines a function `bot` that takes a chat history as input and generates a
response using a language model and updates the chat history. It also builds a Gradio user
interface for the chatbot.

1. `def bot(history):` - Define a function `bot` that takes a chat history as input.

2. `print("Question: ", history[-1][0])` - Print the user's question.

3. `llm_chain.run(question=history[-1][0])` - Run the language model with the user's


question.

4. `history[-1][1] = ""` - Clear the previous response.


5. `for character in llm.streamer:` - Iterate over the characters in the language model's
response.

6. `history[-1][1] += character` - Append each character to the response.

7. `yield history` - Yield the updated chat history.

8. `with gr.Blocks() as bot_interface:` - Define a Gradio user interface for the chatbot.

9. `with gr.Row():` - Define a row in the user interface.

10. `chatbot = gr.Chatbot()` - Define a chatbot component.

11. `with gr.Row():` - Define a row in the user interface.

12. `with gr.Column():` - Define a column in the user interface.

13. `html = gr.HTML()` - Define an HTML component.

14. `with gr.Row():` - Define a row in the user interface.

15. `with gr.Column():` - Define a column in the user interface.

16. `msg = gr.Textbox()` - Define a textbox for user input.

17. `audio_input=gr.Audio(type="filepath")` - Define an audio input component.

18. `user_input = gr.Textbox()` - Define a textbox for the transcription of the user's audio
input.

19. `gr.Interface(fn=transcribe, inputs=[audio_input], outputs=[user_input], live=True)` -


Define an interface for transcribing the user's audio input.

20. `msg.submit(create_conversation, [msg, chatbot], [msg, chatbot, html])` - Define a


submit button for the textbox that triggers the `create_conversation` function.

The `bot` function generates a response using a language model and updates the chat
history. The Gradio user interface allows the user to interact with the chatbot through text
or audio input. The `create_conversation` function is called when the user submits a text
input, updating the chat history with the user's question and the language model's
response. The HTML component can be used to display additional information, such as
the response time or the confidence score of the language model's response.
## limitations:

- This code uses the RetrievalQA chain, which is not the best option for dialogue and
conversation; we used this chain due to the server's limited resources. (RetrievalQA is
faster than other chains).

- As per the results, we will not be able to use the server in production.

## RetrievalQA limitations

One of the biggest disadvantages of RetrievalQA chain is that the QA chain fails to
preserve conversational history. This can be checked as follows:

```
# Create a QA Chain
qa_chain = RetrievalQA.from_chain_type(
llm,
retriever=vectordb.as_retriever()
)

```

We will now ask a question to the chain.

```
question = "Is probability a class topic?"
result = qa_chain({"query": question})
result["result"]

```

Now, we will ask a second question to the chain.

```
question = "why are those prerequesites needed?"
result = qa_chain({"query": question})
result["result"]

```

We were able to get a reply from the chain which was not related to the previous answer.
Basically, the RetrievalQA chain doesn’t have any concept of state. It doesn’t remember
what previous questions or what previous answers were. We could In order for the chain
to remember the previous question or previous answer, we need to introduce the concept
of memory. This ability to remember the previous question or previous answer is required
in the case of chatbots as we are able to ask follow-up questions to the chatbot or ask for
clarification about previous answers.
implementation-2-A100
- We use 1XA100 GPUs @GCP with the Mistral 7B model in this experiment.

# Code Implementation

I will list only the differences here:

- Handling **multi-processers** had been **removed** as we are using a single GPU.

- **Quantization** had been **removed** as we are using the whole memory.

implementation-3-groq
- In this implementation, we are using remote calls for (ASR, TTS, and LLM).

- We are utilizing Deepgram for ASR, TTS and qroq for Inference.

- We need to create an API-key for qroq here: https://console.groq.com/keys

- We need to create an API-key for deepgram as per doc:


https://developers.deepgram.com/docs/create-additional-api-keys

# Code Implementation

**Install required dependencies**

```
import asyncio
from dotenv import load_dotenv
import shutil
import subprocess
import requests
import time
import os
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter,
haracterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.prompts import PromptTemplate
from langchain.chains import ConversationalRetrievalChain, ConversationChain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.chains import create_history_aware_retriever
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.messages import HumanMessage
from langchain_core.prompts import ChatPromptTemplate
from langchain_groq import ChatGroq
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferMemory, VectorStoreRetrieverMemory
from langchain.prompts import (
ChatPromptTemplate,
MessagesPlaceholder,
SystemMessagePromptTemplate,
HumanMessagePromptTemplate,
)
from langchain.chains import LLMChain
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,
pipeline
from langchain.llms import HuggingFacePipeline
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
import sys
from deepgram import (
DeepgramClient,
DeepgramClientOptions,
LiveTranscriptionEvents,
LiveOptions,
Microphone,
)

```

**Import the ChatGroq class and initialize it with a model:**

```
self.llm = ChatGroq(temperature=0, model_name="mixtral-8x7b-32768",
groq_api_key=os.getenv("GROQ_API_KEY"))

```

This code snippet establishes a Groq client object to interact with the Groq API. It begins by
retrieving the API key from an environment variable named GROQ_API_KEY and passes it
to the argument api_key. Subsequently, the API key initializes the Groq client object,
enabling API calls to the Large Language Models within Groq Servers.

**load docs for rag same way as listed in the previous implementations**

**use same embedding model sentence-transformers/all-mpnet-base-v2**

**user same vector db FAISS**


# Add chat history

In many Q&A applications we want to allow the user to have a back-and-forth conversation,
meaning the application needs some sort of “memory” of past questions and answers, and
some logic for incorporating those into its current thinking.

In this guide we focus on **adding logic for incorporating historical messages.** Further
details on chat history management is [covered
here](https://python.langchain.com/docs/expression_language/how_to/message_history/).

We’ll work off of the Q&A app we built over the [LLM Powered Autonomous
Agents](https://lilianweng.github.io/posts/2023-06-23-agent/ ) blog post by Lilian Weng in
the
[Quickstart](https://python.langchain.com/docs/use_cases/question_answering/quickstart/ ).
We’ll need to update two things about our existing app:

1. **Prompt**: Update our prompt to support historical messages as an input.

2. **Contextualizing questions**: Add a sub-chain that takes the latest user question and
reformulates it in the context of the chat history. This is needed in case the latest question
references some context from past messages. For example, if a user asks a follow-up
question like “Can you elaborate on the second point?”, this cannot be understood without
the context of the previous message. Therefore we can’t effectively perform retrieval with a
question like this.

## Contextualizing the question[


](https://python.langchain.com/docs/use_cases/question_answering/chat_history/#contextual
izing-the-question "Direct link to Contextualizing the question")

First we’ll need to define a sub-chain that takes historical messages and the latest user
question, and reformulates the question if it makes reference to any information in the
historical information.

We’ll use a prompt that includes a `MessagesPlaceholder` variable under the name
“chat_history”. This allows us to pass in a list of Messages to the prompt using the
“chat_history” input key, and these messages will be inserted after the system message and
before the human message containing the latest question.

Note that we leverage a helper function


[create_history_aware_retriever](https://api.python.langchain.com/en/latest/chains/langchain
.chains.history_aware_retriever.create_history_aware_retriever.html ) for this step, which
manages the case where `chat_history` is empty, and otherwise applies `prompt | llm |
StrOutputParser() | retriever` in sequence.

`create_history_aware_retriever` constructs a chain that accepts keys `input` and


`chat_history` as input, and has the same output schema as a retriever.

```
contextualize_q_system_prompt = """Given a chat history and the latest user question
\
which might reference context in the chat history, formulate a standalone question \
which can be understood without the chat history. Do NOT answer the question, \
just reformulate it if needed and otherwise return it as is."""
contextualize_q_prompt = ChatPromptTemplate.from_messages(
[
("system", contextualize_q_system_prompt),
MessagesPlaceholder("chat_history"),
("human", "{input}"),
]
)
history_aware_retriever = create_history_aware_retriever(
llm, retriever, contextualize_q_prompt
)

```

This chain prepends a rephrasing of the input query to our retriever, so that the retrieval
incorporates the context of the conversation.

## Chain with chat history[


](https://python.langchain.com/docs/use_cases/question_answering/chat_history/#chain-
with-chat-history "Direct link to Chain with chat history")

And now we can build our full QA chain.

Here we use
[create_stuff_documents_chain](https://api.python.langchain.com/en/latest/chains/langchain.
chains.combine_documents.stuff.create_stuff_documents_chain.html ) to generate a
`question_answer_chain`, with input keys `context`, `chat_history`, and `input`– it accepts
the retrieved context alongside the conversation history and query to generate an answer.
We build our final `rag_chain` with
[create_retrieval_chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.r
etrieval.create_retrieval_chain.html ). This chain applies the `history_aware_retriever` and
`question_answer_chain` in sequence, retaining intermediate outputs such as the retrieved
context for convenience. It has input keys `input` and `chat_history`, and includes `input`,
`chat_history`, `context`, and `answer` in its output.

```
qa_system_prompt = """You are an assistant for question-answering tasks. \
Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, just say that you don't know. \
Use three sentences maximum and keep the answer concise.\

{context}"""
qa_prompt = ChatPromptTemplate.from_messages(
[
("system", qa_system_prompt),
MessagesPlaceholder("chat_history"),
("human", "{input}"),
]
)

question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)

rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

```

## Code Flow:

Here we’ve gone over how to add application logic for incorporating historical outputs, but
we’re still manually updating the chat history and inserting it into each input. In a real Q&A
application we’ll want some way of persisting chat history and some way of automatically
inserting and updating it.

For this we can use:

-
[BaseChatMessageHistory](https://python.langchain.com/docs/modules/memory/chat_messa
ges/ ): Store chat history.

-
[RunnableWithMessageHistory](https://python.langchain.com/docs/expression_language/ho
w_to/message_history/ ): Wrapper for an LCEL chain and a `BaseChatMessageHistory`
that handles injecting chat history into inputs and updating it after each invocation.

For a detailed walkthrough of how to use these classes together to create a stateful
conversational chain, head to the [How to add message history
(memory)](https://python.langchain.com/docs/expression_language/how_to/message_histor
y/ ) LCEL page.

Below, we implement a simple example of the second option, in which chat histories are
stored in a simple dict.

Full Code:

```
### Contextualize question ###
contextualize_q_system_prompt = """Given a chat history and the latest user
question \
which might reference context in the chat history, formulate a standalone
question \
which can be understood without the chat history. Do NOT answer the
question, \
just reformulate it if needed and otherwise return it as is."""
contextualize_q_prompt = ChatPromptTemplate.from_messages(
[
("system", contextualize_q_system_prompt),
MessagesPlaceholder("chat_history"),
("human", "{input}"),
]
)

history_aware_retriever = create_history_aware_retriever(
self.llm, self.retriever, contextualize_q_prompt
)

### Answer question ###


qa_system_prompt = """You are an assistant for question-answering tasks. \
Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, just say that you don't know. \
Use three sentences maximum and keep the answer concise.\

{context}"""
qa_prompt = ChatPromptTemplate.from_messages(
[
("system", qa_system_prompt),
MessagesPlaceholder("chat_history"),
("human", "{input}"),
]
)
question_answer_chain = create_stuff_documents_chain(self.llm, qa_prompt)

self.rag_chain = create_retrieval_chain(history_aware_retriever,
question_answer_chain)

def get_session_history(session_id: str) -> BaseChatMessageHistory:


if session_id not in self.store:
self.store[session_id] = ChatMessageHistory()
return self.store[session_id]

self.conversational_rag_chain = RunnableWithMessageHistory(
self.rag_chain,
get_session_history,
input_messages_key="input",
history_messages_key="chat_history",
output_messages_key="answer",
)

start_time = time.time()

# Go get the response from the LLM


response = self.conversational_rag_chain.invoke({"input":
text},config={"configurable": {"session_id": "abc123"}},)
end_time = time.time()
elapsed_time = int((end_time - start_time) * 1000)
print(f"LLM ({elapsed_time}ms): {response["answer"]}")
self.chat_history.extend([HumanMessage(content=text), response["answer"]])
return response["answer"]

```

## ASR with Deepgram:

```
class TranscriptCollector:
def __init__(self):
self.reset()

def reset(self):
self.transcript_parts = []

def add_part(self, part):


self.transcript_parts.append(part)

def get_full_transcript(self):
return ' '.join(self.transcript_parts)

transcript_collector = TranscriptCollector()
```

`TranscriptCollector` is a calls that is used to collect and manage parts of a transcript. A


transcript is typically a written or printed record of what was said during a conversation,
meeting, or interview.

Here's a breakdown of the class:

- `__init__`: This is a special method that is automatically called when an object of the
class is created. In this case, it calls the `reset` method, which initializes an empty list
`transcript_parts`.

- `reset`: This method resets the `transcript_parts` list to an empty list.

- `add_part`: This method adds a new part to the `transcript_parts` list.

- `get_full_transcript`: This method returns the full transcript by joining all the parts in the
`transcript_parts` list with a space.

The last line creates an instance of the `TranscriptCollector` class and assigns it to the
variable `transcript_collector`.

```
async def get_transcript(callback):
transcription_complete = asyncio.Event() # Event to signal transcription
completion

try:
# example of setting up a client config. logging values: WARNING, VERBOSE,
DEBUG, SPAM
config = DeepgramClientOptions(options={"keepalive": "true"})
deepgram: DeepgramClient = DeepgramClient("", config)

dg_connection = deepgram.listen.asynclive.v("1")
print ("Listening...")

async def on_message(self, result, **kwargs):


sentence = result.channel.alternatives[0].transcript

if not result.speech_final:
transcript_collector.add_part(sentence)
else:
# This is the final part of the current sentence
transcript_collector.add_part(sentence)
full_sentence = transcript_collector.get_full_transcript()
# Check if the full_sentence is not empty before printing
if len(full_sentence.strip()) > 0:
full_sentence = full_sentence.strip()
print(f"Human: {full_sentence}")
callback(full_sentence) # Call the callback with the
full_sentence
transcript_collector.reset()
transcription_complete.set() # Signal to stop transcription and
exit
dg_connection.on(LiveTranscriptionEvents.Transcript, on_message)

options = LiveOptions(
model="nova-2",
punctuate=True,
language="en-US",
encoding="linear16",
channels=1,
sample_rate=16000,
endpointing=300,
smart_format=True,
)

await dg_connection.start(options)

# Open a microphone stream on the default input device


microphone = Microphone(dg_connection.send)
microphone.start()

await transcription_complete.wait() # Wait for the transcription to


complete instead of looping indefinitely

# Wait for the microphone to close


microphone.finish()

# Indicate that we've finished


await dg_connection.finish()

except Exception as e:
print(f"Could not open socket: {e}")
return

```

This code is an implementation of a speech-to-text system using the Deepgram API. Here's a
breakdown of what the code does:

1. Creates an instance of `TranscriptCollector`.

2. The `get_transcript` function is an asynchronous function that starts a transcription


process. It uses the Deepgram API to listen to the user's microphone and transcribe the audio
in real-time.

3. The function sets up a Deepgram client with a configuration that keeps the connection
alive.
4. It then starts a live transcription session with the `listen.asynclive.v("1")` method. This
method returns a `dg_connection` object that is used to send and receive data.

5. The function defines an `on_message` function that is called whenever a new message is
received from the Deepgram API. This function is responsible for processing the
transcription data.

6. The `on_message` function checks if the received message is the final part of a sentence.
If it is, it adds the sentence to the `TranscriptCollector` and resets it. It then prints the full
transcript and calls the `callback` function with the full transcript.

7. The function then starts the transcription process by calling


`dg_connection.start(options)`. This method starts the transcription process with the
specified options.

8. It then opens a microphone stream and starts it.

9. The function waits for the transcription to complete by calling


`transcription_complete.wait()`. This method blocks until the transcription is complete.

10. After the transcription is complete, the function waits for the microphone to close and
then finishes the transcription process by calling `dg_connection.finish()`.

11. If any exceptions occur during the transcription process, the function catches them and
prints an error message.
The `callback` function is not defined in this code snippet, but it is likely a function that is
called when the transcription is complete. It is passed the full transcript as an argument.

## TTS with Deepgram:

```
class TextToSpeech:
# Set your Deepgram API Key and desired voice model
DG_API_KEY = os.getenv("DEEPGRAM_API_KEY")
MODEL_NAME = "aura-helios-en" # Example model name, change as needed

@staticmethod
def is_installed(lib_name: str) -> bool:
lib = shutil.which(lib_name)
return lib is not None

def speak(self, text):


if not self.is_installed("ffplay"):
raise ValueError("ffplay not found, necessary to stream audio.")

DEEPGRAM_URL =
f"https://api.deepgram.com/v1/speak?model={self.MODEL_NAME}&performance=some&encodin
g=linear16&sample_rate=24000"
headers = {
"Authorization": f"Token {self.DG_API_KEY}",
"Content-Type": "application/json"
}
payload = {
"text": text
}

player_command = ["ffplay", "-autoexit", "-", "-nodisp"]


player_process = subprocess.Popen(
player_command,
stdin=subprocess.PIPE,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)

start_time = time.time() # Record the time before sending the request


first_byte_time = None # Initialize a variable to store the time when the
first byte is received

with requests.post(DEEPGRAM_URL, stream=True, headers=headers, json=payload)


as r:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
if first_byte_time is None: # Check if this is the first chunk
received
first_byte_time = time.time() # Record the time when the
first byte is received
ttfb = int((first_byte_time - start_time)*1000) # Calculate
the time to first byte
print(f"TTS Time to First Byte (TTFB): {ttfb}ms\n")
player_process.stdin.write(chunk)
player_process.stdin.flush()

if player_process.stdin:
player_process.stdin.close()
player_process.wait()
```

`TextToSpeech` is a class that uses the Deepgram API to convert text to speech. Here's a
breakdown of the code:

**Class variables**

- `DG_API_KEY`: The Deepgram API key, set using the `os.getenv` function to retrieve
an environment variable named `DEEPGRAM_API_KEY`.

- `MODEL_NAME`: The name of the Deepgram model to use for text-to-speech


conversion, set to `"aura-helios-en"` (English).

**`is_installed` method**

- This method checks if a given library (e.g., `ffplay`) is installed on the system.

- It uses the `shutil.which` function to search for the executable in the system's PATH.

- If the executable is found, the method returns `True`, otherwise it returns `False`.

**`speak` method**

- This method takes a `text` parameter and converts it to speech using the Deepgram API.

- It checks if `ffplay` is installed using the `is_installed` method. If not, it raises a


`ValueError`.

- It sets the Deepgram API URL, headers, and payload for the request.

- It uses the `requests` library to send a POST request to the Deepgram API with the text
to be converted to speech.

- It uses the `ffplay` command-line tool to play the audio stream.

- It records the time before sending the request and calculates the time to first byte (TTFB)
by measuring the time between sending the request and receiving the first byte of the audio
stream.
- It writes the audio stream to the `ffplay` process's stdin and flushes the buffer.

- Finally, it closes the stdin stream and waits for the `ffplay` process to finish.

**Notes**

- The `speak` method assumes that `ffplay` is installed and available on the system.

- The `MODEL_NAME` variable can be changed to use a different Deepgram model.

- The `DG_API_KEY` variable should be set to a valid Deepgram API key.

- The `speak` method returns no value, but it prints the TTFB time to the console.

Overall, this code provides a simple way to convert text to speech using the Deepgram API
and play the audio stream using `ffplay`.

## Manage the Conversation:

```
class ConversationManager:
def __init__(self):
self.transcription_response = ""
self.llm = LanguageModelProcessor()

async def main(self):


def handle_full_sentence(full_sentence):
self.transcription_response = full_sentence

# Loop indefinitely until "goodbye" is detected


while True:
await get_transcript(handle_full_sentence)

# Check for "goodbye" to exit the loop


if "goodbye" in self.transcription_response.lower():
break

llm_response = self.llm.process(self.transcription_response)

tts = TextToSpeech()
tts.speak(llm_response)

# Reset transcription_response for the next loop iteration


self.transcription_response = ""

if __name__ == "__main__":
manager = ConversationManager()
asyncio.run(manager.main())
```

1. The `ConversationManager` class is initialized with an empty string


`transcription_response` and an instance of `LanguageModelProcessor` (LLM) for
language processing.

2. The `main` method is an asynchronous function that runs indefinitely until the user says
"goodbye".

3. Inside the `main` method, it calls the `get_transcript` function with a callback function
`handle_full_sentence`. This function is called whenever a full sentence is transcribed.

4. The `handle_full_sentence` function updates the `transcription_response` with the full


sentence.

5. The code then checks if the `transcription_response` contains the word "goodbye" (case-
insensitive). If it does, the loop breaks and the program exits.

6. If the `transcription_response` does not contain "goodbye", the code processes the
`transcription_response` using the `LanguageModelProcessor` (LLM) to generate a
response.

7. The LLM response is then converted to speech using a `TextToSpeech` object and
spoken to the user.

8. Finally, the `transcription_response` is reset to an empty string for the next iteration of
the loop.
implementation-4-llama3-A4000

- We use 2XA4000 GPUs with low memory and the llama3 model in this experiment.

# Code Implementation

- You have to grand access by email first here: https://huggingface.co/meta-llama/Meta-


Llama-3-8B

- Then use huggingface access-token (https://huggingface.co/settings/tokens) to login :


https://huggingface.co/welcome `huggingface-cli login`

- Use the model

```
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

```

- Setup quantization (to be able to use the model with low resource)

```
device_ids = [0, 1] # Modify this list according to your GPU configuration
primary_device = f'cuda:{device_ids[0]}' # Primary device
torch.cuda.set_device(primary_device)

config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16)

model_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)


model =
AutoModelForCausalLM.from_pretrained(model_name,quantization_config=config,config=mo
del_config,device_map='auto')

model = torch.nn.DataParallel(model, device_ids=device_ids)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

pipeline = transformers.pipeline(
"text-generation",
tokenizer = tokenizer,
model=model.module,
model_kwargs={
"torch_dtype": torch.float16,
"quantization_config": {"load_in_4bit": True},
},
)
```

- This model using a different terminator

```
terminators = [
pipeline.tokenizer.eos_token_id,
pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

```

- Setup prompt (prompt here has a different config)

```
class Llama3_8B_gen:
def __init__(self,pipeline):
self.pipeline= pipeline

@staticmethod
def generate_prompt(query,retrieved_text):
messages = [
{"role": "system", "content": "Answer the Question for the Given below
context and information and not prior knowledge, only give the output result
\n\ncontext:\n\n{}".format(retrieved_text) },
{"role": "user", "content": query},]
return pipeline.tokenizer.apply_chat_template(messages,
tokenize=False,add_generation_prompt=True)

def generate(self,query,retrieved_context):
prompt = self.generate_prompt(query ,retrieved_context)
output =
pipeline(prompt,max_new_tokens=512,eos_token_id=terminators,do_sample=False,)
return output[0]["generated_text"][len(prompt):]

```

- Setup RAG

```
class Langchain_RAG:
def __init__(self):
# self.embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-
v1.5")
self.embeddings = HuggingFaceEmbeddings(model_name="sentence-
transformers/all-mpnet-base-v2")

text_loader_kwargs={'autodetect_encoding': True}
loader_txt = DirectoryLoader("txt/", glob="./*.txt", loader_cls=TextLoader,
loader_kwargs=text_loader_kwargs)
documents_txt = loader_txt.load()
text_splitter = CharacterTextSplitter(chunk_size=50, chunk_overlap=5)

self.texts = text_splitter.split_documents(documents_txt)
self.get_vec_value= FAISS.from_documents(self.texts, self.embeddings)
self.retriever = self.get_vec_value.as_retriever(search_kwargs={"k": 4})

def __call__(self,query):
rev = self.retriever.get_relevant_documents(query)
return "".join([i.page_content for i in rev])

text_gen = Llama3_8B_gen(pipeline=pipeline)
retriever = Langchain_RAG()

```

- Finally query the model+rag

```
query = "what is blacklist feature?"
start_time = time.time()
retriever_context = retriever(query)
result = text_gen.generate("what is blacklist feature?",retriever_context)
print(result)
print("--- answering in: %s seconds ---" % (time.time() - start_time))

```
Use Case – 2
Action integration with chatbot (google
calendar booking)
- Here we are utilizing same stack (groq+deepgram) to build a calendar scheduler POC.

- We are integrate Gmail API using Gmail API-Key to access google calendar.

- We are using groq tool to perform actions based on the conversation.

# Code Implementation

```
def run_conversation_book(user_prompt):
# Step 1: send the conversation and available functions to the model
messages=[
{
"role": "system",
"content": "Today is 17 April 2024,You are a function calling LLM that
Book an event on the calendar at the provided datetime in ISO format with the
provided period. if the provided period is not intger set the default period to 1"
},
{
"role": "user",
"content": user_prompt,
}
]
tools = [
{
"type": "function",
"function": {
"name": "book_event",
"description": "Set calendar events",
"parameters": {
"type": "object",
"properties": {
"str_datetime": {
"type": "string",
"description": "The event date and time",
},
"period": {
"type": "integer",
"description": "The event period",
}
},
"required": ["str_datetime", "period"],
},
},
}
]
response = client.chat.completions.create(
model=MODEL,
messages=messages,
tools=tools,
tool_choice="auto",
max_tokens=4096
)

response_message = response.choices[0].message
tool_calls = response_message.tool_calls

# Step 2: check if the model wanted to call a function

if tool_calls:

# Step 3: call the function

# Note: the JSON response may not always be valid; be sure to handle errors
available_functions = {
"book_event": book_event,
} # only one function in this example, but you can have multiple
messages.append(response_message) # extend conversation with assistant's
reply

# Step 4: send the info for each function call and function response to the model
for tool_call in tool_calls:
function_name = tool_call.function.name
function_to_call = available_functions[function_name]
function_args = json.loads(tool_call.function.arguments)
print (function_args)

function_response = function_to_call(
str_datetime=function_args.get("str_datetime"),
period=function_args.get("period")
)
messages.append(
{
"tool_call_id": tool_call.id,
"role": "tool",
"name": function_name,
"content": function_response,
}
) # extend conversation with function response
second_response = client.chat.completions.create(
model=MODEL,
messages=messages
) # get a new response from the model where it can see the function
response
return second_response.choices[0].message.content

print ("[AI] : hi, please pickup a date, time for the meeting")
```

- defining the prompt here is very important:

"content": "Today is 17 April 2024,You are a function calling LLM that Book an event on
the calendar at the provided datetime in ISO format with the provided period. if the provided
period is not intger set the default period to 1"

- The model is not aware of the current date and time, so I am informing the model about
today's date.

- I am also telling the model via the prompt about the format of the datetime and time, so I
am limiting any date-time conversion to a specific format.

- and finally, I am telling the prompt to set a default value of 1 hour for the meeting time in
case of not provided.

- Then I am defining groq tool template

```
tools = [
{
"type": "function",
"function": {
"name": "book_event",
"description": "Set calendar events",
"parameters": {
"type": "object",
"properties": {
"str_datetime": {
"type": "string",
"description": "The event date and time",
},
"period": {
"type": "integer",
"description": "The event period",
}
},
"required": ["str_datetime", "period"],
},
},
}
]
```

- make sure to define the correct data types.

- Here, I am informing the model that I am expecting two mandatory params (ste_datetime,
period) with types (string, integer), respectively.

- Telling the model that the action function is : book_event function.

- Finally pass the extracted params to the function.

## book_event function and Gmail APIs

```
# Create Calendar event
def book_event(str_datetime , period):
#if period == "Custom":
# hours = 0.5
#else:
# hours = int(re.search(r'\d+', period).group())
#hours = int(re.search(r'\d+', period).group())
hours = period

# Authenticate Google Calendar API


oauth2_client_secret_file = './cred.json'
scopes = ['https://www.googleapis.com/auth/calendar']
service = authenticate_google(scopes=scopes,
oauth2_client_secret_file=oauth2_client_secret_file)

# Get email-ids of all subscribed calendars


calendars_result = service.calendarList().list().execute()

calendars = calendars_result.get('items', [])

print(str_datetime)
str_datetime = str_datetime.replace ('AM','').replace('PM','').replace('T',' ')
if (len(str_datetime.split(':'))) > 2:
str_datetime = str_datetime[:-3]

print(str_datetime)

# Feature 3: Insert an event


event = {
'summary': 'AI-Reserved Meeting',
'location': 'Zoom meeting',
'description': 'A meeting scheduled by AI.',
'start': {
'dateTime': (datetime.strptime(str_datetime, '%Y-%m-%d
%H:%M')).isoformat(),
'timeZone': 'America/Los_Angeles',
},
'end': {
'dateTime': (datetime.strptime(str_datetime, '%Y-%m-%d %H:%M') +
timedelta(hours=hours)).isoformat(),
'timeZone': 'America/Los_Angeles',
},
}
created_event = service.events().insert(calendarId="zahaby@gmail.com",
body=event).execute()
print(f"Created event: {created_event['id']}")
return json.dumps({"Created event": created_event['description']})

```

## TODO:

- Check the reservation date-time first. (by calling calendar-list API). if reserved suggest
other free slots.

- This is a stateless calendar reservation POC, doesn't handle any type of conversation or
stateful flow. Need to put this POC in a full context.

- Handle exceptions, as there is a big window for different error scenarios.

### useful links:

https://developers.google.com/gmail/api/quickstart/python#authorize_credentials_for_a_des
ktop_application

https://python.langchain.com/docs/integrations/toolkits/gmail/
Source Code
The source code for the mentioned tutorial is listed in this repo:
https://github.com/zahaby/llm-rag

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy