0% found this document useful (0 votes)
19 views33 pages

Improving Retrieval Augmented Generation

The document discusses a study on improving Retrieval Augmented Generation (RAG) for large language models (LLMs) to address limitations such as outdated information. Various methods were tested, including chunking, query re-ranking, query expansion, and knowledge graphs, with mixed results indicating that optimizations are not straightforward. Overall, the findings suggest that while RAG has potential for enhancing accuracy and efficiency, the effectiveness of optimization methods requires further exploration.

Uploaded by

aroojpandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views33 pages

Improving Retrieval Augmented Generation

The document discusses a study on improving Retrieval Augmented Generation (RAG) for large language models (LLMs) to address limitations such as outdated information. Various methods were tested, including chunking, query re-ranking, query expansion, and knowledge graphs, with mixed results indicating that optimizations are not straightforward. Overall, the findings suggest that while RAG has potential for enhancing accuracy and efficiency, the effectiveness of optimization methods requires further exploration.

Uploaded by

aroojpandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

2024 Future Computing Summer Internship

at the
Laboratory for Physical Sciences (LPS)

Improving Retrieval Augmented Generation

Eric Mollard
Anant Patel
Lynn Pham
Reuben Trachtenberg

August 9, 2024
ABSTRACT
With the recent popularity of large language models (LLMs), there are some limitations, namely,
information being outdated due to a data cutoff. Retrieval augmented generation proves to be a
good solution to this, providing the LLM access to recent information with the use of databases. We
test a combination of methods like chunking, query re-ranking, query expansion, and knowledge
graphs. Chunking tests what chunk sizes provide the best context for the LLM, resulting in single-
sentence chunks being the most effective. Furthermore, query re-ranking is where the ordering of
documents is modified to include the most relevant documents at the top of a search. We find that
query re-ranking has mixed results compared to the baseline, where some methods perform better
with certain k values. Next, we looked at query expansion, which uses an LLM to generate another
response to further expand on the question. Using three different methods of query expansion, we
find that they underperforming compared to the baseline. Finally, we use a knowledge graph to
model relationships between different points of data to better understand and connect informa-
tion. With this, results were a little more complicated due to the nature of the testing. Optimizations
of the retrieval augmented generation pipeline is a promising way of showing effective methods
to increase the accuracy and efficiency for real-world scenarios, however, in our testing, we find
that these optimization methods are not as clear cut as they are expected to be.

Keywords:
Bi-Encoder: A model that processes a query and input documents independently
with its two encoders that then each produce an embedding. The embeddings
are later compared against each other for similarity and relevance score.
Closed-book Setting: A setting in which the model will not have access to external
resources but reply solely on its pre-trained parameters.
Cross-Encoder: A model that processes the query and input documents together
with its one encoder to produce a joint embedding presentation of both. This
model tends to have greater overall accuracy as it captures the interactions be-
tween the items within the input during processing.
Knowledge Graph: A graphical representation of relationships between entities
generated via semantic linking.
Large Language Model (LLM): A computational deep-learning model that is trained
on large amounts of data used for natural language processing.
Lost in the Middle: An issue within the re-ranking technique for LLMs in which a LLM
tends to performs the best when relevant documents are placed at the beginning
and the end of the input context and worst when relevant documents are placed
in the middle of the input context.

i
Multilingual-E5: A multi-language embedding model that is specifically trained to
embed natural language into vector representations.
Oracle Setting: A setting in which the model will be given a single document that
contains the correct answer to the question to generate the answer.
Pseudo-Relevance Feedback (PRF): A query expansion approach that uses the
retrieval documents from the original query as “pseudo-relevant” documents to
retrieve new query terms.
Query Expansion: An optimization technique that allows the model to semantically
retrieve the relevant documents that might not share the same keywords, thus, ex-
panding the diversity and perspective for the LLM-generated response. By giving
the model a broader content, it helps with enriching the model’s comprehensive
capacity, increasing the chances of getting the correct answer.
Re-Ranking: The process of taking the initial outputs of a vector database and
reordering the documents based on some metric (i.e. relevance, age, size, etc.)
to improve the performance of a retrieval augmented generation pipeline.
Retrieval Augmented Generation (RAG): A technique for improving the perfor-
mance of a LLM by using external data (i.e. documents) to ground the response
of the LLM to reduce errors and improve generated responses.
Stop Words: Words that are filtered out of a dataset because they are deemed
insignificant to the meaning of the data. Common stop words include ’the’, ’and’,
and ’is’.
Vector Database: A specialized database that stores and searches with vector
data.
Vector Embedding: A way of storing data using large-dimensional vectors of num-
bers that represent the semantic value of the data.

ii
CONTENTS
Abstract i

1 Introduction 1

2 Background 1
2.1 Large Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Retrieval-Augmented Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.3 Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3 Literature Review 3

4 Methods 5
4.1 Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2.1 Dataset Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2.2 Storing the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3.1 Questions Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.4 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.5 Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.5.1 Fixed-size Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.5.2 Recursive Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.5.3 Semantic Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.6 Query Re-Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.6.1 Temporal Re-ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.7 Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.7.1 New Queries with Keywords Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.7.2 New Queries Based On the Original Query . . . . . . . . . . . . . . . . . . . . . . 14
4.7.3 Answers Based On the Original Query . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.8 Knowledge Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.8.1 Generating Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.8.2 Neo4j Database Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.8.3 Querying Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.9 Testing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Performance Analysis 18
5.1 Qdrant Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.1.1 Chunking Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.1.2 Re-ranking Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1.3 Query Expansion Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Knowledge Graph Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2.1 Node Query Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.2 Path Query Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.3 Combining Node and Path Query Results . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.4 Manually Reviewed Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.5 Graph and Vector Results Compared . . . . . . . . . . . . . . . . . . . . . . . . . 25

iii
6 Conclusion 26
6.1 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.3 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

References 28

iv
1 INTRODUCTION
With the recent introduction of large language models (LLMs) into various industries, there is a bur-
geoning desire to make them more efficient and effective, as is often the case with any new tech-
nology. One proposed improvement, retrieval augmented generation (RAG), has proven to be an
important advancement for large language models [1]. Retrieval augmented generation provides
a way for large language models to be context specific and generate responses based on stored
data. This means models do not require retraining to draw on new information. However, there is
still much work to be done before RAG systems are ready for real-world use cases.
This paper intends to explore several methods with the aim of improving RAG models with optimiza-
tions and improvements to the RAG pipeline. We propose testing various optimization methods to
test their effectiveness with real-world data. The standard RAG pipeline shows a couple of areas
with room for refinement. We look at improving the quality of chunking, cleanliness of data, con-
ciseness of queries, and prioritization during document retrieval for a standard RAG pipeline, as well
as the potential upsides of knowledge graphs as a means of interacting with the model.
We first explore chunking techniques, experimenting with various methods like character-based
chunking, semantic chunking, and token-based chunking. We compare the success of each
chunking method based on the results of the retrieval from the database. The next item that we
look at is the quality of our data. We approach this by studying common methods in natural lan-
guage processing (NLP) that are used in areas like classical machine learning. Here, we experiment
with techniques such as stemming, eliminating stop words, and cleaning up unnecessary informa-
tion with the use of regex strings. Another improvement that we consider is query expansion. Query
expansion allows for the large language model to generate additional responses in order to obtain
a greater number of documents from the database. Finally, we use re-ranking to prioritize some
documents over others to further better the response from the large language model. Knowledge
graphs is the other consideration for improving RAG. A key benefit of knowledge graphs is their
relationship-building, which allows for greater complexity in responses and multi-hop querying ca-
pabilities [2]. As this proves to be a lofty endeavor, we dedicate half of our team’s efforts to this
task alone.
In this work, we go more in-depth into the techniques discussed and the profound effects they have
on the retrieval augmented generation system. We will compare various implementations to find
effective methods to optimize the pipeline.

2 BACKGROUND
2.1 Large Language Models
Large language models, or LLMs, are models with a number of parameters that can interpret and
generate human language and that have impressive learning capabilities [3]. Many of these mod-
els are trained on large sets of data which is held in the model’s network via weights. Furthermore,
LLMs are often fine-tuned to produce more accurate and plausible results. [4].
LLMs work through a process of encoding and decoding. When a model receives input, it tok-
enizes that input and proceeds to create a vector embedding of each token based on the token’s
meaning. The result of these embeddings is a large vector space containing points that are near

1
to similar entities and far from dissimilar entities [5]. When a user asks an LLM, like ChatGPT, a ques-
tion, the model embeds the text and passes the embedding through several layers of encoders
and decoder in the transformer to build an response to the question. Eventually, the model is able
to generate a response based on its training data [5]. However, training model weights becomes
much more expensive as it grows. This leads to the unfortunate downside that large language
models are not always up to date due to their training being cutoff from recent information.

2.2 Retrieval-Augmented Generation


Retrieval-Augmented Generation (RAG) gives LLMs the capability to access data from external
sources [1]. The obvious upside of this advancement is that models can produce responses with
current data. They would not be limited to the data that they were fed at the time of their training,
solving the issue of having outdated information. Additionally, external knowledge stores are far
easier to update than the model itself, making RAG a very efficient way for models to retrieve
information.
With classical RAG, documents containing text data are split into chunks, which are then vectorized,
and stored in a vector database. When a LLM is queried, it searches the database for the most
relevant chunks regarding an answer to the posed question. These chunks, along with the original
question, are what the LLM eventually takes as input. Its answer is generated off of this data [1].

2.3 Knowledge Graphs


While classical RAG has shown promising results, it is not the only way to approach this technique.
Rather than use a vector database, some models refer to a knowledge graph for their augmenta-
tion. Knowledge graphs are ”graph-based abstractions of knowledge” wherein nodes represent
entities and edges represent relationships between those entities [6]. A major upside to knowledge
graphs is their more coherent structure [2]. Instead of vectors floating in space, they contain en-
tities with meaningful connections. Also, the added context helps to reduce hallucinations and
increase nuanced responses from the model. Below is an example sub-graph from our knowledge
graph made in Neo4j.

Figure 1: Knowledge Graph Snippet

This sub-graph shows a cluster of nodes from our knowledge graph containing information about

2
the China-Taiwan conflict. Here, the focal point is on the ”Taiwan’s waters” node, which has links
to several related nodes. We can get a better picture of which entities are being related to this
node in the image below.

Figure 2: Node Neighbors

Here, we can see that the relationship between ”Tugboat” and ”Taiwan’s waters” is ”left”. Apart
from this relationship information, the graph also holds the chunk of text containing the relationship,
as well as the name of the file from which the chunk came (not pictured). These additional bits of
information are useful for the LLM as it’s generating a response to a query.

3 LITERATURE REVIEW
Advanced RAG Techniques: In the effort of improving the performance of LLM models, many tech-
niques have been introduced by experts and researchers in this area. ”Retrieval-Augmented Gen-
eration for Large Language Models: A Survey” is one of the papers that provided inspiration on
multiple aspects of RAG pipeline optimization strategies including indexing, query, embedding, re-
trieval, etc. For indexing, there are different strategies of chunking or enhancing the chunks with
better organization and detailing of metadata. For query, there’s recommendations for query
expansion, multi-query, sub-query, chain-of-Verification, or query-transformation technique. The
survey also suggests the use of a re-ranker as its impact on increasing the relevance and accuracy

3
on an LLM’s generated responses. The survey presents a great number of optimization techniques
that our team integrated within our RAG pipeline as we began our study to measure whether the
performance of an LLM model is improved regard certain technique.

Chunking strategy: In order to better the performance of an LLM, multiples chunking strategies
have been introduced and experimented. Some of the prior key factor includes [2] and [], which
presented the multiple ideas of strategies. Firstly, we have the most simple and straight forward ap-
proach, the ”fixed size strategy”, which divided the text into a fixed size of segments. Secondly, ”re-
cursive strategy”, which allows to handle more complex data when the text needed to be grouped
by certain conditions/operations to produce chunks with similar size. Thirdly, we have the splitting
strategy base on its ”contextual”, also known as semantic chunking, that allows text to be clustered
together based on their contextual relevance.

Re-ranking: There is prior experiences to examine whether the specific placements of retrieved
chunks would affect the result of the LLM’s performance in producing accurate responses. Thus,
the paper ”Lost in the Middle: How Language Models Use Long Contexts” has provided us with their
experimented result and analysis that encoder-only models, such as BERT, which is the model we
used for this study, tends to result in the model’s highest performance when the relevant documents
are placed at the very beginning and ending of the input context. However, when the relevant
contents are located in the middle of the input, the model tends to significantly degrade in its per-
formance. On the other hand, the encoder-decoder model, such as sequence-to-sequence, in
certain cases, can still result in good performances when the placement of relevant document is
in the middle part of the input. This paper gave us an idea of the possible impact of re-ranking the
retrieval documents to the RAG pipeline, thus leading to the experiments of different re-ranking
techniques to integrate within our RAG model.

Query Expansion: Among the list of techniques that has proven to be helpful in boosting a LLM’s
accuracy, Query Expansion is one of the standout option that many have picked. One of the prior
works on query expansion is ”Query Expansion by Prompting Large Language Models”. This paper
studies the efficiency impact of different prompts of LLM, which includes zero-shot, few-shot and
Chain-of-Thought (CoT), and compares them to the efficiency of Pseudo-Relevance Feedback
(PRF), a traditional query expansion. The issue with PRF approaches is it assumes the top retrieved
documents are the most relevant, however, this would give a poor result when the very first set
of retrieved documents is not relevant enough due to a poorly written query. Their methodology
utilized an LLM to generate new query terms and concatenate them into the original query (q′ =
Concat(q, q, q, q, q, LLM(prompt q)) to broaden the relevant keywords for the query in order to
achieve a higher percentage of getting the correct answer. Among all the approaches, their ex-
periments showed that CoT/PRF prompts generally yield the best performance. This prior work has
influenced us to experiment similar method to optimise our RAG pipeline.

An Alternate Approach to Knowledge Graph RAG: A primary challenge in implementing RAG sys-
tems with knowledge graphs is determining how to query the graph data. In our case, we wanted
to be able to compare the results of our knowledge graph system to those of our vector database
system. This meant the queries for our knowledge graph had to produce comparable results to
the queries for our vector database. We ended up querying our knowledge graph directly using

4
Cypher queries (the language used for interacting with Neo4j databases). With this method, we
received relevant nodes, files, and chunks to the question asked in our query. As will later be ex-
plained, there were some issues with this technique. Perhaps a better idea would have been to
follow in the footsteps of Matsumoto et al., who, per their paper ”KRAGEN: a knowledge graph-
enhanced RAG framework for biomedical problem solving using large language models”, vector-
ized the relationships in their knowledge graph and did vector searches on the data. In essence,
their methodology combined the two separate mediums that we sought to compare. Additionally,
they used a Graph of Thoughts (GoT) framework to enhance the performance of their LLM, KRA-
GEN. Their results were promising as they saw leaps in accuracy across several question types with
GoT over traditional input/output knowledge graph RAG. The paper illustrates a potential break-
through in improving knowledge graph RAG systems that we might consider testing in the future.

ChatGPT and Knowledge Graphs: One group that found success in having an LLM generate queries
for a knowledge graph query language published their findings in a paper titled ”LLM-assisted
Knowledge Graph Engineering: Experiments with ChatGPT”. Meyer et al. created a knowledge
graph and then had ChatGPT generate queries in the SPARQL language to ask of the graph. They
found that both GPT-3.5 and GPT-4 were fairly consistent in producing plausible, syntactically valid
queries. Our trials with creating valid cypher queries were less successful, however our methodolo-
gies differed and, after all, these are two different query languages. One important difference is
that this group of researchers had GPT re-enter the query generation process if the last query re-
sulted in the empty set when sent to the knowledge graph. This allowed for a cycle of generation
until a query was created that produced one or more results. We did not take this step in our code,
but it might have proven valuable. Too often our generated queries were resulting in empty sets, so
we believed that the only way to use GPT to create queries would be a different approach outlined
later in the paper. Apart from the fruitful query generation performed by this research group, there
were also positive results found regarding the capabilities of LLMs to generate knowledge graphs
and schemas. These findings confirm that LLMs still have room for improvement but are at least
able to produce the necessary components to run a RAG knowledge graph pipeline.

4 METHODS
4.1 Technologies
With the boom in demand for efficient, low-cost LLMs, researchers, hobbyist programmers, and
software developers have produced many resources for those looking to improve on the newest
tech phenomenon. We use some of these techniques, tools, and libraries to aid our work. For
the retrieval augmented generation system, we first used Mistral’s Mixtral-8x7B, but later switched
to various models from OpenAI due to some constraints. The models were then interfaced with
Qdrant, an open-source vector database, and used to generate queries for query expansion. To
work with the data and the vector database, we used Multilingual-E5 for the embedding model.
These additional models are integrated into the pipeline alongside several libraries: LangChain,
an open-source library that helps build large language model applications, datasets, a Hugging
Face library with ready-to-use data processing methods, and natural language tool kit (NLTK) for
language processing. To help with data organization, we also employ pandas’ Dataframes. Ad-
ditionally, we tested the efficiency of knowledge graphs versus a traditional RAG system. To do
this, we hosted a database on Neo4j. This tool lets users create and interact with graphs locally or

5
through the website’s interface. We chose the latter.

4.2 Dataset
We used a large dataset containing information about three ongoing conflicts. The dataset con-
tains data on China-Taiwan, Israel-Hamas, and Russia-Ukraine conflicts and comes from The Institute
For the Study of War. China-Taiwan is the smallest part of the dataset with only 20 text files. Israel-
Hamas contains 198 text files. The Russia-Ukraine data was a little different because it is broken
down based on year, starting from 2022 and going through 2024. For Russia-Ukraine, there were
325, 363, and 119 text files that were contained within 2022, 2023, and 2024 respectively, for a to-
tal of 807 text files. In total, there are 1,026 text files. The text files mostly contain information from
news reports that detail current events regarding the conflicts such as government plans, wartime
politics, and ongoing situations.
The data was collected by scraping The Institute of War’s web pages of the three conflicts. The
scraped data was then saved into text files and sorted based on the conflict and for the Russia-
Ukraine conflict, on year as well. Upon further inspection of the data, we took note of the format of
the data and how it was presented to the reader. For the China-Taiwan reports, the data followed
a similar pattern between all the files within the conflict. The China-Taiwan files start with a large
header that contains the title, some hyperlinks to other pages (i.e. to download the page as a
PDF), the authors of the report, and the purpose of the reporting on the China-Taiwan conflict.
After the header, there is a section for the key takeaways from the report. The key takeaways
mostly are summarized points from the report itself. The main content of the report comes after
the key takeaways. The body, or the main report, is broken down into multiple paragraphs, where
each paragraph has some in-text citations, which consist of a number enclosed in square brackets.
After the body, the numbered citations are listed alongside the footer which contains a little more
information on the website.
The Israel-Hamas reports followed a similar format to the China-Taiwan reports. They share the same
header, body, and footer. However, the only major difference is that in between the purpose
of the report and the key takeaways, there was an extra section that provided some preliminary
information that is important which is not contained within the key takeaways. Along with a greater
number of documents, the script to clean the data became slightly more complicated. The Russia-
Ukraine reports also followed mostly the same pattern. It is important to note that sometimes, the
reports do not contain the key-takeaways portion or preliminary information.

4.2.1 Dataset Pre-processing

Pre-processing is a standard technique applied to large amounts of data related to machine learn-
ing and neural networks. We applied some pre-processing methods to determine the best ways
to optimize data retrieval. We first attempted to use the data as-is, entirely unprocessed. We also
used minimally cleaned data, doing some testing afterward to see how well it performed.
Looking at the data, we saw that there were some errors. Since the data was stored in text files,
any images were turned into special symbols that were unable to be interpreted. This was not
very useful information to keep given that it is not a part of natural language. In addition to this,
there were also a significant number of extra lines and spaces from the original formatting of the
files. Furthermore, there were many hyperlinks and citations contained in the body of each report.
Hyperlinks were unnecessary for our RAG model as they did not provide any relevant context and

6
only served to clutter the data we wanted to use. Removing special characters and hyperlinks
could reduce the size of the files and create better, more relevant chunks. Some of the files within
the folders were just blank because they were supposed to represent a map or some illustration,
which could not be parsed correctly. We worked with multiple methods to find the most effective
way to process the data.
We began by looking at existing pre-processing methods that are commonly used in areas like clas-
sical machine learning. Such methods include stemming words and eliminating stop words. To im-
plement this, we used the natural language tool kit library (NLTK). With NLTK, we removed instances
of stop words and stemmed the data. However, we did not continue with this approach, seeing
as it made sentences harder to understand and removed some key context from the sentence.
Large language models are already efficient with natural language processing, so stemming and
eliminating stop words was not needed.
Additionally, we also removed all instances of hyperlinks from the data. To remove the hyperlinks,
we used Python’s regex library, re. Using re, we were able to settle on various regex strings to elimi-
nate large chunks of the text by pattern matching to specific keywords. For instance, to eliminate
the citations near the bottom of the report, the pattern would look for a ”[1]”, skip a character,
and then look for specific keywords like ”http” or ”Sources available upon request” to determine
if it is a citation or not. Using re, we were able to remove unnecessary portions of the text like
the header, key takeaways, citations, and footers, along with the in-text citations. This excelled in
quickly removing large portions of text.
Once the data was processed, we chunked it. Chunking is a process by which the data is broken
up based on a set number of characters, tokens, sentences, paragraphs, or some other criteria.
With chunking, we could also specify overlap and strings to separate with. This became another
area for experimentation. Once chunks were created, we stored them in a CSV file with columns
representing filenames and chunked data. We created three different CSV files (one for each of
the conflicts) to separate the data.

4.2.2 Storing the Dataset

Once the data pre-processing was finished, working with the data was another question. We
needed a way for the LLM to access and query the data when a question is asked of it. For this,
we utilized Qdrant. Qdrant stores data in collections, which are groups of points, which contain
the data. Points in the database represented chunks of the text and were variable in size depend-
ing on the chunking method we were testing. For each point in the collection, we added the file
name, the date the report was published, raw text, and the embedding of that chunk to the pay-
load. Points in Qdrant utilize vectors to store and locate data given a similarity metric, along with
an optional payload. In this case, the chosen similarity metric was cosine similarity, which seems like
the standard for vector databases [7]. Additionally, we used a locally hosted instance of Qdrant
on a high-performance computer, allowing us to quickly upload and manage our collections with
little downtime.
What we included in each Qdrant point was necessary, as we needed multiple pieces of informa-
tion to test the retrieval quality. The information that we needed was included in the payload. The
first item is the file name that the text chunk comes from. The file name is used for testing to ensure
that the correct chunks are being retrieved. The second item is the date of the file that the chunk
comes was published. This date is used for temporal re-ranking, which will be covered in-depth
later on. The third item was the raw text or the chunk of text. This is the data that will be given to

7
the LLM when it needs to generate an answer. The last item is the embedding of the raw text. Like
the file name, this is used for testing and making sure that the correct data is being retrieved.
To store the data, we needed to convert it into vectors. Our first idea was to test the system by
using a simple hash function that would assign each new entry a hash. Each hash was assigned
by iterating over a counter variable. Since Qdrant expects vectors, not a single value, we stored
the hashes as one-dimensional lists to act as the vectors for each point. This worked and reassured
that the points were being created properly and being stored in the Qdrant collection. However,
the problem with storing the chunks like this is that hashing does not provide any relevance for
a point when there is a query on the database. Slight changes to the raw text can completely
change a hash. To solve this, we used an embedding model to turn the chunks into vectors. We
used a pre-trained embedding model called Multilingual-E5 from Hugging Face. Multilingual-E5 is a
multilingual model that is used for embedding words into vectors [8]. By embedding with this model,
we got vectors that were made up of 1,024 floating-point numbers. This is the vector that will be
used for locating Qdrant points. We completed this embedding process with all of the chunks from
the three conflicts and stored the embeddings in three different arrow files. Once the format was
decided on the points, we worked to upload the data to the Qdrant collection. Qdrant provides a
Python library called ”qdrant-client” to help upload information using ”upsert”. This function made
it easy to send points to the database in bulk, reducing waiting times to upload a large number of
points.

4.3 Questions
In addition to the text data, we created a database of 9,774 questions used to query the text
database. The questions were created using an LLM. During generation, the LLM was provided
chunks of text from the dataset to serve as context for the questions. The questions were specifi-
cally created so that they did not directly contain information from the chunks but still maintained
strong connections with the chunks used to generate each of them. Later into the experiments,
as we worked with the data more, we extended our testing, where we integrated a new testing
approach. As a result, we created a new set of 1,977 questions that were generated proportionally
with the length of each folder.

4.3.1 Questions Pre-processing

The pre-processing for the questions was similar to that of the text data. We first started with a
manual review of the original 9,774 questions. We found that there were inconsistencies among
the questions. Some of the generated questions included direct answers to the generated question
within the same block of generated text. To resolve this, we opened the data using Hugging Face
datasets and removed any instances of text that came after the newline following the question.
This effectively got rid of all extra blocks of generated text that may have contained an answer to
the question.
Within the questions file, there is an additional paragraph section that contains a ground truth and
a matching file name, from which this ground truth content came, per question. These two pieces
of information are designed for performance evaluation of the RAG pipeline. Similar to the question
itself, these two data also needed pre-processing. As we studied the file, we noticed multiple entries
of the ground truths containing extra spaces, unreadable symbols, and hyperlinks that were useless
to its purpose of evaluating the performance, thus, we approached this with a similar technique
on how we cleaned the question, and removed the unnecessary fields. For each entry, there were

8
three columns: a column that contained the question, a column that contained the ground-truth
text, and a column that contained the filename.
We then embedded the questions with Hugging Face’s multilingual-e5-large and mapped the vec-
tor data into the existing file as the fourth column, which was later transferred to an arrow file. With
the questions embedded, we were able to query the database and find the k-nearest neighbors
to each question vector, with k being an argument that we specified when performing the query.

4.4 Baseline
To evaluate the performance of future optimizations, we created a baseline. The baseline utilizes
single-sentence chunks. The baseline is rudimentary, as it does not contain any optimizations. The
baseline was created with little data pre-processing and no use of re-ranking or query expansion.
The baseline only has to vectorize the input and use it to query the database to get the documents
to pass to the Mistral model. This baseline was needed to measure and compare the improvements
of our methods.

4.5 Chunking
4.5.1 Fixed-size Chunking

Fixed-size chunking is when the document is split into multiple chunks with a fixed size until the file
reader reaches the end of the file. In our experiments, we used LangChain’s RecursiveCharacter-
TextSplitter to pre-process our database into sizes as listed below:
• 150 tokens with 50 overlapping tokens
• 200 tokens with 50 overlapping tokens
• 250 tokens with 75 overlapping tokens
• 500 tokens with 100 overlapping tokens
• 2000 characters with 500 overlapping characters

4.5.2 Recursive Chunking

To better our studies, we wanted to create a variety in our collections, thus, we applied recursive
chunking to create paragraph chunking and single sentence chunking, which is also our baseline
collection for testing.
For single-sentence chunking, we used the Python Regex Split String with pattern "(r'[.?!]\s+', text)"
to split the text by common sentence-ending punctuation marks (periods, question marks, and ex-
clamation marks) followed by white space. We wrapped each operation within a try-catch block
in order to catch any errors that might happened due to unreadable characters or any minor
patterns that we would not be able to recognize due to the big dataset.
For paragraph chunking, we also applied Python Regex Split String with "\n" pattern along with
strip() function to ensure all extra spaces are cleaned up. After a couple of times experimenting
the paragraph chunks, we noticed that the titles for sections/subsections and short sentences in a
list format tended to become its own chunk, which defeated the purpose of paragraph chunking.
Thus, we coded an algorithm to filter out sections’ labels and list formatted sentences, then glued

9
the sections’ labels to stick with the consecutive paragraph and have the list-formatted sentences
grouped together.

4.5.3 Semantic Chunking

Semantic Chunking is splitting the text based on its contextual meaning. By re-using the single sen-
tences collection, we applied cosine-similarity from scikit-learn library to compute a percentage
threshold by feeding pairs of consecutive sentences until it reaches the end of the file. Using this
calculated threshold, we found all the breakpoints indices in which, at that index, the sentences
aren’t relevant enough to the previous part. Thus, we grouped those relevant sentences into a
group of text according to the breakpoints to create new chunks based on their contextual rele-
vance. These semantic chunks are then re-embedded to upload into Qdrant.

Figure 4.6.3a Semantic Chunking

4.6 Query Re-Ranking


Query re-ranking is a growing method to increase the effectiveness of document retrieval [9]–[11].
To do this, documents are re-ordered depending on some metric. In theory, re-ordering documents
can prioritize more relevant information that can be given to the LLM, thus improving the quality of
responses. Many of the existing re-ranking methods make use of existing models like BERT to help
with the re-ranking process [9]–[11]. However, we would not need to leverage a LLM or a re-ranking
model to sort the documents for us. The method that we propose uses simple math and the data
that is already given to us with Qdrant to help re-rank the documents.

4.6.1 Temporal Re-ranking

For our use case, we implemented temporal re-ranking, which uses basic math. Temporal re-
ranking, accounts for the document’s relevance and its age to determine the overall score that is
given to the document. We would like to prioritize new information since that would be the most
relevant and up-to-date information that the LLM can pull from. However, it is important to note
that not all recent information is always relevant to the question being asked. We would need a
way to account for both relevance and age. We can implement temporal re-ranking using a linear
combination as shown below:

α×R+β×T =S

10
Where R is the relevancy score of the chunk. This is the score that is returned by Qdrant after
document retrieval. Each returned point contains the relevancy of the document to the original
search query. We can multiply R by α, which is a hyper-parameter weight that can be adjusted to
tweak the importance of R. T is the temporal score of the chunk. This is the score that is calculated
by using the date that is contained in the payload of the chunk. The age is calculated using the
current data and subtracting the document’s age. This is the age of the document in days. The
age in days is then plugged into a rational function shown below:

1
=T
δ+1

Where δ is the age of the document in days. 1 is added to the document’s age to account for
documents that are published the same day, providing a maximal and to prevent a division by 0
error in the scenario where the document is published the same day. T is then multiplied by β, the
second weight, used tweak the importance of T . This linear combination is then applied to reorder
the results from Qdrant, resulting in a new ordering of the points. The weights can be modified as
necessary to put more emphasis on one or the other. This is explored more during testing.

4.7 Query Expansion


Previous papers have explored and indicated the improvements that query expansion has pro-
vided [9]–[12]. As we looked into multiple query expansion approaches proved in prior studies to
be efficient in boosting the model performance by providing longer and more in-depth questions,
we are aware of how different datasets and pipeline structures might yield varying results. Thus, we
decided to select a couple of different approaches for query expansion. The first is generating new
queries with keywords extracted from top-k retrieved chunks. Second, is generating new questions
solely based on the original queries. The third is generating answers to the original queries.

4.7.1 New Queries with Keywords Extraction

Generating new queries with the keywords extraction approach is slightly more complex compared
to the second approach as it involves more steps. Due to the complexity and resource limitations,
we only examined this on the first 100 questions and the first 300 questions. This approach is the
combination of Chain-of-Thought (CoT), a technique that asks the LLM model to break down its
logical thinking process in providing the final answer, and Pseudo-Relevance Feedback (PRF), a
technique that is based on the assumption that the top documents of each query search will be
the most relevant documents, thus, automatically using it to generate the new queries.
Going into detail, the top-k chunks of each round are extracted out from the first 100 query searches,
and saved into a CSV file. With the use of Mistral’s Mistral-8x7 model and a customized prompt
[Figure 3.8.1b] based on the CoT technique, We gave step-to-step specific instructions on the ex-
traction procedure we want the model to operate along with the model explaining its reasons,
to extract the keywords from the prior saved chunks and then we stored the keywords in a multi-
dimensional list (list of lists). Each keyword list and its coordinate question is later used to compute
the prompt [Figure 3.8.1d] for generating the new question. When the set of new questions is com-
puted, they are saved in a CSV and embedded. After being embedded, we ran a query search
using the new set of generated queries. The query searches were recorded and evaluated using
our baseline testing.

11
sample =

"""
I have the following document:
* The website mentions that it only takes a couple of days to deliver but I still
have not received mine.

Let's think step by step to extract the keywords from this document.

Step 1: Identify the main subject of the document.


- The main subject is about a delay in the delivery mentioned on the website.

Step 2: Highlight the key components of the subject.


- The key components are the website, the mention of delivery time, and the delay.

Step 3: Break down the components into individual keywords.


- Keywords should capture the essence of the document.

Extract 5 keywords from that document.

Keywords: website, mentions, deliver, couple, days


"""

Figure 3.8.1a Sample to Instruct LLM on How to Extract Keywords

prompt =

"""
Here is the provided document:
{chunks['top3_chunks'][idx]}
Here is the provided matching question:
{q_list[idx]}
You are an assistant that will leverage your generative abilities to extract
important and relevant keywords in the document. Here is a sample of how you will
do it:
{sample}
End of sample.
Take the document and the question, following the sample, understand their meaning
and connection, then identify the keywords.
Extract ONLY EIGHT relevant keywords that you found in the previous step, ensuring
they are no longer than 4 words each. Return the EIGHT keywords separated by
commas, following this format: "Keywords: " in a clear and concise, human-readable
format, and end with a period.
"""

Figure 3.8.1b Keywords Extraction Prompt

12
sample2 =

"""
Original question: What is the significance of Ko Wen-je's shift in messaging
towards pan-Green voters and how does it relate to the decline in support among
KMT-identifying voters and the importance of youth turnout in the Taiwan election?

Keywords: Ko Wen-je, messaging, pan-Green, DPP-aligned, support decline, KMT-


identifying voters, polling data, joint ticket collapse, Hou Yu-ih, KMT base.

Let's think step by step to generate a new question based on these keywords and
the original question.

Step 1: Understand the context of the original question. - The original question
is focused on the strategic shift in messaging by Ko Wen-je, its impact on KMT-
identifying voters, and the role of youth turnout in the election.

Step 2: Identify the core components of the original question and how the keywords
relate to these components.
- Core components: significance of messaging shift, impact on voter support, role
of youth turnout.
- Related keywords: Ko Wen-je, messaging, pan-Green, DPP-aligned, support decline,
KMT-identifying voters, polling data, joint ticket collapse, Hou Yu-ih, KMT base.

Step 3: Consider additional angles or aspects that could be explored based on the
keywords.
- Potential angles: broader implications of the messaging shift, comparison with
other candidates, specific impacts on different voter demographics, longer-term
electoral strategies.

Step 4: Formulate a new question that expands on the original question using the
keywords.

Generate a new question based on the original question and the list of keywords.

New Question: How does Ko Wen-je's strategic shift in messaging towards pan-Green
(DPP-aligned) voters impact the broader electoral landscape in Taiwan, particularly
in comparison to Hou Yu-ih's consolidation of the KMT base following the joint
ticket collapse, and what role do changing polling data and youth turnout play in
this dynamic?
"""

Figure 3.8.1c Sample to Instruct LLM on How to Generate New Question

13
prompt =

"""
Based on this sample
{sample2}
End of sample.

Here are the given keywords:


{keyword_list[idx]}
Here is the given corresponding question for the keywords above:
{q_list[idx]}

You are an assistant that will leverage your generative abilities to generalize a
given question.
Following how the given sample above break down the thought process to while
generate the new question base on the keywords, generate and return a new question
for the given question and given keywords above in a clear and concise format that
is humanreadable, starting with "New Question: " and ending with a period. Ensure
to use varied question structures and not just follow the sample strictly.

New Question:
"""

Figure 3.8.1d Generate New Query Prompt

4.7.2 New Queries Based On the Original Query

This method aims to generate a query that is similar to the provided question. This utilizes prompt
engineering to have the LLM generate an output meeting specific criteria. The idea behind this
method is to generate a query that expands on the original query, giving more room for interpre-
tation. Expanding on the query means reinterpreting it to give a broader context and bring in new
words. This, in turn, should help to return a more diverse set of documents from the database, lead-
ing to a better response generation by the LLM. The LLM is encouraged to add new keywords to
the question to achieve this. For this, the following specific prompt was created for the LLM:
”You are an assistant that will leverage your generative abilities to generalize a given question.
Here is how you will do it: 1. Take the question and understand its meaning, identifying keywords
and phrases. 2. Generate a question that is similar to the original. Consider adding other keywords
and phrases to expand on the question. 3. Return the generated question and only the question
in English. Provide an in-depth question in a clear and concise format that is human-readable.
The question should not contain leading phrases like ’Here is the expanded message’. Here is the
provided question: question.”
The query generated by the LLM is then used in two ways. The first way is embedding the generated
query and using the new embedding to search the database. This method replaces the original
query with the new one for when the search happens. The second method is to combine the
original query and the generated query into one prompt, and then embed the combined query
to search the database. This way, both queries can be leveraged, giving the embedding access

14
to a more diverse set of words.

4.7.3 Answers Based On the Original Query

Similar to the last approach, except, that the LLM is asked to answer the query rather than generate
a new one. The intuition is that creating an answer can give new keywords that can be used to
search the database. Answering the query provides a new way to look at the question, giving
room for more interpretation, and leading to new results. The prompt for answer generation is very
similar to query generation:
”You are an assistant that will leverage your generative abilities to answer a given question. This is
what you will do: 1. Take in the question and understand its meaning and what it is asking for. 2.
Generate an answer that gives the best possible answer to the question. 3. Return the generated
answer and only the answer in English. Provide a short answer in a clear and concise format that is
human-readable. The question should not contain leading phrases like ’Here is the answer’. Here
is the provided question: question.”
We use the same testing methods as the second query expansion method.

4.8 Knowledge Graph


4.8.1 Generating Relationships

The process of creating a knowledge graph involves several steps. First, we had to generate rela-
tionships from our text files. This was the most complex part of the process. Initially, we used Mistral’s
Mixtral-8x7B to scan the texts and form relevant relationships. We wrote a prompt for the model that
asked it to identify meaningful words and relationships in the text and store them as we wanted.
There is room for improvement in this area as prompt engineering played a massive role in how
effectively Mixtral completed our task. For future work, taking the time to experiment with many
prompts to find the best-performing format is advisable. As for the relationship generation process,
it went something like this:

Find entities and relation words

Store relationships as triplets

Extract triplets

Figure 4.9.1a Relationship Generation Process

The triplets that the model created were of the form (subject, object, relationship). Each of these
items was surrounded by angled brackets containing keyword identifiers. For example, a triplet
would look like this: <triplet><subj>Joe Biden</subj><obj>United States</obj><rel>president of</rel>
</triplet>. This made it easier for us to extract the relationships from the model output later. Once

15
we had all the triplets extracted, we stored them in a CSV file. Each row of the CSV contained the
subject, object, and relationship for a triplet, as well as the chunk that the relationship came from
and the file that the chunk came from. After we had generated relationships for all three of our
conflicts, we were ready to upsert them into Neo4j.
After having tested Mixtral, we found that the relationships being generated were not of the caliber
we were looking for. Often, the model would mix up where to place each part of a triplet. Addi-
tionally, because we were running our code in Kaggle, the process of generating relationships was
taking quite a while. We decided to swap our model to OpenAI’s GPT-4o-mini. The change was
made so that we could run our code on a stronger computer (which we could not do with Mixtral
due to limitations). Also, we hoped that GPT would provide higher-quality relationships, which it did.
Perhaps this can be attributed in part to better prompt engineering, but we also believe that GPT
performed better at extracting entities and relationships generally. Below is the prompt we used
with GPT-4o-mini.

Figure 4.9.1b GPT-4o-mini Prompt

The process for generating relationships with GPT was much the same as it was with Mixtral. The
primary differences were the more powerful system we ran the generation on, and the inclusion
of multi-threading with asyncio to further speed up our runtime. Between these two additions, the
runtime dropped to almost 1% of what it had been. This allowed us to test more prompts and chunk
sizes.

4.8.2 Neo4j Database Structure

We used Neo4j graph database to store the relationships. For interacting with the database such
as creating and querying we used Cypher query language. We stored the data in Neo4j’s cloud
service, Aura, as it has useful graph visualization tools.
From the relationships generated, the subject and objects were imported as nodes and the rela-
tionships as the edges between these nodes. Each node has a label corresponding to the conflict.
Both the nodes and edges in the database had the following properties:

16
• text: A string, representing the subject, object, or relationship.
• files: A list, storing the file names in which the subject was found.
• Chunks: A list, storing the text chunks in which the subject was found.
• Weight: An integer, For nodes this represents the frequency of the subject’s occurrence in
relationships. For edges, this represents the number of occurrences of this specific relationship
being mentioned between the same nodes.
This weight property served as a measure of the subjects and relationships commonality in the
overall text corpus. This property aided in querying to find documents and chunks for specific niche
topics. All text was made lowercase to increase the ease of querying.
In total, the database stored over 80 thousand nodes and over 225 thousand edges.

Figure 4.9.2a Neo4j Database Information

4.8.3 Querying Methods

The first method attempted was a full creation of Cypher queries using OpenAI. This was done by
giving the graph schema and prompting the LLM to create Cypher queries that would help answer
the question. This had issues of Cypher queries being returned with syntax errors as well as queries
that would timeout from over complications and infinite loops. LangChain’s GraphCypherQAChain
library, for query creation with OpenAI, was also tested but had issues returning queries too simple
for the questions.
For the next methods, first, the LLM was used to categorize the question into one of the three con-
flicts which was used in the query to specify the label for the nodes.
The second method was a Node search. First, OpenAI created a list of keywords and phrases from
the question. Given this list, a query was created to find any nodes that were equal to a keyword
or phrase. With these nodes, compiling a list of all the files and chunks these nodes came from
would return the most common. To make sure that ties would be settled when files and chunks
were found, a weight was added. The weight = 1 + 1/ number of files or chunks respectively.
The third method was a modified version of the keyword search, that limits the number of returned
nodes by using path search. Like above, OpenAI creates a list of keywords and phrases from the
question. Given this list, a path query was created, finding any path of length one that connects
the two keywords or phrases. If none of these existed, the path query length is increased to two
and re-run. The query also separates the high-weighted nodes and the low-weighted nodes, so

17
the path results would connect general and frequently found keywords with specific words and
ideas. Searching by path method has stronger constraints causing, in some instances, nothing to
be returned or files to be overlooked. Given this, the accuracy was lower but usually, the quality of
files and chunks returned were greater.
The last method, which ended up working the best, was combining Node Search and Path Search.
This method ran both queries and then combined the weighted files and chunks. The Node search
was general and provided lots of data. Combining this with the Path search returns that were few
in number boosted the files and chunks that were more relevant.

4.9 Testing Methods


For testing, the query returns nodes or paths and through processing the node properties files and
chunk we get the top-k for each. We compare the file name that the question is generated from
to the file names of the top-k results. If there is at least one instance of the same file name within the
top-k results, then the query was a success and counts as accurate or a hit. If the correct file name
does not appear in the file names of the top-k results, the query is not accurate and we count it as
a miss. Accuracy is the number of hits divided by the total number of hits and misses. We do this
with chunks instead of file names as well.
Given the second set of 1,977 questions, we added an additional testing approach. Since each
question was generated for each paragraph that had between 1,500 and 4,000 characters, we
checked if any of the chunks per query search was a part of the ground truth paragraph of that
question. If there’s at least one chunk that is part of the ground truth, it counts as one hit. Similar
to the first testing approach, the final accuracy is calculated by dividing the total hits by the total
search queries.

5 PERFORMANCE ANALYSIS
5.1 Qdrant Results
5.1.1 Chunking Results

During the couple first rounds of analysis, the Qdrant collections that were created using the min-
imally cleaned dataset, we encountered a couple of issues, one of them being having multiple
points in Qdrant that express the same idea, in somewhat identical word-by-word as described in
Figure 5.1a. Thus, we applied an algorithm, that integrated cosine-similarity, to pull the first 1,000
points from a Qdrant collection and compared their similarity, which helped us find many more
similar points than we expected. As seen in Figures 5.1a and 5.1b, those points present the same
sentences but came from different text files. After careful examination, we came to a couple of
realizations, in which across multiple text files that shared the same topic, such as the China-Taiwan
conflict, they often have very similar/identical opening sentences. Secondly, their Abstract/ Key
Takeaways paragraph also tends to be repeated across files and within main content paragraphs.
Due to the duplicated points being in the collections, the result calculated from Qdrant at this point
isn’t exactly accurate. Additionally, this issue also affected how the query expansion experiment
would turn out. Thus, we came down to the decision to perform an optimal cleaning process on
the raw database, re-chunking, and re-embedding them.

18
Figure 5.1.1a Duplicated Points Check - Abstract Paragraph

Figure 5.1.1b Duplicated Points Check - Introduction Section

As we experimented with 10 different chunking techniques, we had one collection per chunking
approach, thus, 10 total collections for 10 chunking techniques with dataset 1 (minimal cleaning).
Then, we did the same thing regarding dataset 2 (optimal cleaning), and we ended up with an
additional 10 collections, which brought the final total to 20 collections.
As mentioned above, when we were progressing further into the experiment, we made additional
cleaning to our database that yielded us the final optimal database and a set of 1,977 new ques-
tions. Thus, below is the result set that was yielded from the optimal cleaning database with 1,977
generated questions.

19
Result Set | Optimal Cleaning | Second Testing Approach | 1,977-question Set

Chunking methods Accuracy (k = 1) Accuracy (k = 5) Accuracy (k = 10)

Baseline (single sentence) 51.7451% 69.6510% 74.8103%

150 tokens (50 overlap) 18.7658% 35.8624% 41.4770%

200 tokens (50 overlap) 15.4780% 29.2868% 34.2438%

250 tokens (75 overlap) 11.8867% 22.2054% 25.0885%

500 tokens (100 overlap) 0.3541% 0.5058% 0.6070%

2000 characters 20.5362% 33.1310% 38.7962%

Semantic 6.5250% 8.7001% 9.1553%

Paragraph 0.4552% 0.5564% 0.6576%

Single sentence w/spaCy 49.9241% 68.1335% 73.0905%

Two sentences w/spaCy 12.1396% 21.8007% 26.5048%

Per Conflict Test Result

Folder China-Taiwan Ukraine-Russia Israel-Hamas

Single sentence 55.56% 72.67% 93.33%

Single sentence w/spaCy 55.56% 68.14% 80.00%

Since the 1977-question dataset was generated from individual paragraphs that had between
1,500 and 4,000 characters, our baseline testing is the second approach, which checks if the ground
truth contains any of the retrieved chunks. Given the above, we hypothesized that the single sen-
tence would fit the best to be our baseline collection.

5.1.2 Re-ranking Results

Temporal re-ranking:

20
Method Accuracy (k = 1) Accuracy (k = 5) Accuracy (k = 10)

Baseline 51.8462% 69.6510% 74.8103%

50/50 51.4922% 69.2969% 74.5574%

60/40 51.9474% 69.3981% 74.7597%

70/30 51.5427% 69.4487% 75.0126%

80/20 51.2898% 69.3475% 74.9621%

40/60 50.0759% 69.2463% 74.4057%

30/70 45.4224% 68.1841% 74.4562%

20?80 36.3177% 64.3905% 73.2423%

5.1.3 Query Expansion Results

21
Keyword Extraction

Description Number of Number of En- Original Expanded


Chunks tries Query Query

All 5 chunks First 100 ques- 72% 21%


• No limit on how many key-
from each tions
words, flexible on the amount
query search
the LLM will generate.
• There are a couple of entries
that have keywords quite long
(more than 3-4 words).
• Only feed the chunks to LLM
for keyword extraction. Zero-
shot learning.
• 1977 question-answer set.

Top 3 chunks First 100 ques- 69% 23%


• Limit to 8 keywords.
out of 5 re- tions
• Generate with New Prompts.
trieved chunks
• Attempt of Few-shots learning.
• 1977-question set.

No chunks, First 100 ques- 69% 48%


• Limit to 10 keywords.
extracted key- tions
• Instead of extract keywords
words solely
from chunks, this approach ex-
on original
tract keywords from original
queries
query.
• New prompts.
• Few-shots learning.
• 1977-question set.

22
New Queries Based On the Original Query and Answers Based On the Original Query

Method Accuracy (k = Accuracy (k = Accuracy (k =


1) 5) 10)

Baseline (no query expansion) 44% 69% 72%

Generated question 36% 55% 61%

Original question and Generated ques- 43% 64% 70%


tion

Generated answer 38% 59% 67%

Original question and Generated answer 40% 58% 71%

Despite the prior studies that proved Query Expansion with the Keywords Extract technique gener-
ally should improve upon the performance of the base RAG pipeline, it actually performed slightly
worse compare to the baseline for us. There are several hypothesises that we considered that might
be the reason for the result of this experiment.
• Improper Baseline Testing: Our baseline testing technique isn’t the correct way to evaluate
the performance of query expansion in general since query expansion is a technique that’s
meant to broader the content/information of the retrieved chunks. Therefore, their retrieved
chunks with new queries are expected to contain more information compared to retrieved
chunks with original queries. So, it potentially will not provide the exact sentence that will be
a part of the ground truth as the baseline testing expected. Therefore, even though it might
improve contextually, by giving more terms and related information, but with our baseline
testing, it wouldn’t count as an improvement.
• Different Large Language Model: Different models have their own proper ways to specifically
instruct the LLM through the prompts. Thus, without time limitation, if we can study how this
LLM performs with different prompts, we might be able to improve the prompts, thus, better
the results.

5.2 Knowledge Graph Results


We have categorized our outcomes into four sections: Results from the Node Query Method, Results
from the Path Query Method, Combined method and the fourth, Performing Manual Tests.
Each test was run with 100 questions randomly selected from the questions database. China-
Taiwan has the smallest number of questions, 148, while Israel-Hamas and Russia-Ukraine have 1,485
and 3,195 respectively. Given this, the results have been split by conflict and the column represent-
ing All Conflicts, is heavily influenced by Russia-Ukraine and Israel-Hamas.

23
5.2.1 Node Query Method

Node Query Results

Conflicts File (k=1) File (k=5) File (k=10) Chunk Chunk Chunk
(k=1) (k=5) (k=10)

China-Taiwan 60% 82% 89% 51% 69% 75%

Israel-Hamas 24% 47% 52% 32% 42% 47%

Russia-Ukraine 27% 46% 56% 28% 45% 48%

The Node query method was the most successful for querying the Knowledge Graph. The top-K
comes from the frequency of the file or chunk in the returned nodes. If K is not limited, file and
chunks’ accuracy rates are 96% and 84% respectively. Given this, this method has a broad re-
turn and optimization comes from limiting and deciding what weighting method to use to return
files and chunks. Different weighting options were tried involving using the node weight and file or
chunk counts. The option that showed the greatest promise was weighing file equal to (1 + 1 / files
in Node) and chunk weight = (1 + 1 / chunks in Node). This weighting increased results accuracy
by around 15%.

5.2.2 Path Query Results

Path Query Results

Conflicts File (k=1) File (k=5) File (k=10) Chunk Chunk Chunk
(k=1) (k=5) (k=10)

China-Taiwan 47% 64% 67% 31% 41% 49%

Israel-Hamas 20% 32% 37% 19% 29% 32%

Russia-Ukraine 20% 30% 31% 21% 29% 30%

The Path Query method results are significantly worse than the keyword results. This is primarily
as Path Query is using the same keywords but limiting the results by only looking at valid paths.
The n1.w < 100 excludes nodes with a weight greater than 100. This was used to avoid queries from
only returning paths between two high weighted, common nodes.
Although the results are lower, this method could continue to be optimized to lead to greater results.
As for most queries the files and chunks returned are usually around 5 and sometimes none. This can
be be seen by the diminishing returns when K is increased. This conveys that querying by path is to

24
MATCH path = (n1:label)-[l:link]-(n2:label)
WHERE (ANY(x IN keyword WHERE n1.t = x AND n1.w <100)
AND ANY(x IN keyword WHERE l.ty CONTAINS x) AND ANY(x IN rel WHERE n2.t = x))
OR (ANY(x IN keyword WHERE n1.t = x AND n1.w < 100)
AND ANY(x IN keyword WHERE l.ty CONTAINS x))
RETURN path
LIMIT 150
Listing 1: Cypher Query

constrictive. This could be improved by creating more relations, as from inspection certain connec-
tions one would assumed to be present between nodes are missing. With the China dataset, we
tested combining five runs of relationship creation together and found 10% improved accuracy.
This wasn’t continued for the larger datasets due to model constraints but showed promising results.

5.2.3 Combining Node and Path Query Results

Combined Node and Query Results

Conflicts File (k=1) File (k=5) File (k=10) Chunk Chunk Chunk
(k=1) (k=5) (k=10)

China-Taiwan 70% 86% 90% 45% 66% 71%

Israel-Hamas 41% 51% 56% 30% 43% 49%

Russia-Ukraine 43% 50% 53% 33% 42% 47%

Combining the Node and Path search queries lead to improved results. These methods are com-
bined by running both queries separately getting each lists of files and chunks. Then, by combining
these lists, a new top k can be retrieved. This allows the generalization of the Node method to
return many files and the stricter path queries to boost those same files.

5.2.4 Manually Reviewed Results

Some of the questions are very general or can be answered in texts not where the question originally
came from. Given this, we manually reviewed 20 questions to see if chunks retrieved helped to
answer the question where k = 5. For the Node method the computed result was .40 but when
manually evaluating the chunks, accuracy was determined to be 0.85. Repeating this with Path
Query gave computed results of 0.3, and after inspection accuracy was measured as 0.5. Given
this the retrieval is working better than the test are indicating and the testing criteria could be
reevaluated.

5.2.5 Graph and Vector Results Compared

These results come from testing 100 random questions from the 1,977-question dataset. These ques-
tions are almost all about the conflict in Russia and Ukraine.

25
Results of 100 Question from 1977-Dataset

Method File (k=1) File (k=5) File (k=10) Chunk Chunk Chunk
(k=1) (k=5) (k=10)

Vector, Single 50% 72% 78% 51% 71% 78%


Sentence

Graph Nodes 47% 59% 63% 29% 45% 48%


and Path

Vector retrieval has better results than Graph retrieval. Overall Knowledge Graphs show some
promise for RAG LLM’s and can be comparable to traditional vector RAG.

6 CONCLUSION
6.1 Lessons Learned
Perhaps the biggest lesson we have taken away from this project is the importance of clean data.
In the first few weeks of the project, we were using the data without any form of cleaning. Aside
from our results taking hits in accuracy, we also had to find workarounds to deal with data entries
that contained certain special characters, citations, hyperlinks, or that were empty. Working with
cleaner data was far easier and produced better results for us.
Another point of reflection is our prompt engineering. Working with LLMs requires a level of under-
standing of how a model functions so that one can draft prompts that will yield desirable responses.
Given that we were new to this subject, we believe that the prompts we fed Mixtral-8x7B and GPT-
4o-mini could have been more concise and better-structured. When working with query expansion,
the expanded queries slightly under-performed, which was not expected. With better prompting,
it could be possible that the LLM could generate a better response. For the knowledge graph, our
first prompt was so poorly written that the model hardly knew what its task was and often returned
triplets with only one entity or relationships that made little sense. Our prompting developed as we
tested and eventually we were able to get results, however it became clear as we worked on the
project that engineering prompts is truly a skill, requiring time and testing to fully understand.

6.2 Challenges
Early into development, there were delays in getting a model approved for use on our high-performance
computer (HPC), making the process of working with the data and Qdrant less streamlined. To work
with the data and the model in another setting, we resorted to Kaggle, an online service for the
data science and machine learning community. However, we quickly learned that a large limita-
tion of Kaggle is the limited GPU compute time (30 hours per week) and less GPUs meant that we
had to wait much longer to do simple tasks like embedding and getting a model running. Addition-
ally, this created significantly more overhead, since Qdrant was running on the HPC, we needed
to constantly transfer the data over whenever we needed to upload or test the data.
Another challenge was keeping the data constantly clean throughout the project. Sometimes, we

26
believe that the data is in a good position, only to find a couple of files have some problem that we
need to fix. This led to a lot of back and forth, along with the overhead from using Kaggle, made
the process more time intensive than it normally would be.

6.3 Future Works


Given the short lifespan of this project, there were many ideas that we did not have the time to
experiment with.
For one, we only tested query expansion with the first 100 questions in the questions database. Due
to time constraints, we could not do an expansion with all 1,977 questions. Testing with the rest of
the questions should provide a better idea of how query expansion performs in the long run and if it
is worth the computational effort or not. The 100 questions were good enough to provide a rough
estimate, but not the complete picture.
Like we said in the conclusion, looking more into prompt engineering would be something else
to spend more time on as well as getting to know our how our Large Language Model work with
different prompts. Tweaking prompts and figuring out what works the best would be a valuable
tool for insight and see how much our methods improve with better prompting.
We believe that increasing the organization of the graph through the use of more labels would
have increased query performance. As well as increasing the relationships created.
Also, combining both methods to create a vectored knowledge graph could greatly increase
query results and keep the advantages of both RAG methods. This is a promising approach that
we hope to test in the future [2].
As said above, given the time limitation, our testing baseline is quite straight forward and wouldn’t
be the most proper way to evaluate the performance of our RAG pipeline, especially when it
comes to query expansion performance. Thus, if we are allowed with more time, we would want
to evaluate our experiments in a more proper set up, which it would take the retrieved chunks,
generate the response from them, then evaluate on its relevance to the ground truth. This would
potentially be a better way to properly weight our experiments.

27
REFERENCES
[1] Y. Gao, Y. Xiong, X. Gao, et al., Retrieval-augmented generation for large language models:
A survey, 2024. arXiv: 2312.10997 [cs.CL]. [Online]. Available: https://arxiv.org/abs/2312.
10997.
[2] N. Matsumoto, J. Moran, H. Choi, et al., “KRAGEN: a knowledge graph-enhanced RAG frame-
work for biomedical problem solving using large language models”, Bioinformatics, vol. 40,
no. 6, btae353, Jun. 2024, ISSN: 1367-4811. DOI: 10 . 1093 / bioinformatics / btae353. eprint:
https://academic.oup.com/bioinformatics/article-pdf/40/6/btae353/58186419/btae353.
pdf. [Online]. Available: https://doi.org/10.1093/bioinformatics/btae353.
[3] Y. Chang, X. Wang, J. Wang, et al., “A survey on evaluation of large language models”, ACM
Trans. Intell. Syst. Technol., vol. 15, no. 3, 2024, ISSN: 2157-6904. DOI: 10.1145/3641289. [Online].
Available: https://doi.org/10.1145/3641289.
[4] W. X. Zhao, K. Zhou, J. Li, et al., A survey of large language models, 2023. arXiv: 2303.18223
[cs.CL]. [Online]. Available: https://arxiv.org/abs/2303.18223.
[5] H. Naveed, A. U. Khan, S. Qiu, et al., A comprehensive overview of large language models,
2024. arXiv: 2307.06435 [cs.CL]. [Online]. Available: https://arxiv.org/abs/2307.06435.
[6] A. Hogan, E. Blomqvist, M. Cochez, et al., “Knowledge graphs”, ACM Comput. Surv., vol. 54,
no. 4, 2021, ISSN: 0360-0300. DOI: 10.1145/3447772. [Online]. Available: https://doi.org/10.
1145/3447772.
[7] N. Kunungo. “How vector databases search by similarity: A comprehensive prime”. (accessed:
Aug. 1, 2024). (2023).
[8] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei, “Multilingual e5 text embed-
dings: A technical report”, arXiv preprint arXiv:2402.05672, 2024.
[9] R. Nogueira, W. Yang, K. Cho, and J. Lin, Multi-stage document ranking with bert, 2019. arXiv:
1910.14424 [cs.IR]. [Online]. Available: https://arxiv.org/abs/1910.14424.
[10] R. Nogueira and K. Cho, Passage re-ranking with bert, 2020. arXiv: 1901.04085 [cs.IR]. [On-
line]. Available: https://arxiv.org/abs/1901.04085.
[11] Y.-S. Chuang, W. Fang, S.-W. Li, W.-t. Yih, and J. Glass, Expand, rerank, and retrieve: Query
reranking for open-domain question answering, 2023. arXiv: 2305 . 17080 [cs.CL]. [Online].
Available: https://arxiv.org/abs/2305.17080.
[12] L. Wang, N. Yang, and F. Wei, Query2doc: Query expansion with large language models,
2023. arXiv: 2303.07678 [cs.IR]. [Online]. Available: https://arxiv.org/abs/2303.07678.
[13] R. Jagerman, H. Zhuang, Z. Qin, X. Wang, and M. Bendersky, Query expansion by prompting
large language models, 2023. arXiv: 2305.03653 [cs.IR]. [Online]. Available: https://arxiv.
org/abs/2305.03653.
[14] e. a. Liu Nelson F, Lost in the middle: How language models use long context, 2023. [Online].
Available: https://doi.org/10.1162/tacl_a_00638.
[15] Z. Q. Rolf Jagerman Honglei Zhuang, Query expansion by prompting large language models,
2023. arXiv: 2305.03653 [cs.CL]. [Online]. Available: https://arxiv.org/pdf/2305.03653.

28

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy