0% found this document useful (0 votes)
41 views57 pages

Syrja - Saku-Matti - Retrieval-Augmented Generation

This thesis explores the integration of Retrieval-Augmented Generation (RAG) techniques with SQL databases to improve the numerical accuracy of Large Language Models (LLMs) in a sports statistics application, specifically focusing on NHL data. The research includes a literature review and a case study demonstrating a web application that utilizes OpenAI's GPT-4o-mini model and PostgreSQL for statistical retrieval, revealing challenges in database design and opportunities for optimization. The findings contribute to the development of more reliable statistical retrieval systems by providing a framework for SQL-based RAG implementations and identifying areas for future improvement.

Uploaded by

gegepokemon4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views57 pages

Syrja - Saku-Matti - Retrieval-Augmented Generation

This thesis explores the integration of Retrieval-Augmented Generation (RAG) techniques with SQL databases to improve the numerical accuracy of Large Language Models (LLMs) in a sports statistics application, specifically focusing on NHL data. The research includes a literature review and a case study demonstrating a web application that utilizes OpenAI's GPT-4o-mini model and PostgreSQL for statistical retrieval, revealing challenges in database design and opportunities for optimization. The findings contribute to the development of more reliable statistical retrieval systems by providing a framework for SQL-based RAG implementations and identifying areas for future improvement.

Uploaded by

gegepokemon4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

SAKU-MATTI SYRJÄ

Retrieval-Augmented Generation
Utilizing SQL Database

Case: Web Sport Statistics Application

DEGREE PROGRAMME IN DATA ENGINEERING


2024
ABSTRACT

This thesis investigates the implementation of Retrieval-Augmented Genera-


tion (RAG) techniques in conjunction with SQL databases to enhance the nu-
merical accuracy of Large Language Models (LLMs). While LLMs have
demonstrated remarkable capabilities in natural language processing, they of-
ten struggle with numerical precision and statistical information retrieval. This
research addresses this limitation through a case study implementing a sports
statistics application, specifically focusing on NHL data from 2008 to 2024.

The study comprises a comprehensive literature review examining transformer


architectures, SQL database systems, and RAG methodologies, followed by a
detailed implementation of a full-stack web application. The application utilizes
OpenAI's GPT-4o-mini model integrated with a PostgreSQL database contain-
ing comprehensive NHL statistics. The implementation demonstrates a novel
approach to RAG by employing direct text-to-SQL query generation rather than
traditional vector search methods.

Key findings reveal that while LLMs can effectively generate SQL queries for
statistical retrieval, challenges persist in database design paradigms, where
traditional normalization principles proved counterproductive for RAG applica-
tions. The study identifies specific limitations in handling season formatting and
column selection, while also highlighting the potential for production-level ap-
plications. The research contributes to the field by presenting a practical frame-
work for implementing SQL-based RAG systems and identifies areas for future
improvement, including dataset optimization and model fine-tuning opportuni-
ties.

This work provides valuable insights into the integration of LLMs with struc-
tured databases and offers a foundation for developing more accurate and re-
liable statistical retrieval systems.
PREFACE

This thesis represents the culmination of my studies in 2022-2025 at Satakunta


University of Applied Science (SAMK). The work combines my interest in arti-
ficial intelligence with the practical applications of modern language models.
The theme of the study touches one of my long-time interests and therefore
was very pleasant to conduct.

I would like to express my gratitude to my thesis supervisor Aleksi Postari for


their invaluable guidance and feedback throughout this research process. I
would also like to express my gratitude for Jussi Bergman, Toni Aaltonen and
Jeffrey Salahub as they have been invaluable teachers in my journey through
these years.

Special thanks go to MoneyPuck for providing access to their comprehensive


NHL statistics database, which formed the foundation of this research.

This work would not have been possible without the support of the open-source
community, particularly the developers of various tools and libraries utilized in
this project. Their contributions to the field have been invaluable.

Special thanks to these companies who provide invaluable tools for research
and studying purposes: OpenAI, Antropic, Perplexity and Meta.
CONTENTS

1 INTRODUCTION ........................................................................................ 7
1.1 Motivation ............................................................................................. 7
1.2 Goals .................................................................................................... 8
2 LITERATURE REVIEW .............................................................................. 8
2.1 Artificial Intelligence (AI) ....................................................................... 8
2.2 History of language model architectures ............................................ 10
2.3 Transformers Encoder Architecture .................................................... 11
2.3.1 Input Embeddings ...................................................................... 14
2.3.2 Positional Encoding ................................................................... 14
2.3.3 Multi-Head Attention .................................................................. 15
2.3.4 Layer Normalization and Residual Connections ........................ 18
2.4 Transformers Decoder Architecture ................................................... 20
2.4.1 Decoder Embeddings and Positional Encoding ......................... 22
2.4.2 Masked Multi-head Attention ..................................................... 22
2.4.3 Encoder-Decoder Attention ....................................................... 24
2.4.4 Decoders Token Prediction ....................................................... 25
2.5 Scale and Computational Requirements ............................................ 26
2.6 Retrieval Augmented Generation (RAG) ............................................ 26
2.7 SQL database .................................................................................... 27
2.7.1 Text-to-SQL query ..................................................................... 28
3 CASE STUDY ........................................................................................... 29
3.1 Planning phase ................................................................................... 29
3.2 Data collection .................................................................................... 30
3.2.1 Database Normalization ............................................................ 31
3.3 Backend Architecture ......................................................................... 31
3.3.1 Handling User Queries .............................................................. 33
3.3.2 Generating SQL queries ............................................................ 38
3.3.3 Testing And Correcting Generated SQL Query ......................... 41
3.3.4 Generating Answer .................................................................... 42
3.3.5 Chat History ............................................................................... 43
3.3.6 Backend Integration Architecture ............................................... 43
3.4 Frontend Architecture ......................................................................... 44
3.4.1 User Interface and Interaction.................................................... 45
3.5 Deployment of the Application ............................................................ 48
3.5.1 Web Deployment ....................................................................... 49
4 DISCUSSION............................................................................................ 50
5 CONCLUSION .......................................................................................... 51
REFERENCES ............................................................................................ 53
LIST OF SYMBOLS AND TERMS

AI = Artificial Intelligence
LLM = Large Language Model
SQL = Structured Query Language
UI = User Interface
API = Application Programming Interface
JSON = JavaScript Object Notation
7

1 INTRODUCTION

The emergence of advanced language models has transformed the landscape


of natural language processing (Brown et al., 2020), while simultaneously re-
vealing limitations in handling precise numerical data (Zhu, Dai, & Sui, 2024).
This study explores an approach to address these limitations through direct
integration with structured databases, focusing specifically on sports statistics
applications.

1.1 Motivation

Since the public release of ChatGPT-3.5 in November 2022 (OpenAI, 2022),


large language models have gained significant popularity. Transformer-based
language models have established themselves as state-of-the-art in natural
language processing. However, these models fundamentally operate by using
probabilistic token prediction, with their knowledge limited to their training data
(Vaswani et al., 2017). Research has shown that they particularly struggle with
numerical accuracy and specific statistical information, often producing hallu-
cinations or approximations when dealing with precise numerical data (Zhu,
Dai, & Sui, 2024). This limitation is especially evident in domains requiring
accurate statistical recall, such as sports analytics, where models might un-
derstand conceptual aspects but fail to provide accurate numerical data (Xia
et al., 2024). To address these limitations, researchers have developed ap-
proaches to augment language models with external knowledge sources. One
such approach is Retrieval-Augmented Generation (RAG), which allows lan-
guage models to access and utilize current, accurate information beyond their
training data. This method has shown promising results in improving the accu-
racy and reliability of language model responses, particularly in domains re-
quiring precise factual information or up-to-date knowledge (Lewis et al.,
2020).
8

1.2 Goals

The primary objective of this study is to investigate methods for enabling large
language models to accurately handle numerical data through direct SQL da-
tabase integration. This research addresses the following questions:

1. How can Language Models be utilized in Retrieval Augmented Gener-


ation from SQL databases?
2. What are the practical considerations and challenges in implementing
such a system for real-world statistics applications?

To address these research questions, this study consists of two main compo-
nents. First, a comprehensive literature review examines the underlying tech-
nologies: transformer architecture in large language models, SQL database
systems, and Retrieval-Augmented Generation (RAG). Second, through a de-
tailed case study, a practical implementation and integration of these technol-
ogies is demonstrated, providing empirical evidence of their effectiveness and
limitations in real-world applications.

2 LITERATURE REVIEW

This literature review examines the fundamental concepts and technologies


relevant to implementing language model systems with statistical database in-
tegration. The review begins with the foundations of artificial intelligence, pro-
gresses through the evolution of language models, and concludes with current
approaches to retrieval-augmented generation and database integration.

2.1 Artificial Intelligence (AI)

Artificial Intelligence can be defined in many ways. As shown in Figure 1.1, the
definitions of artificial intelligence can be categorized into thinking humanly,
9

thinking rationally, acting humanly, and acting rationally (Russell & Norvig,
2010).

Figure 1 Definitions of artificial intelligence, organized into four categories. From Artificial Intelligence: A
Modern Approach (3rd ed., p. 2), by S. J. Russell & P. Norvig, 2010, Pearson.

Based on these categorizations, it becomes apparent that defining AI is chal-


lenging due to the varying approaches and interpretations of intelligence
(McCarthy & Hayes, 1969; Nilsson, 2009). The complexity of establishing a
singular definition arises from AI's multifaceted objectives: replicating human-
like thinking, implementing rational action, and achieving autonomous opera-
tion (Poole & Mackworth, 2017). The difficulty in defining AI stems from the
diversity of objectives within the field. Some researchers focus on replicating
human-like behaviour, aiming to create systems that can perform tasks in a
manner indistinguishable from humans (Turing, 1950; Searle, 1980), while oth-
ers prioritize creating systems that act based on logic and reasoning (Simon,
1996; Legg & Hutter, 2007). This variance in objectives raises a question on
the purpose of AI: Should the field aim to replicate human intelligence and
behaviour, or should it focus on developing AI systems optimized for specific
applications (Brooks, 1991; Nilsson, 2009)?
10

Russell and Norvig (2010) draw an insightful parallel between the development
of artificial flight and artificial intelligence. Russell and Norvig (2010) point out
that successful powered flight was achieved only when inventors stopped try-
ing to replicate bird flight and instead focused on understanding fundamental
aerodynamics.

I think this analogy is particularly relevant to AI development - rather than at-


tempting to precisely replicate human cognitive processes, the field should fo-
cus on leveraging the unique capabilities of computers. Just as aeronautical
engineering evolved beyond biomimicry to develop its own principles, AI de-
velopment should prioritize finding efficient computational solutions rather than
strictly adhering to human-like approaches.

2.2 History of language model architectures

The modern architecture of language models was introduced by Vaswani et


al. (2017). This seminal research paper titled "Attention is all you need" revo-
lutionized Natural Language Processing (NLP) and translation tasks through
the introduction of the transformer architecture, with self-attention as its key
feature (Vaswani et al., 2017). Prior to this breakthrough, NLP and other se-
quential modelling tasks were predominantly handled with recurrent neural
networks (RNN) (Mikolov et al., 2010). The primary advantage of RNNs was
their long short-term memory (LSTM) implementation utilizing gate mecha-
nisms (Hochreiter & Schmidhuber, 1997). LSTM, a specialized type of RNN,
was designed to capture long-term dependencies in sequential data through
its gating mechanism (Graves, 2013). This mechanism enables LSTMs to se-
lectively remember or forget information over long sequences, addressing the
vanishing gradient problem and making them particularly effective for tasks
requiring memory of earlier inputs (Pascanu et al., 2013). However, Vaswani
et al. (2017) identified a fundamental limitation of RNNs: their inability to sup-
port parallel training. Due to the sequential nature of memory gates relying on
previous tokens, parallel computation is not possible. Instead, calculations
11

must be performed sequentially, making the process time-consuming and im-


practical for large-scale datasets (Young et al., 2018).

2.3 Transformers Encoder Architecture

The Transformer architecture, introduced by Vaswani et al. (2017) in their pa-


per "Attention is All You Need," represents a significant advancement in natu-
ral language processing, particularly in its ability to perform parallel computa-
tions. The model they introduce consists of an encoder and a decoder, as
shown in Figure 2. The encoder handles the vectorization of the input, and the
decoder makes token predictions. The outline of the encoder architecture is
shown in Figure 3.
12

Figure 2 Transformer model architecture (Vaswani et al., 2017, p. 3)


13

Figure 3 Encoder Architecture


14

2.3.1 Input Embeddings

Natural Language Processing (NLP) models utilize input embeddings to trans-


form discrete textual input sequences into continuous vector representations
(Vaswani et al., 2017). Rather than employing traditional word-based vocabu-
laries, modern NLP systems implement subword tokenization strategies,
where tokens can represent entire words, word fragments, or individual char-
acters (Sennrich et al., 2016). For instance, the sentence "I dislike raspberries"
might be tokenized as ["I", "dis", "like", "rasp", "ber", "ries"], enabling efficient
vocabulary compression while maintaining semantic expressivity. This ap-
proach substantially reduces the required vocabulary size while preserving the
model's ability to handle out-of-vocabulary words (Wu et al., 2016).

In the seminal Transformer architecture paper, Vaswani et al. (2017) demon-


strated that a vocabulary of 37,000 tokens was sufficient to represent both
English and German languages in their machine translation task. Each token
was mapped to a 512-dimensional embedding vector using 32-bit floating-
point precision, a dimensionality carefully selected to balance representational
capacity with computational efficiency (Vaswani et al., 2017; Devlin et al.,
2019).

2.3.2 Positional Encoding

The Transformer architecture incorporates positional encodings to preserve


sequential information during parallel processing of input tokens (Vaswani et
al., 2017). These encodings are implemented through element-wise addition
of position-specific vectors to the token embeddings. The canonical approach
employs sinusoidal functions of varying frequencies to generate these posi-
tional vectors, as defined by:

PE[(pos,2i)] = sin(pos/10000^((2i)/d_model))
PE[(pos,2i+1)] = cos(pos/10000^((2i)/d_model))
15

where pos represents the token position and i denotes the dimension (Vaswani
et al., 2017). This formulation enables the model to generalize to sequence
lengths unseen during training, as the sinusoidal pattern provides a consistent
positional signal (Dehghani et al., 2019).

2.3.3 Multi-Head Attention

The multi-head attention mechanism employs parallel attention operations,


each utilizing distinct learned linear transformations of the input representa-
tions (Vaswani et al., 2017). For each attention head h, three weight matrices
W_Q, W_K, and W_V project the input sequences into query (Q), key (K), and
value (V) representations respectively. These transformations can be formally
expressed as:

𝑄 = 𝑋𝑊_𝑄,
𝐾 = 𝑋𝑊_𝐾,
𝑉 = 𝑋𝑊_𝑉

where X represents the input token embeddings. Each attention head operates
on representations with dimensionality d_k = d_model/n_heads, where
d_model denotes the model's dimension and n_heads represents the number
of parallel attention operations. In the original implementation, with d_model =
512 and n_heads = 8, each head processed 64-dimensional projections of the
input sequence (Vaswani et al., 2017). This dimensional partitioning maintains
computational efficiency while enabling the model to attend to information from
different representation subspaces simultaneously (Voita et al., 2019; Michel
et al., 2019).

Query (Q), key (K), and value (V) vectors are learned representations that fa-
cilitate token interaction within the attention mechanism (Vaswani et al., 2017).
The query vector (Q) of a token functions as a learned representation for infor-
mation seeking, determining which other tokens in the sequence are most rel-
evant for its contextual understanding. The key vector serves as a learned
16

compatibility measure, enabling other tokens to assess its relevance for their
contextual needs. The value vector encodes the semantic content that is ag-
gregated during the attention computation (Bahdanau et al., 2015; Vaswani et
al., 2017). This query-key-value formulation derives conceptually from infor-
mation retrieval systems, where queries are matched against keys to retrieve
associated values (Graves et al., 2014).

Figure 4 Inside attention head

Within each attention head, attention weights are computed through scaled
dot-product operations between query and key vectors (Vaswani et al., 2017).
For a sequence of length n, each query vector qi is multiplied with all key vec-
tors kj, generating n attention weights per token. These dot products are then
scaled by dividing them by √d_k, where d_k is the dimensionality of the key
vectors. The scaled values are passed through a softmax function to produce
the final attention weights:

𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛_𝑤𝑒𝑖𝑔ℎ𝑡𝑠 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑄𝐾^𝑇/√𝑑_𝑘)

The softmax function normalizes its inputs z_i according to:


17

𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧_𝑖) = 𝑒𝑥𝑝(𝑧_𝑖) / 𝛴_𝑗 𝑒𝑥𝑝(𝑧_𝑗)

This transformation ensures the attention weights sum to 1 and fall within the
range [0,1], creating a proper probability distribution over the input sequence.
The scaling factor √d_k is important for maintaining stable gradients during
training (Vaswani et al., 2017). Without scaling, the dot products would grow
large in magnitude as d_k increases, pushing the softmax function into regions
where gradients become vanishingly small, as the exponential nature of soft-
max would cause the output distribution to concentrate heavily on the maxi-
mum values (Xu et al., 2019).

The attention weights derived from the softmax function are subsequently ap-
plied to the value vectors through matrix multiplication, yielding the final atten-
tion output. The complete attention mechanism within each head can be for-
mally expressed as:

𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾, 𝑉) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥((𝑄𝐾^𝑇) / √𝑑_𝑘)𝑉

where Q, K, and V represent the query, key, and value matrices respectively
(Vaswani et al., 2017). This computation is performed independently in each
attention head, allowing different heads to capture distinct relationships in the
input sequence. The multi-head attention mechanism then aggregates infor-
mation across all heads through concatenation:

𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑(𝑄, 𝐾, 𝑉 ) = 𝐶𝑜𝑛𝑐𝑎𝑡(ℎ𝑒𝑎𝑑1, . . . , ℎ𝑒𝑎𝑑ℎ)𝑊 𝑂

where WO is a learned parameter matrix that projects the concatenated atten-


tion outputs back to the model's dimensional space (Vaswani et al., 2017). This
final linear transformation integrates the diverse attention patterns captured by
individual heads into a unified representation.
18

2.3.4 Layer Normalization and Residual Connections

The multi-head attention output is combined with the input through a residual
connection, followed by layer normalization (Vaswani et al., 2017). This can be
formally expressed as:

𝑂𝑢𝑡𝑝𝑢𝑡 = 𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚(𝑥 + 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑥))

where x represents the input sequence prior to the attention operations. The
residual connection, implemented through element-wise addition, creates a di-
rect path for information flow from lower layers by adding the input directly to
the transformed representation. This architecture mitigates the vanishing gra-
dient problem in deep networks and enables more effective training of the
transformer's deep structure (He et al., 2016). Layer normalization stabilizes
the network by normalizing the activations across the feature dimension, en-
suring consistent scaling throughout the network depth (Ba et al., 2016).
This attention block structure is repeated N times in the encoder, with each
iteration refining the representational quality of the sequence. The combination
of residual connections and layer normalization enables the network to pre-
serve and gradually enhance relevant information from the input while learning
increasingly sophisticated features at higher layers (Vaswani et al., 2017).

The LayerNorm used in the equation is expanded here:

1
𝑥𝑖 − 𝑛 ∑𝑛𝑗=1 𝑥𝑗
𝑦𝑖 = 𝛾 ⋅ +𝛽
√1 ∑𝑛𝑗=1 1
(𝑥𝑗 − 𝑛 ∑𝑛𝑘=1 𝑥𝑘 )2 +𝜖
𝑛

Layer normalization operates across all feature dimensions for each sequence
position independently. For a given input vector, the algorithm computes the
mean and variance across its features, then normalizes the values by subtract-
ing the mean and dividing by the standard deviation (Ba et al., 2016). This
standardization centers the features around zero and scales them to unit vari-
ance. The normalized values are then transformed using learnable parameters
19

γ and β, enabling the model to adaptively scale and shift the normalized fea-
tures. In the original transformer implementation (Vaswani et al., 2017), with a
maximum sequence length of 256 and dimension size of 512, layer normaliza-
tion is applied independently to each position's 512-dimensional feature vec-
tor. Thus, for each position in the sequence, the normalization statistics (mean,
variance, and standard deviation) are computed using its corresponding 512
feature values.

Figure 5 illustrates the application of layer normalization across a sequence of


256 inputs, where each input vector spans 512 dimensions (d₁-d₅₁₂). The nor-
malization process employs dimension-specific learnable parameters: a scal-
ing factor γ and a shift parameter β. While the transformer encoder is designed
to handle sequences up to length 256, it can process variable-length se-
quences below this maximum threshold (Vaswani et al., 2017).

Figure 5 Layer normalization

The learnable parameters serve distinct roles in the normalization process.


The shift parameter β enables linear translation of normalized values, which is
particularly significant preceding the ReLU activation function. This translation
can suppress feature activation by shifting values into the negative domain,
where ReLU maps them to zero, or enhance feature propagation by shifting
values into the positive domain, preserving them through the ReLU operation.
The scale parameter γ modulates the magnitude of normalized values through
multiplication, preserving their sign while adjusting their scale. Values of γ > 1
amplify the feature response, while γ < 1 attenuates it, allowing the model to
learn optimal feature scaling for each dimension (Ba et al., 2016).
20

Following layer normalization, each position in the sequence is processed in-


dependently through a position-wise feed-forward network (FFN). The FFN
transformation, as defined in Vaswani et al. (2017), can be expressed as:

𝐹𝐹𝑁(𝑥) = 𝑚𝑎𝑥(0, 𝑥 𝑊_1 + 𝑏_1) 𝑊_2 + 𝑏_2

, where x represents individual input vectors of dimension d_model = 512. The


first linear transformation employs a weight matrix W₁ ∈ ℝ^(d_model × d_ff),
where d_ff = 2048, projecting the input into a higher-dimensional space. Fol-
lowing the addition of bias term b₁, a ReLU activation function introduces non-
linearity by mapping negative values to zero. The second linear transformation,
utilizing W₂ ∈ ℝ^(d_ff × d_model), projects the intermediate representation
back to the original d_model dimensionality. This architectural choice of ex-
panding to a larger intermediate dimension enables the network to capture
more complex functional relationships before returning to the model's standard
dimensionality (Vaswani et al., 2017).

After the position-wise feed-forward network, the sequence again goes


through a residual connection and layer normalization. This output is then fed
as input to a new iteration of the multi-head attention mechanism. The full en-
coder component, as described in the original paper, repeats this entire pro-
cess of multi-head attention, feed-forward network, residual connection, and
layer normalization for N=6 layers. The final output of the encoder is produced
after this sequence of N=6 encoding layers (Vaswani et al., 2017).

2.4 Transformers Decoder Architecture

The decoder architecture mirrors the encoder's structure with N = 6 identical


layers but incorporates an additional sub-layer for cross-attention over the en-
coder's output as shown in figure 6. While the encoder processes the input
sequence, the decoder generates output predictions through a distinct layered
21

architecture that leverages the encoder's representations through cross-atten-


tion mechanisms (Vaswani et al., 2017).

Figure 6 Decoder architecture


22

2.4.1 Decoder Embeddings and Positional Encoding

The decoder begins its sequence generation with a special <start> token. This
token is embedded into the same dimensional space as other tokens in the
model's vocabulary, but its specific purpose is to initiate the generation of the
output sequence. Similar to the encoder embeddings, each token in the de-
coder, including the <start> token, is combined with positional encodings to
preserve sequential information (Vaswani et al., 2017).

2.4.2 Masked Multi-head Attention

The decoder architecture contains three sub-layers in each layer: "self-atten-


tion, encoder-decoder (cross) attention, and a feed-forward network" (Vaswani
et al., 2017, p. 3). A key feature of the decoder's self-attention mechanism is
its masking strategy, which "prevent[s] positions from attending to subsequent
positions" (Vaswani et al., 2017, p. 3). For a sequence of length T, this masking
is implemented as a T×T matrix where the entry at position (i, j) is 1 if j ≤ i, and
0 otherwise. When applied to the attention weights, this effectively prevents
positions from attending to future positions by setting those attention scores to
negative infinity.

During training, the decoder processes the entire target sequence simultane-
ously, with all tokens shifted one position right and the <start> token prepended
to the sequence. This offset, combined with the masking mechanism, ensures
that predictions for any position i can only depend on known outputs at posi-
tions less than i. This technique enables parallel training while maintaining the
sequential nature of the generation process (Vaswani et al., 2017). During in-
ference, however, the decoder operates autoregressively, generating one to-
ken at a time and using each new prediction as input for the next position
(Sutskever et al., 2014).

The decoder's masking mechanism can be implemented in different ways, with


causal masking being the standard approach used in the original transformer
architecture. Figure 7 illustrates the causal masking pattern, where each
23

position can only attend to itself and previous positions. This masking strategy
maintains consistency between training and inference by enforcing the same
autoregressive constraints in both phases (Vaswani et al., 2017).

Figure 7 Srambical, F. (2024, June 8). Causal mask in large-scale language modeling. p(doom). Re-
trieved September 26, 2024, from https://pdoom.org/causal_mask.html

An alternative approach, known as causal masking with prefix, was later intro-
duced and is shown in Figure 8. This variant allows previous positions to attend
to positions up to the current prediction point. For instance, when generating
the fourth token, the first and second tokens can attend to the third token, un-
like in standard causal masking where such connections are prohibited. While
this approach potentially enhances context utilization during training and im-
proves sequence coherence, it introduces a training-inference discrepancy.
The model learns to utilize information patterns during training that are not
available during standard autoregressive generation, potentially leading to de-
graded inference performance (Srambical, 2024).
24

Figure 8 Srambical. F. (2024, June 8). Causal mask in large-scale language modeling. p(doom). Re-
trieved September 27, 2024, from https://pdoom.org/causal_mask.html

2.4.3 Encoder-Decoder Attention

The encoder-decoder attention mechanism, also known as cross-attention, en-


ables the decoder to integrate information from the encoder's output. As de-
scribed in the original architecture, "in encoder-decoder attention layers, the
queries come from the previous decoder layer, and the memory keys and val-
ues come from the output of the encoder. This allows every position in the
decoder to attend over all positions in the input sequence" (Vaswani et al.,
2017, p. 5).

This cross-attention mechanism differs from self-attention in its input compo-


sition: while query vectors (Q) are derived from the decoder's previous layer
(Vaswani et al., 2017), the key (K) and value (V) vectors are computed from
25

the encoder's final output (Vaswani et al., 2017). This architectural design en-
ables each decoder position to selectively retrieve relevant contextual infor-
mation from the encoder's representation, establishing direct information path-
ways between the input and output sequences (Bahdanau, Cho, & Bengio,
2015). Unlike the self-attention mechanisms, the cross-attention does not uti-
lize the decoder's own key and value vectors, instead relying entirely on the
encoder's output for these representations (Vaswani et al., 2017).

2.4.4 Decoders Token Prediction

The final stage of the transformer's decoding process involves converting the
decoder's output into token predictions. As described in the original paper, "We
use learned embeddings to convert the input tokens and output tokens to vec-
tors of dimension dmodel. We also use the usual learned linear transformation
and SoftMax function to convert the decoder output to predicted next-token
probabilities. In our model, we share the same weight matrix between the two
embedding layers and the pre-SoftMax linear transformation" (Vaswani et al.,
2017, p. 5).

Following the N decoder layers, the output sequence undergoes a linear pro-
jection using the shared embedding matrix. This weight sharing between the
input embeddings and output projection is a key architectural choice that re-
duces the model's parameter count while maintaining performance. The pro-
jected vectors are then processed through a softmax function, producing a
probability distribution over the model's entire vocabulary for each position.
The model generates tokens sequentially based on these probability distribu-
tions until it predicts a special <end> token, signaling the completion of the
output sequence (Vaswani et al., 2017).
26

2.5 Scale and Computational Requirements

The evolution of transformer-based language models demonstrates a remark-


able increase in training data scale and computational requirements. The orig-
inal transformer model (Vaswani et al., 2017) was trained on 4.5 million sen-
tence pairs, comprising approximately 216 million tokens, using 8 NVIDIA
P100 GPUs over 3.5 days. This scale of training was significantly surpassed
by GPT-3, which processed 45 terabytes of compressed plaintext, filtered to
570 GB, equivalent to 400 billion byte-pair-encoded tokens (Brown et al.,
2020). More recently, the Llama 3 model further expanded this scale, training
on approximately 15 trillion multilingual tokens using 16,000 H100 GPUs over
54 days (Llama Team, AI @ Meta, 2024).

From my point of view, this exponential growth in training scale, increasing by


several orders of magnitude within seven years, has been enabled by concur-
rent advances in computing infrastructure. The progression from P100 to H100
GPUs, coupled with the ability to efficiently parallelize training across thou-
sands of processors, illustrates the fundamental role of hardware advance-
ment in enabling larger and more sophisticated language models.

2.6 Retrieval Augmented Generation (RAG)

Amazon Web Services (AWS, n.d. -a) describes Retrieval-Augmented Gener-


ation (RAG) as a method to enhance language model outputs by enabling ac-
cess to external knowledge bases beyond their training data. RAG allows lan-
guage models to reference authoritative sources during response generation,
extending their capabilities to specific domains without model retraining. This
approach provides a cost-effective way to improve accuracy and relevance of
LLM outputs across various contexts. The core principle of RAG is to provide
custom context to the language model during generation. This context typically
contains information not present in the model's training data, thereby expand-
ing its knowledge base for more accurate and specific responses (AWS, n.d.).
27

Inference based RAG always works with the same principles. A language
model has a context size of some number of tokens. The user query and the
retrieved information from RAG are fit into this context length one after another.
The retrieved information depends on the user query. There are many tech-
niques for retrieving the required context information. In NirDiamants GitHub
page (NirDiamant, n.d. -a) there are over 30 listed ways to implement RAG.
The best technique depends on the structure of the data and the use case. A
common RAG framework is LangChain (LangChain, n.d.). LangChain is a pop-
ular open-source framework that simplifies the implementation of RAG sys-
tems. LangChain provides modular components that can be combined to cre-
ate sophisticated RAG pipelines. The framework's primary strength lies in its
abstraction of common RAG operations and its extensive integration ecosys-
tem. The framework offers essential functionalities for handling document pro-
cessing, including document loaders for various data formats and configurable
text splitting mechanisms. It provides seamless integration with vector stores
for efficient information retrieval. The framework's chain components facilitate
the creation of sequential processing pipelines, allowing developers to con-
struct complex RAG systems that can effectively augment language model re-
sponses with external knowledge.

2.7 SQL database

Relational databases, managed through Structured Query Language (SQL),


serve as a fundamental technology for storing and processing information in
tabular structures. SQL enables users to perform various operations including
storing, updating, removing, searching, and retrieving data from databases or-
ganized in rows and columns (AWS, n.d.-b). These database systems form
the backbone of modern information systems, with SQL serving as the stand-
ard language for both basic queries and complex analytical processing.
Through its structured approach, SQL provides consistent and reliable data-
base interaction capabilities, supporting both routine operations and sophisti-
cated data analysis tasks (Garcia-Molina et al., 2008).
28

2.7.1 Text-to-SQL query

In a 2024 study paper, Zhu et al. examine text to SQL query generation and
its evolution. The paper examines the evolution from traditional text-to-SQL
methods to modern approaches. Early implementations relied on bi-structured
models using LSTM and Transformer architectures to generate SQL queries
by learning contextual representations between natural language questions
and database schemas. The field underwent significant transformation with the
introduction of pre-trained models like BERT, GPT, and T5, which demon-
strated enhanced capability in capturing semantic relationships between natu-
ral language and SQL. These models, trained on extensive text data, showed
remarkable flexibility in handling complex queries and cross-domain tasks, es-
tablishing themselves as fundamental components of contemporary text-to-
SQL systems. The emergence of LLM Agents represents the latest advance-
ment, offering interactive dialogue capabilities for dynamic query generation
adjustment (Zhu et al., 2024).

Zhu et al. (2024) identify several significant challenges in text-to-SQL systems


when bridging the gap between natural language and structured database
queries. A primary challenge lies in handling various forms of ambiguity in nat-
ural language processing. This includes word segmentation ambiguity, partic-
ularly challenging in languages like Chinese and Japanese where words aren't
clearly separated, and semantic ambiguity where terms can have multiple in-
terpretations depending on context. Database complexity presents another
substantial hurdle, as real-world databases often contain hundreds of tables
and columns with intricate relationships, making it difficult to incorporate com-
plete schema information within system constraints. Databases across differ-
ent domains may employ varying naming conventions, formats, and structures,
often including non-intuitive column names or numerous abbreviations that re-
quire sophisticated reasoning capabilities. The complexity of SQL queries
themselves adds another layer of difficulty, requiring systems to handle oper-
ations such as multi-table joins, nested subqueries, and complex conditional
filtering. Certain queries may require domain-specific SQL functions or opera-
tions, further complicating the generation process. Robustness and efficiency
29

pose ongoing challenges, as systems must handle imperfect user inputs, in-
cluding spelling mistakes and grammatical errors, while ensuring generated
queries are not only syntactically correct but also optimized for execution per-
formance, particularly crucial for large-scale databases (Zhu et al., 2024).

While GPT-4 has demonstrated state-of-the-art performance in text-to-SQL


tasks, the field utilizes a range of large language models including LLaMA-2,
PaLM-2, and various specialized models like CodeLlama-Instruct, each offer-
ing different advantages in specific contexts (Zhu et al., 2024).

3 CASE STUDY

The emergence of large language models has revolutionized how information


is accessed and processed, yet these models demonstrate significant limita-
tions in handling precise numerical data and statistical information (Zhu et al.,
2024). Sports statistics present an ideal testing ground for addressing this lim-
itation, as they combine structured numerical data with natural language que-
ries. This case study explores the development of a full-stack web application
that bridges this gap by integrating an LLM with a comprehensive NHL statis-
tics database. The application implements a conversational user interface for
accessing sports statistics, combining natural language processing with pre-
cise database queries.

3.1 Planning phase

The abundance of available sports data presents significant opportunities for


artificial intelligence applications, as data serves as the fundamental building
block of AI systems (Goodfellow, Bengio, & Courville, 2016). The objective was
to develop a full-stack application functioning as a statistical retrieval system.
The application architecture consists of two primary components: a conversa-
tional user interface similar to existing LLM applications, enhanced with direct
30

access to a statistical database, and a backend LLM pipeline for query pro-
cessing and data retrieval. The technical implementation utilizes OpenAI's
GPT-4o-mini model for natural language processing and incorporates NHL
statistics spanning from 2008 to 2024, with a React-based single-page appli-
cation serving as the frontend interface.

3.2 Data collection

The dataset was sourced from MoneyPuck, a comprehensive repository of


hockey statistics (MoneyPuck, 2024). The dataset encompasses player statis-
tics, goalie performance metrics, line combinations, defensive pairings, and
team-level data. This data was subsequently integrated into a PostgreSQL da-
tabase to facilitate efficient retrieval operations. All columns were configured
with numerical data types to enable computational operations within the SQL
environment. The dataset underwent preprocessing to eliminate duplicate en-
tries and redundant columns, though the initial data demonstrated high con-
sistency with minimal required preparation. The resulting database schema
comprises nine distinct tables, as illustrated in Figure 9.

Figure 9 Database Schema

The full database schema with all the column explanations is located at the
project's GitHub repository (Syrjä, 2024).
31

3.2.1 Database Normalization

The dataset exhibited significant data redundancy, primarily attributable to the


representation of player statistics across five distinct game situations: all,
5on4, 4on5, 4on4, and other. This structure results in multiple entries per
player, as exemplified in Figure 10, where identical player information (season,
name, team, position) is replicated across different game situations.

Figure 10 Example data

Initial attempts at optimization involved normalizing the database to First Nor-


mal Form (1NF), resulting in three normalized tables: player information, player
seasons, and player statistics. However, this normalization approach necessi-
tated complex table joins in SQL queries, which exceeded the current capabil-
ities of LLMs in generating accurate queries. Consequently, the decision was
made to denormalize the database structure, sacrificing storage efficiency for
query simplicity and improved LLM compatibility.

3.3 Backend Architecture

The backend implementation consists of two primary components: App.py and


rag_OpenAI.py. The App.py module serves as the central controller, managing
backend operations and establishing API endpoints for frontend communica-
tion. The core functionality resides in the query_process function, imported
from rag_OpenAI.py, which executes the computational logic. The architec-
tural structure of this implementation is illustrated in Figures 11 and 12,
32

providing a comprehensive visualization of the backend components and their


interactions.

Figure 11 Backend Architecture Part 1


33

Figure 12 Backend Architecture part 2

3.3.1 Handling User Queries

The system implements a sequential processing pipeline for handling user


queries. The initial phase involves the parse_and_expand_query function,
which performs preliminary query analysis and expansion. This function incor-
porates a structured set of instructions that guide the language model's inter-
pretation and processing of incoming queries. Critical components of this
34

function are illustrated in figures 13 and 14, with the complete implementation
available in the project's source repository.

Figure 13 Parse and Expand Query function part 1


35

Figure 14 Parse and Expand Function part 2

The “system_content”, “prompt” and “messages” shown in figures 13 and 14


serve the OpenAI’s API call structure (OpenAI, 2024). The “system_content”
variable contains instructions on how the model should respond and the
“prompt” variable contains the user input. Ideally the “prompt” would only con-
tain the user query but in this case the “prompt” includes provided instructions.
This is due to the consistency of the output. Specifying the wanted structure in
the “prompt” part leads to more consistent json outputs from the model.

The system_content variable serves as a knowledge base initialization com-


ponent. Within this component, comprehensive definitions of the nine data-
base tables and their corresponding attributes are provided. This initialization
is necessary as the base language model lacks inherent knowledge of the da-
tabase schema and structure. The provision of table definitions enables the
model to understand and interact with the specific database architecture im-
plemented in this system.
36

The system implements a query analysis protocol with multiple components.


The primary component, query expansion, performs semantic reformulation of
the original user input to enhance precision. This process includes correction
of typographical errors and standardization of player name entities. The accu-
racy of player name standardization is particularly critical for subsequent SQL
query generation, as database queries require exact string matching for suc-
cessful execution.

In the “hockey related” part the model analyzes whether the user query is
hockey related at all. The model then returns true or false. This flag optimizes
the next steps since most of them can be skipped if the query is not hockey
related. In this case the information is passed on to generate_natural_lan-
guage_answer_non_hockey function which returns an answer with a recom-
mendation of asking hockey stats related questions.

The “query intent” is similar to “hockey_related”. In this part the model evalu-
ates whether the user query is general hockey knowledge or if it asks for
hockey statistics. If the model deems the query to be general hockey
knowledge it won’t fetch any data and will answer the query based on its
hockey knowledge.

In the “player names” and “team abbreviations” part the model lists all player
names mentioned in query as well as abbreviations of teams mentioned. The
database has team names as their 3-letter abbreviations of the names.

The system implements table selection logic to identify the minimal set of da-
tabase tables required for query resolution. As illustrated in Figure 15, this se-
lection process serves a crucial role in token optimization. By limiting database
schema exposure to only essential tables, the system achieves more efficient
token utilization in subsequent model interactions.
37

Figure 15 Parsing columns

In the last parts of the function the schemas for required tables get added to
the JSON object. This JSON object is passed around multiple times to the
model. Each time the JSON object is provided to the model tokens are used.
To minimize the token usage only the schemas for the required columns are
added to the JSON object. These schemas contain the columns in the required
tables. The columns are needed for the SQL query generation later.

The “situation” flag is critical. As mentioned in 4.2.2 Database Normalization


the database has 5 situations to choose from: all, 5on4, 4on5, 4on4 and other.
Most cases it is not defined and the model defaults to “all”.

In the “seasons” the model decides what season to take into consideration.
The issue in the “seasons” part is that an NHL season spans over two calendar
years. This means that a season is commonly stated as 202x-202x+1 contain-
ing both calendar years it covers. However, in the database a season is
marked as the year it started. So, 2023-2024 season is 2023 in the database.
This would not be a problem if the AI model wasn’t being called recursively.
This means that even if in this part of the code the model could 100% of the
time mark seasons correctly, it is not guaranteed for the following parts. Here
is an example:
“user query”: “Mcdavid goals in 2023-2024 season.”
“seasons: “2023”
This is after parse_and_expand_query. When this goes to generate_sql_query
function the model in some cases puts seasons = 2023-2024 even though the
correct season is explicitly stated in the JSON file. This would return errors
since the SQL query does not work with 2023-2024. This issue is still unsolved.

The model analyses all of these and returns a JSON object containing the
information about each part.
38

3.3.2 Generating SQL queries

The most important parts of the generate_sql_query function are shown in fig-
ures 16 and 17. The “system_content” variable consists of instructions to the
model on how to construct the query.

Figure 16 Generating SQL queries function part 1


39

Figure 17 Generating SQL queries part 2

“1. Use only the specified tables and columns.” This is important since without
this the model might hallucinate columns that don’t exist causing errors in SQL
query generation. Rule 8 almost repeats the same thing, but repetition seems
to help the model follow the instructions more consistently.

Rules 2, 6, 9 and 10 are basic SQL instructions and the model would follow
these rules even if they weren’t stated. But they are stated for consistency
reasons.

Rules 3 and 13 repetitive but for a very good reason. If situation is not defined,
then all the stats from all situations are summed. There is already situation
called “all”. This means when situation is not defined the stats are calculated
for “all” plus the other four situations. This means getting incorrect stats.

Rules 5, 7 and 12 give the model instructions on how to behave on specific


cases regarding the query formation. Without the rule 12 the model would often
hallucinate there being an assist column even when it’s not provided the col-
umn schema. It makes sense that there would be an assist column as there is
a goals column. The model has a strong built-in correlation between these two
40

therefore making the instruction following to be secondary in this case. "LLMs


generate predictions of the ‘statistically likely continuations of word sequences’
based on brute-force iterative training on massive corpuses of digital text data.
As sequence predictors, these models draw on the underlying statistical distri-
bution of previously generated text to stitch together vectorized symbol strings
based on the probabilities of their co-occurrence" (Nature Reviews Physics,
2023). This quote highlights that LLMs prioritize statistical likelihood over in-
tent, which can lead to outputs that reflect the correlations in the data rather
than following specific instructions. In this context, the model's tendency to
hallucinate an assist column is driven by its learned correlations between goals
and assists, making instruction following less likely.

Rules 4 and 11 are crucial for maintaining the relevance and clarity of the query
results. Rule 4 ensures that the season from the provided JSON is taken into
consideration. As mentioned in 4.3.2 Handling User Queries, fetching the cor-
rect season is not a straightforward process due to the multiyear nature of NHL
seasons. This rule forces the model to only consider the season from the an-
alyzed JSON data and not original user query. Rule 11 is a flexible guideline.
The notion that the model would choose the correct amount of columns gets
overridden by setting the max rows to 10. The model always uses 10 columns
even when it’s not necessary. This problem leads to the conclusion that the
model is unable to figure out when all 10 columns are not necessary.

Rule 14 serves as a safeguard to manage the volume of the output. There are
cases where the results would be thousands of rows long. This amount of data
overwhelms the model when forming the final answer. It also takes up a huge
amount of tokens.

In the prompt part the model is fed all the required columns, player names,
teams, situations and seasons. These are from the analysed JSON object.
41

3.3.3 Testing And Correcting Generated SQL Query

As illustrated in Figures 18 and 19, the system implements a query validation


protocol following SQL query generation.

Figure 18 Testing SQL query function part 1

Figure 19 Testing SQL query function part 2


42

This protocol executes the generated query against the database and imple-
ments error handling procedures when necessary. Upon encountering execu-
tion errors, the system invokes a correction mechanism that utilizes compre-
hensive error data and contextual information to attempt query reconstruction.
If query reconstruction fails to resolve the errors, the system terminates the
data retrieval process.

3.3.4 Generating Answer

The response generation system implements four distinct functions, each de-
signed for specific query scenarios:
1. Non-domain Query Response Generation
2. General Domain Knowledge Response Generation
3. Data-driven Response Generation
4. Error State Response Generation
Each function implements a specific set of response parameters and format-
ting protocols. As illustrated in Figure 20, the non-domain query response func-
tion implements a unique protocol: rather than providing direct answers, it gen-
erates responses that direct users toward domain-specific (hockey statistics)
queries.

Figure 20 Generating NL answer for non-hockey queries -function

The response generation system integrates with the backend architecture


through the app.py endpoint, which manages the final response delivery to the
user interface.
43

3.3.5 Chat History

The system implements a context management protocol through a chat history


mechanism initialized at system launch. This implementation utilizes a deque
data structure to maintain a rolling context window of three query-response
pairs. The architecture comprises three primary components:
1. update_chat_history: Manages the insertion of new query-response ex-
changes
2. get_chat_history: Facilitates state retrieval operations
3. query_requires_history: Implements contextual dependency analysis
for incoming queries

As illustrated in Figure 21, this architecture optimizes token utilization through


selective context inclusion. Rather than implementing automatic context inher-
itance, which would result in token consumption proportional to the cumulative
length of all query-response pairs, the system employs dynamic context inclu-
sion based on semantic analysis of incoming queries.

Figure 21 Chat history implementation

3.3.6 Backend Integration Architecture

The backend integration architecture is implemented through the app.py mod-


ule, which functions as the central control system for backend operations. This
Flask-based implementation defines the application programming interface
(API) endpoints that facilitate frontend-backend communication. The primary
endpoint, '/api/query', implements POST request handling for user query pro-
cessing.
44

As illustrated in Figure 22, the system architecture implements query pro-


cessing through the process_query function, imported from the rag_OpenAI
module. The implementation includes:
1. Static File Management: Integration of React frontend components
through static file serving
2. Network Configuration: Implementation of port specification (default:
5000)
3. Cross-Origin Protocol: Integration of CORS (Cross-Origin Resource
Sharing) settings for secure frontend communication

Figure 22 API call function

3.4 Frontend Architecture

The frontend implementation utilizes React, a JavaScript library optimized for


user interface development. The architecture implements a component-based
design pattern to ensure modularity and maintainability. The system's core
functionality is distributed across several key components: a ChatContainer for
chat interface management, a QueryInput component for user query submis-
sion, and a ResponseDisplay component for rendering AI responses and sta-
tistical data.
45

State management within the application is implemented through React's hook


system, utilizing useState for local component state operations and useCon-
text for cross-component data distribution. The styling implementation lever-
ages Tailwind CSS, providing responsive design capabilities while maintaining
visual coherence across the application.

The current implementation represents an architectural evolution from an initial


HTML/CSS/JavaScript structure. This transition was necessitated by design
complexity limitations in the original implementation. The adoption of React's
component-based architecture significantly enhanced development efficiency
and functional modularity.

3.4.1 User Interface and Interaction

The responsive UI is shown in mobile size in figure 23. The user interface of
the Hockey AI app is designed to be intuitive and user-friendly, mimicking a
chat-like experience similar to popular AI assistants.
46

Figure 23 App UI

The main screen features a chat window where user queries and AI responses
are displayed in a conversational format. At the bottom of the screen, users
can find a text input field where they can type their queries. The app uses a
send button or allows users to press enter to submit their queries.
47

When a query is submitted, the interface displays a loading indicator to show


that the system is processing the request. Responses from the AI are then
displayed in the chat window, often including both textual explanations and
formatted tables to visualize statistical data. The UI is designed to handle var-
ious types of responses, from simple text answers to complex statistical break-
downs.

The interface is responsive, adapting to different screen sizes to ensure a con-


sistent experience across desktop and mobile devices. This responsiveness is
achieved with Tailwind CSS, allowing for flexible layouts and components that
adjust to the user's device.

The logo and background images are AI generated with gpt-4o. The color
scheme was made to match nicely with the logo and the background. Lengthy
responses make the chat window scrollable both vertically and horizontally.
48

Figure 24 Info button

The info button has information on how the application works. It functions as a
guide to new visitors.

3.5 Deployment of the Application

The local deployment architecture implements a multi-stage configuration pro-


cess initiated through version control integration via GitHub repository cloning.
The backend configuration encompasses Python environment initialization
through virtual environment protocols, followed by dependency management
via pip package installation. The architecture requires PostgreSQL database
configuration with schema initialization, while implementing security protocols
49

through environment variable configuration. The system utilizes Flask server


initialization on localhost for backend services.

The frontend implementation architecture requires Node.js package manage-


ment integration coupled with React development server configuration. The
system architecture necessitates OpenAI API authentication for full functional-
ity. Comprehensive deployment documentation is maintained within the project
repository.

3.5.1 Web Deployment

The production deployment architecture utilizes Heroku's Platform as a Ser-


vice (PaaS) infrastructure, incorporating automated deployment pipelines trig-
gered by Git repository integration. The backend deployment architecture im-
plements Python runtime configuration through Heroku buildpacks, with pro-
duction server deployment utilizing Gunicorn WSGI HTTP Server. The data-
base architecture employs PostgreSQL provisioning with automated schema
migration, while implementing secure environment variable management pro-
tocols.

The frontend deployment architecture implements Node.js-based build optimi-


zation and static file serving configuration, fully integrated with backend ser-
vices. The production environment incorporates automatic scaling protocols
and load balancing mechanisms, ensuring optimal performance under varying
traffic conditions.
50

4 DISCUSSION

The case study demonstrates several key findings regarding the integration of
LLMs with SQL databases for statistical data retrieval. The implementation of
text-to-SQL generation through LLMs, while successful in basic query han-
dling, revealed both opportunities and challenges in this approach. The sys-
tem's ability to process natural language queries and convert them to accurate
SQL statements varied significantly based on query complexity and database
structure.

The database normalization experiments yielded particularly interesting re-


sults. While theoretical database design principles suggest normalization for
data integrity, the practical implementation revealed that denormalized struc-
tures were more compatible with current LLM capabilities. This finding sug-
gests a potential trade-off between database optimization and LLM compati-
bility that warrants further investigation.

The system's agent architecture revealed important insights regarding the bal-
ance between processing accuracy and response time. The implementation
identified an optimal configuration of two specialized agents: the first perform-
ing query analysis and table selection, and the second generating SQL queries
based on the selected schema. This architecture proved efficient as it avoided
the token consumption that would be required for single-agent processing of
the complete database schema, which would consume approximately 3-4
times more tokens. While additional agent specialization could theoretically
improve accuracy, empirical testing demonstrated that the two-agent system
achieved sufficient precision while maintaining efficient token usage. The po-
tential for implementing larger language models for specific agents presents
an interesting trade-off between processing power and accuracy, suggesting
promising directions for future optimization research.

The token optimization strategies implemented in the system proved crucial


for both performance and cost efficiency. However, the necessity to balance
51

context preservation with token limitations highlighted a fundamental challenge


in LLM-based systems. The chat history implementation demonstrated that se-
lective context retention could effectively address this challenge without signif-
icantly impacting system performance.

5 CONCLUSION

This thesis addressed the challenge of enhancing Large Language Models'


accuracy in handling numerical data retrieval through SQL databases. The
main focus was on developing a system that could accurately retrieve sports
statistics based on natural language queries. The implementation consisted of
a full-stack web application utilizing RAG techniques with NHL statistics data,
employing a two-agent approach for query processing and SQL generation.
The research was guided by two main research questions:

1. How can Language Models be effectively utilized in RAG implementa-


tions with SQL databases?
2. What practical considerations and challenges arise when implementing
such systems?

Regarding the first research question, the implementation demonstrated an


effective approach to utilizing Language Models in RAG with SQL databases
through a two-agent system. The first agent analysed queries and selected
relevant database components, while the second generated SQL queries from
this filtered context. The study revealed that success was heavily dependent
on database structure and naming conventions. The decision to denormalize
the database, while contradicting traditional database design principles,
proved crucial for enabling more accurate query generation. This finding sug-
gests that implementing LLMs in RAG systems may require different ap-
proaches to database design than conventional systems.
52

Concerning the second research question, several practical considerations


and challenges were identified:

• Database design trade-offs between normalization and LLM compatibil-


ity
• Impact of column naming conventions on query accuracy
• Token optimization requirements and their effect on system design
• Resource constraints affecting model fine-tuning possibilities
• Challenges in handling complex statistical queries

The implementation demonstrated significant potential for practical applica-


tions, particularly in enhancing existing sports statistics platforms. The system
could serve as a complementary natural language interface to traditional data-
base access methods, though several limitations need to be addressed for
production-ready implementation.

Future research opportunities include:

• Development of specialized database schemas optimized for LLM in-


teraction
• Investigation of efficient token usage strategies for complex queries
• Exploration of fine-tuning approaches specific to sports statistics do-
mains
• Integration strategies with existing statistical platforms

While this research provides a foundation for implementing LLM-based SQL


query generation systems, it also highlights the complexities involved in bridg-
ing natural language processing with traditional database systems. The find-
ings contribute to the growing body of knowledge in applied AI systems, par-
ticularly in the domain of sports statistics access and retrieval.
53

REFERENCES

Amazon Web Services. (n.d.-a). What is Retrieval-Augmented Generation


(RAG)? Retrieved from https://aws.amazon.com/what-is/retrieval-augmented-
generation/
Amazon Web Services. (n.d.-b). What is SQL? Retrieved from https://aws.am-
azon.com/what-is/sql/

Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by
jointly learning to align and translate. arXiv.
https://doi.org/10.48550/arXiv.1409.0473

Brooks, R. A. (1991). Intelligence without representation. Artificial Intelligence,


47(1-3), 139-159.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P.,
Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss,
A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J.,
Winter, C., ... Amodei, D. (2020). Language models are few-shot learners.
arXiv. https://arxiv.org/pdf/2005.14165

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training
of deep bidirectional transformers for language understanding. In Proceedings
of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long
and Short Papers) (pp. 4171–4186). Association for Computational Linguistics.
https://doi.org/10.18653/v1/N19-1423

Garcia-Molina, H., Ullman, J. D., & Widom, J. (2008). Database systems: The
complete book (2nd ed.). Pearson Prentice Hall.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
54

Graves, A. (2013). Generating sequences with recurrent neural networks.


arXiv preprint arXiv:1308.0850.

Graves, A., Wayne, G., & Danihelka, I. (2014). Neural Turing machines. arXiv.
https://doi.org/10.48550/arXiv.1410.5401

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
recognition. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (pp. 770–778). IEEE.
https://doi.org/10.1109/CVPR.2016.90

LangChain. (n.d.). LangChain Documentation. Retrieved October 22, 2024,


from https://python.langchain.com/docs/introduction/

Legg, S., & Hutter, M. (2007). Universal intelligence: A definition of machine


intelligence. Minds and Machines, 17(4), 391-444.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela,
D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks.
Advances in Neural Information Processing Systems, 33, 9179-9191.

Llama Team, AI @ Meta. (2024). The Llama 3 Herd of Models. Meta.


https://llama.meta.com/

McCarthy, J., & Hayes, P. J. (1969). Some philosophical problems from the
standpoint of artificial intelligence. Machine Intelligence, 4, 463-502.

Michel, P., Levy, O., & Neubig, G. (2019). Are sixteen heads really better than
one? In Advances in Neural Information Processing Systems (NeurIPS 2019).
arXiv. https://doi.org/10.48550/arXiv.1905.10650

Mikolov, T., Karafiát, M., Burget, L., Černocký, J., & Khudanpur, S. (2010).
Recurrent neural network based language model. In Proceedings of the 11th
55

Annual Conference of the International Speech Communication Association


(INTERSPEECH 2010) (pp. 1045-1048).

MoneyPuck. (2024). Hockey Statistics and Analytics. Retrieved July 2, 2024,


from https://moneypuck.com/data.htm

Nilsson, N. J. (2009). The quest for artificial intelligence: A history of ideas and
achievements. Cambridge University Press.

NirDiamant. (n.d.). RAG_TECHNIQUES. GitHub repository. Retrieved Octo-


ber 22, 2024, from https://github.com/NirDiamant/RAG_TECHNIQUES

OpenAI. (2022). Introducing ChatGPT. Retrieved from https://openai.com/in-


dex/chatgpt/

OpenAI. (2024). API Documentation. Retrieved August 14, 2024, from


https://platform.openai.com/docs/guides/text-generation/chat-completions-api

Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training re-
current neural networks. In International Conference on Machine Learning (pp.
1310-1318).

Poole, D., & Mackworth, A. (2017). Artificial Intelligence: foundations of com-


putational agents (2nd ed.). Cambridge University Press.

Russell, S. J., & Norvig, P. (2010). Artificial intelligence: A modern approach


(3rd ed.).

Syrjä, S-M. (2024). Hockey Statistics RAG System [Implementation reposi-


tory]. GitHub. https://github.com/sauhumatti/hockey_RAG/tree/main

Searle, J. R. (1980). Minds, brains, and programs. Behavioral and Brain Sci-
ences, 3(3), 417-424.
56

Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of
rare words with subword units. In Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers) (pp.
1715–1725). Association for Computational Linguistics.
https://doi.org/10.18653/v1/P16-1162

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural


Computation, 9(8), 1735–1780.

Simon, H. A. (1996). The sciences of the artificial (3rd ed.). MIT Press.
Srambical, F. (2024, June 8). Causal mask in large-scale language modeling.
p(doom). Retrieved September 26, 2024, from
https://pdoom.org/causal_mask.html

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning
with neural networks. arXiv. https://doi.org/10.48550/arXiv.1409.3215
Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59(236),
433-460.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. arXiv preprint
arXiv:1706.03762. https://arxiv.org/pdf/1706.03762

Voita, E., Talbot, D., Moiseev, F., Sennrich, R., & Titov, I. (2019). Analyzing
multi-head self-attention: Specialized heads do the heavy lifting, the rest can
be pruned. In Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics (pp. 5797–5808). Association for Computational
Linguistics. https://doi.org/10.18653/v1/P19-1580

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., … Dean,
J. (2016). Google’s neural machine translation system: Bridging the gap be-
tween human and machine translation. arXiv.
https://doi.org/10.48550/arXiv.1609.08144
57

Xu, K., Zhang, M., Li, J., Du, S. S., Kawarabayashi, K., & Jegelka, S. (2020).
How neural networks extrapolate: From feedforward to graph neural networks.
arXiv. https://doi.org/10.48550/arXiv.2009.11848

Young, T., Hazarika, D., Poria, S., & Cambria, E. (2018). Recent trends in deep
learning based natural language processing. IEEE Computational Intelligence
Magazine, 13(3), 55-75.

Zhu, F., Dai, D., & Sui, Z. (2024). Language models understand numbers, at
least partially. arXiv. https://arxiv.org/html/2401.03735v3

Zhu, X., Li, Q., Cui, L., & Liu, Y. (2024). Large language model enhanced text-
to-SQL generation: A survey. arXiv.
https://doi.org/10.48550/arXiv.2410.06011

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy