Syrja - Saku-Matti - Retrieval-Augmented Generation
Syrja - Saku-Matti - Retrieval-Augmented Generation
Retrieval-Augmented Generation
Utilizing SQL Database
Key findings reveal that while LLMs can effectively generate SQL queries for
statistical retrieval, challenges persist in database design paradigms, where
traditional normalization principles proved counterproductive for RAG applica-
tions. The study identifies specific limitations in handling season formatting and
column selection, while also highlighting the potential for production-level ap-
plications. The research contributes to the field by presenting a practical frame-
work for implementing SQL-based RAG systems and identifies areas for future
improvement, including dataset optimization and model fine-tuning opportuni-
ties.
This work provides valuable insights into the integration of LLMs with struc-
tured databases and offers a foundation for developing more accurate and re-
liable statistical retrieval systems.
PREFACE
This work would not have been possible without the support of the open-source
community, particularly the developers of various tools and libraries utilized in
this project. Their contributions to the field have been invaluable.
Special thanks to these companies who provide invaluable tools for research
and studying purposes: OpenAI, Antropic, Perplexity and Meta.
CONTENTS
1 INTRODUCTION ........................................................................................ 7
1.1 Motivation ............................................................................................. 7
1.2 Goals .................................................................................................... 8
2 LITERATURE REVIEW .............................................................................. 8
2.1 Artificial Intelligence (AI) ....................................................................... 8
2.2 History of language model architectures ............................................ 10
2.3 Transformers Encoder Architecture .................................................... 11
2.3.1 Input Embeddings ...................................................................... 14
2.3.2 Positional Encoding ................................................................... 14
2.3.3 Multi-Head Attention .................................................................. 15
2.3.4 Layer Normalization and Residual Connections ........................ 18
2.4 Transformers Decoder Architecture ................................................... 20
2.4.1 Decoder Embeddings and Positional Encoding ......................... 22
2.4.2 Masked Multi-head Attention ..................................................... 22
2.4.3 Encoder-Decoder Attention ....................................................... 24
2.4.4 Decoders Token Prediction ....................................................... 25
2.5 Scale and Computational Requirements ............................................ 26
2.6 Retrieval Augmented Generation (RAG) ............................................ 26
2.7 SQL database .................................................................................... 27
2.7.1 Text-to-SQL query ..................................................................... 28
3 CASE STUDY ........................................................................................... 29
3.1 Planning phase ................................................................................... 29
3.2 Data collection .................................................................................... 30
3.2.1 Database Normalization ............................................................ 31
3.3 Backend Architecture ......................................................................... 31
3.3.1 Handling User Queries .............................................................. 33
3.3.2 Generating SQL queries ............................................................ 38
3.3.3 Testing And Correcting Generated SQL Query ......................... 41
3.3.4 Generating Answer .................................................................... 42
3.3.5 Chat History ............................................................................... 43
3.3.6 Backend Integration Architecture ............................................... 43
3.4 Frontend Architecture ......................................................................... 44
3.4.1 User Interface and Interaction.................................................... 45
3.5 Deployment of the Application ............................................................ 48
3.5.1 Web Deployment ....................................................................... 49
4 DISCUSSION............................................................................................ 50
5 CONCLUSION .......................................................................................... 51
REFERENCES ............................................................................................ 53
LIST OF SYMBOLS AND TERMS
AI = Artificial Intelligence
LLM = Large Language Model
SQL = Structured Query Language
UI = User Interface
API = Application Programming Interface
JSON = JavaScript Object Notation
7
1 INTRODUCTION
1.1 Motivation
1.2 Goals
The primary objective of this study is to investigate methods for enabling large
language models to accurately handle numerical data through direct SQL da-
tabase integration. This research addresses the following questions:
To address these research questions, this study consists of two main compo-
nents. First, a comprehensive literature review examines the underlying tech-
nologies: transformer architecture in large language models, SQL database
systems, and Retrieval-Augmented Generation (RAG). Second, through a de-
tailed case study, a practical implementation and integration of these technol-
ogies is demonstrated, providing empirical evidence of their effectiveness and
limitations in real-world applications.
2 LITERATURE REVIEW
Artificial Intelligence can be defined in many ways. As shown in Figure 1.1, the
definitions of artificial intelligence can be categorized into thinking humanly,
9
thinking rationally, acting humanly, and acting rationally (Russell & Norvig,
2010).
Figure 1 Definitions of artificial intelligence, organized into four categories. From Artificial Intelligence: A
Modern Approach (3rd ed., p. 2), by S. J. Russell & P. Norvig, 2010, Pearson.
Russell and Norvig (2010) draw an insightful parallel between the development
of artificial flight and artificial intelligence. Russell and Norvig (2010) point out
that successful powered flight was achieved only when inventors stopped try-
ing to replicate bird flight and instead focused on understanding fundamental
aerodynamics.
PE[(pos,2i)] = sin(pos/10000^((2i)/d_model))
PE[(pos,2i+1)] = cos(pos/10000^((2i)/d_model))
15
where pos represents the token position and i denotes the dimension (Vaswani
et al., 2017). This formulation enables the model to generalize to sequence
lengths unseen during training, as the sinusoidal pattern provides a consistent
positional signal (Dehghani et al., 2019).
𝑄 = 𝑋𝑊_𝑄,
𝐾 = 𝑋𝑊_𝐾,
𝑉 = 𝑋𝑊_𝑉
where X represents the input token embeddings. Each attention head operates
on representations with dimensionality d_k = d_model/n_heads, where
d_model denotes the model's dimension and n_heads represents the number
of parallel attention operations. In the original implementation, with d_model =
512 and n_heads = 8, each head processed 64-dimensional projections of the
input sequence (Vaswani et al., 2017). This dimensional partitioning maintains
computational efficiency while enabling the model to attend to information from
different representation subspaces simultaneously (Voita et al., 2019; Michel
et al., 2019).
Query (Q), key (K), and value (V) vectors are learned representations that fa-
cilitate token interaction within the attention mechanism (Vaswani et al., 2017).
The query vector (Q) of a token functions as a learned representation for infor-
mation seeking, determining which other tokens in the sequence are most rel-
evant for its contextual understanding. The key vector serves as a learned
16
compatibility measure, enabling other tokens to assess its relevance for their
contextual needs. The value vector encodes the semantic content that is ag-
gregated during the attention computation (Bahdanau et al., 2015; Vaswani et
al., 2017). This query-key-value formulation derives conceptually from infor-
mation retrieval systems, where queries are matched against keys to retrieve
associated values (Graves et al., 2014).
Within each attention head, attention weights are computed through scaled
dot-product operations between query and key vectors (Vaswani et al., 2017).
For a sequence of length n, each query vector qi is multiplied with all key vec-
tors kj, generating n attention weights per token. These dot products are then
scaled by dividing them by √d_k, where d_k is the dimensionality of the key
vectors. The scaled values are passed through a softmax function to produce
the final attention weights:
𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛_𝑤𝑒𝑖𝑔ℎ𝑡𝑠 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑄𝐾^𝑇/√𝑑_𝑘)
This transformation ensures the attention weights sum to 1 and fall within the
range [0,1], creating a proper probability distribution over the input sequence.
The scaling factor √d_k is important for maintaining stable gradients during
training (Vaswani et al., 2017). Without scaling, the dot products would grow
large in magnitude as d_k increases, pushing the softmax function into regions
where gradients become vanishingly small, as the exponential nature of soft-
max would cause the output distribution to concentrate heavily on the maxi-
mum values (Xu et al., 2019).
The attention weights derived from the softmax function are subsequently ap-
plied to the value vectors through matrix multiplication, yielding the final atten-
tion output. The complete attention mechanism within each head can be for-
mally expressed as:
where Q, K, and V represent the query, key, and value matrices respectively
(Vaswani et al., 2017). This computation is performed independently in each
attention head, allowing different heads to capture distinct relationships in the
input sequence. The multi-head attention mechanism then aggregates infor-
mation across all heads through concatenation:
The multi-head attention output is combined with the input through a residual
connection, followed by layer normalization (Vaswani et al., 2017). This can be
formally expressed as:
where x represents the input sequence prior to the attention operations. The
residual connection, implemented through element-wise addition, creates a di-
rect path for information flow from lower layers by adding the input directly to
the transformed representation. This architecture mitigates the vanishing gra-
dient problem in deep networks and enables more effective training of the
transformer's deep structure (He et al., 2016). Layer normalization stabilizes
the network by normalizing the activations across the feature dimension, en-
suring consistent scaling throughout the network depth (Ba et al., 2016).
This attention block structure is repeated N times in the encoder, with each
iteration refining the representational quality of the sequence. The combination
of residual connections and layer normalization enables the network to pre-
serve and gradually enhance relevant information from the input while learning
increasingly sophisticated features at higher layers (Vaswani et al., 2017).
1
𝑥𝑖 − 𝑛 ∑𝑛𝑗=1 𝑥𝑗
𝑦𝑖 = 𝛾 ⋅ +𝛽
√1 ∑𝑛𝑗=1 1
(𝑥𝑗 − 𝑛 ∑𝑛𝑘=1 𝑥𝑘 )2 +𝜖
𝑛
Layer normalization operates across all feature dimensions for each sequence
position independently. For a given input vector, the algorithm computes the
mean and variance across its features, then normalizes the values by subtract-
ing the mean and dividing by the standard deviation (Ba et al., 2016). This
standardization centers the features around zero and scales them to unit vari-
ance. The normalized values are then transformed using learnable parameters
19
γ and β, enabling the model to adaptively scale and shift the normalized fea-
tures. In the original transformer implementation (Vaswani et al., 2017), with a
maximum sequence length of 256 and dimension size of 512, layer normaliza-
tion is applied independently to each position's 512-dimensional feature vec-
tor. Thus, for each position in the sequence, the normalization statistics (mean,
variance, and standard deviation) are computed using its corresponding 512
feature values.
The decoder begins its sequence generation with a special <start> token. This
token is embedded into the same dimensional space as other tokens in the
model's vocabulary, but its specific purpose is to initiate the generation of the
output sequence. Similar to the encoder embeddings, each token in the de-
coder, including the <start> token, is combined with positional encodings to
preserve sequential information (Vaswani et al., 2017).
During training, the decoder processes the entire target sequence simultane-
ously, with all tokens shifted one position right and the <start> token prepended
to the sequence. This offset, combined with the masking mechanism, ensures
that predictions for any position i can only depend on known outputs at posi-
tions less than i. This technique enables parallel training while maintaining the
sequential nature of the generation process (Vaswani et al., 2017). During in-
ference, however, the decoder operates autoregressively, generating one to-
ken at a time and using each new prediction as input for the next position
(Sutskever et al., 2014).
position can only attend to itself and previous positions. This masking strategy
maintains consistency between training and inference by enforcing the same
autoregressive constraints in both phases (Vaswani et al., 2017).
Figure 7 Srambical, F. (2024, June 8). Causal mask in large-scale language modeling. p(doom). Re-
trieved September 26, 2024, from https://pdoom.org/causal_mask.html
An alternative approach, known as causal masking with prefix, was later intro-
duced and is shown in Figure 8. This variant allows previous positions to attend
to positions up to the current prediction point. For instance, when generating
the fourth token, the first and second tokens can attend to the third token, un-
like in standard causal masking where such connections are prohibited. While
this approach potentially enhances context utilization during training and im-
proves sequence coherence, it introduces a training-inference discrepancy.
The model learns to utilize information patterns during training that are not
available during standard autoregressive generation, potentially leading to de-
graded inference performance (Srambical, 2024).
24
Figure 8 Srambical. F. (2024, June 8). Causal mask in large-scale language modeling. p(doom). Re-
trieved September 27, 2024, from https://pdoom.org/causal_mask.html
the encoder's final output (Vaswani et al., 2017). This architectural design en-
ables each decoder position to selectively retrieve relevant contextual infor-
mation from the encoder's representation, establishing direct information path-
ways between the input and output sequences (Bahdanau, Cho, & Bengio,
2015). Unlike the self-attention mechanisms, the cross-attention does not uti-
lize the decoder's own key and value vectors, instead relying entirely on the
encoder's output for these representations (Vaswani et al., 2017).
The final stage of the transformer's decoding process involves converting the
decoder's output into token predictions. As described in the original paper, "We
use learned embeddings to convert the input tokens and output tokens to vec-
tors of dimension dmodel. We also use the usual learned linear transformation
and SoftMax function to convert the decoder output to predicted next-token
probabilities. In our model, we share the same weight matrix between the two
embedding layers and the pre-SoftMax linear transformation" (Vaswani et al.,
2017, p. 5).
Following the N decoder layers, the output sequence undergoes a linear pro-
jection using the shared embedding matrix. This weight sharing between the
input embeddings and output projection is a key architectural choice that re-
duces the model's parameter count while maintaining performance. The pro-
jected vectors are then processed through a softmax function, producing a
probability distribution over the model's entire vocabulary for each position.
The model generates tokens sequentially based on these probability distribu-
tions until it predicts a special <end> token, signaling the completion of the
output sequence (Vaswani et al., 2017).
26
Inference based RAG always works with the same principles. A language
model has a context size of some number of tokens. The user query and the
retrieved information from RAG are fit into this context length one after another.
The retrieved information depends on the user query. There are many tech-
niques for retrieving the required context information. In NirDiamants GitHub
page (NirDiamant, n.d. -a) there are over 30 listed ways to implement RAG.
The best technique depends on the structure of the data and the use case. A
common RAG framework is LangChain (LangChain, n.d.). LangChain is a pop-
ular open-source framework that simplifies the implementation of RAG sys-
tems. LangChain provides modular components that can be combined to cre-
ate sophisticated RAG pipelines. The framework's primary strength lies in its
abstraction of common RAG operations and its extensive integration ecosys-
tem. The framework offers essential functionalities for handling document pro-
cessing, including document loaders for various data formats and configurable
text splitting mechanisms. It provides seamless integration with vector stores
for efficient information retrieval. The framework's chain components facilitate
the creation of sequential processing pipelines, allowing developers to con-
struct complex RAG systems that can effectively augment language model re-
sponses with external knowledge.
In a 2024 study paper, Zhu et al. examine text to SQL query generation and
its evolution. The paper examines the evolution from traditional text-to-SQL
methods to modern approaches. Early implementations relied on bi-structured
models using LSTM and Transformer architectures to generate SQL queries
by learning contextual representations between natural language questions
and database schemas. The field underwent significant transformation with the
introduction of pre-trained models like BERT, GPT, and T5, which demon-
strated enhanced capability in capturing semantic relationships between natu-
ral language and SQL. These models, trained on extensive text data, showed
remarkable flexibility in handling complex queries and cross-domain tasks, es-
tablishing themselves as fundamental components of contemporary text-to-
SQL systems. The emergence of LLM Agents represents the latest advance-
ment, offering interactive dialogue capabilities for dynamic query generation
adjustment (Zhu et al., 2024).
pose ongoing challenges, as systems must handle imperfect user inputs, in-
cluding spelling mistakes and grammatical errors, while ensuring generated
queries are not only syntactically correct but also optimized for execution per-
formance, particularly crucial for large-scale databases (Zhu et al., 2024).
3 CASE STUDY
access to a statistical database, and a backend LLM pipeline for query pro-
cessing and data retrieval. The technical implementation utilizes OpenAI's
GPT-4o-mini model for natural language processing and incorporates NHL
statistics spanning from 2008 to 2024, with a React-based single-page appli-
cation serving as the frontend interface.
The full database schema with all the column explanations is located at the
project's GitHub repository (Syrjä, 2024).
31
function are illustrated in figures 13 and 14, with the complete implementation
available in the project's source repository.
In the “hockey related” part the model analyzes whether the user query is
hockey related at all. The model then returns true or false. This flag optimizes
the next steps since most of them can be skipped if the query is not hockey
related. In this case the information is passed on to generate_natural_lan-
guage_answer_non_hockey function which returns an answer with a recom-
mendation of asking hockey stats related questions.
The “query intent” is similar to “hockey_related”. In this part the model evalu-
ates whether the user query is general hockey knowledge or if it asks for
hockey statistics. If the model deems the query to be general hockey
knowledge it won’t fetch any data and will answer the query based on its
hockey knowledge.
In the “player names” and “team abbreviations” part the model lists all player
names mentioned in query as well as abbreviations of teams mentioned. The
database has team names as their 3-letter abbreviations of the names.
The system implements table selection logic to identify the minimal set of da-
tabase tables required for query resolution. As illustrated in Figure 15, this se-
lection process serves a crucial role in token optimization. By limiting database
schema exposure to only essential tables, the system achieves more efficient
token utilization in subsequent model interactions.
37
In the last parts of the function the schemas for required tables get added to
the JSON object. This JSON object is passed around multiple times to the
model. Each time the JSON object is provided to the model tokens are used.
To minimize the token usage only the schemas for the required columns are
added to the JSON object. These schemas contain the columns in the required
tables. The columns are needed for the SQL query generation later.
In the “seasons” the model decides what season to take into consideration.
The issue in the “seasons” part is that an NHL season spans over two calendar
years. This means that a season is commonly stated as 202x-202x+1 contain-
ing both calendar years it covers. However, in the database a season is
marked as the year it started. So, 2023-2024 season is 2023 in the database.
This would not be a problem if the AI model wasn’t being called recursively.
This means that even if in this part of the code the model could 100% of the
time mark seasons correctly, it is not guaranteed for the following parts. Here
is an example:
“user query”: “Mcdavid goals in 2023-2024 season.”
“seasons: “2023”
This is after parse_and_expand_query. When this goes to generate_sql_query
function the model in some cases puts seasons = 2023-2024 even though the
correct season is explicitly stated in the JSON file. This would return errors
since the SQL query does not work with 2023-2024. This issue is still unsolved.
The model analyses all of these and returns a JSON object containing the
information about each part.
38
The most important parts of the generate_sql_query function are shown in fig-
ures 16 and 17. The “system_content” variable consists of instructions to the
model on how to construct the query.
“1. Use only the specified tables and columns.” This is important since without
this the model might hallucinate columns that don’t exist causing errors in SQL
query generation. Rule 8 almost repeats the same thing, but repetition seems
to help the model follow the instructions more consistently.
Rules 2, 6, 9 and 10 are basic SQL instructions and the model would follow
these rules even if they weren’t stated. But they are stated for consistency
reasons.
Rules 3 and 13 repetitive but for a very good reason. If situation is not defined,
then all the stats from all situations are summed. There is already situation
called “all”. This means when situation is not defined the stats are calculated
for “all” plus the other four situations. This means getting incorrect stats.
Rules 4 and 11 are crucial for maintaining the relevance and clarity of the query
results. Rule 4 ensures that the season from the provided JSON is taken into
consideration. As mentioned in 4.3.2 Handling User Queries, fetching the cor-
rect season is not a straightforward process due to the multiyear nature of NHL
seasons. This rule forces the model to only consider the season from the an-
alyzed JSON data and not original user query. Rule 11 is a flexible guideline.
The notion that the model would choose the correct amount of columns gets
overridden by setting the max rows to 10. The model always uses 10 columns
even when it’s not necessary. This problem leads to the conclusion that the
model is unable to figure out when all 10 columns are not necessary.
Rule 14 serves as a safeguard to manage the volume of the output. There are
cases where the results would be thousands of rows long. This amount of data
overwhelms the model when forming the final answer. It also takes up a huge
amount of tokens.
In the prompt part the model is fed all the required columns, player names,
teams, situations and seasons. These are from the analysed JSON object.
41
This protocol executes the generated query against the database and imple-
ments error handling procedures when necessary. Upon encountering execu-
tion errors, the system invokes a correction mechanism that utilizes compre-
hensive error data and contextual information to attempt query reconstruction.
If query reconstruction fails to resolve the errors, the system terminates the
data retrieval process.
The response generation system implements four distinct functions, each de-
signed for specific query scenarios:
1. Non-domain Query Response Generation
2. General Domain Knowledge Response Generation
3. Data-driven Response Generation
4. Error State Response Generation
Each function implements a specific set of response parameters and format-
ting protocols. As illustrated in Figure 20, the non-domain query response func-
tion implements a unique protocol: rather than providing direct answers, it gen-
erates responses that direct users toward domain-specific (hockey statistics)
queries.
The responsive UI is shown in mobile size in figure 23. The user interface of
the Hockey AI app is designed to be intuitive and user-friendly, mimicking a
chat-like experience similar to popular AI assistants.
46
Figure 23 App UI
The main screen features a chat window where user queries and AI responses
are displayed in a conversational format. At the bottom of the screen, users
can find a text input field where they can type their queries. The app uses a
send button or allows users to press enter to submit their queries.
47
The logo and background images are AI generated with gpt-4o. The color
scheme was made to match nicely with the logo and the background. Lengthy
responses make the chat window scrollable both vertically and horizontally.
48
The info button has information on how the application works. It functions as a
guide to new visitors.
4 DISCUSSION
The case study demonstrates several key findings regarding the integration of
LLMs with SQL databases for statistical data retrieval. The implementation of
text-to-SQL generation through LLMs, while successful in basic query han-
dling, revealed both opportunities and challenges in this approach. The sys-
tem's ability to process natural language queries and convert them to accurate
SQL statements varied significantly based on query complexity and database
structure.
The system's agent architecture revealed important insights regarding the bal-
ance between processing accuracy and response time. The implementation
identified an optimal configuration of two specialized agents: the first perform-
ing query analysis and table selection, and the second generating SQL queries
based on the selected schema. This architecture proved efficient as it avoided
the token consumption that would be required for single-agent processing of
the complete database schema, which would consume approximately 3-4
times more tokens. While additional agent specialization could theoretically
improve accuracy, empirical testing demonstrated that the two-agent system
achieved sufficient precision while maintaining efficient token usage. The po-
tential for implementing larger language models for specific agents presents
an interesting trade-off between processing power and accuracy, suggesting
promising directions for future optimization research.
5 CONCLUSION
REFERENCES
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by
jointly learning to align and translate. arXiv.
https://doi.org/10.48550/arXiv.1409.0473
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P.,
Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss,
A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J.,
Winter, C., ... Amodei, D. (2020). Language models are few-shot learners.
arXiv. https://arxiv.org/pdf/2005.14165
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training
of deep bidirectional transformers for language understanding. In Proceedings
of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long
and Short Papers) (pp. 4171–4186). Association for Computational Linguistics.
https://doi.org/10.18653/v1/N19-1423
Garcia-Molina, H., Ullman, J. D., & Widom, J. (2008). Database systems: The
complete book (2nd ed.). Pearson Prentice Hall.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
54
Graves, A., Wayne, G., & Danihelka, I. (2014). Neural Turing machines. arXiv.
https://doi.org/10.48550/arXiv.1410.5401
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
recognition. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (pp. 770–778). IEEE.
https://doi.org/10.1109/CVPR.2016.90
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela,
D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks.
Advances in Neural Information Processing Systems, 33, 9179-9191.
McCarthy, J., & Hayes, P. J. (1969). Some philosophical problems from the
standpoint of artificial intelligence. Machine Intelligence, 4, 463-502.
Michel, P., Levy, O., & Neubig, G. (2019). Are sixteen heads really better than
one? In Advances in Neural Information Processing Systems (NeurIPS 2019).
arXiv. https://doi.org/10.48550/arXiv.1905.10650
Mikolov, T., Karafiát, M., Burget, L., Černocký, J., & Khudanpur, S. (2010).
Recurrent neural network based language model. In Proceedings of the 11th
55
Nilsson, N. J. (2009). The quest for artificial intelligence: A history of ideas and
achievements. Cambridge University Press.
Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training re-
current neural networks. In International Conference on Machine Learning (pp.
1310-1318).
Searle, J. R. (1980). Minds, brains, and programs. Behavioral and Brain Sci-
ences, 3(3), 417-424.
56
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of
rare words with subword units. In Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers) (pp.
1715–1725). Association for Computational Linguistics.
https://doi.org/10.18653/v1/P16-1162
Simon, H. A. (1996). The sciences of the artificial (3rd ed.). MIT Press.
Srambical, F. (2024, June 8). Causal mask in large-scale language modeling.
p(doom). Retrieved September 26, 2024, from
https://pdoom.org/causal_mask.html
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning
with neural networks. arXiv. https://doi.org/10.48550/arXiv.1409.3215
Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59(236),
433-460.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. arXiv preprint
arXiv:1706.03762. https://arxiv.org/pdf/1706.03762
Voita, E., Talbot, D., Moiseev, F., Sennrich, R., & Titov, I. (2019). Analyzing
multi-head self-attention: Specialized heads do the heavy lifting, the rest can
be pruned. In Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics (pp. 5797–5808). Association for Computational
Linguistics. https://doi.org/10.18653/v1/P19-1580
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., … Dean,
J. (2016). Google’s neural machine translation system: Bridging the gap be-
tween human and machine translation. arXiv.
https://doi.org/10.48550/arXiv.1609.08144
57
Xu, K., Zhang, M., Li, J., Du, S. S., Kawarabayashi, K., & Jegelka, S. (2020).
How neural networks extrapolate: From feedforward to graph neural networks.
arXiv. https://doi.org/10.48550/arXiv.2009.11848
Young, T., Hazarika, D., Poria, S., & Cambria, E. (2018). Recent trends in deep
learning based natural language processing. IEEE Computational Intelligence
Magazine, 13(3), 55-75.
Zhu, F., Dai, D., & Sui, Z. (2024). Language models understand numbers, at
least partially. arXiv. https://arxiv.org/html/2401.03735v3
Zhu, X., Li, Q., Cui, L., & Liu, Y. (2024). Large language model enhanced text-
to-SQL generation: A survey. arXiv.
https://doi.org/10.48550/arXiv.2410.06011