Large Language Models Need Symbolic Ai
Large Language Models Need Symbolic Ai
Abstract
The capability of systems based on large language models (LLMs), such as ChatGPT, to generate human-
like text has captured the attention of the public and the scientific community. It has prompted both
predictions that systems such as ChatGPT will transform AI and enumerations of system problems
with hopes of solving them by scale and training. This position paper argues that both over-optimistic
views and disppointments reflect misconceptions of the fundamental nature of LLMs as language models.
As such, they are statistical models of language production and fluency, with associated strengths and
limitations; they are not—and should not be expected to be—knowledge models of the world, nor do they
reflect the core role of language beyond the statistics: communication. The paper argues that realizing
that role will require driving LLMs with symbolic systems based on goals, facts, reasoning, and memory.
Keywords
ChatGPT, Large language models, Natural Language Understanding, Neuro-Symbolic AI
1. Introduction
The language generation capabilities of systems based on large language models, and ChatGPT
in particular, have captured the attention of the general public, scientific community, and
educators. Their ability to produce human-like language has spurred predictions that they will
transform AI. Though they are powerful, there seems to be a deep misunderstanding as to what
they actually are—which has led to an ongoing enumeration of problems with their ability to
reason causally and to produce facts reliably, combined with their propensity to hallucinate.
This, in turn, has led both to attempts at banning the technology and approaches to solving
these issues through scale-up, under the hypothesis that size is the solution and training with
even more data is the key.
Our argument is that the perceived issues associated with language models flow from a mis-
understanding of what the models are. Ironically, we only need look to their name, language
models, to understand they are engines for language production and fluency rather than
information systems or repositories of fact. They are exceptionally good at producing language
that expresses ideas and potential facts but were not developed to generate the ideas themselves.
In fact, we argue that the statistical nature of these systems makes them, by design, incapable of
NeSy 2023, 17th International Workshop on Neural-Symbolic Learning and Reasoning, Certosa di Pontignano, Siena,
Italy
∗
Corresponding author.
Envelope-Open hammond@cs.northwestern.edu (K. Hammond); leake@indiana.edu (D. Leake)
Orcid 0000-0002-4579-6685 (K. Hammond); 0000-0002-8666-3416 (D. Leake)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
“remembering” facts about the world. There is a difference between seeing words and features
in terms of “the odds are this is right” and actually recalling the facts associated with an object
or doing inference. The former is essential to language while the latter is what is needed for
reasoning.
We argue that LLMs need to be seen as the fluency components of larger systems that
integrate classical reasoning, data analytics, and even look-up as the producers of the facts
that are used to automatically craft the prompts for the models. Language does not exist in a
vacuum; it is a medium for communication, and that communication depends on goals, facts,
and knowledge. The paper argues for addressing these problems by integrating LLMs with
symbolic systems that drive their communication.
3. Scaling Up
LLM research has developed a sequence of increasingly large models—from 117 million param-
eters for GPT-1 to 1.2 billion for GPT-2, to 117 billion for GPT-3, to speculative estimates of
100 trillion parameters for the recently introduced GPT-4. When GPT-3 was introduced it was
seen to illustrate the power of model size, supporting the principle that “scaling up language
models greatly improves task-agnostic, few-shot performance” [1]. An optimistic view of the
power of LLM size sees refining the models and performing large-scale training as a primary
solution to observed gaps. Such approaches have shown benefits, though large size does not
guarantee superior performance (and has its own potential drawbacks such as training cost,
which has given rise to interest in distilled models). However, we argue that the key issue is
not one of data size or training, but instead the fundamental in-principle issue of what LLMs
are, and hence, what they are capable of doing.
• Facts: LLMs can only propose assertions as likely (“the odds are that…”), and in different
instances might change the assertions.
• Causality: They capture correlations from text, which may or may not reflect the structure
of causal reality.
• Reasoning: They can capture likely alternatives but cannot identify conclusions as defini-
tive.
• Ephemera: They depend on pretrained models requiring enormous computational re-
sources to train, resulting in a time lag in model coverage. Responses of the current
version of ChatGPT are based on 2021 data.
• Memories: They have no capability to learn long-term memories from interactions.
• Explanations: They cannot provide provenance information to account for their conclu-
sions.
Various ongoing research efforts aim to address specific aspects of these issues. For example, in-
Context Retrieval-Augmented Language Models [9] are promising for supporting explanation
by increasing the ability to attribute information to its sources. As another example, the
Selection-Inference framework [10] applies an alternation of LLM steps to build more focused
inference chains, with the goal of inferences that can be seen as more causally-based. Much
recent research focuses on augmented language models, which add the capability to decompose
complex tasks and enable an LLM to call external modules to augment their performance;
Mialon et al. survey these approaches [11]. As Mialon et al. point out, such systems are no
longer “pure” language models—though language models are still drivers. We propose instead
placing LLMs in integrated systems in which symbolic reasoning drives processing in light of
goals and determines components to apply, using LLMs for fluency and assessing its results.
Taking an even broader view, this process arises from the needs of agents with goals and plans in
the physical and mental worlds to serve the agents’ goals. These needs require AI systems that
handle what LLMs cannot: to provide facts, to capture and relay relevant ephemeral information,
to make inferences and to remember.
7. Conclusions
LLMs are receiving enormous attention from both the AI and cognitive science communities
and the general public. Implicit in many commentaries is the view that LLMs can form the
heart of a general mechanism for intelligence, with observed gaps treated as surprising; to
address them, a proposed path is scaleup and training. Another view is that LLMs should be
augmented with additional capabilities to function under the “LLM umbrella.” We have argued
that “LLM-first” systems have fundamental limitations due to the nature of LLMs as statistical
language models. In our view, fully realizing the opportunity provided by LLMs will depend
on integrations of symbolic AI with LLMs in which goal-based symbolic systems drive LLMs
and provide knowledge. Language models used as language models, to articulate guaranteed
facts, are very different from systems that attempt to rely on language models for discovering
facts. Realizing the potential of LLMs depends on cognizance of their intrinsic capabilities—both
strengths and limitations—and on symbolic guidance, mediator, and support systems.
Acknowledgments
Funding for the first author’s work was provided by UL Research Institutes through the Center
for Advancing Safety of Machine Intelligence. The second author’s work was funded by the US
Department of Defense (Contract W52P1J2093009), and by the Department of the Navy, Office
of Naval Research (Award N00014-19-1-2655).
References
[1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan,
R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei,
Language models are few-shot learners, in: Advances in Neural Information Processing
Systems, volume 33, Curran, 2020, pp. 1877–1901.
[2] F. Manjoo, Chatgpt has a devastating sense of humor, 2022. URL: https://www.nytimes.
com/2022/12/16/opinion/conversation-with-chatgpt.html.
[3] D. Milmo, Chatgpt reaches 100 million users two months after
launch, 2023. URL: https://www.theguardian.com/technology/2023/feb/02/
chatgpt-100-million-users-open-ai-fastest-growing-app.
[4] R. Shiffrin, M. Mitchell, Probing the psychology of ai models, Proceedings of the National
Academy of Sciences 120 (2023) e2300963120.
[5] M. Binz, E. Schulz, Using cognitive psychology to understand GPT-3, Proceedings of the
National Academy of Sciences 120 (2023) e2218523120.
[6] J. Warner, In case it’s not clear the #ChatGPT just makes stuff up, 2022. URL: https:
//twitter.com/biblioracle/status/1599545554006003712.
[7] K. Roose, A conversation with Bing’s chatbot left me deeply unsettled, 2023. URL: https:
//www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html.
[8] M. Klee, AI chat bots are running amok—and we have no clue how to
stop them, 2023. URL: https://www.rollingstone.com/culture/culture-features/
ai-chat-bots-misinformation-hate-speech-1234677574/.
[9] O. Ram, Y. Levine, I. Dalmedigos, D. Muhlgay, A. Shashua, K. Leyton-Brown, Y. Shoham,
In-context retrieval-augmented language models, 2023. a r X i v : 2 3 0 2 . 0 0 0 8 3 .
[10] A. Creswell, M. Shanahan, I. Higgins, Selection-inference: Exploiting large language
models for interpretable logical reasoning, 2022. a r X i v : 2 2 0 5 . 0 9 7 1 2 .
[11] G. Mialon, R. Dessì, M. Lomeli, C. Nalmpantis, R. Pasunuru, R. Raileanu, B. Rozière,
T. Schick, J. Dwivedi-Yu, A. Celikyilmaz, E. Grave, Y. LeCun, T. Scialom, Augmented
language models: a survey, 2023. a r X i v : 2 3 0 2 . 0 7 8 4 2 .
[12] R. Schank, Conceptual Information Processing, volume 3 of Fundamental Studies in Com-
puter Science, North-Holland, Amsterdam, 1975.
[13] R. López de Mántaras, D. McSherry, D. Bridge, D. Leake, B. Smyth, S. Craw, B. Faltings,
M. Maher, M. Cox, K. Forbus, M. Keane, A. Aamodt, I. Watson, Retrieval, reuse, revision,
and retention in CBR, Knowledge Engineering Review 20 (2005) 215–240.
[14] A. Ram, D. Leake (Eds.), Goal-Driven Learning, MIT Press, 1995.
[15] D. Leake, Evaluating Explanations: A Content Theory, Lawrence Erlbaum, Hillsdale, NJ,
1992.