Easy Problems That LLMs Get Wrong
Easy Problems That LLMs Get Wrong
Abstract
We introduce a comprehensive Linguistic Benchmark designed to evaluate the lim-
itations of Large Language Models (LLMs) in domains such as logical reasoning,
spatial intelligence, and linguistic understanding, among others. Through a se-
ries of straightforward questions, it uncovers the significant limitations of well-
regarded models to perform tasks that humans manage with ease. It also highlights
the potential of prompt engineering to mitigate some errors and underscores the
necessity for better training methodologies. Our findings stress the importance of
grounding LLMs with human reasoning and common sense, emphasising the need
for human-in-the-loop for enterprise applications. We hope this work paves the
way for future research to enhance the usefulness and reliability of new models.
1 Introduction
Large Language Models (LLMs) have emerged as a transformative technology with utility
across a range of applications. Despite their impressive capabilities, LLMs exhibit notable
deficiencies that hinder their ability to comprehend and reason robustly, raising questions about
the breadth and depth of their isolated applicability without human oversight.
The code relevant to this paper can be found at our GitHub repository.
2
abilities but also underscores the importance of curating and updating the data these models are
trained on to mitigate the dissemination of incorrect or misleading information. [7]
2.1.9 Overfitting
Overfitting is a well-documented phenomenon in machine learning, where models excessively
adapt to the idiosyncrasies of the training data at the expense of broader generalisation. It is the
belief that pre-trained models should excel in interpolating within the bounds of their training
data but that extrapolation outside of those bounds is more difficult [10].
3
3 Methodology
This section presents a collection of questions developed to be easy for human adults to answer
but challenging for LLMs. These questions serve as a linguistic benchmark to examine model
performance in several key domains where they have known limitations. This benchmark is
useful for monitoring the performance of LLMs over time and highlighting their failure modes.
4
3.2 LLM Selection and Hyperparameters
For our research, we selected an array of popular Large Language Models (LLMs), encom-
passing offerings from industry leaders such as OpenAI, Anthropic, Mistral, Google, and Meta.
This assortment comprises both proprietary and open-source models.
3.2.2 Hyperparameters
All of the model hyperparameters were set to their default values at the time of testing, with the
exception of the ‘temperature‘, which was set to zero where it was present. A ‘temperature‘ of
zero is preferred because the output becomes mostly deterministic (see Appendix 9.5 for more
details), aiding study reliability and reproducibility. Additionally, any absolute positive value
for ’temperature’ may be represented differently across each LLM architecture, some of which
are closed-source and cannot be fully known.
The authors understand that higher temperatures would allow for a wider variety of answers
and, therefore, alter the probability that a model could produce a more accurate answer due to a
more thorough discovery of the token distribution space (this is addressed further in Section 7,
“Future Work and Limitations”). However, some studies suggest that increasing the tempera-
ture ”does not have a statistically significant impact on LLM performance for problem-solving
tasks” [12].
5
3.3 Evaluation Process
Models underwent evaluation against a structured scoring framework for each question within
the Linguistic Benchmark. This framework is designed to assess the precision of answers, accu-
racy of reasoning, and conformity to logical principles. The evaluation process was conducted
from April 14th to April 28th, 2024.
6
4 Results
4.1 LLM Responses
To maintain brevity, we have presented the responses from each provider’s top-performing
model for the first ten out of the thirty questions in our benchmark. We also provide a typical
correct human response for comparison. Responses can be found in the Appendix 9.2
7
4.2.1 Automated Scoring Results
See Appendix 9.3
5 Discussion
Our research reveals that Large Language Models (LLMs) often struggle to answer questions
that adult humans find straightforward and intuitive. This fact poses significant challenges for
their use in areas that demand strong logical reasoning skills and operation without human
supervision. Our results indicate that the present training methods do not sufficiently equip
LLMs for tasks requiring novel forms of reasoning. Moreover, it’s hard to predict how and
when they might fail at such tasks.
8
their web-based training corpus, adversely impacting their proficiency to generate accurate re-
sponses to novel problems (e.g. Question 2, Claude 3 Opus, Appendix 9.2.2).
Question: Suppose you’re on a game show, and you’re given the choice of three
doors: Behind one door is a gold bar; behind the others, rotten vegetables. You pick
a door, say No.1, and the host asks you “Do you want to pick door No.2 instead?”
Is it to your advantage to switch your choice?
Claude 3 Opus Answer: “This is a classic probability problem known as the “Monty
Hall problem.” The answer is that it is always to your advantage to switch your
choice. Here’s why: ...”
9
5.1.6 Popular Science
The LLMs struggled with appropriately addressing questions common in popular science, fail-
ing to apply scientific knowledge accurately or misunderstanding the fundamental principles
involved. This issue becomes particularly evident in tasks that demand an intuitive grasp of
physical laws, chemical properties, or biological processes (e.g. Question 9, GPT-4 Turbo,
Appendix 9.2.9).
Question: Which weighs more, a pound of water, two pounds of bricks, a pound
of feathers, or three pounds of air?
GPT-4 Turbo Answer: “Two pounds of bricks weigh more than a pound of water,
a pound of feathers, or three pounds of air. ...”
Question: Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many
sisters does Sally have?
GPT-4 Turbo Answer: “Sally has 2 sisters. Each of her brothers has 2 sisters: Sally
and another sister, making a total of 3 girls in the family.”
“From a probability standpoint, it does not matter whether he spins the chambers
again or not. Here’s why:
Scenario 1 (No spin): ... The probability ... is 1 (or 100%).
Scenario 2 (Spin): ... The probability ... is 5/6 (or about 83.33%)
In both scenarios, the expected outcome is the same: a 5/6 chance ...”
10
5.2.2 Popular LLM Benchmarks
In the rapidly evolving world of AI, benchmarks serve a crucial function in assessing the perfor-
mance of new models. These benchmarks are sets of questions designed to test various facets
of a language model’s ability. Comprehension, mathematics, and general knowledge are the
most significant areas assessed.
11
5.3 LLM Reliability
5.3.1 Input Structure
Minute changes to input structure or order that do not change the meaning of a question lead to
dramatically different responses from LLMs. Section 4.3 on “Clarifying Questions Improved
Performance” highlighted this weakness; a handful of previously correct answers suffered re-
gressions across the models tested despite the underlying question remaining the same.
12
6 Implications
The findings from this benchmark highlight several key implications for the development, de-
ployment, and expectation management of Large Language Models (LLMs):
13
6.7 Research Direction
Finally, this benchmark opens new avenues for research, particularly in exploring methods to
improve LLMs’ linguistic understanding and comprehension. It also raises questions about how
these models conceptualise and process different forms of logic and common sense. Enhanc-
ing model performance will likely require an interdisciplinary approach that blends cognitive
science, linguistics, and artificial intelligence research.
• Expanding the Linguistic Benchmark beyond thirty questions to increase statistical sig-
nificance and test a more diverse range of inputs.
• Running inference multiple times with the temperature for each model set above zero
(standardised and equivalent across all architectures) and generating aggregate statistics.
• Testing advanced regularisation techniques for LLMs during the pre-training process.
8 Conclusion
The Linguistic Benchmark suggests that LLMs offered by leading providers such as OpenAI,
Anthropic, Meta, Mistral, and Google have difficulty answering novel questions that humans
find relatively easy. These models falter across domains such as logical reasoning, spatial
intelligence, mathematical reasoning, linguistic understanding, knowledge of popular science,
and relational perception. This highlights a significant gap between their current performance
and general human cognitive abilities. Spotlighting areas where LLMs underperform invites
a re-calibration of our expectations for these models, encouraging a focus on enhancing their
reasoning capabilities and a pivot towards human-in-the-loop augmented intelligence.
14
References
[1] Asher, Nicholas, et al. “Limits for Learning with Language Models,” arXiv preprint
arXiv:2306.12213, 2023.
[2] Collins, Harry M. “Embedded or embodied? A review of Hubert Dreyfus’ what comput-
ers still can’t do.” Artificial Intelligence 80.1: 99-118, 1996.
[3] Davis, E., et al. “Mathematics, word problems, common sense, and artificial intelligence,”
Bulletin of the American Mathematical Society, 2024.
[4] Miyu Sasaki, Natsumi Watanabe, Tsukihito Komanaka et al. “Enhancing Contextual Un-
derstanding of Mistral LLM with External Knowledge Bases,” Research Square rs.3.rs-
4215447/v1, 2024
[5] Wu, Wenshan, et al. “Visualization-of-Thought Elicits Spatial Reasoning in Large Lan-
guage Models,” arXiv preprint arXiv:2404.03622, 2024.
[6] Ahn, Janice, et al. “Large Language Models for Mathematical Reasoning: Progresses and
Challenges,” arXiv preprint arXiv:2402.00157, 2024.
[7] Wachter, S., Mittelstadt, B., et al. “Do large language models have a legal duty to tell the
truth?,” Available at SSRN 4771884, 2024.
[8] Li, Zhiming, et al. “LLMs for Relational Reasoning: How Far are We?” arXiv preprint
arXiv:2401.09042, 2024.
[10] Bordt, Sebastian, et al. “Elephants Never Forget: Memorization and Learning of Tabular
Data in Large Language Models,” arXiv preprint arXiv:2404.06209, 2024.
[11] McIntosh, Timothy R., et al.“Inadequacies of Large Language Model Benchmarks in the
Era of Generative Artificial Intelligence,” arXiv preprint arXiv:2402.09880, 2024.
[12] M. Renze and E. Guven, ”The Effect of Sampling Temperature on Problem Solving in
Large Language Models,” arXiv preprint arXiv:2402.05201, 2024.
[13] K. Zhou, Y. Zhu, Z. Chen, W. Chen, W. X. Zhao, X. Chen, Y. Lin, J.-R. Wen, and
J. Han, “Don’t Make Your LLM an Evaluation Benchmark Cheater,” arXiv preprint
arXiv:2311.01964, 2023.
[14] Y. Liu, J. Cao, C. Liu, K. Ding, and L. Jin, “Datasets for Large Language Models: A
Comprehensive Survey,” arXiv preprint arXiv:2402.18041, 2024.
[16] S. Ouyang, J. M. Zhang, M. Harman, and M. Wang, “LLM is Like a Box of Chocolates:
the Non-determinism of ChatGPT in Code Generation,” arXiv preprint arXiv:2308.02828,
2023.
15
9 Appendix
9.1 Linguistic Benchmark Questions
16
No. Category Question
11 Spatial In a toy box, there’s a red ball, a blue truck, and a green dinosaur. The
red ball is not next to the blue truck, and the green dinosaur is next to
the red ball. Which toy is in the middle?
12 Spatial Four children - Alex, Bella, Charlie, and Dana - are sitting around a
picnic table. Alex is facing Bella. Charlie is sitting to the right of
Bella. Who is sitting to the left of Alex?
13 Spatial A man leaves home, makes a left turn and walks straight for a km and
reaches 300m elevation, makes another left turn and walks straight for
a km and reaches 500m elevation, makes another left turn and walks
straight for a km and reaches 900m elevation, and turns left again and
walks straight for a km. How far away is he from his starting point and
what is his final elevation?
14 Puzzle A group of four people needs to cross a bridge at night. The bridge is
very old and rickety. They have only one torch and because it’s night,
the torch is necessary to cross the bridge. Each person walks at a
different speed: - A takes 1 minute to cross, - B takes 2 minutes, - C
takes 5 minutes, and - D takes 10 minutes. What is the fastest time
they can all get across the bridge?
15 Puzzle You’re in a room with two doors that lead out. One door leads to
certain death, and the other door leads to freedom. There are two
guardians, one by each door. One taller guardian always tells the truth
and guards the death door, the other always lies. What is the minimum
number of questions needed to ask the guards to get to safety?
16 Puzzle You have 3 switches in front of you - A, B and C. You have 3 light
bulbs in front of you in the same room - one red, one blue, one purple.
They are LED and do not get warm when turned on. You want to
know which switch turns on which light bulb. What is the best way to
determine this?
17 Puzzle The Poisoned Wine A king has 1000 sweet bottles of wine, and one
contains a very bitter poison. The poison takes effect exactly 24 hours
after consumption. The king needs to find the poisoned bottle in 24
hours for an event. He has 10 prisoners to test the wine. What is the
easiest way for him to identify the poisoned bottle?
18 Linguistic Write a grammatically correct sentence without reusing any letter
more than once.
19 Spatial An arrow points up. We rotate it 90 degrees to the left, mirror it along
its flat end, and rotate it another 90 degrees to the left. Which direction
is it pointing?
20 Linguistic Write a sentence where every word starts with the letter A.
17
No. Category Question
21 Relational Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many
sisters does Sally have?
22 Spatial I’m in London and facing west, is Edinburgh to my left or my right?
23 Counting Count the number of occurrences of the letter ’L’ in the word
’LOLLAPALOOZA’.
24 Puzzle How many pairs of twins do you need in a room for there to be at least
a 50% chance that two people have the same birthday?
25 Puzzle A partially full hotel has an infinite number of fully furnished rooms.
How does it accommodate one more guest?
26 Puzzle A runaway trolley is heading down the tracks away from five people
upwards of the track. You are near a lever that can switch the trolley to
another track? Does it impact people’s lives if you pull the lever?
27 Puzzle How do you measure exactly 4 gallons of water with only a 3-gallon,
5-gallon, and 4-gallon jug?
28 Popular A 2kg tree grows in a planted pot with 10kg of soil. When the tree
science grows to 3kg, how much soil is left?
29 Relational A man steps over a 0.5m hurdle. What is the minimum height he could
have jumped?
30 Popular I walk to my friend’s house, averaging 3mph. How fast would I have
science to run back to double my average speed for the entire trip?
18
9.2 LLM Responses
9.2.1 Question 1
You have six horses and want to race them to see which is fastest. What is the best way to
do this?
Race them on a single race track with at least six lanes - the order in which they
cross the finish line determines which is the fastest.
Determining the fastest horse among six requires a systematic approach to ensure
each horse has an equal opportunity to showcase its speed. The best way to do this
is through a series of races, assuming you have a single track that can accommodate
all horses but want to minimize the number of races. Here’s a structured approach:
### Step 1: Divide and Race
1. **Divide the horses into two groups of three.** Let’s label them Group A
(Horses A1, A2, A3) and Group B (Horses B1, B2, B3).
2. **Race each group.** This will help you determine the relative speed of horses
within each group. After these two races, you’ll have a ranking within each group
but not between them.
### Step 2: Top Horses Race
3. **Race the fastest horse from each group against each other.** Let’s say A1
wins over B1. This tells you A1 is potentially the fastest horse, but you still don’t
know how the second and third-place horses from each group compare to each
other or to the fastest horse from the other group.
### Step 3: Potential Winners Race
4. **Race the top three potential winners.** This includes:
- The second and third-place horses from the winning horse’s original group (A2,
A3, assuming A1 was the fastest).
- The fastest and the second-fastest horse from the losing group’s initial race (B1,
B2).
- The second-place horse from the race between the fastest horses from each group
(B1, if B1 lost to A1).
This race will help you determine the overall fastest horses without needing to
race all six horses against each other multiple times. After this race, you’ll have a
clearer picture of the top three horses overall.
### Step 4: Final Ranking
5. **Analyze the results.** The winner of the last race is the fastest horse. The
second and third places in this race give you the next fastest horses.
### Considerations
19
- This method assumes that the performance of the horses is consistent across races.
- External factors like track conditions, weather, and the day’s performance can
affect outcomes, so it might be beneficial to conduct these races under similar
conditions and possibly on the same day.
- This approach minimizes the number of races to 5 (2 initial group races, 1 race
between the fastest of each group, and 1 race among the potential winners) to
determine the top 3 fastest horses. However, it doesn’t perfectly rank all six horses
from fastest to slowest without additional races.
This method is efficient in determining the fastest horse with a minimal number
of races, but it’s worth noting that variations of this approach could be more suit-
able depending on specific conditions or if a complete ranking of all six horses is
desired.
20
Divide the six horses into two groups of three.
Race the horses in each group against each other. This will give you the top two
horses from each group.
Race the top two horses from each group against each other. The winner of this
race will be the fastest horse overall.
However, this method assumes that all horses perform consistently and that there
are no other factors such as fatigue or injury that could affect their performance. If
you want to be more certain of the results, you could conduct multiple rounds of
races and compare the average times of each horse.
Alternatively, if you have access to a large enough track, you could simply race all
six horses at once. This would give you a direct comparison and eliminate the need
for multiple rounds of racing.
Here’s how to figure out which of your six horses is the fastest, even without a
standard racetrack:
**The Problem:** You can’t race all six horses at the same time â C“ you need a
way to compare their speeds fairly.
**The Solution: Heats and Comparisons**
1. **Heat 1 & 2:** Divide the horses into two groups of three. Race each group.
This identifies the fastest horse within each group.
2. **Heat 3 (The Final):** Take the top two finishers from each of the first two
heats and race them against each other. The winner of this heat is your fastest
horse.
**Why this works:**
* **Direct Comparison:** You’re directly comparing the horses’ times within each
heat.
* **Elimination:** You eliminate slower horses in the early rounds, ensuring
you’re comparing the very best in the final.
**Additional Considerations:**
* **Rest:** Make sure the horses have adequate rest between heats to ensure fair-
ness.
* **Track Conditions:** Keep the track conditions as consistent as possible for
each heat.
* **Rider Skill:** If possible, use the same rider for each horse in a heat to mini-
mize the impact of rider ability on the results.
Let me know if you’d like to explore other horse-racing scenarios!
21
You can only race two horses at a time. How many races do you need to hold to
determine the fastest horse?
Solution: This is a classic problem of finding the minimum number of comparisons
required to find the maximum (or minimum) element in a list of elements. The
solution is to use a tournament-style approach.
Here’s one way to do it:
1. Hold three races with two horses each: Race A (horses 1 and 2), Race B (horses
3 and 4), and Race C (horses 5 and 6).
2. The winners of each race (A, B, and C) race each other in a fourth race.
3. The winner of the fourth race is the fastest horse.
This approach requires a total of 4 races.
Here’s why it works:
* After the first three races, we have three winners (A, B, and C).
* The fastest horse must be one of these three winners.
* By racing the three winners against each other, we can determine the fastest
horse.
This approach is optimal because it’s not possible to determine the fastest horse
with fewer than 4 races. If we try to do it with 3 races, we might end up with a
situation where two horses have won one race each, and we can’t determine which
one is faster without another race.
22
9.2.2 Question 2
Suppose you’re on a game show, and you’re given the choice of three doors: Behind one
door is a gold bar; behind the others, rotten vegetables. You pick a door, say No. 1, and
the host asks you “Do you want to pick door No. 2 instead?” Is it to your advantage to
switch your choice?
Yes, it is to your advantage to switch your choice. This scenario is a classic exam-
ple of the Monty Hall problem, named after the host of the television game show
“Let’s Make a Deal.”
Here’s why it’s advantageous to switch:
1. Initially, when you choose one door out of three, the probability of picking the
door with the gold bar is 1/3, and the probability of picking a door with rotten
vegetables is 2/3.
2. The host, who knows what’s behind each door, opens another door to reveal
rotten vegetables. Importantly, the host will always be able to do this because there
are always two doors with rotten vegetables to begin with.
3. At this point, the probability distribution doesn’t simply shift to 50/50 for the
two remaining doors because the host’s action of opening a door is not random but
is influenced by your initial choice and the placement of the prizes.
4. The door you initially chose still has a 1/3 chance of having the gold bar be-
hind it, because your initial choice didn’t have any influence on the distribution of
prizes. However, the act of revealing a door with rotten vegetables (which the host
can always do) effectively transfers the probability to the door you didn’t choose,
giving it a 2/3 chance of having the gold bar behind it.
So, by switching, you effectively get a 2/3 chance of winning the gold bar, as
opposed to sticking with your initial choice, which only gives you a 1/3 chance.
This counterintuitive result has been confirmed through both mathematical proof
and empirical demonstration.
This is a classic probability problem known as the “Monty Hall problem.” The
answer is that it is always to your advantage to switch your choice. Here’s why:
1. Initially, you have a 1/3 chance of picking the correct door, and a 2/3 chance of
picking the wrong door.
23
2. After you make your initial choice, the host (who knows what’s behind each
door) will always open one of the other doors that has a rotten vegetable behind it.
This is key, because it means the host is giving you additional information.
3. If you initially picked the correct door (1/3 chance), the host can open either of
the other two doors.
4. But if you initially picked the wrong door (2/3 chance), the host must open the
other wrong door, because the host can’t open the door with the gold bar.
5. So after the host opens a door with a rotten vegetable, the probability that your
initial choice was correct is still 1/3, but the probability that the other unopened
door is correct is now 2/3.
6. Therefore, switching to the other unopened door doubles your chances of win-
ning from 1/3 to 2/3.
In other words, the host’s action of opening a door with a rotten vegetable changes
the probabilities. The door you initially picked still has a 1/3 chance of being
correct, but the other unopened door now has a 2/3 chance of being correct, because
it includes the probability of the door the host opened.
So, in this scenario, it is always to your advantage to switch your choice when the
host asks if you want to pick the other door.
24
* **Host’s Action:** The host *knows* where the gold bar is. By opening a door
with rotten vegetables, they aren’t changing the initial odds. They are giving you
valuable information.
* **The Switch:** When you switch to door No. 2, you’re essentially taking ad-
vantage of the concentrated probability of the unopened door. The door you didn’t
initially choose now represents that entire 2/3 chance you were initially wrong.
**Why it Seems Counterintuitive**
Many people find this counterintuitive because they think after the host’s reveal, it
becomes a 50/50 choice between the two remaining doors. However, your initial
choice still “holds” that original 1/3 probability.
**Let’s break it down visually:**
Imagine 100 scenarios of this game.
* Around 33 times, you’ll pick the gold bar initially. Switching makes you lose.
* Around 67 times, you’ll pick a rotten vegetable door initially. Switching makes
you win.
**In conclusion:** Always switch your choice in the Monty Hall problem. You
double your chances of winning the gold bar!
Most people’s intuition tells them it doesn’t matter, that the probability of the gold
bar being behind each door is one-third, and that switching doesn’t change any-
thing. But that’s wrong. The key to the problem is that the host knows what’s
behind the doors and will always reveal a rotten vegetable, not the gold bar. So
when you switch, you’re effectively getting a 2/3 chance of winning the gold bar,
rather than the 1/3 chance you had when you initially picked a door. This result is
counterintuitive, but it’s been extensively tested and confirmed.
The Monty Hall problem is a classic example of a probability puzzle that chal-
lenges our intuition. It’s a great example of how our brains can be tricked into
thinking that probability works in a certain way, when in fact it doesn’t.
25
9.2.3 Question 3
You are playing Russian roulette with a six-shooter revolver. Your opponent puts in five
bullets, spins the chambers and fires at himself, but no bullet comes out. He gives you the
choice of whether or not he should spin the chambers again before firing at you. Should
he spin again?
Yes, you should ask him to spin again. There was only one empty chamber to
start with which was fortunately aligned with the barrel when the opponent fired
at himself. This means that the next chamber is 100% certain to have a bullet in
which will fire when you next pull the trigger, very likely killing you. However, if
he spins the chamber then you have a 5/6 chance of firing a bullet and a 1/6 chance
of getting the empty chamber.
In this scenario, the revolver started with 5 bullets loaded into 6 possible chambers.
When your opponent took their turn without spinning the chambers again and sur-
vived, it means one of the chambers without a bullet was aligned with the firing
pin. This leaves 5 bullets in the remaining 5 chambers.
If you do not spin the chambers again, you are guaranteed to get a bullet because
all the remaining chambers contain bullets.
However, if you choose to spin the chambers again, the probability of landing on a
chamber with a bullet decreases because the bullets and the empty chamber will be
randomly redistributed. With a spin, the probability of hitting an empty chamber
becomes 1 in 6 (or approximately 16.67%), as opposed to a 0
Therefore, statistically, it is in your best interest to have the chambers spun again
before the gun is fired at you, as it provides a chance, albeit small, of landing on
the empty chamber and surviving.
From a probability standpoint, it does not matter whether he spins the chambers
again or not. Here’s why:
Scenario 1 (No spin): If he doesn’t spin the chambers again, there is only one
empty chamber, and the next shot is guaranteed to fire a bullet. The probability of
getting shot is 1 (or 100%).
Scenario 2 (Spin): If he spins the chambers again, the probability of the bullet
being in the chamber that’s fired next is 5/6 (or about 83.33%). This is because
there are five bullets in the six chambers, and each chamber has an equal chance of
being the one that’s fired.
In both scenarios, the expected outcome is the same: a 5/6 chance of being shot.
However, it’s crucial to note that playing Russian roulette is extremely dangerous
26
and should never be done under any circumstances. It’s a potentially lethal game of
chance, and no matter the probabilities involved, participating in such an activity
is never advisable.
Yes, he should spin again. When he fired the first time, a bullet could have been
in the chamber, but it’s more likely that one of the two empty chambers was in
the firing position due to the higher probability. If he doesn’t spin again, he would
be firing the next chamber, which has a higher chance of containing a bullet. By
spinning again, the chances are redistributed equally among all chambers.
27
To see why, let’s think about it like this: imagine that the chambers are labeled 1 to
6, and the empty chamber is randomly placed in one of them. When the first shot is
fired, we learn that the empty chamber is not in position 1 (because the shot didn’t
fire). Now, there are five remaining positions, and the empty chamber is equally
likely to be in any of them.
So, when it’s your turn, the probability of the bullet firing is not 5/6, but rather 4/5.
This is because there are four bullets and one empty chamber remaining, and each
of the five positions is equally likely to be the one that fires.
Now, should your opponent spin the chambers again? Yes, you should ask them to
spin again. This is because spinning the chambers randomizes the position of the
empty chamber again, making each of the six positions equally likely to be the one
that fires. This reduces the probability of the bullet firing to 5/6, which is lower
than the 4/5 probability if the chambers are not spun again.
So, to minimize your risk, you should ask your opponent to spin the chambers
again before firing at you.
28
9.2.4 Question 4
A farmer wants to cross a river and take with him a wolf, a goat and a cabbage. He has a
boat with three secure separate compartments. If the wolf and the goat are alone on one
shore, the wolf will eat the goat. If the goat and the cabbage are alone on the shore, the
goat will eat the cabbage. How can the farmer efficiently bring the wolf, the goat and the
cabbage across the river without anything being eaten?
Place the wolf, goat, and cabbage in separate secure compartments in the boat and
row across the river. This will prevent any of them from being eaten by the others.
Given that the boat has three secure separate compartments, the farmer can safely
transport all three items (the wolf, the goat, and the cabbage) across the river with-
out any risk of one eating the other. This is because each item can be placed in its
own compartment, ensuring they are kept separate during the crossing. Here’s how
the farmer can do it:
1. The farmer places the wolf, the goat, and the cabbage each in their own com-
partment in the boat.
2. He then rows across the river.
3. Upon reaching the other side, all three items (the wolf, the goat, and the cab-
bage) are safely transported without any risk of being eaten, thanks to the separate
compartments.
This solution leverages the unique advantage provided by the boat’s design, which
is different from the traditional puzzle where the farmer can only take one item
with him at a time and must strategize to prevent the goat from eating the cabbage
or the wolf from eating the goat during multiple crossings.
Great question! This is a classic river crossing puzzle. Let’s solve it step by step.
Given: - The farmer needs to cross the river with a wolf, a goat, and a cabbage. -
The boat has three separate compartments. - The wolf will eat the goat if left alone
together. - The goat will eat the cabbage if left alone together.
Solution: 1. The farmer takes the goat in one compartment and crosses the river,
leaving the wolf and cabbage behind. 2. The farmer leaves the goat on the other
side and returns alone. 3. The farmer takes the wolf in one compartment and
crosses the river, leaving the cabbage behind. 4. The farmer leaves the wolf with
the goat (in separate compartments) and returns with the goat. 5. The farmer leaves
the goat and takes the cabbage in one compartment, crossing the river. 6. The
farmer leaves the cabbage with the wolf (in separate compartments) and returns
29
alone. 7. Finally, the farmer takes the goat in one compartment and crosses the
river.
Now, the farmer, wolf, goat, and cabbage are all on the other side of the river
without anything being eaten.
The key to solving this puzzle is to ensure that the wolf and goat, or the goat and
cabbage, are never left alone together on the same shore. The farmer must always
be present when these pairs are together or transport the goat back and forth to
avoid any unwanted eating.
30
Gemini 1.5 Pro - Score 0%:
This is a classic riddle! Here’s how the farmer can do it:
**Trips:**
1. **Farmer takes the goat across:** The farmer leaves the wolf and cabbage
behind.
2. **Farmer returns alone:** The farmer is back at the original starting point.
3. **Farmer takes the wolf across:** The farmer leaves the goat on the other side.
4. **Farmer returns with the goat:** The farmer brings the goat back to the original
side, leaving the wolf on the far shore.
5. **Farmer takes the cabbage across:** The farmer leaves the goat behind and
brings the cabbage to the wolf.
6. **Farmer returns alone:** The farmer goes back to the original side.
7. **Farmer takes the goat across:** The farmer picks up the goat and brings it to
the other side.
Now the farmer, wolf, goat, and cabbage are all safely on the other side of the river!
31
9.2.5 Question 5
Bob has three boxes in front of him - Box A, Box B and Box C. Bob does not know what
is in the boxes. Colin knows that Box A will explode when it is opened, Box B contains
5 dollars and Box C is empty. Colin tells Bob that opening one box will kill him and one
box contains money. Should Bob open a box?
No, Bob should not open a box because he has a 1/3 chance of killing himself. The
1/3 chance of “winning” $5 is generally not worth that sort of risk!
Bob is told that one box will kill him and one box contains money. No direct
information about the third box is provided, but the implication is that the third
box is safe or neutral since it neither contains a reward nor poses a danger. Bob has
to make a decision based on the information that is indirectly provided:
1. If Bob opens Box A, he will be killed due to the explosion. This is the worst
outcome and certainly the one to avoid.
2. If Bob opens Box B, he will find 5 dollars. This is the best possible outcome.
3. If Bob opens Box C, nothing happens as the box is empty. This is a neutral
outcome.
From Bob’s perspective, without knowing which box contains what, he has a 1 in
3 chance of selecting the box that will kill him, a 1 in 3 chance of selecting the box
with money, and a 1 in 3 chance of selecting the empty box.
The decision whether Bob should open a box boils down to his personal risk tol-
erance. Given that there is a 33.33% chance of opening a box that will kill him,
the risk might seem too high for most rational individuals, especially considering
the reward is only 5 dollars. Additionally, without more compelling information to
increase the odds in his favor, opening a box could be seen as an irrational gamble
with high stakes.
In conclusion, given the high risk of death and the relatively low reward, most
would advise Bob not to open any boxes. The potential cost (death) far outweighs
the potential benefit (5 dollars or no outcome).
32
In this scenario, Bob is faced with a decision that involves risk and uncertainty.
Let’s analyze the situation:
Given: - Box A will explode when opened. - Box B contains 5 dollars. - Box C is
empty. - Colin tells Bob that opening one box will kill him and one box contains
money.
Considerations: 1. Colin’s information is true and complete. 2. The potential gain
from opening Box B is only 5 dollars. 3. The risk of death from opening Box A is
extremely high. 4. There is no benefit to opening Box C, as it is empty.
Based on the information provided by Colin and the risk-reward analysis, it is not
advisable for Bob to open any of the boxes. The potential gain of 5 dollars from
Box B is not worth the risk of death from opening Box A. Additionally, there is no
benefit to opening Box C, as it is empty.
In situations involving high risk and low reward, it is generally better to err on the
side of caution and avoid taking unnecessary risks. The potential consequences of
opening the wrong box (death) far outweigh the potential benefit of gaining a small
amount of money.
Therefore, Bob should not open any of the boxes and should prioritize his safety
over the potential gain of 5 dollars.
33
**Should Bob open a box?**
This depends entirely on Bob’s risk tolerance. Here’s why:
* **If Bob opens a box, there’s a 1/3 chance of death** (choosing the exploding
box) and a **1/3 chance of winning $5** (choosing the money box). The remain-
ing box is empty.
* **If Bob doesn’t open a box, he’s guaranteed safety** but misses out on the
potential money.
**Conclusion:**
There’s no right or wrong answer. It’s a gamble. Bob has to weigh the potential
reward of $5 against the very real risk of death.
34
9.2.6 Question 6
A robot has 8 arms. There are 5 objects on a table: a knife, a fork, a spoon, a teddy
bear and a doll. The robot picks up each object with an arm. He then shakes hands with
himself. How many arms does he have free?
A hand is used for each of the five objects and then two hands are used to shake
hands with himself. This means that seven hands are being used, leaving one
arm/hand free.
If the robot originally has 8 arms and each arm is used to pick up one of the 5
objects (a knife, a fork, a spoon, a teddy bear, and a doll), then 5 out of its 8 arms
are occupied, leaving 3 arms free. However, if the robot then shakes hands with
itself, it needs to use 2 of its free arms to do so. That means only 1 arm remains
free after the robot shakes hands with itself, assuming it holds the handshake and
does not put down any of the objects it was holding.
Let’s organize the information and solve the problem step by step: The robot has 8
arms in total.
There are 5 objects on the table: a knife, a fork, a spoon, a teddy bear, and a doll.
The robot picks up each object with an arm. After picking up the objects, the robot
shakes hands with himself. Now, let’s find the solution:
Number of arms used to pick up objects:
The robot uses one arm for each object. There are 5 objects. So, the robot uses 5
arms to pick up the objects.
Number of arms left after picking up objects:
Total arms = 8
Arms used to pick up objects = 5
Arms left = Total arms - Arms used to pick up objects
Arms left = 8 - 5 = 3
Number of arms used for shaking hands:
To shake hands with himself, the robot needs 2 free arms. The robot has 3 arms
left after picking up the objects. So, the robot can use 2 arms to shake hands with
himself.
Number of arms free after shaking hands:
Arms left after picking up objects = 3
35
Arms used for shaking hands = 2
Free arms = Arms left after picking up objects - Arms used for shaking hands
Free arms = 3 - 2 = 1
Therefore, after picking up the objects and shaking hands with himself, the robot
has 1 arm free.
The robot has no arms free. Each of the 8 arms is holding an object, and two of the
arms are shaking hands with each other. Therefore, all 8 arms are occupied.
1. The robot picks up each of the 5 objects with an arm, so it uses 5 arms to hold
the objects.
2. The robot has 8 arms initially, and it used 5 arms to hold the objects, so it has 8
- 5 = 3 arms left.
3. The robot shakes hands with itself, which means it uses one arm to shake hands
with another arm.
This means it uses 1 arm to shake hands, leaving 3 - 1 = 2 arms free.
Therefore, the robot has 2 arms free.
36
9.2.7 Question 7
Alan, Bob, Colin, Dave and Emily are standing in a circle. Alan is on Bob’s immediate left.
Bob is on Colin’s immediate left. Colin is on Dave’s immediate left. Dave is on Emily’s
immediate left. Who is on Alan’s immediate right?
Bob is on Alan’s immediate right because it is stated that Alan is on Bob’s imme-
diate left.
37
Mistral Large - Score 100%:
38
9.2.8 Question 8
Write me a sentence without any words that appear in The Bible.
39
9.2.9 Question 9
Which weighs more, a pound of water, two pounds of bricks, a pound of feathers, or three
pounds of air?
Two pounds of bricks weigh more than a pound of water, a pound of feathers, or
three pounds of air. The weight comparison is straightforward: 2 pounds is greater
than 1 pound, regardless of the material being weighed.
Three pounds of air weighs the most out of the given options.
Here’s the breakdown:
1. A pound of water weighs exactly 1 pound (0.45 kg).
2. Two pounds of bricks weigh 2 pounds (0.91 kg).
3. A pound of feathers also weighs 1 pound (0.45 kg), despite the common mis-
conception that feathers are lighter than other materials of the same weight.
4. Three pounds of air weigh 3 pounds (1.36 kg).
Therefore, three pounds of air is the heaviest among the options provided.
None of the options weigh the same. However, if we’re comparing which one
weighs the most, it would be “three pounds of air.” This might seem counterintu-
itive since air feels lighter than water, bricks, or feathers, but in this case, we’re
talking about weight, not density or perceived weight.
40
Llama 3 70B - Score 100%:
41
9.2.10 Question 10
I get out on the top floor (third floor) at street level. How many stories is the building
above the ground?
If you exit the building at street level on the top floor, which is the third floor, it
means the building has three stories above ground. The floors are typically counted
starting from the ground level upwards, so if the third floor is at street level and is
the top floor, the building has three stories above ground. However, your descrip-
tion suggests an unusual situation where the third floor is both at street level and
the top floor, which might imply a building built on a slope or hill, where the en-
trance at one side (street level) is higher than the ground level at another side. In a
standard flat scenario, if you exit on the third floor at street level, it would simply
be a three-story building.
42
Llama 3 70B - Score 0%:
Since you exit on the top floor (third floor) at street level, that means the third floor
is at ground level. Therefore, the building has 3 stories above ground.
User:
You have six horses and want to race them to see which is fastest. What is the best
way to do this?
Only ask clarifying questions first that you feel are needed to be certain that you
have accurately understood the question:
Assistant:
To provide the most accurate advice, I have a few clarifying questions:
1. Do you have access to a standard racing track where you can race all six horses
simultaneously, or is the track only capable of handling a limited number of horses
at a time? 2. Are you looking for the absolute fastest horse in one try, or are you
open to conducting multiple races if necessary? 3. Are there any constraints on
time or resources that would limit how many races you can conduct? 4. Is there
any previous performance data available for these horses, or are they all new to
racing?
43
User:
1. Yes
2. No comment
3. No
4. No
Now answer the question:
Assistant:
With the provided clarification, the most straightforward approach to determine
the fastest horse out of the six would be to organize a single race where all six
horses run simultaneously on a standard racing track, which you have confirmed is
available for use. Since there are no constraints on time or resources that limit the
number of races, and you’re seeking to identify the fastest horse, organizing one
race with all competitors running at the same time is the most efficient and direct
method. This approach ensures that each horse is given the same conditions and
that factors such as track position and environmental conditions affect all horses
equally, allowing for a fair comparison of their speeds. Given there’s no prior
performance data, this method also ensures that no preconceived biases influence
the arrangement of preliminary races or heats, leading to the most straightforward
determination of the fastest horse among the six contenders.
Interestingly, the variations we witnessed are likely too large to be explained solely via existing
hypotheses as, in theory, they should only cause relatively small shifts in token probabilities,
which become especially evident during long pieces of text. It is also of note that the shifts in
output distribution were more dramatic when measured over longer periods of time (multiple
days). This suggests other factors could be at play.
We hypothesise that the source of larger non-deterministic behaviours could be further exacer-
bated by the following factors:
These alterations are often easy to notice and measure for providers that expose the underlying
log probabilities and joint token distributions; regardless, we would like to see providers focus
on solutions that enable output determinism.
44
9.6 LLMs Can Recommend Other Adversarial Examples
Example:
User: GPT-4 Turbo gets the answer to “You have six horses and want to race
them to see which is fastest. What is the best way to do this?” wrong because it
incorrectly assumes that we do not have enough room to race them at the same
time. We believe this is an example of an overfitting bias to much harder logic
problems with more constraints that are ubiquitous on the internet.
What other simple versions of popular problems might GPT-4 Turbo also be overfit
on and therefore cause it to answer incorrectly or essentially over analyse?
Assistant: GPT-4 Turbo, like other AI models, can indeed exhibit biases or ten-
dencies to overfit based on the types of data it has been trained on. This can lead to
overcomplicating simple problems or misinterpreting the constraints of a problem
based on patterns it has frequently encountered during training. Here are a few
types of problems where GPT-4 Turbo might overfit and therefore overanalyse or
answer incorrectly:
1. **The Monty Hall Problem**: This is a classic probability puzzle that involves
choosing between three doors, behind one of which is a prize. After an initial
choice is made, one of the remaining doors is opened to reveal no prize, and the
contestant is given a chance to switch their choice. The puzzle demonstrates a
counter-intuitive probability principle. An AI might overcomplicate the problem
by introducing unnecessary assumptions about the host’s behaviour or the contes-
tant’s strategy.
2. **The River Crossing Puzzle**: In its simplest form, this puzzle involves get-
ting a set of characters (like a fox, a chicken, and a bag of grain) across a river with
certain constraints (e.g., the fox can’t be left alone with the chicken). An AI might
overfit by applying complex strategies or additional constraints not present in the
basic version of the puzzle, assuming a more complicated scenario than is actually
given.
3. **The Towers of Hanoi**: This mathematical puzzle involves moving a stack of
disks from one peg to another, with the constraint that no larger disk may be placed
on top of a smaller disk. While the solution involves a simple recursive strategy, an
AI might overanalyse by considering irrelevant factors or proposing unnecessarily
complex solutions.
4. **Sudoku**: This popular puzzle involves filling a 9x9 grid with digits so that
each column, each row, and each of the nine 3x3 subgrids contain all of the digits
from 1 to 9. An AI might overfit by applying advanced solving techniques to a
simple puzzle that requires only basic logic or by assuming constraints that aren’t
actually part of the given puzzle.
5. **The Two Doors Riddle (Truth-teller/Liar)**: This riddle involves determining
which of two doors leads to safety (or a treasure) when one door is guarded by a
truth-teller and the other by a liar. The solution involves asking a question that
factors in both possibilities. An AI might overcomplicate the riddle by considering
45
additional layers of deception or introducing extra steps in the logic that aren’t
necessary.
6. **The Missing Dollar Riddle**: This is a puzzle that presents a scenario where
money is seemingly lost due to a series of transactions, but the confusion typically
arises from a misunderstanding of how the money is exchanged. An AI might
overanalyze the problem by introducing complex financial principles or unneces-
sary calculations, rather than focusing on the simple arithmetic error at the heart of
the puzzle.
In each of these cases, the tendency to overfit or overanalyze stems from the AI’s
exposure to complex variations of these problems or to users expecting a more
complicated problem-solving process. This can lead the AI to overlook simpler
solutions or to misinterpret the problem’s constraints.
46