0% found this document useful (0 votes)
52 views23 pages

An In-Depth Look at Gemini's Language Abilities

This document summarizes an in-depth analysis of the language abilities of Google's Gemini models compared to OpenAI's GPT models. The authors benchmarked the models on 10 tasks testing various language skills and found that while Gemini Pro performed comparably to GPT 3.5 Turbo on English tasks, it excelled at translating to other supported languages. Gemini Pro achieved accuracy close to but slightly below GPT 3.5 Turbo on all English tasks, but outperformed on translation. Mixtral was competitive on some tasks but fell short on more complex ones. The analysis identified areas each model performed best and provided explanations for some of Gemini Pro's underperformance.

Uploaded by

Agra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views23 pages

An In-Depth Look at Gemini's Language Abilities

This document summarizes an in-depth analysis of the language abilities of Google's Gemini models compared to OpenAI's GPT models. The authors benchmarked the models on 10 tasks testing various language skills and found that while Gemini Pro performed comparably to GPT 3.5 Turbo on English tasks, it excelled at translating to other supported languages. Gemini Pro achieved accuracy close to but slightly below GPT 3.5 Turbo on all English tasks, but outperformed on translation. Mixtral was competitive on some tasks but fell short on more complex ones. The analysis identified areas each model performed best and provided explanations for some of Gemini Pro's underperformance.

Uploaded by

Agra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

An In-depth Look at Gemini’s Language Abilities

Syeda Nahida Akter∗,1 , Zichun Yu∗,1 , Aashiq Muhamed∗,1 , Tianyue Ou∗,1 , Alex Bäuerle1
Ángel Alexander Cabrera1 , Krish Dholakia2 , Chenyan Xiong1 , Graham Neubig1
1
Carnegie Mellon University, 2 BerriAI
arXiv:2312.11444v2 [cs.CL] 24 Dec 2023

Abstract

The recently released Google Gemini class of models are the first to comprehen-
sively report results that rival the OpenAI GPT series across a wide variety of tasks.
In this paper, we do an in-depth exploration of Gemini’s language abilities, making
two contributions. First, we provide a third-party, objective comparison of the abil-
ities of the OpenAI GPT and Google Gemini models with reproducible code and
fully transparent results. Second, we take a closer look at the results, identifying
areas where one of the two model classes excels. We perform this analysis over
10 datasets testing a variety of language abilities, including reasoning, answering
knowledge-based questions, solving math problems, translating between languages,
generating code, and acting as instruction-following agents. From this analysis,
we find that Gemini Pro achieves accuracy that is close but slightly inferior to the
corresponding GPT 3.5 Turbo on all English-language tasks that we benchmarked,
but find that Gemini Pro excels in translation into other languages for the languages
that it supports. We further provide explanations for some of the under-performing
tasks, including failures in mathematical reasoning with many digits, sensitivity
to multiple-choice answer ordering, and others. We also identify areas where
Gemini Pro demonstrates comparably high performance, such as handling longer
and more complex reasoning chains. Code and data for reproduction can be found
at https://github.com/neulab/gemini-benchmark

1 Introduction

Gemini is the most recent in a series of large language models released by Google DeepMind [Gemini
Team, 2023]. It is notable in particular because the results reported by the Gemini team are the first to
rival the OpenAI GPT model series [Brown et al., 2020] across a wide variety of tasks. Specifically,
Gemini’s “Ultra” version reportedly outperforms GPT-4 on a wide variety of tasks, while Gemini’s
“Pro” version is reportedly comparable to GPT-3.5 [Gemini Team, 2023]. Despite the potential impact
of these results, the exact evaluation details and model predictions have not been released, limiting
the ability to reproduce, inspect, and analyze the results and their implications in detail.
In this paper, we conduct an in-depth exploration of Gemini’s language understanding and generation
abilities, with two goals:

1. We aim to provide a third-party, objective comparison of the abilities of the OpenAI GPT
and Google Gemini model classes with reproducible code and fully transparent results.
2. We aim to take an in-depth look into the results, identifying areas where one of the two
model classes excels.

Furthermore, we also perform a limited comparison with the recently released Mixtral model, as a
point of reference for a best-in-class open source model [Mistral AI team, 2023].
*
Lead authors. Individual author contributions are listed in Appendix A.
Model
Task Dataset Gemini Pro GPT 3.5 Turbo GPT 4 Turbo Mixtral
MMLU (5-shot) 65.22 67.75 80.48 68.81
Knowledge-based QA
MMLU (CoT) 62.09 70.07 78.95 59.57
Reasoning BIG-Bench-Hard 67.53 71.02 83.90 60.76
GSM8K 76.42 78.01 92.72 71.65
SVAMP 81.10 82.30 92.60 81.60
Mathematics
ASDIV 85.31 89.07 92.75 83.16
MAWPS 96.50 98.00 98.67 96.00
HumanEval 59.76 74.39 76.83 45.12
Code Generation
ODEX 39.86 52.62 45.79 40.55
FLORES (5-shot) Unblocked 53.31 52.43 54.00 40.97
Machine Translation
FLORES (5-shot) All 21.68 40.00 48.24 30.27
Web Agents WebArena 7.12 8.87 14.90 1.39

Table 1: Main results of our benchmarking. The best model is listed in bold, and the second best
model is underlined.

We perform this analysis over 10 datasets, testing a variety of text understanding and generation capa-
bilities, including the models’ abilities to answer knowledge-based questions (MMLU; Hendrycks
et al. [2021]), perform reasoning (BigBenchHard; Suzgun et al. [2022]), answer mathematics ques-
tions (e.g. GSM8K; Cobbe et al. [2021]), translate between languages (e.g. FLORES; Goyal et al.
[2022]), generate code (e.g. HumanEval; Chen et al. [2021]), and act as an instruction-following
agent (WebArena; Zhou et al. [2023b]).1
A summary of our main results can be found in Table 1. In sum, we found that across all tasks, as of
this writing (December 27, 2023), Gemini’s Pro model achieved comparable but slightly inferior
accuracy compared to the current version of OpenAI’s GPT 3.5 Turbo for all English tasks, but
superior ability to translate into other languages that it supports. Mixtral was competitive with
the Gemini and GPT models for Knowledge-based QA and Mathematics tasks, but fell short in more
complex tasks. In the following sections, we will detail our experimental methodology (Section 2)
and then perform an in-depth description and analysis of the results on each task. Each analysis is
accompanied by an online results browser using Zeno [Cabrera et al., 2023],2 which can be accessed
through the Zeno Report images in this PDF. All results and code for reproduction can be found at
https://github.com/neulab/gemini-benchmark.

2 Experimental Setup

Before discussing evaluation results and findings, this section describes our experiment configurations,
including models tested, model querying details, and evaluation procedures.

2.1 Models Tested

In this work, we compare 4 models.

Gemini Pro is the second largest model in the Gemini Series, next to the largest Gemini Ultra.3 The
model is based on the Transformer [Vaswani et al., 2017] architecture and was trained multimodally
over videos, text, and images. The number of parameters and size of training data are not disclosed.
In the original Google paper on Gemini, it was reported to achieve similar performance to GPT 3.5
Turbo.
1
Note that Gemini is a multi-modal model, but for this examination, we only focus on Gemini’s language
understanding, generation, and translation abilities.
2
https://zenoml.com
3
Gemini Ultra is not yet publicly available, and thus we do not test it in the current version of this paper.

2
GPT 3.5 Turbo is the second most capable text model served by OpenAI, part of the GPT-3 series
[Brown et al., 2020]. The model has been instruction tuned and trained using reinforcement learning
from human feedback [Ouyang et al., 2022], but was trained solely on text. Similarly, model size and
precise training details are not disclosed.

GPT 4 Turbo is the second generation of the GPT-4 [OpenAI, 2023] family, a family of models
trained multimodally. The turbo version is moderately cheaper than the original GPT-4 model (making
it more conducive to benchmarking) and similarly lacks detail of the actual training algorithms, data,
or parameter size.

Mixtral in contrast, is an open-source mixture-of-experts model, where each feedforward block


picks from a set of 8 distinct groups of parameters and uses two to process the token [Mistral AI
team, 2023]. It has been reported to achieve comparable accuracy to GPT 3.5 Turbo on several tasks,
including some examined in this paper. We use the mistralai/Mixtral-8x7b-Instruct-v0.1
version of the model.

2.2 Model Querying Details

All models were queried through the unified interface


provided by LiteLLM4 between December 11-22, 2023.
Gemini was queried through Google Vertex AI, OpenAI Language Model Input Output
models through the OpenAI API, and Mixtral through Gemini Pro $1.00 $2.00
the API provided by Together.5 For reference, we also GPT 3.5 Turbo $1.00 $2.00
list the current pricing of each model through these APIs GPT 4 Turbo $10.00 $30.00
for 1M tokens in Table 2, which provides an approximate Mixtral $0.60 $0.60
measure of how efficiently the models can be run.
Table 2: Pricing per 1M tokens. Gemini
It is also notable that in some cases Gemini Pro, by default, Pro charges by character; we multiply by
has safety features6 that block some questions, particu- 4, a rule-of-thumb average of characters
larly in the case of potentially illegal or sensitive material. per English token [Raf, 2023].
For analysis in this paper, we disabled these safety set-
tings, but in some cases discuss the effect on measured
accuracy (contrasting with gemini-pro-filtered, a model with these safety filters enabled).

2.3 Evaluation Procedure

To perform a fair comparison between the models, we re-ran experiments with all models using
exactly the same prompts and evaluation protocol for all evaluated models. We make this decision
to ensure that all models are compared on exactly the same footing, in contrast to previous papers
where these settings may differ. In general, we tried to follow both prompts and evaluators from
standard repositories, either those officially released by the datasets themselves, or from the Eleuther
evaluation harness [Gao et al., 2023a]. We also personally communicated with the Gemini team, and
in some cases followed their recommended prompts for evaluating the models, in the cases where
these prompts provided uniformly superior to the standard prompts over all evaluated model classes.
These prompts generally consist of a query, input, and few-shot examples, sometimes including
chain-of-thought reasoning [Wei et al., 2022]. In some cases, we found it necessary to make small
changes from standard practice to stably evaluate all models under consideration; all such deviations
are noted below and implemented in the companion code repository.

3 Knowledge-based QA Zeno Report

In this category, we focus on 57 knowledge-based multiple-choice question-answering tasks from


MMLU [Hendrycks et al., 2021], which span topics across STEM, the humanities, the social sciences,
and more. MMLU has been widely used as a holistic evaluation of LLMs’ knowledge-based
capabilities. There are 14,042 test samples in total.
4
https://litellm.ai/
5
https://cloud.google.com/vertex-ai/docs https://openai.com/api https://docs.together.ai/docs
6
https://cloud.google.com/vertex-ai/docs/generative-ai/multimodal/configure-safety-attributes

3
0.9
system A B C D
0.8 gemini-pro
gpt-3.5-turbo 0.4
0.7 gpt-4-turbo
mixtral
0.6 gemini-pro-cot 0.3

Accuracy
0.5 gpt-3.5-turbo-cot
gpt-4-turbo-cot

Ratio
0.4 mixtral-cot 0.2

0.3
0.1
0.2
0.1
0.0
0.0 gemini-pro gpt-3.5-turbo gpt-4-turbo mixtral

Al
li
ns
Figure 1: Overall accuracy on MMLU with 5-shot prompts

ta
nc
es
and chain-of-thought prompts
slice Figure 2: Ratio of multiple-choice
answers being predicted by models
3.1 Experimental Details

Generation Parameters We examine two popular evaluation methods in this task, including the
standard 5-shot prompts from Hendrycks et al. [2021] and 5-shot chain-of-thought prompts from
chain-of-thought-hub7 [Fu et al., 2023] with a prefix of “Let’s think step by step.” [Kojima et al.,
2022]. Note that we opt not to sample multiple responses and perform self-consistency based
reranking [Wang et al., 2022a] as done by Gemini Team [2023], as this significantly increases cost
and may not be feasible in many scenarios. We generate via greedy search with a temperature of 0.

Evaluation For the standard prompting, we directly take the first character generated by models
as their answer since this is what the 5-shot prompts imply. Sometimes, the model may not follow
this format and output the answer elsewhere. We treat examples like this as incorrect (and elaborate
more on the effect of this in the following section). For the chain-of-thought prompting, we perform
answer extraction from the model’s response and set the default answer as “C” if no answer can be
extracted, as is done in chain-of-thought-hub.

3.2 Results and Analysis

In this section, we compare and analyze the overall performance, performance by sub-tasks, and
performance by output length on MMLU.
First, from the overall results shown in Figure 1, we can see that Gemini Pro achieves an accuracy
close to, but slightly lower than that of GPT 3.5 Turbo, and much lower than that of GPT 4 Turbo.
MMLU is the strongest task for the Mixtral model, with it besting both Gemini Pro and GPT 3.5
Turbo. We saw some degradation of performance using chain-of-thought prompting, likely because
MMLU is mostly a knowledge-based question answering task that may not benefit significantly from
reasoning-oriented prompts.8
Based on this overall result, we next dive a bit deeper. One first notable point is that all questions in
MMLU are multiple-choice with 4 potential answers ordered A through D. In Figure 2, we show the
ratio of the number of times each model selects each multiple choice answer. From this figure, we
can see that Gemini has a very skewed label distribution, biased towards selecting the final choice of
“D”, which contrasts to the result of the other models, which are more balanced. This may indicate
that Gemini has not been heavily instruction-tuned towards solving multiple-choice questions, which
can cause models to be biased with respect to answer ordering [Pezeshkpour and Hruschka, 2023,
Zheng et al., 2023, Tjuatja et al., 2023].
Next, we examine each subtask’s performance. Figure 3 illustrates each model’s performance on
selected representative tasks. We notice that Gemini Pro underperforms on most tasks compared to
GPT 3.5 Turbo. Chain-of-thought prompting decreases the variance across the subtasks.
Further, we dig deeper into the tasks where Gemini Pro underperforms/outperforms GPT 3.5
Turbo the most. From Figure 4, we can observe that Gemini Pro falls behind GPT 3.5
Turbo on human_sexuality, marketing abstract_algebra, and miscellaneous. In con-
trast, Gemini Pro excels at both college_biology and high_school_biology, as well as
high_school_macroeconomics and security_studies.
7
https://github.com/FranxYao/chain-of-thought-hub
8
Note that our evaluation numbers are slightly lower than those reported in the Gemini [Gemini Team, 2023]
and Mixtral [Mistral AI team, 2023] technical reports. We attribute this to sensitivity to prompts – other prompts
may achieve somewhat higher accuracy overall.

4
Figure 3: Accuracy by each subtask on MMLU

1.0 1.0
system system
gemini-pro gemini-pro
0.8 gpt-3.5-turbo 0.8 gpt-3.5-turbo
gpt-4-turbo gpt-4-turbo
mixtral mixtral

Accuracy
0.6
0.6
Accuracy

0.4
0.4
0.2
0.2
0.0

co

hi

hi

se
gh

gh
lle

cu
0.0

_s

_s
ge

rit
ch

ch

y
_b
hu

ab

_s
oo

oo
ar

is

i
m

ol

tu
ce
tra
ke

l_

l_
an

og

di
lla

bi

m
tin

ct

es
_

ol

ac
_
se

ne
g

al

ro
xu

gy
ge

us

ec
al

br
it


a
y

slice slice

(a) Top-4 tasks where GPT 3.5 wins over Gemini Pro (b) Tasks where Gemini Pro wins over GPT 3.5

Figure 4: Tasks where Gemini Pro and GPT 3.5 prevail on MMLU

One important thing to note is that, as previously mentioned in subsection 2.2, Gemini Pro’s safety
features can have a significant effect on overall performance. All results reported above are Gemini Pro
with safety filtering disabled, but when the features are enabled the response rate and corresponding
performance can drop. In most MMLU sub-tasks, the API response rate was greater than 95%, but
two had notably low response rates: moral_scenarios at 85% and human_sexuality at 28%.
Finally, we analyze how the output length in the chain-of-thought prompting affects the model
performance in Figure 5. Generally, a stronger model tends to perform more complex reasoning and
thus outputs a longer response. One of the noteworthy advantages of Gemini Pro is that its accuracy
is less influenced by the output length compared to the two counterparts. It even outperforms GPT
3.5 when the output length is over 900. However, it also can be seen that Gemini Pro and GPT 3.5
Turbo rarely output these long reasoning chains compared to GPT 4 Turbo.

1.0
system system
10,000
gemini-pro-cot gemini-pro-cot
0.8 gpt-3.5-turbo-cot gpt-3.5-turbo-cot
gpt-4-turbo-cot 8,000 gpt-4-turbo-cot
mixtral-cot mixtral-cot
Accuracy

slice size

0.6
6,000

0.4
4,000

0.2 2,000

0.0 0
ou

30

60

ou

ou

30

60

ou
tp

tp

tp

tp
u

<

<

<

<

u
t_

t_

t_

t_
ou

ou

ou

ou
l

l
en

en

en

en
t

t
pu

pu

pu

pu
<=

>

<=

>
t

t
_l

_l

_l

_l
90

90
e

e
30

30
n

0
0

0
<=

<=

<=

<=
60

90

60

90
0

slice slice

(a) Accuracy by output length (b) Output length distribution

Figure 5: Analysis of output length on MMLU

4 General-purpose Reasoning Zeno Report

In this category, we focus on 27 diverse reasoning tasks from BIG-Bench Hard [Suzgun et al.,
2022] which consists of arithmetic, symbolic and multilingual reasoning and factual knowledge
understanding tasks. Most of the tasks consist of 250 question-answer pairs, with a few having
somewhat fewer.

5
0.9
system
0.8 gemini-pro system
gpt-3.5-turbo 0.8 gemini-pro
0.7 gpt-4-turbo gpt-3.5-turbo
mixtral gpt-4-turbo
0.6 0.6
mixtral

Accuracy
Accuracy
0.5
0.4
0.4
0.3 0.2
0.2
0.0
0.1

qu

30

50

qu
es

es
<

<
0.0

tio

tio
qu

qu
n_

n_
es

es
Al

le

le
tio

tio
li

n
ns

n_

n_
<=

>
ta

le

le

70
nc
Figure 6: Overall accuracy on BIG-Bench Hard

n
30
es

<=

<=
50

70
slice
slice

Figure 7: Accuracy by question length on BIG-


Bench Hard

1.0 1.0
system system
gemini-pro gemini-pro
0.8 gpt-3.5-turbo 0.8 gpt-3.5-turbo
gpt-4-turbo gpt-4-turbo
mixtral mixtral
Accuracy

Accuracy
0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
tra

fo

tra

tra

re

ru

sa

sp

dy

hy

lo

pe

sn
eb
ul

rm

as

in

lie

or

gi
ck

ck

ck

or

ck

pe

ar
ng
tis

_n
_o

ca
o

d_
al

nt
in

in

in

ts

k
_l
ni

ui
te

_f

s
ba
_t
g_

f_

g_

g_

l_
_u

s
an
ng

ns
p_

or
al

ra

de
lie

to
sh

sh

sh

es

nd

g
la

_a

_i
tin
a

ns
s

n
ua

du
uf

rit

uf

uf
ci

n_
er
b

g
l
es

at
fle

fle

fle
h

ge

ct
ou

st

a
m

io

io

_t
an
d_

d_

d_

s
t
et

_c

n_

n_

a
di
ob

ob

ob
ic

bl
ol

er

se
ng
_t

e
j

or
e…

e…

e…

r…

v…
w


o

slice slice

(a) Tasks where GPT 3.5 excels over Gemini Pro (b) Tasks where Gemini Pro excels over GPT 3.5

Figure 8: Tasks where Gemini Pro and GPT 3.5 prevail on BBH

4.1 Experimental Details

Generation Parameters We follow standard 3-shot prompts from the Eleuther harness across all
models where each question is followed by a chain of thought resulting in a final concluding sentence
of “So the answer is ___.”. For hyperparameters, we perform greedy decoding, generating with
temperature of 0.

Evaluation The Eleuther evaluation harness implementation of BIG-Bench Hard matches the
sentence “So the answer is ___.” and extracts the text. However, we found that for some models,
they did not produce this sentence verbatim, even in cases when they generated the correct answer,
particularly multiple-choice tasks where the answer is an option chosen from the question text (e.g.,
“answer: (B)”). To remedy this, we modified the matching rule, instead taking the last word of the
generated text as the answer of the question only for multiple-choice tasks.

4.2 Results and Analysis

For the reasoning tasks, we report the overall performance, performance by question complexity,
performance by answer types and performance by BIG-Bench sub-task.
First, we illustrate the overall accuracy in Figure 6, we can see that Gemini Pro achieves an accuracy
slightly lower than that of GPT 3.5 Turbo, and much lower than that of GPT 4 Turbo. In contrast, the
Mixtral model achieves somewhat lower accuracy than Gemini Pro.
Based on this overall result, let us dig a little bit deeper into why Gemini might be underperforming.
First, we examined accuracy by the length of the question, as detailed in Figure 7. We found that
Gemini Pro underperformed on longer, more complex questions while the GPT models were more
robust to this. This was particularly the case for GPT 4 Turbo, which showed very little degradation
even on longer questions, indicating an impressively robust ability to understand longer and more
complex queries. GPT 3.5 Turbo fell in the middle with respect to this robustness. Mixtral had low
accuracy overall compared to the Gemini GPT models but was notably stable with respect to question
length until length 70.
Next we look at whether there are variations in accuracy by the specific task in BIG-Bench Hard. In
Figure 8a, we list the tasks where GPT 3.5 Turbo outperformed Gemini Pro by the largest amount.

6
We can notice that Gemini Pro is particularly bad at the ‘tracking_shuffled_objects’ tasks.
These tasks involve keeping track of who has certain objects as they are traded among people, and
Gemini Pro often has difficulty keeping the order straight (as Table 3).

Prompt Alice, Bob, and Claire are friends and avid readers who occasionally trade books. At
the start of the semester, they each buy one new book: Alice gets Ulysses, Bob gets
Frankenstein, and Claire gets Lolita.
As the semester proceeds, they start trading around the new books. First, Claire and Bob
swap books. Then, Bob and Alice swap books. Finally, Claire and Bob swap books. At
the end of the semester, Bob has
Options:
(A) Ulysses
(B) Frankenstein
(C) Lolita
Label (B)
Output (0) At the start: Alice: Ulysses, Bob: Frankenstein, Claire: Lolita.
(1) Claire and Bob swap books: Alice: Ulysses, Bob: Lolita, Claire: Frankenstein.
(2) Bob and Alice swap books: Alice: Frankenstein, Bob: Ulysses, Claire: Lolita.
(3) Claire and Bob swap books: Alice: Frankenstein, Bob: Lolita, Claire: Ulysses.
At the end of the semester, Bob has Lolita. So the answer is (C).

Table 3: Example of Gemini Pro’s inability to maintain state in tracking_shuffled_objects.

However, there were a few tasks where Gemini Pro outperformed GPT 3.5 Turbo. The
Figure 8b shows the seven tasks where Gemini Pro outperformed GPT 3.5 Turbo by the
largest amount. These were heterogeneous and included those that required world knowl-
edge (sports_understanding), manipulating stacks of symbols (dyck_languages), sorting
words in alphabetical order (word_sorting), detecting sarcasm (snarks) and parsing tables
(penguins_in_a_table), among others.
Tasks that mostly require natural language understanding (i.e., semantic understanding, name
disambiguation, entity resolution, grammar rules, or sarcasm/humor detection), namely
salient_translation_error_detection, snarks, hyperbaton, disambiguition_qa, and
ruin_names, the Mixtral model performed particularly well, often outperforming both Gemini and
GPT 3.5 Turbo (as in Figure 9).
We further investigate the robustness of LLMs across different answer types in the Figure below. We
can see that Gemini Pro shows the worst performance in Valid/Invalid answer type which falls
under the task formal_fallacies representing logical deduction from a given context. However,
Gemini outperformed all GPT models as well as Mixtral on Other answer types (consisting of the
word_sorting and dyck_language tasks) which follows a similar line of findings as above i.e.,
Gemini is particularly good at word rearrangement and producing symbols in the correct order. Also
for MCQ and Digit answers, GPT models excel in this genre while Gemini and Mixtral struggles to
compete with them.

1.0
system
gemini-pro
0.8 gpt-3.5-turbo
mixtral
Accuracy

0.6

0.4

0.2

0.0
di

sn

hy

pe
sa

ar

pe

ng
m

ks

rb

ui
bi

ns
a
gu

to

_i
n
at

n_
io

a_
n_

ta
qa

bl
e

Figure 9: Tasks where Mixtral slice


excels over GPT 3.5 Turbo and Gemini

7
In sum, there did not seem to be a particu- 1.0
0.9
system
gemini-pro
larly strong trend in which tasks one model 0.8
gpt-3.5-turbo
gpt-4-turbo
0.7 mixtral
performed better than the other, so when per- 0.6

Accuracy
forming general-purpose reasoning tasks it may 0.5
0.4

be worth trying both the Gemini and GPT mod- 0.3


0.2
els before making a decision on which to use. 0.1
0.0
On the other hand, between Gemini and Mix-

Ye

Va

O
ig

th
C
s

lid

it

er
/N

/In
o
tral, Mixtral is more reliable in multi-variable

va
lid
slice
reasoning and natural language understanding
tasks, among other. Figure 10: Accuracy by answer types

5 Mathematics Zeno Report

To evaluate the mathematical reasoning ability of the evaluated models, we explore four math word
problems benchmarks (1) the grade-school math benchmark, GSM8K [Cobbe et al., 2021], (2) the
SVAMP dataset [Patel et al., 2021] with questions generated by varying word-order to check the
robust reasoning ability, (3) the ASDIV dataset [Miao et al., 2020] with diverse language patterns and
problem types and (4) the MAWPS benchmark [Koncel-Kedziorski et al., 2016] consisting of arithmetic
and algebraic word problems.

5.1 Experimental Details

Generation Parameters We consider standard 8-shot chain-of-thought prompts [Gao et al., 2023a,
Wei et al., 2022] where each question in few-shot prompting is associated with a chain of thought for
generating the corresponding answer. All models use greedy decoding using a temperature of 0.

Evaluation In evaluation, we make a slight modification to the standard evaluation protocol in the
Eleuther harness, which consisted of matching the words “The answer is” followed by a numerical
output. We found that all evaluated models had a tendency to output the correct answer even when
this specific phrase was not present. To mitigate this, we are simply taking the last number of the
generated text as the answer to the question, which resulted in higher accuracy overall.

5.2 Results and Analysis

In this section, we compare the accuracy of Gemini Pro to GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral,
on the four math word problems tasks, examining overall performance, performance by question
complexity, and performance by chain-of-thought depth.
1.0 1.0
system system
0.9 gemini-pro 0.9 gemini-pro
gpt-3.5-turbo gpt-3.5-turbo
0.8 0.8
gpt-4-turbo gpt-4-turbo
0.7 mixtral 0.7 mixtral
0.6 0.6
Accuracy

Accuracy

0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
All

All
in

in
st

st

(a) GSM8K (b) SVAMP


an

an
c

c
es

es

1.0 slice 1.0 slice


system system
0.9 gemini-pro 0.9 gemini-pro
gpt-3.5-turbo gpt-3.5-turbo
0.8 0.8
gpt-4-turbo gpt-4-turbo
0.7 mixtral 0.7 mixtral
0.6 0.6
Accuracy

Accuracy

0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
All

All
in

in
st

st

(c) ASDIV (d) MAWPS


an

an
ce

ce
s

slice slice

Figure 11: Overall accuracy across four mathematical reasoning tasks

First, looking at overall results in the Figure 11, we can see that Gemini Pro achieves an accuracy
slightly lower than that of GPT 3.5 Turbo, and much lower than that of GPT 4 Turbo on the GSM8K,

8
1.0 1.0
system system
gemini-pro gemini-pro
0.8 gpt-3.5-turbo 0.8 gpt-3.5-turbo
gpt-4-turbo gpt-4-turbo
mixtral mixtral

Accuracy

Accuracy
0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

qu

30

50

qu

qu

30

qu
es

es

es

es
<

<

<
tio

tio

tio

tio
qu

qu

qu
n_

n_

n_

n_
es

es

es
le

le

le

le
tio

tio

tio
n

n
n_

n_

n_
<=

>

<=

>
le

le

le
70

40
n

n
30

30
<=

<=

<=
70
50

40
slice slice

(a) GSM8K (b) SVAMP

1.0 1.0
system system
gemini-pro gemini-pro
0.8 gpt-3.5-turbo 0.8 gpt-3.5-turbo
gpt-4-turbo gpt-4-turbo
mixtral mixtral
Accuracy

Accuracy
0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
qu

20

30

qu

qu

30

qu
es

es

es

es
<

<

<
tio

tio

tio

tio
qu

qu

qu
n_

n_

n_

n_
es

es

es
le

le

le

le
tio

tio

tio
n

n
n_

n_

n_
<=

>

<=

>
le

le

le
40

40
n

n
20

30
<=

<=

<=
30

40

40
slice slice

(c) ASDIV (d) MAWPS

Figure 12: Accuracy by question length across four mathematical reasoning tasks

1.0 1.0
system system
gemini-pro gemini-pro
0.8 gpt-3.5-turbo 0.8 gpt-3.5-turbo
gpt-4-turbo gpt-4-turbo
mixtral mixtral
Accuracy

Accuracy

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
An

10

An

An

10

An
s

s
<=

<=
w

w
er

er

er

er
An

An
<

>=

<

>=
sw

sw
10

10
10

10
er

er
0

0
<

<
10

10
0

0
slice slice

(a) GSM8K (b) SVAMP


0.9944
1.0 1.0 0.9583
system system
gemini-pro gemini-pro
0.8 gpt-3.5-turbo 0.8 gpt-3.5-turbo
gpt-4-turbo gpt-4-turbo
mixtral mixtral
Accuracy

0.6
0.6
Accuracy

0.4
0.4
0.2
0.2
0.0
An

10

An

0.0
s

s
<=
w

w
er

er

An

An
An
<

>=

s
sw

w
10

10

er

er
er

<

>=
<

10
10

10
0

slice slice

(c) ASDIV (d) MAWPS

Figure 13: Accuracy by number of digits in the answer across four mathematical reasoning tasks

SVAMP and ASDIV tasks, which all contain diverse language patterns. For the MAWPS task, all
models achieve more than 90% accuracy, although Gemini Pro is still slightly worse than the other
models. In contrast, the Mixtral model achieves slightly lower accuracy compared to Gemini Pro
except on the SVAMP task.
Similarly to Section 4 we break down the results to observe the robustness of each model to question
length in Figure 12. As with the reasoning tasks on BIG-Bench Hard, we see a drop-off on longer
questions. As before, GPT 3.5 Turbo outperforms Gemini Pro on shorter questions, but drops
off more quickly, with Gemini Pro achieving similar (but still slightly inferior) accuracy on longer
questions - except the MAWPS task where Gemini excels over GPT 3.5 on longer questions. Mixtral’s
degradation is slightly worse than Gemini’s except for SVAMP.
Additionally, we observe the accuracy of the models when the answer requires longer chains of
thought. As shown in Figure 14, GPT 4 Turbo is very robust even when using long reasoning chains,
where GPT 3.5 Turbo, Gemini Pro, and Mixtral struggle with increasing COT lengths. In this analysis,
we also find that Gemini Pro is superior to GPT 3.5 Turbo in the most complex examples where the
COT length is over 100, but underperforms in the shorter examples. In contrast, Mixtral is more

9
affected by longer chain-of-thoughts compared to other models, showing the lowest performance in
the most complex examples.
Finally, we investigate the accuracy 1.0
system
gemini-pro
of the compared models in generating 0.8 gpt-3.5-turbo
gpt-4-turbo
answers with varying numbers of dig- mixtral

Accuracy
0.6

its. We create three buckets based on 0.4


the number of digits in the answer, 1, 0.2
2, or 3+ (except for the MAWPS task
0.0
which does not have answers of more

la

50

70

10
be

0
<

<
l_

<
than two digits). As shown in Fig-

la

la
le

la
be

be
n

be
l

l
<=

_l

_l

l_
en

en

le
50
ure 13, GPT 3.5 Turbo appears to be

n
<=

<=
70

10
0
more robust to multi-digit math prob- Figure 14: GSM8K accuracy
slice by chain-of-thought length
lems, where Gemini Pro and Mixtral
degrades somewhat more on problems
with more digits.
To summarize, GPT 4 Turbo is the best model across all math word problem tasks, showing more
than 90% accuracy in all tasks. In contrast, Gemini Pro and Mixtral lag behind all GPT models in
this domain, while Gemini shows slightly better reasoning ability than Mixtral.

6 Code Generation Zeno Report

In this category, we examine the models’ coding abilities using two code generation datasets Hu-
manEval [Chen et al., 2021] and ODEX [Wang et al., 2022b]. The former tests basic code understand-
ing on a limited set of functions from the Python standard library, while the latter tests the ability to
use a broader set of libraries from the entire Python ecosystem. Both of them take as input a human-
written task description in English (often with test cases). These problems evaluate comprehension of
language, algorithmic understanding, and elementary mathematics. Overall, HumanEval has 164 test
samples, and ODEX has 439 test samples.

6.1 Experimental Details

Generation Parameters We follow the standard zero-shot code evaluation pipeline provided by the
ODEX9 . We generate with temperature 0.0, which demonstrated the best performance for all models,
compared to other temperatures. We use a prompt of “Complete the following python3 function: ” to
ensure that the models’ output fits the desired format.

Evaluation We perform execution-based evaluation, measuring the Pass@1 metric, which deter-
mines whether a single model output passes test cases [Chen et al., 2021]. Since code generation
is evaluated in a zero-shot fashion, the model may inevitably output code that does not conform to
our input format well. Therefore, we perform rudimentary post-processing to make the output code
fit into the final verification pipeline as much as possible, including the removal of markdown code
blocks, the extraction of function implementations and the truncation of natural language sentences
following the code. We do not automatically fix missing library imports.

6.2 Results and Analysis

In this section, we examine the overall performance and present a few examples.
First, from the results shown in Figure 15, we can see that Gemini Pro achieves a Pass@1 lower
than GPT 3.5 Turbo and GPT 4 Turbo on both tasks. On HumanEval, Gemini Pro significantly
outperforms Mixtral, and on ODEX Mixtral and Gemini Pro have roughly equivalent accuracy.10
Second, we analyze the relationship between the gold solution length and the model performance in
Figure 16a. The solution length can partly indicate the difficulty of solving the corresponding code
9
https://github.com/zorazrw/odex
10
Note that the 59.8% accuracy that Gemini Pro achieved on HumanEval is significantly lower than that
reported by Gemini Team [2023]. We suspect the difference may be due to differences in prompting techniques,
and will continue examining the issue.

10
0.8 0.55
system system
gemini-pro 0.50 gemini-pro
0.7
gpt-3.5-turbo 0.45 gpt-3.5-turbo
0.6 gpt-4-turbo gpt-4-turbo
mixtral 0.40 mixtral
0.5 0.35

Pass@1

Pass@1
0.30
0.4
0.25
0.3 0.20

0.2 0.15
0.10
0.1
0.05
0.0 0.00

Al

Al
li

li
ns

ns
(a) HumanEval (b) ODEX

ta

ta
nc

nc
es

es
slice slice

Figure 15: Overall accuracy on code generation tasks

1.0 0.6
system system
gemini-pro gemini-pro
0.8 gpt-3.5-turbo 0.5 gpt-3.5-turbo
gpt-4-turbo gpt-4-turbo
mixtral mixtral
0.6 0.4
Pass@1

Pass@1
0.4 0.3

0.2 0.2

0.1
0.0
so

10

20

so
l_

l_
0

0.0
l

l
<

<
en

en
so

so

pa

nu

da

Al
<=

>
l_

l_

at

oc
30

l
nd

te
le

le

in
10

pl

py

tim
n

st
as
ot
0

<=

<=

an
lib

e
30

ce
20

s
0
0

slice slice

(a) Accuracy by gold solution length on HumanEval (b) Accuracy by used libraries on ODEX

Figure 16: Comparison of Pass@1 w.r.t. gold solution length and the libraries used by gold solution

generation task. We find that even though Gemini Pro achieves comparable Pass@1 with GPT 3.5
when the solution length is below 100 (e.g., easier cases), it falls behind by large margins when the
solution becomes longer. This is an interesting contrast to the results from previous sections, where
we found that in general Gemini Pro performed robustly with respect to longer inputs and outputs on
English language tasks.
On the ODEX benchmark all models achieved lower accuracy than HumanEval, with a significant
portion of the errors being due to the models failing to import libraries that they were using, or using
non-current APIs. We also present the analysis of how the libraries required in each solution affect
the model performance in Figure 16b. Gemini Pro performs slightly worse than GPT 3.5 on most
library-used cases, such as mock, pandas, numpy, and datetime.
Finally, we show several concrete examples of failure cases where Gemini Pro performs worse in
code generation than GPT 3.5. First, we noticed that Gemini is somewhat worse at correctly choosing
functions and arguments from the Python API. For instance, given this prompt:

def f_3283984():
"""decode a hex string '4a4b4c' to UTF-8."""

Gemini Pro generated the following code, which results in a type mismatch error:

bytes(bytearray.fromhex('4a4b4c'), 'utf-8')

In contrast, GPT 3.5 Turbo used the following code, which achieves the desired result:

hex_string = '4a4b4c'
decoded_string = bytes.fromhex(hex_string).decode('utf-8')
return decoded_string

Further, Gemini Pro had a higher proportion of mistakes where the implemented code was syntacti-
cally correct but did not correctly match with a more complex intent. For instance, with respect to the
following prompt:

from typing import List

def remove_duplicates(numbers: List[int]) -> List[int]:


"""From a list of integers, remove all elements that occur more than once.
Keep order of elements left the same as in the input.

11
>>> remove_duplicates([1, 2, 3, 2, 4])
[1, 3, 4]
"""
Gemini Pro created an implementation that just extracts the unique numbers without removing those
that appear more than once.
seen_numbers = set()
unique_numbers = []
for number in numbers:
if number not in seen_numbers:
unique_numbers.append(number)
seen_numbers.add(number)
return unique_numbers

7 Machine Translation Zeno Report

This set of experiments evaluates the models’ multilingual ability, specifically their ability to translate
between various language pairs, using the FLORES-200 machine translation benchmark [NLLB
Team et al., 2022]. We focus on a diverse subset of 20 languages used by the analysis of Robinson
et al. [2023], which encompass various levels of resource availability and translation difficulty. We
evaluate on the 1012 sentences from the test set for all the chosen language pairs. As the first step of
this study, we limited our scope to translations from English to other languages (ENG→X) only.

7.1 Experimental Details

Generation Parameters We use a five-shot prompting strategy (zero shot results are also noted in
Appendix C), specifically the prompts proposed by Gao et al. [2023b] designated in Table 7. Our
experimental setup employed a top_p value of 1, a temperature of 0.3, a context_length of -1, and
max_tokens 500, which we found to generally achieve good performance for translation.

Evaluation To evaluate the outputs, we utilized sentence level averaged chrF2++, leveraging the
implementation provided by sacreBLEU [Post, 2018]. This standard metric is based on character
and word n-gram overlap between the system output and the reference sentence. We compute the
sentence level chrF scores. For simplicity, we refer to this metric as chrF in our discussion [Popović,
2017].

7.2 Results and Analysis

Overall Performances In Table 4, we conduct a comparative analysis of Gemini Pro, GPT 3.5
Turbo, GPT 4 Turbo, and Mixtral. We also compare against established translation-specific systems
like Google Translate11 , and NLLB-MoE [NLLB Team et al., 2022], the leading open-source machine
translation (MT) model known for its extensive language coverage.
The results indicate that Google Translate generally outperforms other models in languages that it
supports, excelling in 10 languages. It is followed by NLLB, which excels on 6 languages. Gemini
Pro provided impressive accuracy on several languages, even having the best accuracy on 3: South
Levantine Arabic, Romanian, and Mesopotamian Arabic (surpassing not only GPT 3.5 Turbo, but
also GPT 4 Turbo).
However, on average, the results indicate that the general-purpose language models showed competi-
tive performances but have not yet surpassed the dedicated machine translation systems in translation
into non-English languages.
Figure 17 illustrates the comparative performance of language models across language pairs for the
5-shot prompt. GPT 4 Turbo showed a consistent deviation of performance with NLLB relative
to GPT 3.5 Turbo and Gemini Pro. This reflects the findings in the literature on GPT 4 Turbo’s
multilingual performance [OpenAI, 2023]. GPT 4 Turbo also offered larger improvements for low-
resource languages (as measured by NLLB Team et al. [2022]), whereas for high-resource languages
11
http://translate.google.com

12
Lang. Gemini Pro GPT 3.5 Turbo GPT 4 Turbo Mixtral Google NLLB
ssw_Latin 0.43† 18.16 38.62 18.84 - 43.51
sna_Latin 0.00† 24.44 43.25 22.20 44.36 43.40
ckb_Arab 0.00† 26.69 41.30 18.36 47.61 47.25
mag_Deva 34.54 39.70 45.46 25.93 - 58.03
ibo_Latin 0.00† 21.46 41.94 17.75 43.37 41.36
hau_Latin 0.02† 30.24 50.82 22.47 53.18 53.25
pbt_Arab 0.00† 22.81 34.21 16.61 - 39.27
tam_Taml 0.00† 35.50 48.04 23.78 55.98 53.63
kat_Geor 0.00† 33.32 40.94 23.78 51.11 47.10
gle_Latin 0.00† 46.72 56.52 26.93 59.93 57.87
kmr_Latin 0.00† 30.03 33.33 19.04 39.94 39.25
war_Latin 0.52† 51.17 56.01 34.74 - 57.25
ajp_Arab 50.64 47.45 47.01 33.60 - 50.61
lim_Latin 39.99 43.77 46.05 32.31 - 47.58
ukr_Cyrl 56.89 54.56 56.44 49.66 58.25 56.04
fra_Latin 70.77 70.99 70.77 66.73 72.98 70.01
lvs_Latin 59.49 54.55 57.95 31.17 62.49 54.89
ron_Latin 65.09 63.18 63.93 56.68 65.08 61.21
tpi_Latin 6.20† 40.14 47.97 33.33 - 42.02
acm_Arab 49.05 45.26 44.44 31.65 - 32.60

Table 4: Machine translation performance (chrF (%) scores) across models for all languages using
5-shot prompt. Best scores are bolded, second best underlined. † indicates languages where Gemini
Pro blocked more than 50% of responses.

performance was similar between the LLMs. In comparison, Gemini Pro outperforms both GPT 3.5
Turbo and GPT 4 Turbo on 5 out of 20 languages, and achieved the top performances on 3 languages.
Mixtral underperformed the other models across language pairs.

Gemini Blocked Responses If we analyze responses at a language level, we see in Figure 18 that
Gemini Pro’s lower performance in 12/20 languages is due to its tendency to block responses on
particular languages, generally ones with lower levels of resources. We consider a response "blocked"
if Gemini Pro generates a Blocked Response error, and define unblocked languages as those languages
where >50% samples are not blocked.
Examining performance at a language level in Figure 19, we see that Gemini Pro outperforms GPT 3.5
Turbo and GPT 4 Turbo on 5/8 unblocked languages most of which are high resource. Additionally,
in Figure Figure 19, we observe that the implementation of safety filters leads to a decrease in the
overall chrF score. This reduction occurs because the filters block samples from languages that the
model otherwise handles relatively effectively.

Other trends In Figure 21, we present apparent trends when categorizing languages by family or
script. A key observation is Gemini Pro’s competitive performance with other models on Cyrillic
scripts, is contrasted by its underperformance on other scripts. GPT 4 Turbo stands out, outperforming
other models across various scripts, especially in the Devanagari script.

13
Figure 17: Machine translation performance (chrF (%) scores) by language pairs for 5-shot prompt.

Figure 18: Number of samples that are blocked by Gemini Pro 5-shot

In Figure 20, we examine the performance of various mod-


els across different sentence length segments, categorized
by both target length (Figure 20a) and predicted length
(Figure 20b). Upon scrutinizing Figure 20, we observe
that Gemini Pro’s performance at longer target lengths
does not match that of GPT 4 Turbo and GPT 3.5 Turbo.
However, when considering predicted lengths, Gemini
Pro generally outperforms both GPT 4 Turbo and GPT
3.5 Turbo at longer lengths, suggesting it produces higher
quality translations at longer lengths. Additionally, the fig-
ure suggests that a significant portion of the performance
decline at shorter predicted lengths even on unblocked
languages may be attributed to empty predictions, likely
triggered by the content filtering mechanism.
Figure 21: Performance by script
8 Web Agents Zeno Report

Finally, we examine the ability of each model to act as an


instruction following agent that performs tasks on the web,

14
Figure 19: Performance in chrf (%) on blocked and unblocked languages for 5-shot prompt.

(a) chrf by target sentence length (b) chrf by predicted sentence length

Figure 20: Performance with varying target and predicted length using 5-shot prompt on unblocked
languages.

which requires long-term planning and complex data understanding. We use WebArena [Zhou et al.,
2023b], an execution-based simulation environment where the success criterion is based on execution
outcome. Tasks given to agents consist of information seeking, site navigation, and content &
configuration operations. The tasks span over a variety of web sites, including E-commerce platforms,
social forums, collaborative software development platforms (e.g. gitlab), content management
systems, and online maps.

8.1 Experiment Details

Generation Parameters We follow


WebArena’s testing methodology in testing
Gemini. We used the two-shot chain-of-thought CoT UA Hint Model SR SRAC
prompts from Zhou et al. [2023b], where each ✓ ✓ G EMINI - PRO 7.12 3.52
prompt includes two CoT style examples. We ✓ ✗ G EMINI - PRO 6.25 4.83
further distinguished between whether or not ✓ ✓ GPT-3.5- TURBO 8.87 6.44
the model is instructed to terminate execution ✓ ✗ GPT-3.5- TURBO 6.36 6.06
when it believes the task is unachievable ✓ ✗ GPT-4- TURBO 14.90 14.22
(the “unachievable” hint, or UA in WebArena
parlance). Table 5: Performances on WebArena.
In sum, we tested with two prompts from
WebArena: p_cot_id_actree_2s and
p_cot_id_actree_2s_no_na, which are the CoT prompt with the UA hint and CoT prompt without
the UA hint, respectively. To make results comparable between GPTs and Gemini, we set the same
upper limit on the observation lengths for all of them. This number is set to 1920 tokens using
the tokenizer of gpt-4-1106-preview, consistent with experiments in WebArena. In terms of
hyper-parameters, we used the default suggested by each of the large language model providers.
For the Gemini models, the suggested default temperature is 0.9 and default top-p is 1.0, and the
WebArena suggested default for GPT models is 1.0 for temperature and 0.9 for top-p.

15
system
gemini-pro-cot
0.20 gemini-pro-cot-uahi
gpt-3.5-cot
gpt-3.5-cot-uahint
0.15 gpt-4-turbo-cot
success

mixtral-cot-uahint

0.10

0.05

0.00
sh

gi

re

sh

m
tla

ap

ul
d
op

op
di

ti-
b
pi

pi
t

si
ng

ng

te
_a
dm
in

slice
Figure 23: Web agent success rate of evaluated models at different site groups

Evaluation Procedure The action sequence of an agent is considered correct as long as they
achieved the final goal, regardless the intermediate steps they take. We use WebArena’s evaluation,
which determines wether a task is completed successfully or not with the agent’s final output.

8.2 Results and Analysis

We examine Gemini-Pro’s overall success rate, rate across dif-


ferent tasks, its response lengths, trajectory step counts, and
tendency to predict that the task is unachievable. The overall
performance is list in Table 5. Gemini-Pro performs com-
parably but slightly worse than GPT-3.5-Turbo. Similarly to
GPT-3.5-Turbo, Gemini-Pro performs better when the prompt
mentions that task might be unachievable (UA hint). With UA
hint, Gemini-Pro achieves an overall 7.09 percent success rate.
If we break down by websites, as shown in Figure 23, we
can see that Gemini-Pro performs worse than GPT-3.5-Turbo
on gitlab and maps, while being close to GPT-3.5-Turbo on
shopping admin, reddit, and shopping. It performs better than
GPT-3.5-Turbo on multi-site tasks, which is in concert with
our previous results of Gemini being a bit better on the more
complex sub-tasks across benchmarks.
In general, Gemini-Pro predicts more tasks as unachievable,
especially in the case where a UA hint is given, as shown in Figure 22: UA prediction count
Figure 22. Gemini-Pro predicts over 80.6% of the tasks as
unachievable when given a UA hint, compared to 47.7% by
GPT-3.5-Turbo. Note that 4.4% of the tasks in the dataset are actually unachievable, so both far
over-predict the actual number of unachievable tasks.
At the same time, we observed that Gemini Pro has a greater tendency to respond in shorter phrases
and take fewer steps before reaching a conclusion. As shown in Figure 24a, more than half of
trajectories by Gemini Pro are under ten steps, while majority of trajectories by GPT 3.5 Turbo and
GPT 4 Turbo are between 10 and 30 steps. Similarly, the majority of Gemini responses are less than
100 characters in length, while most of GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral’s responses are
over 300 characters in length Figure 24b. Gemini tends to directly predict the actions while other
models would start with reasoning and then give their action predictions.

16
800
system system
600 700 gemini-pro-cot
gemini-pro-cot
gemini-pro-cot-uah gemini-pro-cot-uah
500 600 gpt-3.5-cot
gpt-3.5-cot
gpt-3.5-cot-uahint 500 gpt-3.5-cot-uahint

slice size
400 gpt-4-turbo-cot gpt-4-turbo-cot

slice size
mixtral-cot-uahint 400 mixtral-cot-uahint
300 300
200
200
100
100 0

ou

10

20

30
0

tp

0
<

<

<
ut
st

10

30

_l

ou

ou

ou
ep

en
<

<

tp

tp

tp
s

<

ut

ut

ut
st

st
<

ep

ep

_l

_l

_l
=

10

en

en

en
s

s
10

0
<

<

<
=

20

30
30
slice slice

0
(a) Average steps taken per task (b) Average response length

Figure 24: Model behaviors on WebArena.

9 Conclusion
In this paper, we have taken a first impartial, in-depth look into Google’s Gemini model, comparing
it to OpenAI’s GPT 3.5 and 4 models, as well as the open source Mixtral model.

Takeaways We came away with a number of conclusions:


• The Gemini Pro model, which is comparable to GPT 3.5 Turbo in model size and class,
generally achieves accuracy that is comparable but somewhat inferior to GPT 3.5 Turbo,
and much worse than GPT 4 on English tasks.
• In particular, we find that Gemini Pro was somewhat less performant than GPT 3.5 Turbo on
average, but in particular had issues of bias to response order in multiple-choice questions,
mathematical reasoning with large digits, and premature termination of agentive tasks. When
using the default content filtering settings, there were also failed responses due to aggressive
content filtering.
• On the other hand, there were bright points: Gemini performed better than GPT 3.5 Turbo
on particularly long and complex reasoning tasks.
• In addition, when generating text in other languages (specifically through translation),
Gemini Pro outperforms both GPT 3.5 Turbo and GPT 4 Turbo on the languages where
requests are not blocked, but there are several languages for which Gemini Pro does not
return any answer.
• The open-source model Mixtral is competitive with Gemini Pro and GPT 3.5 Turbo on
Knowledge-based QA and Mathematics tasks, but significantly underperformed on other
tasks.

Limitations Finally, we would like to temper these conclusions with a number of limitations.
First, our work is a snapshot in time with respect to ever-changing and unstable API-based systems.
All results here are current as of this writing on December 27, 2023, but may change in the future as
models and the surrounding systems are upgraded.
Second, the results may be dependent on the specific prompts and generation parameters that we
selected. In fact, we found that the results of all models were affected by prompt selection, and that
the GPT models seemed somewhat more robust to small variations in the prompts of the GPT models.
It is quite possible that with further prompt engineering, or multiple samples and self-consistency as
was used by Gemini Team [2023], the results could change significantly.
Finally, any benchmarking paper would be remiss without a discussion of data leakage, which plagues
current evaluation of large language models [Zhou et al., 2023a]. While we did not measure this
leakage explicitly, we did attempt to mitigate by evaluating on a broad variety of tasks, including
those who’s outputs were not sourced from or widely available on the internet (such as WebArena).

Outlook Based on this paper, we can make the recommendation to researchers and practitioners
to carefully look at the Gemini Pro model as a tool in the toolbox, comparable to GPT 3.5 Turbo.
In particular, Gemini Pro may be a preferable alternative when processing non-English languages.
Gemini’s Ultra edition, which is yet to be released, is reported to be on par with GPT 4, and a further
examination of this model will be warranted when it is available.

17
Acknowledgements
We greatly appreciate the help of Zhiruo Wang in handling the ODEX dataset, and Shuyan Zhou with
high-level guidance on the WebArena experiments. We also are very grateful to the Gemini team
who, based on an earlier version of this draft, provided significant help in attempting to reproduce
the numbers in their paper. We also thank those who provided comments on our earlier paper draft
on social media, including Arthur Mensch, who noted that we had inadvertently used a non-official
third-party version of the Mixtral model in comparisons, and those who pointed out an inaccuracy in
description of the Mixtral model’s mixture of experts mechanism.

References
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

Ángel Alexander Cabrera, Erica Fu, Donald Bertucci, Kenneth Holstein, Ameet Talwalkar, Jason I
Hong, and Adam Perer. Zeno: An interactive framework for behavioral evaluation of machine
learning. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems,
pages 1–14, 2023.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared
Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri,
Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan,
Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian,
Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios
Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino,
Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders,
Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa,
Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob
McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating
large language models trained on code. 2021.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve
math word problems. arXiv preprint arXiv:2110.14168, 2021.
Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, and Tushar Khot. Chain-of-thought hub:
A continuous effort to measure large language models’ reasoning performance. arXiv preprint
arXiv:2305.17306, 2023.
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster,
Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff,
Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika,
Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot
language model evaluation, 12 2023a. URL https://zenodo.org/records/10256836.
Yuan Gao, Ruili Wang, and Feng Hou. How to design translation prompts for chatgpt: An empirical
study. arXiv preprint arXiv: 2304.02182, 2023b.
Gemini Team. Gemini: A family of highly capable multimodal models. Technical report, Google, 12
2023. URL https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf.
Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana
Krishnan, Marc’Aurelio Ranzato, Francisco Guzmán, and Angela Fan. The flores-101 evaluation
benchmark for low-resource and multilingual machine translation. Transactions of the Association
for Computational Linguistics, 10:522–538, 2022.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob
Steinhardt. Measuring massive multitask language understanding. Proceedings of the International
Conference on Learning Representations (ICLR), 2021.

18
Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large
language models are zero-shot reasoners. In Advances in Neural Information Processing Systems,
volume 35, pages 22199–22213, 2022.
Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS:
A math word problem repository. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors,
Proceedings of the 2016 Conference of the North American Chapter of the Association for Com-
putational Linguistics: Human Language Technologies, pages 1152–1157, San Diego, Califor-
nia, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1136. URL
https://aclanthology.org/N16-1136.
Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing
english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics, pages 975–984, 2020.
Mistral AI team. Mixtral of experts, December 2023. URL https://mistral.ai/news/mixtral-of-experts/.
Accessed: 2023-12-15.
NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield,
Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang,
Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip
Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit,
Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan,
Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko,
Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. No language left behind:
Scaling human-centered machine translation. META, 2022.
OpenAI. Gpt-4 technical report, 2023.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow
instructions with human feedback. Advances in Neural Information Processing Systems, 35:
27730–27744, 2022.
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple
math word problems? In Proceedings of the 2021 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094,
Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.
168. URL https://aclanthology.org/2021.naacl-main.168.
Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options
in multiple-choice questions. arXiv preprint arXiv:2308.11483, 2023.
Maja Popović. chrf++: words helping character n-grams. In Proceedings of the second conference
on machine translation, pages 612–618, 2017.
Matt Post. A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771, 2018.
Raf. What are tokens and how to count them? https://help.openai.com/en/articles/
4936856-what-are-tokens-and-how-to-count-them, 2023. Accessed: 2023-12-15.
Nathaniel R. Robinson, Perez Ogayo, David R. Mortensen, and Graham Neubig. Chatgpt mt:
Competitive for high- (but not low-) resource languages. Conference on Machine Translation,
2023. doi: 10.48550/arXiv.2309.07423.
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung,
Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks
and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
Lindia Tjuatja, Valerie Chen, Sherry Tongshuang Wu, Ameet Talwalkar, and Graham Neubig.
Do llms exhibit human-like response biases? a case study in survey design. arXiv preprint
arXiv:2311.04076, 2023.

19
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information
Processing Systems, volume 30, pages 5998–6008. Curran Associates, Inc., 2017.

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh-
ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.
arXiv preprint arXiv:2203.11171, 2022a.

Zihuro Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. Execution-based evaluation for
open-domain code generation. arXiv preprint arXiv:2212.10481, 2022b.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny
Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in
Neural Information Processing Systems, 35:24824–24837, 2022.

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are
not robust multiple choice selectors. arXiv e-prints, pages arXiv–2309, 2023.

Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin,
Ji-Rong Wen, and Jiawei Han. Don’t make your llm an evaluation benchmark cheater. arXiv
preprint arXiv:2311.01964, 2023a.

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng,
Yonatan Bisk, Daniel Fried, Uri Alon, et al. Webarena: A realistic web environment for building
autonomous agents. arXiv preprint arXiv:2307.13854, 2023b.

A Author Contributions
Syeda Akter performed experiments, analysis, and writing for the text understanding and mathematical
reasoning tasks. Zichun Yu performed experiments, analysis, and writing for the knowledge-based
question answering and the code generation tasks. Aashiq Muhamed performed experiments, analysis,
and writing for the machine translation task. Tianyue Ou performed experiments, analysis, and writing
for the instruction following agents task. Ángel Alexander Cabrera and Alex Bäuerle provided
visualization support and performed fine-grained analysis of the results for each tasks. Krrish
Dholakia provided support implementing calls to each of the language models. Chenyan Xiong
provided direction on the varieties of tasks to pursue and helped with paper writing. Graham Neubig
proposed the project idea, wrote the introduction, experimental setup, and conclusions section, and
provided analysis and writing support for all other sections.

B Prompt Details
In this section, we detail the prompts that we used for each task.
For Knowledge-based QA task in Section 3, we have used standard 5-shot prompts from Hendrycks
et al. [2021]12 and 5-shot chain-of-thought prompts from chain-of-thought-hub13 .
For General-purpose Reasoning task in Section 4, we have used Chain-of-Thought prompts from
Gao et al. [2023a]14 .
For Mathematics tasks in Section 5, we also have followed Chain-of-Thought prompts from Gao et al.
[2023a]15 .
For Code Generation in Section 6, prompt is listed in Table 6.
For Machine Translation in Section 7, prompts are listed in Table 7.
12
https://github.com/hendrycks/test
13
https://github.com/FranxYao/chain-of-thought-hub/blob/main/MMLU/lib_prompt/mmlu-cot.json
14
https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/lm_eval/tasks/bbh/cot_fewshot
15
https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/tasks/gsm8k/gsm8k-cot.
yaml

20
Prompt
Write the following python3 function:
[CODE BLOCK]

Table 6: Prompts used for code generation tasks.

Shot Prompt
This is an English to [TGT] translation, please provide
the [TGT] translation for this sentence. Do not provide
zero any explanations or text apart from the translation.
[SRC]: [src-sentence]
[TGT]:
This is an English to [TGT] translation, please provide
the [TGT] translation for these sentences:
[SRC]: [src-sentence] [TGT]: [tgt-sentence]
[SRC]: [src-sentence] [TGT]: [tgt-sentence]
[SRC]: [src-sentence] [TGT]: [tgt-sentence]
[SRC]: [src-sentence] [TGT]: [tgt-sentence]
five
[SRC]: [src-sentence] [TGT]: [tgt-sentence]
Please provide the translation for the following sentence.
Do not provide any explanations or text apart from the
translation.
[SRC]: [src-sentence]
[TGT]:

Table 7: Prompts used for zero- and five-shot settings in translation tasks.

For WebArena in Section 8, we used CoT with UA (unachievable) hint16 and CoT without UA hint17 .

C Additional Experiments: Machine Translation


This section includes plots and results comparing 0-shot and 5-shot prompts for all models, as well as
the results using 0-shot prompt per language. Throughout our analysis of the models, we observed that
few-shot prompts generally yield a modest enhancement in average performance, with an increasing
variance pattern following the order: GPT 3.5 Turbo < Gemini Pro < GPT 4 Turbo < Mixtral. Using
5-shot prompts improves chrf for both unblocked language instances as well as all instances. Gemini
Pro gains 1.8 chrf, GPT 5 Turbo gains 2.87 chrf, GPT 2.5 Turbo gains 0.8 chrf, and Mixtral gains
4.75 chrf.

16
https://github.com/oootttyyy/webarena/blob/main/agent/prompts/raw/p_cot_id_actree_2s.py
17
https://github.com/oootttyyy/webarena/blob/main/agent/prompts/raw/p_cot_id_actree_2s_no_na.py

21
Lang. Gemini Pro GPT 3.5 Turbo GPT 4 Turbo Mixtral Google NLLB
ssw_Latin 0.54† 20.21 37.38 18.27 - 43.51
sna_Latin 0.11† 23.92 42.84 22.07 44.36 43.40
ckb_Arab 0.00† 21.13 39.90 14.31 47.61 47.25
mag_Deva 24.83 39.20 26.40 20.29 - 58.03
ibo_Latin 0.08† 20.13 41.78 16.71 43.37 41.36
hau_Latin 0.05† 29.64 50.14 22.41 53.18 53.25
pbt_Arab 0.01† 21.26 32.85 12.46 - 39.27
tam_Taml 0.00† 35.13 47.67 20.97 55.98 53.63
kat_Geor 0.00† 33.24 40.45 21.42 51.11 47.10
gle_Latin 0.09† 46.44 55.91 25.34 59.93 57.87
kmr_Latin 0.57† 29.16 32.17 17.96 39.94 39.25
war_Latin 0.38† 49.88 54.21 29.66 - 57.25
ajp_Arab 49.79 46.86 45.32 26.07 - 50.61
lim_Latin 32.70 41.00 42.04 29.13 - 47.58
ukr_Cyrl 57.20 54.39 56.51 48.62 58.25 56.04
fra_Latin 70.15 70.88 70.66 66.22 72.98 70.01
lvs_Latin 58.53 54.34 57.34 28.99 62.49 54.89
ron_Latin 64.50 62.98 63.90 54.46 65.08 61.21
tpi_Latin 4.62† 38.05 44.70 28.63 - 42.02
acm_Arab 46.85 43.44 39.83 16.26 - 32.60

Table 8: Machine translation performance (chrF (%) scores) across models for all languages using
0-shot prompt. Best scores are bolded, second best underlined. † indicates languages where Gemini
Pro blocked more than 50% of responses.

Figure 25: Machine translation performance (chRF (%) scores) by language pairs for 0-shot prompt.

22
Figure 26: Number of samples that are blocked by Gemini Pro for 0-shot prompt and 5-shot prompt.

Figure 27: Performance in chrf (%) on blocked and unblocked languages for 0-shot prompt

23

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy