Skip to content

Commit 32e3da7

Browse files
authored
test(verification): more tests, multiturn tool use tests (meta-llama#1954)
# What does this PR do? ## Test Plan (myenv) ➜ llama-stack python tests/verifications/generate_report.py --providers fireworks,together,openai --run-tests https://github.com/meta-llama/llama-stack/blob/f27f61762980925330fb46da5e9e74e3a1b999a2/tests/verifications/REPORT.md
1 parent 86c6f1f commit 32e3da7

File tree

6 files changed

+6270
-595
lines changed

6 files changed

+6270
-595
lines changed

tests/verifications/REPORT.md

Lines changed: 49 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Test Results Report
22

3-
*Generated on: 2025-04-10 16:48:18*
3+
*Generated on: 2025-04-14 18:11:37*
44

55
*This report was generated by running `python tests/verifications/generate_report.py`*
66

@@ -15,15 +15,15 @@
1515

1616
| Provider | Pass Rate | Tests Passed | Total Tests |
1717
| --- | --- | --- | --- |
18-
| Together | 64.7% | 22 | 34 |
19-
| Fireworks | 82.4% | 28 | 34 |
20-
| Openai | 100.0% | 24 | 24 |
18+
| Together | 48.7% | 37 | 76 |
19+
| Fireworks | 47.4% | 36 | 76 |
20+
| Openai | 100.0% | 52 | 52 |
2121

2222

2323

2424
## Together
2525

26-
*Tests run on: 2025-04-10 16:46:35*
26+
*Tests run on: 2025-04-14 18:08:14*
2727

2828
```bash
2929
# Run all tests for this provider:
@@ -48,19 +48,33 @@ pytest tests/verifications/openai_api/test_chat_completion.py --provider=togethe
4848
| test_chat_non_streaming_basic (earth) ||||
4949
| test_chat_non_streaming_basic (saturn) ||||
5050
| test_chat_non_streaming_image ||||
51+
| test_chat_non_streaming_multi_turn_tool_calling (add_product_tool) ||||
52+
| test_chat_non_streaming_multi_turn_tool_calling (compare_monthly_expense_tool) ||||
53+
| test_chat_non_streaming_multi_turn_tool_calling (get_then_create_event_tool) ||||
54+
| test_chat_non_streaming_multi_turn_tool_calling (text_then_weather_tool) ||||
55+
| test_chat_non_streaming_multi_turn_tool_calling (weather_tool_then_text) ||||
5156
| test_chat_non_streaming_structured_output (calendar) ||||
5257
| test_chat_non_streaming_structured_output (math) ||||
5358
| test_chat_non_streaming_tool_calling ||||
59+
| test_chat_non_streaming_tool_choice_none ||||
60+
| test_chat_non_streaming_tool_choice_required ||||
5461
| test_chat_streaming_basic (earth) ||||
5562
| test_chat_streaming_basic (saturn) ||||
5663
| test_chat_streaming_image ||||
64+
| test_chat_streaming_multi_turn_tool_calling (add_product_tool) ||||
65+
| test_chat_streaming_multi_turn_tool_calling (compare_monthly_expense_tool) ||||
66+
| test_chat_streaming_multi_turn_tool_calling (get_then_create_event_tool) ||||
67+
| test_chat_streaming_multi_turn_tool_calling (text_then_weather_tool) ||||
68+
| test_chat_streaming_multi_turn_tool_calling (weather_tool_then_text) ||||
5769
| test_chat_streaming_structured_output (calendar) ||||
5870
| test_chat_streaming_structured_output (math) ||||
5971
| test_chat_streaming_tool_calling ||||
72+
| test_chat_streaming_tool_choice_none ||||
73+
| test_chat_streaming_tool_choice_required ||||
6074

6175
## Fireworks
6276

63-
*Tests run on: 2025-04-10 16:44:44*
77+
*Tests run on: 2025-04-14 18:04:06*
6478

6579
```bash
6680
# Run all tests for this provider:
@@ -85,19 +99,33 @@ pytest tests/verifications/openai_api/test_chat_completion.py --provider=firewor
8599
| test_chat_non_streaming_basic (earth) ||||
86100
| test_chat_non_streaming_basic (saturn) ||||
87101
| test_chat_non_streaming_image ||||
102+
| test_chat_non_streaming_multi_turn_tool_calling (add_product_tool) ||||
103+
| test_chat_non_streaming_multi_turn_tool_calling (compare_monthly_expense_tool) ||||
104+
| test_chat_non_streaming_multi_turn_tool_calling (get_then_create_event_tool) ||||
105+
| test_chat_non_streaming_multi_turn_tool_calling (text_then_weather_tool) ||||
106+
| test_chat_non_streaming_multi_turn_tool_calling (weather_tool_then_text) ||||
88107
| test_chat_non_streaming_structured_output (calendar) ||||
89108
| test_chat_non_streaming_structured_output (math) ||||
90109
| test_chat_non_streaming_tool_calling ||||
110+
| test_chat_non_streaming_tool_choice_none ||||
111+
| test_chat_non_streaming_tool_choice_required ||||
91112
| test_chat_streaming_basic (earth) ||||
92113
| test_chat_streaming_basic (saturn) ||||
93114
| test_chat_streaming_image ||||
115+
| test_chat_streaming_multi_turn_tool_calling (add_product_tool) ||||
116+
| test_chat_streaming_multi_turn_tool_calling (compare_monthly_expense_tool) ||||
117+
| test_chat_streaming_multi_turn_tool_calling (get_then_create_event_tool) ||||
118+
| test_chat_streaming_multi_turn_tool_calling (text_then_weather_tool) ||||
119+
| test_chat_streaming_multi_turn_tool_calling (weather_tool_then_text) ||||
94120
| test_chat_streaming_structured_output (calendar) ||||
95121
| test_chat_streaming_structured_output (math) ||||
96122
| test_chat_streaming_tool_calling ||||
123+
| test_chat_streaming_tool_choice_none ||||
124+
| test_chat_streaming_tool_choice_required ||||
97125

98126
## Openai
99127

100-
*Tests run on: 2025-04-10 16:47:28*
128+
*Tests run on: 2025-04-14 18:09:51*
101129

102130
```bash
103131
# Run all tests for this provider:
@@ -121,12 +149,26 @@ pytest tests/verifications/openai_api/test_chat_completion.py --provider=openai
121149
| test_chat_non_streaming_basic (earth) |||
122150
| test_chat_non_streaming_basic (saturn) |||
123151
| test_chat_non_streaming_image |||
152+
| test_chat_non_streaming_multi_turn_tool_calling (add_product_tool) |||
153+
| test_chat_non_streaming_multi_turn_tool_calling (compare_monthly_expense_tool) |||
154+
| test_chat_non_streaming_multi_turn_tool_calling (get_then_create_event_tool) |||
155+
| test_chat_non_streaming_multi_turn_tool_calling (text_then_weather_tool) |||
156+
| test_chat_non_streaming_multi_turn_tool_calling (weather_tool_then_text) |||
124157
| test_chat_non_streaming_structured_output (calendar) |||
125158
| test_chat_non_streaming_structured_output (math) |||
126159
| test_chat_non_streaming_tool_calling |||
160+
| test_chat_non_streaming_tool_choice_none |||
161+
| test_chat_non_streaming_tool_choice_required |||
127162
| test_chat_streaming_basic (earth) |||
128163
| test_chat_streaming_basic (saturn) |||
129164
| test_chat_streaming_image |||
165+
| test_chat_streaming_multi_turn_tool_calling (add_product_tool) |||
166+
| test_chat_streaming_multi_turn_tool_calling (compare_monthly_expense_tool) |||
167+
| test_chat_streaming_multi_turn_tool_calling (get_then_create_event_tool) |||
168+
| test_chat_streaming_multi_turn_tool_calling (text_then_weather_tool) |||
169+
| test_chat_streaming_multi_turn_tool_calling (weather_tool_then_text) |||
130170
| test_chat_streaming_structured_output (calendar) |||
131171
| test_chat_streaming_structured_output (math) |||
132172
| test_chat_streaming_tool_calling |||
173+
| test_chat_streaming_tool_choice_none |||
174+
| test_chat_streaming_tool_choice_required |||

tests/verifications/openai_api/fixtures/test_cases/chat_completion.yaml

Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,3 +131,221 @@ test_tool_calling:
131131
type: object
132132
type: function
133133
output: get_weather_tool_call
134+
135+
test_chat_multi_turn_tool_calling:
136+
test_name: test_chat_multi_turn_tool_calling
137+
test_params:
138+
case:
139+
- case_id: "text_then_weather_tool"
140+
input:
141+
messages:
142+
- - role: user
143+
content: "What's the name of the Sun in latin?"
144+
- - role: user
145+
content: "What's the weather like in San Francisco?"
146+
tools:
147+
- function:
148+
description: Get the current weather
149+
name: get_weather
150+
parameters:
151+
type: object
152+
properties:
153+
location:
154+
description: "The city and state (both required), e.g. San Francisco, CA."
155+
type: string
156+
required: ["location"]
157+
type: function
158+
tool_responses:
159+
- response: "{'response': '70 degrees and foggy'}"
160+
expected:
161+
- num_tool_calls: 0
162+
answer: ["sol"]
163+
- num_tool_calls: 1
164+
tool_name: get_weather
165+
tool_arguments:
166+
location: "San Francisco, CA"
167+
- num_tool_calls: 0
168+
answer: ["foggy", "70 degrees"]
169+
- case_id: "weather_tool_then_text"
170+
input:
171+
messages:
172+
- - role: user
173+
content: "What's the weather like in San Francisco?"
174+
tools:
175+
- function:
176+
description: Get the current weather
177+
name: get_weather
178+
parameters:
179+
type: object
180+
properties:
181+
location:
182+
description: "The city and state (both required), e.g. San Francisco, CA."
183+
type: string
184+
required: ["location"]
185+
type: function
186+
tool_responses:
187+
- response: "{'response': '70 degrees and foggy'}"
188+
expected:
189+
- num_tool_calls: 1
190+
tool_name: get_weather
191+
tool_arguments:
192+
location: "San Francisco, CA"
193+
- num_tool_calls: 0
194+
answer: ["foggy", "70 degrees"]
195+
- case_id: "add_product_tool"
196+
input:
197+
messages:
198+
- - role: user
199+
content: "Please add a new product with name 'Widget', price 19.99, in stock, and tags ['new', 'sale'] and give me the product id."
200+
tools:
201+
- function:
202+
description: Add a new product
203+
name: addProduct
204+
parameters:
205+
type: object
206+
properties:
207+
name:
208+
description: "Name of the product"
209+
type: string
210+
price:
211+
description: "Price of the product"
212+
type: number
213+
inStock:
214+
description: "Availability status of the product."
215+
type: boolean
216+
tags:
217+
description: "List of product tags"
218+
type: array
219+
items:
220+
type: string
221+
required: ["name", "price", "inStock"]
222+
type: function
223+
tool_responses:
224+
- response: "{'response': 'Successfully added product with id: 123'}"
225+
expected:
226+
- num_tool_calls: 1
227+
tool_name: addProduct
228+
tool_arguments:
229+
name: "Widget"
230+
price: 19.99
231+
inStock: true
232+
tags:
233+
- "new"
234+
- "sale"
235+
- num_tool_calls: 0
236+
answer: ["123", "product id: 123"]
237+
- case_id: "get_then_create_event_tool"
238+
input:
239+
messages:
240+
- - role: system
241+
content: "Todays date is 2025-03-01."
242+
- role: user
243+
content: "Do i have any meetings on March 3rd at 10 am? Yes or no?"
244+
- - role: user
245+
content: "Alright then, Create an event named 'Team Building', scheduled for that time same time, in the 'Main Conference Room' and add Alice, Bob, Charlie to it. Give me the created event id."
246+
tools:
247+
- function:
248+
description: Create a new event
249+
name: create_event
250+
parameters:
251+
type: object
252+
properties:
253+
name:
254+
description: "Name of the event"
255+
type: string
256+
date:
257+
description: "Date of the event in ISO format"
258+
type: string
259+
time:
260+
description: "Event Time (HH:MM)"
261+
type: string
262+
location:
263+
description: "Location of the event"
264+
type: string
265+
participants:
266+
description: "List of participant names"
267+
type: array
268+
items:
269+
type: string
270+
required: ["name", "date", "time", "location", "participants"]
271+
type: function
272+
- function:
273+
description: Get an event by date and time
274+
name: get_event
275+
parameters:
276+
type: object
277+
properties:
278+
date:
279+
description: "Date of the event in ISO format"
280+
type: string
281+
time:
282+
description: "Event Time (HH:MM)"
283+
type: string
284+
required: ["date", "time"]
285+
type: function
286+
tool_responses:
287+
- response: "{'response': 'No events found for 2025-03-03 at 10:00'}"
288+
- response: "{'response': 'Successfully created new event with id: e_123'}"
289+
expected:
290+
- num_tool_calls: 1
291+
tool_name: get_event
292+
tool_arguments:
293+
date: "2025-03-03"
294+
time: "10:00"
295+
- num_tool_calls: 0
296+
answer: ["no", "no events found", "no meetings"]
297+
- num_tool_calls: 1
298+
tool_name: create_event
299+
tool_arguments:
300+
name: "Team Building"
301+
date: "2025-03-03"
302+
time: "10:00"
303+
location: "Main Conference Room"
304+
participants:
305+
- "Alice"
306+
- "Bob"
307+
- "Charlie"
308+
- num_tool_calls: 0
309+
answer: ["e_123", "event id: e_123"]
310+
- case_id: "compare_monthly_expense_tool"
311+
input:
312+
messages:
313+
- - role: system
314+
content: "Todays date is 2025-03-01."
315+
- role: user
316+
content: "what was my monthly expense in Jan of this year?"
317+
- - role: user
318+
content: "Was it less than Feb of last year? Only answer with yes or no."
319+
tools:
320+
- function:
321+
description: Get monthly expense summary
322+
name: getMonthlyExpenseSummary
323+
parameters:
324+
type: object
325+
properties:
326+
month:
327+
description: "Month of the year (1-12)"
328+
type: integer
329+
year:
330+
description: "Year"
331+
type: integer
332+
required: ["month", "year"]
333+
type: function
334+
tool_responses:
335+
- response: "{'response': 'Total expenses for January 2025: $1000'}"
336+
- response: "{'response': 'Total expenses for February 2024: $2000'}"
337+
expected:
338+
- num_tool_calls: 1
339+
tool_name: getMonthlyExpenseSummary
340+
tool_arguments:
341+
month: 1
342+
year: 2025
343+
- num_tool_calls: 0
344+
answer: ["1000", "$1,000", "1,000"]
345+
- num_tool_calls: 1
346+
tool_name: getMonthlyExpenseSummary
347+
tool_arguments:
348+
month: 2
349+
year: 2024
350+
- num_tool_calls: 0
351+
answer: ["yes"]

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy