Skip to content

Commit f34f22f

Browse files
authored
feat: add batch inference API to llama stack inference (meta-llama#1945)
# What does this PR do? This PR adds two methods to the Inference API: - `batch_completion` - `batch_chat_completion` The motivation is for evaluations targeting a local inference engine (like meta-reference or vllm) where batch APIs provide for a substantial amount of acceleration. Why did I not add this to `Api.batch_inference` though? That just resulted in a _lot_ more book-keeping given the structure of Llama Stack. Had I done that, I would have needed to create a notion of a "batch model" resource, setup routing based on that, etc. This does not sound ideal. So what's the future of the batch inference API? I am not sure. Maybe we can keep it for true _asynchronous_ execution. So you can submit requests, and it can return a Job instance, etc. ## Test Plan Run meta-reference-gpu using: ```bash export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000 export MODEL_PARALLEL_SIZE=4 export MAX_BATCH_SIZE=32 export MAX_SEQ_LEN=6144 LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu ``` Then run the batch inference test case.
1 parent 854c2ad commit f34f22f

File tree

23 files changed

+702
-393
lines changed

23 files changed

+702
-393
lines changed

docs/_static/llama-stack-spec.html

Lines changed: 60 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@
8585
}
8686
}
8787
},
88-
"/v1/batch-inference/chat-completion": {
88+
"/v1/inference/batch-chat-completion": {
8989
"post": {
9090
"responses": {
9191
"200": {
@@ -112,7 +112,7 @@
112112
}
113113
},
114114
"tags": [
115-
"BatchInference (Coming Soon)"
115+
"Inference"
116116
],
117117
"description": "",
118118
"parameters": [],
@@ -128,7 +128,7 @@
128128
}
129129
}
130130
},
131-
"/v1/batch-inference/completion": {
131+
"/v1/inference/batch-completion": {
132132
"post": {
133133
"responses": {
134134
"200": {
@@ -155,7 +155,7 @@
155155
}
156156
},
157157
"tags": [
158-
"BatchInference (Coming Soon)"
158+
"Inference"
159159
],
160160
"description": "",
161161
"parameters": [],
@@ -239,7 +239,7 @@
239239
}
240240
},
241241
"tags": [
242-
"Inference"
242+
"BatchInference (Coming Soon)"
243243
],
244244
"description": "Generate a chat completion for the given messages using the specified model.",
245245
"parameters": [],
@@ -287,7 +287,7 @@
287287
}
288288
},
289289
"tags": [
290-
"Inference"
290+
"BatchInference (Coming Soon)"
291291
],
292292
"description": "Generate a completion for the given content using the specified model.",
293293
"parameters": [],
@@ -4366,6 +4366,51 @@
43664366
],
43674367
"title": "ToolCall"
43684368
},
4369+
"ToolConfig": {
4370+
"type": "object",
4371+
"properties": {
4372+
"tool_choice": {
4373+
"oneOf": [
4374+
{
4375+
"type": "string",
4376+
"enum": [
4377+
"auto",
4378+
"required",
4379+
"none"
4380+
],
4381+
"title": "ToolChoice",
4382+
"description": "Whether tool use is required or automatic. This is a hint to the model which may not be followed. It depends on the Instruction Following capabilities of the model."
4383+
},
4384+
{
4385+
"type": "string"
4386+
}
4387+
],
4388+
"default": "auto",
4389+
"description": "(Optional) Whether tool use is automatic, required, or none. Can also specify a tool name to use a specific tool. Defaults to ToolChoice.auto."
4390+
},
4391+
"tool_prompt_format": {
4392+
"type": "string",
4393+
"enum": [
4394+
"json",
4395+
"function_tag",
4396+
"python_list"
4397+
],
4398+
"description": "(Optional) Instructs the model how to format tool calls. By default, Llama Stack will attempt to use a format that is best adapted to the model. - `ToolPromptFormat.json`: The tool calls are formatted as a JSON object. - `ToolPromptFormat.function_tag`: The tool calls are enclosed in a <function=function_name> tag. - `ToolPromptFormat.python_list`: The tool calls are output as Python syntax -- a list of function calls."
4399+
},
4400+
"system_message_behavior": {
4401+
"type": "string",
4402+
"enum": [
4403+
"append",
4404+
"replace"
4405+
],
4406+
"description": "(Optional) Config for how to override the default system prompt. - `SystemMessageBehavior.append`: Appends the provided system message to the default system prompt. - `SystemMessageBehavior.replace`: Replaces the default system prompt with the provided system message. The system message can include the string '{{function_definitions}}' to indicate where the function definitions should be inserted.",
4407+
"default": "append"
4408+
}
4409+
},
4410+
"additionalProperties": false,
4411+
"title": "ToolConfig",
4412+
"description": "Configuration for tool use."
4413+
},
43694414
"ToolDefinition": {
43704415
"type": "object",
43714416
"properties": {
@@ -4554,7 +4599,7 @@
45544599
"BatchChatCompletionRequest": {
45554600
"type": "object",
45564601
"properties": {
4557-
"model": {
4602+
"model_id": {
45584603
"type": "string"
45594604
},
45604605
"messages_batch": {
@@ -4575,25 +4620,8 @@
45754620
"$ref": "#/components/schemas/ToolDefinition"
45764621
}
45774622
},
4578-
"tool_choice": {
4579-
"type": "string",
4580-
"enum": [
4581-
"auto",
4582-
"required",
4583-
"none"
4584-
],
4585-
"title": "ToolChoice",
4586-
"description": "Whether tool use is required or automatic. This is a hint to the model which may not be followed. It depends on the Instruction Following capabilities of the model."
4587-
},
4588-
"tool_prompt_format": {
4589-
"type": "string",
4590-
"enum": [
4591-
"json",
4592-
"function_tag",
4593-
"python_list"
4594-
],
4595-
"title": "ToolPromptFormat",
4596-
"description": "Prompt format for calling custom / zero shot tools."
4623+
"tool_config": {
4624+
"$ref": "#/components/schemas/ToolConfig"
45974625
},
45984626
"response_format": {
45994627
"$ref": "#/components/schemas/ResponseFormat"
@@ -4613,7 +4641,7 @@
46134641
},
46144642
"additionalProperties": false,
46154643
"required": [
4616-
"model",
4644+
"model_id",
46174645
"messages_batch"
46184646
],
46194647
"title": "BatchChatCompletionRequest"
@@ -4710,7 +4738,7 @@
47104738
"BatchCompletionRequest": {
47114739
"type": "object",
47124740
"properties": {
4713-
"model": {
4741+
"model_id": {
47144742
"type": "string"
47154743
},
47164744
"content_batch": {
@@ -4740,7 +4768,7 @@
47404768
},
47414769
"additionalProperties": false,
47424770
"required": [
4743-
"model",
4771+
"model_id",
47444772
"content_batch"
47454773
],
47464774
"title": "BatchCompletionRequest"
@@ -4812,51 +4840,6 @@
48124840
],
48134841
"title": "CancelTrainingJobRequest"
48144842
},
4815-
"ToolConfig": {
4816-
"type": "object",
4817-
"properties": {
4818-
"tool_choice": {
4819-
"oneOf": [
4820-
{
4821-
"type": "string",
4822-
"enum": [
4823-
"auto",
4824-
"required",
4825-
"none"
4826-
],
4827-
"title": "ToolChoice",
4828-
"description": "Whether tool use is required or automatic. This is a hint to the model which may not be followed. It depends on the Instruction Following capabilities of the model."
4829-
},
4830-
{
4831-
"type": "string"
4832-
}
4833-
],
4834-
"default": "auto",
4835-
"description": "(Optional) Whether tool use is automatic, required, or none. Can also specify a tool name to use a specific tool. Defaults to ToolChoice.auto."
4836-
},
4837-
"tool_prompt_format": {
4838-
"type": "string",
4839-
"enum": [
4840-
"json",
4841-
"function_tag",
4842-
"python_list"
4843-
],
4844-
"description": "(Optional) Instructs the model how to format tool calls. By default, Llama Stack will attempt to use a format that is best adapted to the model. - `ToolPromptFormat.json`: The tool calls are formatted as a JSON object. - `ToolPromptFormat.function_tag`: The tool calls are enclosed in a <function=function_name> tag. - `ToolPromptFormat.python_list`: The tool calls are output as Python syntax -- a list of function calls."
4845-
},
4846-
"system_message_behavior": {
4847-
"type": "string",
4848-
"enum": [
4849-
"append",
4850-
"replace"
4851-
],
4852-
"description": "(Optional) Config for how to override the default system prompt. - `SystemMessageBehavior.append`: Appends the provided system message to the default system prompt. - `SystemMessageBehavior.replace`: Replaces the default system prompt with the provided system message. The system message can include the string '{{function_definitions}}' to indicate where the function definitions should be inserted.",
4853-
"default": "append"
4854-
}
4855-
},
4856-
"additionalProperties": false,
4857-
"title": "ToolConfig",
4858-
"description": "Configuration for tool use."
4859-
},
48604843
"ChatCompletionRequest": {
48614844
"type": "object",
48624845
"properties": {
@@ -11173,7 +11156,9 @@
1117311156
"x-displayName": "Agents API for creating and interacting with agentic systems."
1117411157
},
1117511158
{
11176-
"name": "BatchInference (Coming Soon)"
11159+
"name": "BatchInference (Coming Soon)",
11160+
"description": "This is an asynchronous API. If the request is successful, the response will be a job which can be polled for completion.\n\nNOTE: This API is not yet implemented and is subject to change in concert with other asynchronous APIs\nincluding (post-training, evals, etc).",
11161+
"x-displayName": "Batch inference API for generating completions and chat completions."
1117711162
},
1117811163
{
1117911164
"name": "Benchmarks"

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy