This is a simple e2e example of an agent + isolated tools that leverages llama.cpp's new universal tool call support (which I've added in ggml-org/llama.cpp#9639)
While any model should work (using some generic support), the bigger and more recent the model (fine-tuned specifically for function calling), the better.
Here's how to run an agent w/ local tool call:
-
Install prerequisite:
-
uv (used to simplify python deps)
-
llama.cpp, see build / install instructions (on a Mac, just install w/
brew install llama.cpp
)
git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build -DLLAMA_CURL=1 cmake --build build --parallel cmake --install build
-
-
Clone this gist and cd to its contents:
git clone https://gist.github.com/9246d289b7d38d49e1ee2755698d6c79.git llama.cpp-agent-example cd llama.cpp-agent-example
-
Run
llama-server
w/ any model good at function calling. Bigger / more recent = better (tryQ8_0
quants if you can). Also you might need a template chat override to make the most of your model's native tool call format support, see server/README.md). The following models have been verified:# Native support for Llama 3.x, Mistral Nemo, Qwen 2.5, Hermes 3, Functionary 3.x, Firefunction v2... llama-server --jinja -fa -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M llama-server --jinja -fa -hf bartowski/phi-4-GGUF:Q4_0 llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L llama-server --jinja -fa -hf bartowski/functionary-small-v3.2-GGUF:Q4_K_M llama-server --jinja -fa -hf bartowski/Hermes-3-Llama-3.1-8B-GGUF:Q4_K_M \ --chat-template-file <( python scripts/get_chat_template.py NousResearch/Hermes-3-Llama-3.1-8B tool_use ) llama-server --jinja -fa -hf bartowski/Llama-3.3-70B-Instruct-GGUF:Q4_K_M llama-server --jinja -fa -hf bartowski/firefunction-v2-GGUF -hff firefunction-v2-Q5_K_M.gguf \ --chat-template-file <( python scripts/get_chat_template.py fireworks-ai/llama-3-firefunction-v2 tool_use ) llama-server --jinja -fa -hf bartowski/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M \ --chat-template-file <( python scripts/get_chat_template.py NousResearch/Hermes-2-Pro-Llama-3-8B tool_use )
-
Run the tools in this gist inside a docker container for some level of isolation (+ sneaky logging of outgoing http and https traffic: you wanna watch over those agents' shoulders for the time being 🧐). Check http://localhost:8088/docs to see the tools exposed.
export BRAVE_SEARCH_API_KEY=... # Get one at https://api.search.brave.com/ ./serve_tools_inside_docker.sh
[!WARNING] The command above gives tools (and your agent) access to the web (and read-only access to
examples/agent/**
. You can loosen / restrict web access in examples/agent/squid/conf/squid.conf. -
Run the agent with some goal
uv run agent.py "What is the sum of 2535 squared and 32222000403?"
See output w/ Hermes-3-Llama-3.1-8B
🛠️ Tools: python, fetch_page, brave_search ⚙️ python(code="print(2535**2 + 32222000403)") → 15 chars The sum of 2535 squared and 32222000403 is 32228426628.
uv run agent.py "What is the best BBQ joint in Laguna Beach?"
See output w/ Hermes-3-Llama-3.1-8B
🛠️ Tools: python, fetch_page, brave_search ⚙️ brave_search(query="best bbq joint in laguna beach") → 4283 chars Based on the search results, Beach Pit BBQ seems to be a popular and highly-rated BBQ joint in Laguna Beach. They offer a variety of BBQ options, including ribs, pulled pork, brisket, salads, wings, and more. They have dine-in, take-out, and catering options available.
uv run agent.py "Search (with brave), fetch and summarize the homepage of llama.cpp"
See output w/ Hermes-3-Llama-3.1-8B
🛠️ Tools: python, fetch_page, brave_search ⚙️ brave_search(query="llama.cpp") → 3330 chars Llama.cpp is an open-source software library written in C++ that performs inference on various Large Language Models (LLMs). Alongside the library, it includes a CLI and web server. It is co-developed alongside the GGML project, a general-purpose tensor library. Llama.cpp is also available with Python bindings, known as llama.cpp-python. It has gained popularity for its ability to run LLMs on local machines, such as Macs with NVIDIA RTX systems. Users can leverage this library to accelerate LLMs and integrate them into various applications. There are numerous resources available, including tutorials and guides, for getting started with Llama.cpp and llama.cpp-python.
-
To compare the above results w/ a cloud provider's tool usage behaviour, just set the
--provider
flag (acceptsopenai
,together
,groq
) and/or use--endpoint
,--api-key
, and--model
export LLAMA_API_KEY=... # for --provider=llama.cpp https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md export OPENAI_API_KEY=... # for --provider=openai https://platform.openai.com/api-keys export TOGETHER_API_KEY=... # for --provider=together https://api.together.ai/settings/api-keys export GROQ_API_KEY=... # for --provider=groq https://console.groq.com/keys uv run agent.py "Search for, fetch and summarize the homepage of llama.cpp" --provider=openai