Skip to content

Commit 03f171e

Browse files
authored
example: LLM inference with Ray Serve (abetlen#1465)
1 parent b564d05 commit 03f171e

File tree

3 files changed

+42
-0
lines changed

3 files changed

+42
-0
lines changed

examples/ray/README.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
This is an example of doing LLM inference with [Ray](https://docs.ray.io/en/latest/index.html) and [Ray Serve](https://docs.ray.io/en/latest/serve/index.html).
2+
3+
First, install the requirements:
4+
5+
```bash
6+
$ pip install -r requirements.txt
7+
```
8+
9+
Deploy a GGUF model to Ray Serve with the following command:
10+
11+
```bash
12+
$ serve run llm:llm_builder model_path='../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf'
13+
```
14+
15+
This will start an API endpoint at `http://localhost:8000/`. You can query the model like this:
16+
17+
```bash
18+
$ curl -k -d '{"prompt": "tell me a joke", "max_tokens": 128}' -X POST http://localhost:8000
19+
```

examples/ray/llm.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
from starlette.requests import Request
2+
from typing import Dict
3+
from ray import serve
4+
from ray.serve import Application
5+
from llama_cpp import Llama
6+
7+
@serve.deployment
8+
class LlamaDeployment:
9+
def __init__(self, model_path: str):
10+
self._llm = Llama(model_path=model_path)
11+
12+
async def __call__(self, http_request: Request) -> Dict:
13+
input_json = await http_request.json()
14+
prompt = input_json["prompt"]
15+
max_tokens = input_json.get("max_tokens", 64)
16+
return self._llm(prompt, max_tokens=max_tokens)
17+
18+
19+
def llm_builder(args: Dict[str, str]) -> Application:
20+
return LlamaDeployment.bind(args["model_path"])

examples/ray/requirements.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
ray[serve]
2+
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
3+
llama-cpp-python

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy