Skip to content

Commit 7da8e0f

Browse files
committed
Merge branch 'main' of github.com:abetlen/llama_cpp_python into main
2 parents 8474665 + 40b2290 commit 7da8e0f

File tree

1 file changed

+14
-7
lines changed

1 file changed

+14
-7
lines changed

README.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -106,14 +106,14 @@ Below is a short example demonstrating how to use the high-level API to generate
106106

107107
```python
108108
>>> from llama_cpp import Llama
109-
>>> llm = Llama(model_path="./models/7B/ggml-model.bin")
109+
>>> llm = Llama(model_path="./models/7B/llama-model.gguf")
110110
>>> output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
111111
>>> print(output)
112112
{
113113
"id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
114114
"object": "text_completion",
115115
"created": 1679561337,
116-
"model": "./models/7B/ggml-model.bin",
116+
"model": "./models/7B/llama-model.gguf",
117117
"choices": [
118118
{
119119
"text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
@@ -136,15 +136,15 @@ The context window of the Llama models determines the maximum number of tokens t
136136
For instance, if you want to work with larger contexts, you can expand the context window by setting the n_ctx parameter when initializing the Llama object:
137137

138138
```python
139-
llm = Llama(model_path="./models/7B/ggml-model.bin", n_ctx=2048)
139+
llm = Llama(model_path="./models/7B/llama-model.gguf", n_ctx=2048)
140140
```
141141

142142
### Loading llama-2 70b
143143

144144
Llama2 70b must set the `n_gqa` parameter (grouped-query attention factor) to 8 when loading:
145145

146146
```python
147-
llm = Llama(model_path="./models/70B/ggml-model.bin", n_gqa=8)
147+
llm = Llama(model_path="./models/70B/llama-model.gguf", n_gqa=8)
148148
```
149149

150150
## Web Server
@@ -156,17 +156,24 @@ To install the server package and get started:
156156

157157
```bash
158158
pip install llama-cpp-python[server]
159-
python3 -m llama_cpp.server --model models/7B/ggml-model.bin
159+
python3 -m llama_cpp.server --model models/7B/llama-model.gguf
160+
```
161+
Similar to Hardware Acceleration section above, you can also install with GPU (cuBLAS) support like this:
162+
163+
```bash
164+
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python[server]
165+
python3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35
160166
```
161167

162168
Navigate to [http://localhost:8000/docs](http://localhost:8000/docs) to see the OpenAPI documentation.
163169

170+
164171
## Docker image
165172

166173
A Docker image is available on [GHCR](https://ghcr.io/abetlen/llama-cpp-python). To run the server:
167174

168175
```bash
169-
docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/ggml-model-name.bin ghcr.io/abetlen/llama-cpp-python:latest
176+
docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/abetlen/llama-cpp-python:latest
170177
```
171178
[Docker on termux (requires root)](https://gist.github.com/FreddieOliveira/efe850df7ff3951cb62d74bd770dce27) is currently the only known way to run this on phones, see [termux support issue](https://github.com/abetlen/llama-cpp-python/issues/389)
172179

@@ -183,7 +190,7 @@ Below is a short example demonstrating how to use the low-level API to tokenize
183190
>>> llama_cpp.llama_backend_init(numa=False) # Must be called once at the start of each program
184191
>>> params = llama_cpp.llama_context_default_params()
185192
# use bytes for char * params
186-
>>> model = llama_cpp.llama_load_model_from_file(b"./models/7b/ggml-model.bin", params)
193+
>>> model = llama_cpp.llama_load_model_from_file(b"./models/7b/llama-model.gguf", params)
187194
>>> ctx = llama_cpp.llama_new_context_with_model(model, params)
188195
>>> max_tokens = params.n_ctx
189196
# use ctypes arrays for array params

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy