Merge branch 'main' of github.com:abetlen/llama_cpp_python into main

abetlen · abetlen · commit 7da8e0fbf13b · 2023-09-14T14:51:50.000-04:00
diff --git a/README.md b/README.md
@@ -106,14 +106,14 @@ Below is a short example demonstrating how to use the high-level API to generate
 
 ```python
 >>> from llama_cpp import Llama
->>> llm = Llama(model_path="./models/7B/ggml-model.bin")
+>>> llm = Llama(model_path="./models/7B/llama-model.gguf")
 >>> output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
 >>> print(output)
 {
   "id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
   "object": "text_completion",
   "created": 1679561337,
-  "model": "./models/7B/ggml-model.bin",
+  "model": "./models/7B/llama-model.gguf",
   "choices": [
     {
       "text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
@@ -136,15 +136,15 @@ The context window of the Llama models determines the maximum number of tokens t
 For instance, if you want to work with larger contexts, you can expand the context window by setting the n_ctx parameter when initializing the Llama object:
 
 ```python
-llm = Llama(model_path="./models/7B/ggml-model.bin", n_ctx=2048)
+llm = Llama(model_path="./models/7B/llama-model.gguf", n_ctx=2048)
 ```
 
 ### Loading llama-2 70b
 
 Llama2 70b must set the `n_gqa` parameter (grouped-query attention factor) to 8 when loading:
 
 ```python
-llm = Llama(model_path="./models/70B/ggml-model.bin", n_gqa=8)
+llm = Llama(model_path="./models/70B/llama-model.gguf", n_gqa=8)
 ```
 
 ## Web Server
@@ -156,17 +156,24 @@ To install the server package and get started:
 
 ```bash
 pip install llama-cpp-python[server]
-python3 -m llama_cpp.server --model models/7B/ggml-model.bin
+python3 -m llama_cpp.server --model models/7B/llama-model.gguf
+```
+Similar to Hardware Acceleration section above, you can also install with GPU (cuBLAS) support like this:
+
+```bash
+CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python[server]
+python3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35
 ```
 
 Navigate to [http://localhost:8000/docs](http://localhost:8000/docs) to see the OpenAPI documentation.
 
+
 ## Docker image
 
 A Docker image is available on [GHCR](https://ghcr.io/abetlen/llama-cpp-python). To run the server:
 
 ```bash
-docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/ggml-model-name.bin ghcr.io/abetlen/llama-cpp-python:latest
+docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/abetlen/llama-cpp-python:latest
 ```
 [Docker on termux (requires root)](https://gist.github.com/FreddieOliveira/efe850df7ff3951cb62d74bd770dce27) is currently the only known way to run this on phones, see [termux support issue](https://github.com/abetlen/llama-cpp-python/issues/389) 
 
@@ -183,7 +190,7 @@ Below is a short example demonstrating how to use the low-level API to tokenize
 >>> llama_cpp.llama_backend_init(numa=False) # Must be called once at the start of each program
 >>> params = llama_cpp.llama_context_default_params()
 # use bytes for char * params
->>> model = llama_cpp.llama_load_model_from_file(b"./models/7b/ggml-model.bin", params)
+>>> model = llama_cpp.llama_load_model_from_file(b"./models/7b/llama-model.gguf", params)
 >>> ctx = llama_cpp.llama_new_context_with_model(model, params)
 >>> max_tokens = params.n_ctx
 # use ctypes arrays for array params