Openvino Toolkit Llms Solution White Paper
Openvino Toolkit Llms Solution White Paper
Artificial Intelligence
Table of Contents
1
Solution White Paper | Optimizing Large Language Models with the OpenVINO™ Toolkit
This solution white paper focuses on causal LLMs, which are LLMs that focus on predicting the next word in a sequence and are
the type most used for AI chatbots.
Training LLMs
LLMs are trained from an extensive repository of text data, which is usually collected by crawling web pages across the
internet. LLMs are trained for a simple task: given a sequence of words, predict the next word in the sequence. To accomplish
this task, the billions of parameters in the LLM’s transformer network gradually adjust during training until they can accurately
predict the content of the text documents in the training data. Essentially, this training process compresses the entirety of
publicly available text on the internet (> 10TB of data) into a neural network of parameters (1GB - 200GB of data).
Training LLMs from scratch is an expensive process! Millions of GPU hours are required for the model to fit to the training data.
Meta’s* Llama 2 models were trained using over 3 million GPU hours at an estimated cost of $2,000,000 USD (reference2).
Training from scratch is typically only done by research groups in large companies. Fortunately, many trained models are
released to the public under open-source licenses so they can be fine-tuned to a unique application.
Fine-tuning LLMs
Fine-tuning a large language model allows it to adapt its pre-trained knowledge to specific tasks. While the model may be good
at predicting words in a document, it is not yet adapted to providing human-like responses to questions. It is only trained on
publicly available data up to a specific cut-off date, so it may not have knowledge of current events. In the fine-tuning stage, a
new training dataset is used that consists of a smaller number of high-quality text examples that are often hand-selected or
generated by humans. Fine tuning aligns the model to this narrower dataset related to the target task so it can generate more
accurate responses.
For example, to convert the Llama 2 base model into a question-answering AI assistant, Meta researchers fine-tuned it using a
dataset of about 100,000 samples of hand-picked documents or human-written responses to prompts. Similarly sized
datasets may be used to align an LLM to provide more customized responses and improve its knowledge on specific topics.
Fine-tuning a LLM requires much less resources than training an LLM from scratch. Parameter efficient fine tuning (PEFT)
techniques such as Low Rank Adaptation (LoRA)3 and QLoRA reduce the memory requirements. With these techniques, fine-
tuning can be accomplished on a computer equipped with a high-end GPU. Fine-tuning Llama 2-7B using Hugging Face’s
PEFT LoRA method takes about 16 hours on a single GPU and uses less than 10GB GPU memory.
▪ LLama 2 family: Llama 2-7B, Llama 2-13B, and Llama 2-70B (Meta*)
▪ Mistral 7B (Mistral* AI)
▪ Zephyr 7B (Hugging Face)
▪ Vicuna 7B (LMSYS*)
▪ Phi-2 (Microsoft*)
▪ Mixtral-8x7B (Mistral AI)
These LLMs are all supported by OpenVINO™, along with additional versions of these models with larger weights.
Note: OpenAI’s* GPT-4 model, and Anthropic’s* Claude 2 model outperform all the open-source models listed above in terms
of response accuracy and quality. However, both models are closed source and not available for fine-tuning.
2
Solution White Paper | Optimizing Large Language Models with the OpenVINO™ Toolkit
▪ Storage requirements: The system must have enough disk space to store the model files. The total file size of LLMs can
range from 2GB for small models to over 300GB for large models.
▪ RAM requirements: The amount of RAM used during inference depends on the context length, model size, and other
factors. As a rule of thumb, the system should have at least as much RAM as the size of the model file. If it doesn’t have
enough RAM available to load the full model, the system may resort to using disk storage as swap space for memory, which
will cause inference to run slowly. Or it may just crash when memory is exhausted.
▪ GPU vRAM requirements (if running on GPU): The GPU must have at least as much vRAM as the size of the model file with
sufficient memory padding for other OS tasks.
Fortunately, these requirements can be significantly reduced using weight compression. Compressing the model’s weights
from FP32 to INT8 quarters its size and converting to INT4 format leads to approximately 1/8th of its size. For example, the
Zephyr-7B-beta model in FP32 format has a file size of 28 GB. When it is compressed to INT4 using OpenVINO™ NNCF, it
reduces the file size to 4GB while maintaining similar accuracy. To learn more, see the Weight Compression With OpenVINO™
section of this document.
Generally, a computer that has a mid-to-high-end processor and 16GB of RAM can safely run quantized 7B models such as
Llama-7B or Zephyr-7B-beta. These models provide a good entry point to experiment with text generation pipelines. As
models increase in size and number of parameters, higher-end hardware is needed to support them.
Slim Deployment
OpenVINO™ is a self-contained package that requires fewer dependencies than Hugging Face, PyTorch, and other machine
learning frameworks. Hugging Face and PyTorch environments require several gigabytes worth of dependencies, while
OpenVINO™ only requires several hundred megabytes.
Figure 2. Benefit of OpenVINO™ is its reduced footprint size compared to other frameworks
The slimmer binary size and memory footprint of OpenVINO™ reduce the storage requirements for target hardware and make
containers easier to deploy and update. Fewer dependencies mean less headache with package and version management for
deployment environments.
Speed
OpenVINO™ provides optimized inference for LLMs, and it is constantly being improved for even faster performance.
Solutions built with OpenVINO™ are as fast or faster than other third-party solutions such as the open source llama.cpp
distribution.
Most other LLM-capable runtimes rely on Python code executing through a Python interpreter. OpenVINO™ is one of the only
runtime libraries that provides a full C/C++ API for inference with LLMs, targeted for resource-optimized production
environments. OpenVINO™ allows LLM applications to be built and optimized for target processors using C/C++.
Of course, OpenVINO™ also offers a Python API, which allows for quicker development of algorithms and programs:
Prototype a solution in Python, and then optimize it in C++ using OpenVINO™.
See the OpenVINO Performance Benchmarks page for graphs showing latency and throughput of LLMs on various platforms.
For more information on how to benchmark LLMs, see the Benchmarking LLMs and Measuring Accuracy section of this
solution white paper.
3
Solution White Paper | Optimizing Large Language Models with the OpenVINO™ Toolkit
Below, we include a snapshot of benchmarking data on a few, select CPU-only and GPU Intel hardware platforms:
Table 1 shows benchmark latency (milliseconds per token) for three Intel processors with three generative AI models,
chatGLM2-6b, Llama-2-7b-chat, and Mistral-7b, with FP16, INT8, and INT4 weights.
Table 2 shows benchmark latency (milliseconds per token) for GPU-based Intel processors (both discrete and integrated
GPU) with the three generative AI models, chatGLM2-6b, Llama-2-7b-chat, and Mistral-7b, with INT4 weights.
INT4
chatGLM2-6b 81.85
Intel® Data Center GPU Flex 170 dGPU Llama-2-7b-chat 295.37
Mistral-7b 170.35
chatGLM2-6b 112.15
Intel® Arc™ A-Series Graphics dGPU Llama-2-7b-chat
Mistral-7b 191.33
chatGLM2-6b 174.13
Intel® Data Center GPU Flex 140 dGPU Llama-2-7b-chat
Mistral-7b 246.55
chatGLM2-6b 279.64
Intel® Core™ i7-1360P Processor iGPU Llama-2-7b-chat 328.57
Mistral-7b 322.98
Table 2. Generative AI benchmarks (latency) on Intel GPU Processors
OpenVINO™ is developed and maintained on a monthly release schedule. Every month, new features and patches will be
shipped to keep OpenVINO™ up to speed with the fast-moving field of AI and LLMs. As new technology advancements occur,
other community-developed libraries may not release patches to support those advancements.
Flexibility
OpenVINO™ is a flexible and efficient library for developing and deploying AI applications. OpenVINO’s flexibility makes it
simple to import and run deep learning models from all popular frameworks. While specialized deployment solutions like
llama.cpp only support LLMs, OpenVINO™ supports all kinds of models and architectures. This enables development of
multimodal applications ranging from computer vision, image generation, text-to-speech, data classification, and much more.
4
Solution White Paper | Optimizing Large Language Models with the OpenVINO™ Toolkit
Training frameworks such as PyTorch also have great flexibility for developing and deploying deep learning models, but they
are not as optimized and efficient as OpenVINO™. The C/C++ APIs in OpenVINO™ provide an inherent advantage over the
Python APIs of other frameworks. OpenVINO™ allows developers to write an application once and deploy it anywhere, with
maximum performance from hardware.
Hardware support
OpenVINO™ supports LLM deployment on a wide range of hardware devices, including CPUs, integrated GPUs and discrete
GPUs. It supports ARM-based architectures as well as x86/x64 architectures. This range of hardware support enables LLMs
to be deployed on a wide variety of targets, ranging from high-powered servers to compact edge devices.
OpenVINO’s automated optimization squeezes a maximum amount of performance out of the target hardware without
needing to reconfigure the application. On newer hardware (such as Intel® Advanced Matrix Extensions (Intel® AMX) -enabled
products like our newer Intel® Xeon® processors), OpenVINO™ automatically selects the best data type for inference. For
more information, see the “Optimize Inference” and “Precision Control” pages for additional information, including how INT8
quantized models are run by default with BF16 plus INT8 mixed precision, taking full advantage of the AMX capability of 4th
Generation Intel® Xeon® Scalable Processors.
1. Hugging Face: Use OpenVINO™ as a backend for the Hugging Face Transformers API through the Optimum Intel
extension.
2. Native OpenVINO™: Use OpenVINO native APIs (Python and C++) with custom pipeline code.
In both cases, the OpenVINO™ runtime is used as the backend for inference, and OpenVINO™ tools are used for model
optimization. The main differences are in ease of use, footprint size, and customizability.
The Hugging Face API is easy to learn and provides a simpler interface for the developer. It hides the complexity of model
initialization and text generation through high-level methods and classes. However, it has more dependencies, provides
abstractions of the text generation loop, scheduler, tokenizer, and other elements of the LLM workflow leading to less options
for detailed customization, and cannot be ported to C/C++.
The native OpenVINO™ API requires fewer dependencies, minimizing the size of the application footprint, and can be used to
build efficient C/C++ applications. However, it has a steeper learning curve, and it requires explicit implementation of the text
generation loop, tokenization functions, and scheduler functions used in a typical LLM pipeline.
Ease of use Lower learning curve; quick to integrate Higher learning curve; requires more effort in
integration
Ideal use case Ideal for Python-centric projects Best suited for high-performance, resource-
optimized production environments
5
Solution White Paper | Optimizing Large Language Models with the OpenVINO™ Toolkit
Hugging Face’s libraries enable experimenting with different models to build out applications. For leveraging further
optimizations, the model and application can be ported to OpenVINO™ using OpenVINO™ APIs.
This Solution White Paper shows how to optimize and deploy LLMs using both Hugging Face and native OpenVINO™. It
begins with a Quick Start example using Hugging Face, and then provides more details on how to load LLMs, compress them,
and run inference.
Install Dependencies
To get started with OpenVINO™, set up a Python virtual environment for OpenVINO™ by following the OpenVINO Installation
Instructions.
Once the environment is created and activated, install Optimum Intel, OpenVINO™, NNCF and their dependencies in a Python
environment by issuing:
1. It loads the zephyr-7b-beta4 LLM from Hugging Face using the Optimum Intel API, which converts it to
OpenVINO™ Intermediate Representation (IR) format and sets OpenVINO™ as the backend for inference.
2. It automatically compresses the model to INT8 format using OpenVINO™ NNCF by default.
3. It loads a tokenizer for converting an input text prompt into tokens that can be understood by the model.
4. It sets up an inference pipeline with the model and tokenizer, passes in an input prompt, and prints the resulting
response.
6
Solution White Paper | Optimizing Large Language Models with the OpenVINO™ Toolkit
By default, this example runs the model on CPU, but the GPU can be used instead by uncommenting the model.to(“GPU”)
line.
Here’s the output from running the example. Try changing the prompt to see what other text outputs can be generated!
"My cat's favorite foods are wet cat food, especially those with gravy or pate, and dry cat food
with chicken or fish flavors. She also enjoys treats with chicken or tuna flavors."
The above example hides much of the complexity involved with LLM inference behind the high-level pipeline and
AutoTokenizer classes. These are great for setting up simple examples, but they don’t allow for much customization. Also,
they are Python classes, so they can’t be used in C/C++ applications. Let’s dive into more details about how to create and
customize LLM applications in OpenVINO™.
To initialize a model from Hugging Face, use the OVModelForCasualLM.from_pretrained method as shown in the snippet
below. By setting the parameter export=True, the model is converted to OpenVINO™ IR format on the fly.
To save and export a model and its tokenizer, use model.save_pretrained("your-model-name") and
tokenizer.save_pretrained(“your-model-name”) as shown in the snippet below.
The model will be exported in OpenVINO™ IR format (openvino_model.bin, openvino_model.xml) and saved to a new
folder in the specified directory. The tokenizer will also be saved to the directory.
To load the model and tokenizer in a future session, use OVModelForCausalLM.from_pretrained("your-model-name")
7
Solution White Paper | Optimizing Large Language Models with the OpenVINO™ Toolkit
--model <MODEL_NAME>: This part of the command specifies the name of the mode to be converted. Replace
<MODEL_NAME> with the actual model name from Hugging Face.
<NEW_MODEL_NAME>: Here, you specify the name you want to give to the new model in the OpenVINO™ IR format. Replace
<NEW_MODEL_NAME> with your desired name.
For example, to convert the Llama 2-7B model from Hugging Face (formally named as meta-llama/Llama-2-7b-chat-hf)
to an OpenVINO™ IR model and name it "ov_llama_2", use the following command:
In this example, meta-llama/Llama-2-7b-chat-hf is the Hugging Face model name, and ov_llama_2 is the new name for the
converted OpenVINO™ IR model.
Additionally, you can specify the --weight-format argument to apply 8-bit or 4-bit weight quantization when exporting your
model with the CLI. An example command applying 8-bit quantization to the model gpt2 is below:
Models that have been fine-tuned using LoRA require a few extra steps when being loaded. The trained weight adapters
(which are produced during LoRA training) must be merged into the baseline model using the merge_and_unload() function
before the model is used for inference. For example:
model_id = "meta-llama/Llama-2-7b-chat-hf"
lora_adaptor = "./lora_adaptor"
model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True)
model = PeftModelForCausalLM.from_pretrained(model, lora_adaptor)
model.merge_and_unload()
model.get_base_model().save_pretrained("fused_lora_model")
Now the model can be converted to OpenVINO™ using the Optimum Intel API or CLI options mentioned above.
8
Solution White Paper | Optimizing Large Language Models with the OpenVINO™ Toolkit
Unlike full model quantization, where weights and activations are quantized, weight compression in NNCF only targets the
model's weights. This approach allows the activations to remain as floating-point numbers, preserving most of the model's
accuracy. It's a subtle yet important difference that ensures the accuracy of the model is maintained while improving its speed
and reducing its size. The reduction in size is especially noticeable with larger models. For instance, the Llama-2 7B model can
be reduced from about 25GB to 4GB using 4-bit weight compression.
Weight compression should be performed offline rather than in a real-time application. The LLM can be compressed and
exported in a development environment and then used in a deployment environment.
▪ Enables inference of exceptionally large models that cannot otherwise be accommodated in the device memory
▪ Reduces storage and memory overhead by implementing model weight compression, making models more lightweight
and less resource intensive for deployment
▪ Improves inference speed by reducing the latency of memory access when calculating outputs of each layer (the weights
are smaller and thus faster to load from memory)
▪ Unlike full quantization, weight compression does not require sample data to calibrate the range of activation values, so it is
easier to perform
9
Solution White Paper | Optimizing Large Language Models with the OpenVINO™ Toolkit
Comparison table
Table 4 summarizes the benefits and trade-offs for each compression type in terms of memory reduction, speed gain, and
accuracy loss.
Memory Reduction Latency Improvement Accuracy Loss
For more information on group size, see the Weight Compression page in OpenVINO™ documentation. In this example,
compression_option="int8" is used to indicate INT8 quantization should be performed.
# Inference
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
phrase = "The weather is"
results = pipe(phrase)
print(results)
Note at the end of the example, the compressed model and its tokenizer are saved so they can be imported for use in a future
session. This saves the time of compressing the model whenever it is used in a new session. For more information, see the
Saving and Loading Models section.
10
Solution White Paper | Optimizing Large Language Models with the OpenVINO™ Toolkit
The compression type used by the nncf.compress_weights method is set using the mode argument, which accepts one of
the three following options:
▪ mode=CompressWeightsMode.INT8
▪ mode=CompressWeightsMode.INT4_SYM
▪ mode=CompressWeightsMode.INT4_ASYM
In this example, the INT4_SYM mode is set to use 4-bit symmetric quantization.
NNCF also allows for configuring the group_size and ratio compression parameters, which can be used to tweak the size
and inference speed of the compressed model. For more information on these parameters, see the Weight Compression page
in OpenVINO™ documentation.
Figure 3. To generate text, LLM predicts the next best word in the input sequence, appends it to the input, and repeats the
process until an “end of sequence” token is generated.
11
Solution White Paper | Optimizing Large Language Models with the OpenVINO™ Toolkit
The token generation inference loop is one stage of a text generation application. The other stages are loading the model,
tokenizing the input text, and processing the output tokens to be displayed to the user. The four stages are listed in order
below:
1. Load model
2. Tokenize input text
3. Execute inference loop
4. Process output tokens
These stages look very different depending on if Hugging Face or the native OpenVINO™ API is used for implementation. This
section of the Solution White Paper shows how to implement LLM text generation using both approaches.
Shown below is the Quick Start example given earlier in this document that uses Hugging Face Transformers and Optimum
Intel to set up and run a simple text generation pipeline.
Figure 4. A basic text generation example using Hugging Face Transformers and Optimum Intel.
▪ OVModelForCausalLM.from_pretrained from Optimum Intel: Loads the LLM from Hugging Face, converts it to
OpenVINO™ IR format, and compiles it on a target device using OpenVINO™ as the inference backend
▪ AutoTokenizer from Hugging Face Transformers: Initializes a text tokenizer for the LLM
▪ Pipeline from Hugging Face Transformers: Handles the bulk of text generation, including tokenizing the inputs,
executing the inference loop, and processing the outputs
These classes provide a simple interface for setting up text generation. Each class and method has more parameters that can
be used to further configure the model or the text generation process. The Transformers API also has other features that give
more control over inference parameters, such as the model.generate() method. To learn more, visit the Hugging Face
Transformers documentation6 page.
12
Solution White Paper | Optimizing Large Language Models with the OpenVINO™ Toolkit
There are two options to select which device (CPU, iGPU, GPU, etc) the LLM is compiled on:
1. Specify the device parameter in the .from_pretrained() call. For example, use
OVModelForCausalLM.from_pretrained (model_id, export=True, device=”GPU.0”)to run the model on the
GPU. See the Device Query documentation for more information on how target devices are named and enumerated.
2. Use the model.to method after the model has been loaded and pass in the name of the target device. For example, use
model.to(“GPU.0”) to run the model on the GPU.
While the Hugging Face APIs greatly simplify the code for implementing text generation, one drawback is that they cannot be
implemented in C/C++. In contrast, the native OpenVINO™ API supports building solutions with C/C++.
This section provides code snippets showing how to implement each stage with the native OpenVINO Python API. These
snippets implement a stateful model technique to increase the memory efficiency of LLMs. With this technique, the context of
the model, i.e. its internal states (the KV cache), are shared among multiple iterations of inference: The KV cache that belongs
to a particular text sequence is accumulated inside the model during the generation loop. The stateful model implementation
supports both greedy search and beam search (preview) for LLMs. This technique also reduces the memory footprint of
LLMs, for example for being able to run INT4 models.
In the same Python virtual environment that was set up in the Install Dependencies section of this Solution White Paper, install
OpenVINO™ Tokenizers by issuing:
OpenVINO™ Tokenizer comes equipped with a CLI tool, convert_tokenizer, that converts tokenizers from the Hugging
Face Hub to OpenVINO™ IR format:
The example above transforms the HuggingFaceH4/zephyr-7b-beta tokenizer from the Hugging Face Hub. The --with-
detokenizer argument tells the command to also convert the detokenizer. The -o argument specifies the name of the output
directory where the converted objects will be saved (openvino_tokenizer, in this case).
Next, convert the LLM itself to OpenVINO™ IR format using optimum-cli, as shown in the Converting a Hugging Face Model
to OpenVINO™ IR Using CLI section of this document. For example, the following command is used to convert the
HuggingFaceH4/zephyr-7b-beta model from Hugging Face to an OpenVINO™ IR model and save it in a folder named
openvino_model:
The model and tokenizer are now saved in the openvino_model and openvino_tokenizer folders.
13
Solution White Paper | Optimizing Large Language Models with the OpenVINO™ Toolkit
ov.core.compile_model method.
import numpy as np
from pathlib import Path
import openvino_tokenizers
from openvino import compile_model, Tensor
model_dir = Path("path/to/model/directory")
# Compile the tokenizer, model, and detokenizer using OpenVINO. These files are XML
representations of the models optimized for OpenVINO
tokenizer = compile_model(model_dir / "openvino_tokenizer.xml")
detokenizer = compile_model(model_dir / "openvino_detokenizer.xml")
infer_request = compile_model(model_dir / "openvino_model.xml").create_infer_request()
The model and tokenizer are now compiled and ready to be used for inference.
Figure 5. An example phrase broken into tokens, where each token has its own numerical value. [Source]
The compiled tokenizer can be used to convert the input text string into tokens, as shown below.
14
Solution White Paper | Optimizing Large Language Models with the OpenVINO™ Toolkit
# end of sentence token - the model signifies the end of text generation
# for now can be obtained from the original tokenizer `original_tokenizer.eos_token_id`
eos_token = 2
tokens_result = [[]]
for _ in range(max_infer):
infer_request.start_async(model_input)
infer_request.wait()
# use greedy decoding to get most probable token as the model prediction
output_token = np.argmax(infer_request.get_output_tensor().data[:, -1, :], axis=-
1, keepdims=True)
tokens_result = np.hstack((tokens_result, output_token))
if output_token[0][0] == eos_token:
break
15
Solution White Paper | Optimizing Large Language Models with the OpenVINO™ Toolkit
#include <openvino/openvino.hpp>
namespace {
std::pair<ov::Tensor, ov::Tensor> tokenize(ov::InferRequest& tokenizer, std::string&& prompt) {
constexpr size_t BATCH_SIZE = 1;
tokenizer.set_input_tensor(ov::Tensor{ov::element::string, {BATCH_SIZE}, &prompt});
tokenizer.infer();
return {tokenizer.get_tensor("input_ids"), tokenizer.get_tensor("attention_mask")};
}
void end() {
std::string text = detokenize(detokenizer, token_cache);
std::cout << std::string_view{text.data() + print_len, text.size() - print_len} << '\n';
token_cache.clear();
print_len = 0;
}
};
}
lm.get_tensor("input_ids").set_shape({BATCH_SIZE, 1});
position_ids.set_shape({BATCH_SIZE, 1});
TextStreamer text_streamer{std::move(detokenizer)};
// There's no way to extract special token values from the detokenizer for now
constexpr int64_t SPECIAL_EOS_TOKEN = 2;
while (out_token != SPECIAL_EOS_TOKEN) {
lm.get_tensor("input_ids").data<int64_t>()[0] = out_token;
lm.get_tensor("attention_mask").set_shape({BATCH_SIZE,
lm.get_tensor("attention_mask").get_shape().at(1) + 1});
std::fill_n(lm.get_tensor("attention_mask").data<int64_t>(),
lm.get_tensor("attention_mask").get_size(), 1);
position_ids.data<int64_t>()[0] = int64_t(lm.get_tensor("attention_mask").get_size() -
2);
lm.start_async();
text_streamer.put(out_token);
lm.wait();
logits = lm.get_tensor("logits").data<float>();
out_token = std::max_element(logits, logits + vocab_size) - logits;
}
text_streamer.end();
// Model is stateful which means that context (kv-cache) which belongs to a particular
// text sequence is accumulated inside the model during the generation loop above.
// This context should be reset before processing the next text sequence.
// While it is not required to reset context in this sample as only one sequence is
processed,
// it is called for education purposes:
lm.reset_state();
} catch (const std::exception& error) {
std::cerr << error.what() << '\n';
return EXIT_FAILURE;
} catch (...) {
std::cerr << "Non-exception object thrown\n";
return EXIT_FAILURE;
}
17
Solution White Paper | Optimizing Large Language Models with the OpenVINO™ Toolkit
Using OVMS for LLMs can be beneficial because LLMs require a significant amount of resources to run on a local system.
Larger models can require 140GB or more of RAM, and they take a considerable amount of processing power to run inference.
Hosting a model on a cloud platform or dedicated server using OVMS offloads the hardware requirements from the local
device. The application on the local device can simply send a prompt to the LLM on Model Server and receive a response back,
while the heavy lifting occurs on the server.
This section provides an abbreviated walk through of the demo, showing how to build the server image, mount an LLM, send it
an inference request via a client application, and receive a streamed response back from the LLM on the server. For more
details, see the guide on GitHub.
Requirements
A Linux host with Docker installed and sufficient RAM for loading the model is required for running the demo. The demo was
tested on a host with an Intel® Xeon® Gold 6430 and an Intel® Data Center GPU Flex 170. Running the demo with small models
like tiny-llama-1B-chat requires about 4GB of RAM.
Build Image
First, build the Model Server image by issuing the following commands:
This creates an image called openvino/model_server:py , which can be mounted using Docker.
Download Model
Next, install requirements and download the model using the download_model.py script:
cd demos/python_demos/llm_text_generation
pip install -r requirements.txt
This will download the tiny-llama-1b-chat model, convert it to OpenVINO™ IR format, and save the converted model in the
./tiny-llama-1b-chat directory. The download_model.py script supports several popular LLMs. To see a full list of downloadable
LLMs, issue:
python download_model.py --help
The script creates new directories with compressed versions of the model with FP16, INT8 and INT4 precisions. For example,
the INT8 model files are saved in ./tiny-llama-1b-chat_INT8_compressed_weights . The compressed models can be
used in place of the original as they have compatible inputs and outputs.
chat INT8 model in the /model directory on the container. It also mounts the servable_stream directory, which contains
scripts for configuring the container.
python3 client_stream.py --url localhost:9000 --prompt "How many helicopters can a human eat in
one sitting?"
Example output (the generated text will be flushed to the console in chunks, as soon as it is available on the server):
Question:
How many helicopters can a human eat in one sitting?
I don't have access to this information. However, we don't generally ask numbers from our
clients. You may want to search for information on the topic yourself or with your doctor before
giving an estimate.
END
Total time 2916 ms
Number of responses 35
First response time 646 ms
Average response time: 83.31 ms
More Information
To learn more about how this demo works, visit its repository on GitHub. For more information on how to host models using
OVMS, see the Model Server documentation.
Prerequisites
Install benchmarking dependencies using requirements.txt
pip install -r requirements.txt
Note: You can specify the installed openvino version through pip install
# e.g.
pip install openvino==2023.3.0
# e.g.
python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -n 2
python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -p "What is openvino?" -n 2
python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -pf prompts/llama-2-7b-
chat_l.jsonl -n 2
19
Solution White Paper | Optimizing Large Language Models with the OpenVINO™ Toolkit
Command parameters:
▪ -m - model path
▪ -d - inference device (default=cpu)
▪ -r - report csv
▪ -f - framework (default=ov)
▪ -p - interactive prompt text
▪ -pf - path of JSONL file including interactive prompts
▪ -n - number of benchmarking iterations, if the value greater 0, will exclude the first iteration. (default=0)
Measuring Accuracy
The lm-evaluation-harness tool is a third-party test harness for measuring LLM accuracy. It recently added support for
OpenVINO™. Visit the repository for more information on how to use it to measure model accuracy.
This notebook walks through the process of loading a model, compressing its weights using NNCF, and compiling it to run on a
specific device. It provides code showing how to set up an interactive chatbot that takes in user prompts, runs inference on
them, and returns a result. It shows how to use Gradio to create an interactive UI for the chatbot, as shown in the image below.
Figure 6. The LLM Chatbot example shows how to build a full chatbot UI using Gradio*.
One exciting feature of this example is that it can be used with a variety of open-source models, such as llama-2-7b, mistral-7b,
and more. This allows for trying out different models to compare their responses, and it also teaches how to interface with
various types of models using OpenVINO™. It also supports Retrieval Augmented Generation (RAG) capabilities.
20
Solution White Paper | Optimizing Large Language Models with the OpenVINO™ Toolkit
Learning Resources
Explore the OpenVINO™ toolkit from Intel’s product page, learn and practice coding with Jupyter* Notebooks at the
OpenVINO™ Github or Hugging Face Optimum Intel, and access the developer sandbox in the Intel® Developer Cloud.
▪ OpenVINO™ Product Page
▪ OpenVINO™ Github*
o OpenVINO Notebooks
o OpenVINO GenAI GitHub
▪ Hugging Face* Optimum Intel
▪ Intel® Developer Cloud
Additional Support
If you need additional support getting your solution deployed, wish to report an issue or bug to report, or have feature requests
or require further optimizations, please contact ryan.loney@intel.com.
1. When It Comes to AI Models, Bigger Isn't Always Better | Scientific American, Lauren Leffer, November 21, 2023
2. Open Foundation and Fine Tuned Chat Models, v2 | arxiv.org, Thomas Scialom et. Al, July 19, 2023
3. LoRA: Low-Rank Adaptation of Large Language Models | Github, Edward J. Hu et. Al
4. HuggingFaceH4 Zerphr-7B-beta | Hugging Face, MIT
5. LoRA Conceptual Guide | Hugging Face
6. Hugging Face Transformers Documentation | Hugging Face
21