0% found this document useful (0 votes)
47 views59 pages

Logs-Lovefy - VLLM Server (Bacchus) - FB

The document contains log entries from a serverless worker process, detailing various warnings and information messages related to memory management, process group handling, and model loading. It highlights issues such as leaked shared memory objects and the necessity of properly destroying process groups to avoid blocking operations. Additionally, it provides metrics on memory usage, throughput, and the status of ongoing jobs.

Uploaded by

Pedro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views59 pages

Logs-Lovefy - VLLM Server (Bacchus) - FB

The document contains log entries from a serverless worker process, detailing various warnings and information messages related to memory management, process group handling, and model loading. It highlights issues such as leaked shared memory objects and the necessity of properly destroying process groups to avoid blocking operations. Additionally, it provides metrics on memory usage, throughput, and the status of ongoing jobs.

Uploaded by

Pedro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 59

2025-01-10 13:37:16.

562 | info | 3yn75aympd52v1 |


warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-10 13:37:15.658 | info | 3yn75aympd52v1 |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-10 13:37:14.772 | warning | 3yn75aympd52v1 | [rank0]:[W110
16:37:14.435087749 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has
NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the
application should call destroy_process_group to ensure that any pending NCCL
operations have finished in this process. In rare cases this process can exit
before this point and block the progress of another member of the process group.
This constraint has always been present, but this warning has only been added
since PyTorch 2.4 (function operator())\n
2025-01-10 13:37:14.772 | info | 3yn75aympd52v1 | INFO 01-10 16:37:13
async_llm_engine.py:62] Engine is gracefully shutting down.\n
2025-01-10 13:37:13.161 | info | 3yn75aympd52v1 | Kill worker.
2025-01-10 13:37:13.161 | info | 3yn75aympd52v1 | Finished.
2025-01-10 13:28:32.226 | info | 3yn75aympd52v1 | Finished running generator.
2025-01-10 13:28:32.133 | info | 3yn75aympd52v1 | INFO 01-10 16:28:32
async_llm_engine.py:176] Finished request b91c6d62c0a648399cd7e92d15d493c4.\n
2025-01-10 13:28:32.133 | info | 3yn75aympd52v1 | INFO 01-10 16:28:31
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 13:28:31.274 | info | 3yn75aympd52v1 | INFO 01-10 16:28:31
metrics.py:449] Avg prompt throughput: 624.7 tokens/s, Avg generation throughput:
3.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 2.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 13:28:31.274 | info | 3yn75aympd52v1 | INFO 01-10 16:28:27
async_llm_engine.py:208] Added request b91c6d62c0a648399cd7e92d15d493c4.\n
2025-01-10 13:28:31.274 | info | 3yn75aympd52v1 | Jobs in progress: 1
2025-01-10 13:28:31.274 | info | 3yn75aympd52v1 | Jobs in queue: 1
2025-01-10 13:28:27.392 | info | 3yn75aympd52v1 | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-10 13:28:27.392 | info | 3yn75aympd52v1 | INFO 01-10 16:28:26
chat_utils.py:431] ' }}{% endif %}\n
2025-01-10 13:28:27.392 | info | 3yn75aympd52v1 | INFO 01-10 16:28:26
chat_utils.py:431] \n
2025-01-10 13:28:27.392 | info | 3yn75aympd52v1 | INFO 01-10 16:28:26
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-10 13:28:27.392 | info | 3yn75aympd52v1 | INFO 01-10 16:28:26
chat_utils.py:431] \n
2025-01-10 13:28:27.392 | info | 3yn75aympd52v1 | INFO 01-10 16:28:26
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-10 13:28:27.392 | info | 3yn75aympd52v1 | INFO 01-10 16:28:26
chat_utils.py:431] Using supplied chat template:\n
2025-01-10 13:28:26.954 | info | 3yn75aympd52v1 | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 13:28:26.954 | info | 3yn75aympd52v1 | engine.py :26 2025-
01-10 16:28:26,252 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 13:28:26.954 | info | 3yn75aympd52v1 | engine_args.py :126 2025-
01-10 16:28:26,242 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 13:28:26.242 | info | 3yn75aympd52v1 | engine.py :112 2025-
01-10 16:28:26,241 Initialized vLLM engine in 78.28s\n
2025-01-10 13:28:26.242 | info | 3yn75aympd52v1 | INFO 01-10 16:28:25
model_runner.py:1518] Graph capturing finished in 34 secs, took 0.99 GiB\n
2025-01-10 13:28:25.681 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:28:25 model_runner.py:1518] Graph capturing finished
in 34 secs, took 0.99 GiB\n
2025-01-10 13:28:25.627 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:28:25 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-10 13:28:24.655 | info | 3yn75aympd52v1 | INFO 01-10 16:28:24
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-10 13:28:24.655 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:52 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-10 13:27:52.080 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:52 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-10 13:27:52.080 | info | 3yn75aympd52v1 | INFO 01-10 16:27:51
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-10 13:27:51.941 | info | 3yn75aympd52v1 | INFO 01-10 16:27:51
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-10 13:27:51.941 | info | 3yn75aympd52v1 | INFO 01-10 16:27:49
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
15.62x\n
2025-01-10 13:27:49.120 | info | 3yn75aympd52v1 | INFO 01-10 16:27:49
distributed_gpu_executor.py:57] # GPU blocks: 7999, # CPU blocks: 1638\n
2025-01-10 13:27:48.940 | info | 3yn75aympd52v1 | INFO 01-10 16:27:48
worker.py:232] Memory profiling results: total_gpu_memory=44.34GiB
initial_memory_usage=19.04GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.53GiB
gpu_memory_utilization=0.90\n
2025-01-10 13:27:48.868 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:48 worker.py:232] Memory profiling results:
total_gpu_memory=44.34GiB initial_memory_usage=19.04GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.57GiB
gpu_memory_utilization=0.90\n
2025-01-10 13:27:40.958 | info | 3yn75aympd52v1 | INFO 01-10 16:27:40
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-10 13:27:40.958 | info | 3yn75aympd52v1 | \n
2025-01-10 13:27:40.958 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:11<00:00, 1.29s/it]\n
2025-01-10 13:27:40.736 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:11<00:00, 1.03s/it]\n
2025-01-10 13:27:40.736 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 89% Completed | 8/9 [00:11<00:01, 1.38s/it]\n
2025-01-10 13:27:40.437 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:40 model_runner.py:1077] Loading model weights took
18.5769 GB\n
2025-01-10 13:27:39.384 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 78% Completed | 7/9 [00:10<00:03, 1.51s/it]\n
2025-01-10 13:27:37.869 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 67% Completed | 6/9 [00:08<00:04, 1.51s/it]\n
2025-01-10 13:27:36.344 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 56% Completed | 5/9 [00:07<00:06, 1.51s/it]\n
2025-01-10 13:27:34.823 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 44% Completed | 4/9 [00:05<00:07, 1.50s/it]\n
2025-01-10 13:27:33.265 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 33% Completed | 3/9 [00:04<00:08, 1.46s/it]\n
2025-01-10 13:27:31.655 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 22% Completed | 2/9 [00:02<00:09, 1.33s/it]\n
2025-01-10 13:27:30.106 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 11% Completed | 1/9 [00:01<00:08, 1.02s/it]\n
2025-01-10 13:27:29.082 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 0% Completed | 0/9 [00:00<?, ?it/s]\n
2025-01-10 13:27:29.082 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:28 model_runner.py:1072] Starting to load model
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 13:27:29.082 | info | 3yn75aympd52v1 | INFO 01-10 16:27:28
model_runner.py:1072] Starting to load model /models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 13:27:29.082 | info | 3yn75aympd52v1 | INFO 01-10 16:27:28
shm_broadcast.py:236] vLLM message queue communication handle:
Handle(connect_ip='127.0.0.1', local_reader_ranks=[1],
buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at
0x7809c6281f90>, local_subscribe_port=41563, remote_subscribe_port=None)\n
2025-01-10 13:27:29.082 | info | 3yn75aympd52v1 | INFO 01-10 16:27:28
custom_all_reduce_utils.py:242] reading GPU P2P access cache from
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 13:27:28.706 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:28 custom_all_reduce_utils.py:242] reading GPU P2P
access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 13:27:15.794 | info | 3yn75aympd52v1 | INFO 01-10 16:27:15
custom_all_reduce_utils.py:204] generating GPU P2P access cache in
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 13:27:15.794 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:15 pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 13:27:15.794 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:15 utils.py:960] Found nccl from library
libnccl.so.2\n
2025-01-10 13:27:15.794 | info | 3yn75aympd52v1 | INFO 01-10 16:27:15
pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 13:27:15.274 | info | 3yn75aympd52v1 | INFO 01-10 16:27:15
utils.py:960] Found nccl from library libnccl.so.2\n
2025-01-10 13:27:15.274 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:13 multiproc_worker_utils.py:215] Worker ready;
awaiting tasks\n
2025-01-10 13:27:15.274 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:13 selector.py:135] Using Flash Attention backend.\
n
2025-01-10 13:27:13.937 | info | 3yn75aympd52v1 | INFO 01-10 16:27:13
selector.py:135] Using Flash Attention backend.\n
2025-01-10 13:27:13.781 | info | 3yn75aympd52v1 | INFO 01-10 16:27:13
custom_cache_manager.py:17] Setting Triton cache manager to:
vllm.triton_utils.custom_cache_manager:CustomCacheManager\n
2025-01-10 13:27:13.729 | warning | 3yn75aympd52v1 | WARNING 01-10 16:27:13
multiproc_gpu_executor.py:56] Reducing Torch parallelism from 48 threads to 1 to
avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment
to tune this value as needed.\n
2025-01-10 13:27:13.729 | info | 3yn75aympd52v1 | INFO 01-10 16:27:13
llm_engine.py:249] Initializing an LLM engine (v0.6.4) with config:
model='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', speculative_config=None,
tokenizer='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', skip_tokenizer_init=False,
tokenizer_mode=auto, revision=None, override_neuron_config=None,
tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16,
max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO,
tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False,
quantization=awq, enforce_eager=False, kv_cache_dtype=auto,
quantization_param_path=None, device_config=cuda,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'),
observability_config=ObservabilityConfig(otlp_traces_endpoint=None,
collect_model_forward_time=False, collect_model_execute_time=False), seed=0,
served_model_name=/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-
v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, num_scheduler_steps=1,
chunked_prefill_enabled=False multi_step_stream_outputs=True,
enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False,
chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)\n
2025-01-10 13:27:13.729 | info | 3yn75aympd52v1 | INFO 01-10 16:27:13
config.py:1020] Defaulting to use mp for distributed inference\n
2025-01-10 13:27:13.729 | warning | 3yn75aympd52v1 | WARNING 01-10 16:27:13
config.py:428] awq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.\n
2025-01-10 13:27:13.729 | info | 3yn75aympd52v1 | INFO 01-10 16:27:13
awq_marlin.py:113] Detected that the model can run with awq_marlin, however you
specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin
for faster inference\n
2025-01-10 13:27:13.729 | info | 3yn75aympd52v1 | INFO 01-10 16:27:13
config.py:350] This model supports multiple tasks: {'generate', 'embedding'}.
Defaulting to 'generate'.\n
2025-01-10 13:27:13.154 | info | 3yn75aympd52v1 | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 13:27:13.154 | info | 3yn75aympd52v1 | engine.py :26 2025-
01-10 16:27:07,459 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 13:27:07.424 | info | 3yn75aympd52v1 | engine_args.py :126 2025-
01-10 16:27:07,424 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 12:14:08.168 | info | 3yn75aympd52v1 |
warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-10 12:14:07.128 | info | 3yn75aympd52v1 |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-10 12:14:06.092 | warning | 3yn75aympd52v1 | [rank0]:[W110
15:14:06.755547102 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has
NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the
application should call destroy_process_group to ensure that any pending NCCL
operations have finished in this process. In rare cases this process can exit
before this point and block the progress of another member of the process group.
This constraint has always been present, but this warning has only been added
since PyTorch 2.4 (function operator())\n
2025-01-10 12:14:06.092 | info | 3yn75aympd52v1 | INFO 01-10 15:14:04
async_llm_engine.py:62] Engine is gracefully shutting down.\n
2025-01-10 12:14:04.505 | info | 3yn75aympd52v1 | Kill worker.
2025-01-10 11:54:21.263 | info | 3yn75aympd52v1 | Finished.
2025-01-10 11:54:21.211 | info | 3yn75aympd52v1 | Finished running generator.
2025-01-10 11:54:21.073 | info | 3yn75aympd52v1 | INFO 01-10 14:54:21
async_llm_engine.py:176] Finished request 59eadcbebddf4f3986ece12f5a9e9ad3.\n
2025-01-10 11:54:21.073 | info | 3yn75aympd52v1 | INFO 01-10 14:54:18
metrics.py:465] Prefix cache hit rate: GPU: 36.76%, CPU: 0.00%\n
2025-01-10 11:54:18.434 | info | 3yn75aympd52v1 | INFO 01-10 14:54:18
metrics.py:449] Avg prompt throughput: 4.5 tokens/s, Avg generation throughput: 0.0
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
2.2%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:54:18.434 | info | 3yn75aympd52v1 | INFO 01-10 14:54:17
async_llm_engine.py:208] Added request 59eadcbebddf4f3986ece12f5a9e9ad3.\n
2025-01-10 11:54:18.434 | info | 3yn75aympd52v1 | Jobs in progress: 1
2025-01-10 11:54:17.516 | info | 3yn75aympd52v1 | Jobs in queue: 1
2025-01-10 11:54:17.516 | info | 3yn75aympd52v1 | Finished.
2025-01-10 11:43:53.600 | info | 3yn75aympd52v1 | Finished running generator.
2025-01-10 11:43:53.496 | info | 3yn75aympd52v1 | INFO 01-10 14:43:53
async_llm_engine.py:176] Finished request 55c8f07229ae4b338946079e292342da.\n
2025-01-10 11:43:53.496 | info | 3yn75aympd52v1 | INFO 01-10 14:43:52
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:43:52.945 | info | 3yn75aympd52v1 | INFO 01-10 14:43:52
metrics.py:449] Avg prompt throughput: 523.8 tokens/s, Avg generation throughput:
9.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 2.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:43:52.945 | info | 3yn75aympd52v1 | INFO 01-10 14:43:48
async_llm_engine.py:208] Added request 55c8f07229ae4b338946079e292342da.\n
2025-01-10 11:43:52.944 | info | 3yn75aympd52v1 | Jobs in progress: 1
2025-01-10 11:43:52.944 | info | 3yn75aympd52v1 | Jobs in queue: 1
2025-01-10 11:43:48.632 | info | 3yn75aympd52v1 | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-10 11:43:48.632 | info | 3yn75aympd52v1 | INFO 01-10 14:43:48
chat_utils.py:431] ' }}{% endif %}\n
2025-01-10 11:43:48.632 | info | 3yn75aympd52v1 | INFO 01-10 14:43:48
chat_utils.py:431] \n
2025-01-10 11:43:48.632 | info | 3yn75aympd52v1 | INFO 01-10 14:43:48
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-10 11:43:48.632 | info | 3yn75aympd52v1 | INFO 01-10 14:43:48
chat_utils.py:431] \n
2025-01-10 11:43:48.632 | info | 3yn75aympd52v1 | INFO 01-10 14:43:48
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-10 11:43:48.632 | info | 3yn75aympd52v1 | INFO 01-10 14:43:48
chat_utils.py:431] Using supplied chat template:\n
2025-01-10 11:43:48.225 | info | 3yn75aympd52v1 | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 11:43:48.225 | info | 3yn75aympd52v1 | engine.py :26 2025-
01-10 14:43:47,934 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 11:43:48.225 | info | 3yn75aympd52v1 | engine_args.py :126 2025-
01-10 14:43:47,929 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 11:43:47.929 | info | 3yn75aympd52v1 | engine.py :112 2025-
01-10 14:43:47,929 Initialized vLLM engine in 69.75s\n
2025-01-10 11:43:47.645 | info | 3yn75aympd52v1 | INFO 01-10 14:43:47
model_runner.py:1518] Graph capturing finished in 30 secs, took 0.99 GiB\n
2025-01-10 11:43:47.645 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:43:47 model_runner.py:1518] Graph capturing finished
in 30 secs, took 0.99 GiB\n
2025-01-10 11:43:47.587 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:43:47 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-10 11:43:47.374 | info | 3yn75aympd52v1 | INFO 01-10 14:43:47
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-10 11:43:47.374 | info | 3yn75aympd52v1 | INFO 01-10 14:43:17
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-10 11:43:47.374 | info | 3yn75aympd52v1 | INFO 01-10 14:43:17
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-10 11:43:47.373 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:43:17 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-10 11:43:17.355 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:43:17 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-10 11:43:17.355 | info | 3yn75aympd52v1 | INFO 01-10 14:43:14
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
15.53x\n
2025-01-10 11:43:14.762 | info | 3yn75aympd52v1 | INFO 01-10 14:43:14
distributed_gpu_executor.py:57] # GPU blocks: 7949, # CPU blocks: 1638\n
2025-01-10 11:43:14.593 | info | 3yn75aympd52v1 | INFO 01-10 14:43:14
worker.py:232] Memory profiling results: total_gpu_memory=44.34GiB
initial_memory_usage=19.09GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=19.21Gib non_torch_memory=0.62GiB kv_cache_size=19.41GiB
gpu_memory_utilization=0.90\n
2025-01-10 11:43:14.528 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:43:14 worker.py:232] Memory profiling results:
total_gpu_memory=44.34GiB initial_memory_usage=19.09GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.19Gib non_torch_memory=0.60GiB kv_cache_size=19.46GiB
gpu_memory_utilization=0.90\n
2025-01-10 11:43:14.528 | info | 3yn75aympd52v1 | INFO 01-10 14:43:07
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-10 11:43:07.716 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:43:07 model_runner.py:1077] Loading model weights took
18.5769 GB\n
2025-01-10 11:43:07.716 | info | 3yn75aympd52v1 | \n
2025-01-10 11:43:07.716 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:12<00:00, 1.44s/it]\n
2025-01-10 11:43:07.489 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:12<00:00, 1.16s/it]\n
2025-01-10 11:43:07.263 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 89% Completed | 8/9 [00:12<00:01, 1.58s/it]\n
2025-01-10 11:43:05.903 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 78% Completed | 7/9 [00:11<00:03, 1.69s/it]\n
2025-01-10 11:43:04.164 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 67% Completed | 6/9 [00:09<00:04, 1.66s/it]\n
2025-01-10 11:43:02.439 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 56% Completed | 5/9 [00:07<00:06, 1.63s/it]\n
2025-01-10 11:43:00.784 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 44% Completed | 4/9 [00:06<00:08, 1.62s/it]\n
2025-01-10 11:42:59.133 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 33% Completed | 3/9 [00:04<00:09, 1.60s/it]\n
2025-01-10 11:42:57.392 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 22% Completed | 2/9 [00:02<00:10, 1.48s/it]\n
2025-01-10 11:42:55.703 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 11% Completed | 1/9 [00:01<00:09, 1.17s/it]\n
2025-01-10 11:42:54.532 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 0% Completed | 0/9 [00:00<?, ?it/s]\n
2025-01-10 11:42:54.532 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:42:54 model_runner.py:1072] Starting to load model
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 11:42:54.532 | info | 3yn75aympd52v1 | INFO 01-10 14:42:54
model_runner.py:1072] Starting to load model /models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 11:42:54.532 | info | 3yn75aympd52v1 | INFO 01-10 14:42:54
shm_broadcast.py:236] vLLM message queue communication handle:
Handle(connect_ip='127.0.0.1', local_reader_ranks=[1],
buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at
0x78f709e7de70>, local_subscribe_port=38717, remote_subscribe_port=None)\n
2025-01-10 11:42:54.532 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:42:54 custom_all_reduce_utils.py:242] reading GPU P2P
access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 11:42:54.234 | info | 3yn75aympd52v1 | INFO 01-10 14:42:54
custom_all_reduce_utils.py:242] reading GPU P2P access cache from
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 11:42:54.234 | info | 3yn75aympd52v1 | INFO 01-10 14:42:43
custom_all_reduce_utils.py:204] generating GPU P2P access cache in
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 11:42:54.234 | info | 3yn75aympd52v1 | INFO 01-10 14:42:43
pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 11:42:54.234 | info | 3yn75aympd52v1 | INFO 01-10 14:42:43
utils.py:960] Found nccl from library libnccl.so.2\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:42:43 pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:42:43 utils.py:960] Found nccl from library
libnccl.so.2\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:42:42 multiproc_worker_utils.py:215] Worker ready;
awaiting tasks\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:42:42 selector.py:135] Using Flash Attention backend.\
n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | INFO 01-10 14:42:42
selector.py:135] Using Flash Attention backend.\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | INFO 01-10 14:42:42
custom_cache_manager.py:17] Setting Triton cache manager to:
vllm.triton_utils.custom_cache_manager:CustomCacheManager\n
2025-01-10 11:42:54.233 | warning | 3yn75aympd52v1 | WARNING 01-10 14:42:42
multiproc_gpu_executor.py:56] Reducing Torch parallelism from 48 threads to 1 to
avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment
to tune this value as needed.\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | INFO 01-10 14:42:42
llm_engine.py:249] Initializing an LLM engine (v0.6.4) with config:
model='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', speculative_config=None,
tokenizer='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', skip_tokenizer_init=False,
tokenizer_mode=auto, revision=None, override_neuron_config=None,
tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16,
max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO,
tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False,
quantization=awq, enforce_eager=False, kv_cache_dtype=auto,
quantization_param_path=None, device_config=cuda,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'),
observability_config=ObservabilityConfig(otlp_traces_endpoint=None,
collect_model_forward_time=False, collect_model_execute_time=False), seed=0,
served_model_name=/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-
v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, num_scheduler_steps=1,
chunked_prefill_enabled=False multi_step_stream_outputs=True,
enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False,
chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | INFO 01-10 14:42:42
config.py:1020] Defaulting to use mp for distributed inference\n
2025-01-10 11:42:54.233 | warning | 3yn75aympd52v1 | WARNING 01-10 14:42:42
config.py:428] awq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | INFO 01-10 14:42:42
awq_marlin.py:113] Detected that the model can run with awq_marlin, however you
specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin
for faster inference\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | INFO 01-10 14:42:42
config.py:350] This model supports multiple tasks: {'generate', 'embedding'}.
Defaulting to 'generate'.\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 11:38:52.063 | info | 3yn75aympd52v1 |
warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-10 11:38:51.018 | info | 3yn75aympd52v1 |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-10 11:38:50.069 | warning | 3yn75aympd52v1 | [rank0]:[W110
14:38:50.732630541 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has
NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the
application should call destroy_process_group to ensure that any pending NCCL
operations have finished in this process. In rare cases this process can exit
before this point and block the progress of another member of the process group.
This constraint has always been present, but this warning has only been added
since PyTorch 2.4 (function operator())\n
2025-01-10 11:38:50.069 | info | 3yn75aympd52v1 | INFO 01-10 14:38:48
async_llm_engine.py:62] Engine is gracefully shutting down.\n
2025-01-10 11:38:48.520 | info | 3yn75aympd52v1 | Kill worker.
2025-01-10 11:38:48.520 | info | 3yn75aympd52v1 | Finished.
2025-01-10 11:34:06.349 | info | 3yn75aympd52v1 | Finished running generator.
2025-01-10 11:34:06.225 | info | 3yn75aympd52v1 | INFO 01-10 14:34:06
async_llm_engine.py:176] Finished request 48af1202e7a645bba14346888c841c1e.\n
2025-01-10 11:34:06.225 | info | 3yn75aympd52v1 | INFO 01-10 14:34:03
metrics.py:465] Prefix cache hit rate: GPU: 67.26%, CPU: 0.00%\n
2025-01-10 11:34:03.554 | info | 3yn75aympd52v1 | INFO 01-10 14:34:03
metrics.py:449] Avg prompt throughput: 27.0 tokens/s, Avg generation throughput:
1.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 1.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:34:03.554 | info | 3yn75aympd52v1 | INFO 01-10 14:34:03
async_llm_engine.py:208] Added request 48af1202e7a645bba14346888c841c1e.\n
2025-01-10 11:34:03.554 | info | 3yn75aympd52v1 | Jobs in progress: 1
2025-01-10 11:34:03.481 | info | 3yn75aympd52v1 | Jobs in queue: 1
2025-01-10 11:34:03.480 | info | 3yn75aympd52v1 | Finished.
2025-01-10 11:33:19.149 | info | 3yn75aympd52v1 | Finished running generator.
2025-01-10 11:33:19.039 | info | 3yn75aympd52v1 | INFO 01-10 14:33:19
async_llm_engine.py:176] Finished request 7b0b0d8e7bf34ea6ac6521c531da989c.\n
2025-01-10 11:33:19.039 | info | 3yn75aympd52v1 | INFO 01-10 14:33:16
metrics.py:465] Prefix cache hit rate: GPU: 49.66%, CPU: 0.00%\n
2025-01-10 11:33:16.236 | info | 3yn75aympd52v1 | INFO 01-10 14:33:16
metrics.py:449] Avg prompt throughput: 31.4 tokens/s, Avg generation throughput:
1.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 1.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:33:16.236 | info | 3yn75aympd52v1 | INFO 01-10 14:33:16
async_llm_engine.py:208] Added request 7b0b0d8e7bf34ea6ac6521c531da989c.\n
2025-01-10 11:33:16.236 | info | 3yn75aympd52v1 | Jobs in progress: 1
2025-01-10 11:33:16.109 | info | 3yn75aympd52v1 | Jobs in queue: 1
2025-01-10 11:33:16.109 | info | 3yn75aympd52v1 | Finished.
2025-01-10 11:32:39.708 | info | 3yn75aympd52v1 | Finished running generator.
2025-01-10 11:32:39.597 | info | 3yn75aympd52v1 | INFO 01-10 14:32:39
async_llm_engine.py:176] Finished request 3690ce4c84b6483c822fbff4803c3ca6.\n
2025-01-10 11:32:39.597 | info | 3yn75aympd52v1 | INFO 01-10 14:32:37
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:32:37.897 | info | 3yn75aympd52v1 | INFO 01-10 14:32:37
metrics.py:449] Avg prompt throughput: 3.9 tokens/s, Avg generation throughput: 0.0
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.9%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:32:37.897 | info | 3yn75aympd52v1 | INFO 01-10 14:32:36
async_llm_engine.py:208] Added request 3690ce4c84b6483c822fbff4803c3ca6.\n
2025-01-10 11:32:37.897 | info | 3yn75aympd52v1 | Jobs in progress: 1
2025-01-10 11:32:36.737 | info | 3yn75aympd52v1 | Jobs in queue: 1
2025-01-10 11:32:36.737 | info | 3yn75aympd52v1 | Finished.
2025-01-10 11:27:43.693 | info | 3yn75aympd52v1 | Finished running generator.
2025-01-10 11:27:43.577 | info | 3yn75aympd52v1 | INFO 01-10 14:27:43
async_llm_engine.py:176] Finished request 9c5abc581911481b95d316c73ba566b1.\n
2025-01-10 11:27:43.577 | info | 3yn75aympd52v1 | INFO 01-10 14:27:42
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:27:42.981 | info | 3yn75aympd52v1 | INFO 01-10 14:27:42
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
21.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:27:42.980 | info | 3yn75aympd52v1 | INFO 01-10 14:27:37
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:27:37.970 | info | 3yn75aympd52v1 | INFO 01-10 14:27:37
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
21.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:27:37.970 | info | 3yn75aympd52v1 | INFO 01-10 14:27:32
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:27:32.945 | info | 3yn75aympd52v1 | INFO 01-10 14:27:32
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
21.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:27:32.945 | info | 3yn75aympd52v1 | INFO 01-10 14:27:27
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:27:27.936 | info | 3yn75aympd52v1 | INFO 01-10 14:27:27
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
21.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:27:27.936 | info | 3yn75aympd52v1 | INFO 01-10 14:27:22
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:27:22.890 | info | 3yn75aympd52v1 | INFO 01-10 14:27:22
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
21.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.2%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:27:22.890 | info | 3yn75aympd52v1 | INFO 01-10 14:27:17
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:27:17.851 | info | 3yn75aympd52v1 | INFO 01-10 14:27:17
metrics.py:449] Avg prompt throughput: 4.6 tokens/s, Avg generation throughput:
18.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:27:17.851 | info | 3yn75aympd52v1 | INFO 01-10 14:27:13
async_llm_engine.py:208] Added request 9c5abc581911481b95d316c73ba566b1.\n
2025-01-10 11:27:17.851 | info | 3yn75aympd52v1 | Jobs in progress: 1
2025-01-10 11:27:17.850 | info | 3yn75aympd52v1 | Jobs in queue: 1
2025-01-10 11:27:13.558 | info | 3yn75aympd52v1 | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-10 11:27:13.558 | info | 3yn75aympd52v1 | INFO 01-10 14:27:13
chat_utils.py:431] ' }}{% endif %}\n
2025-01-10 11:27:13.558 | info | 3yn75aympd52v1 | INFO 01-10 14:27:13
chat_utils.py:431] \n
2025-01-10 11:27:13.558 | info | 3yn75aympd52v1 | INFO 01-10 14:27:13
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-10 11:27:13.558 | info | 3yn75aympd52v1 | INFO 01-10 14:27:13
chat_utils.py:431] \n
2025-01-10 11:27:13.558 | info | 3yn75aympd52v1 | INFO 01-10 14:27:13
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-10 11:27:13.557 | info | 3yn75aympd52v1 | INFO 01-10 14:27:13
chat_utils.py:431] Using supplied chat template:\n
2025-01-10 11:27:13.197 | info | 3yn75aympd52v1 | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 11:27:13.197 | info | 3yn75aympd52v1 | engine.py :26 2025-
01-10 14:27:12,842 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 11:27:13.197 | info | 3yn75aympd52v1 | engine_args.py :126 2025-
01-10 14:27:12,837 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 11:27:12.838 | info | 3yn75aympd52v1 | engine.py :112 2025-
01-10 14:27:12,837 Initialized vLLM engine in 66.86s\n
2025-01-10 11:27:12.838 | info | 3yn75aympd52v1 | INFO 01-10 14:27:12
model_runner.py:1518] Graph capturing finished in 28 secs, took 0.99 GiB\n
2025-01-10 11:27:12.838 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:27:12 model_runner.py:1518] Graph capturing finished
in 28 secs, took 0.99 GiB\n
2025-01-10 11:27:12.838 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:27:12 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-10 11:27:12.555 | info | 3yn75aympd52v1 | INFO 01-10 14:27:12
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-10 11:27:12.555 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:44 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-10 11:27:12.555 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:44 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-10 11:27:12.555 | info | 3yn75aympd52v1 | INFO 01-10 14:26:44
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-10 11:26:44.515 | info | 3yn75aympd52v1 | INFO 01-10 14:26:44
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-10 11:26:44.515 | info | 3yn75aympd52v1 | INFO 01-10 14:26:42
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
15.62x\n
2025-01-10 11:26:42.190 | info | 3yn75aympd52v1 | INFO 01-10 14:26:42
distributed_gpu_executor.py:57] # GPU blocks: 7999, # CPU blocks: 1638\n
2025-01-10 11:26:42.050 | info | 3yn75aympd52v1 | INFO 01-10 14:26:42
worker.py:232] Memory profiling results: total_gpu_memory=44.34GiB
initial_memory_usage=19.04GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.53GiB
gpu_memory_utilization=0.90\n
2025-01-10 11:26:41.994 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:41 worker.py:232] Memory profiling results:
total_gpu_memory=44.34GiB initial_memory_usage=19.04GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.57GiB
gpu_memory_utilization=0.90\n
2025-01-10 11:26:41.994 | info | 3yn75aympd52v1 | INFO 01-10 14:26:34
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-10 11:26:34.578 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:34 model_runner.py:1077] Loading model weights took
18.5769 GB\n
2025-01-10 11:26:34.578 | info | 3yn75aympd52v1 | \n
2025-01-10 11:26:34.578 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:12<00:00, 1.42s/it]\n
2025-01-10 11:26:34.435 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:12<00:00, 1.16s/it]\n
2025-01-10 11:26:34.120 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 89% Completed | 8/9 [00:12<00:01, 1.55s/it]\n
2025-01-10 11:26:32.821 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 78% Completed | 7/9 [00:11<00:03, 1.66s/it]\n
2025-01-10 11:26:31.165 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 67% Completed | 6/9 [00:09<00:05, 1.67s/it]\n
2025-01-10 11:26:29.276 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 56% Completed | 5/9 [00:07<00:06, 1.55s/it]\n
2025-01-10 11:26:27.750 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 44% Completed | 4/9 [00:06<00:07, 1.57s/it]\n
2025-01-10 11:26:26.122 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 33% Completed | 3/9 [00:04<00:09, 1.53s/it]\n
2025-01-10 11:26:24.496 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 22% Completed | 2/9 [00:02<00:10, 1.46s/it]\n
2025-01-10 11:26:22.825 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 11% Completed | 1/9 [00:01<00:09, 1.15s/it]\n
2025-01-10 11:26:21.678 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 0% Completed | 0/9 [00:00<?, ?it/s]\n
2025-01-10 11:26:21.678 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:21 model_runner.py:1072] Starting to load model
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 11:26:21.678 | info | 3yn75aympd52v1 | INFO 01-10 14:26:21
model_runner.py:1072] Starting to load model /models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 11:26:21.677 | info | 3yn75aympd52v1 | INFO 01-10 14:26:21
shm_broadcast.py:236] vLLM message queue communication handle:
Handle(connect_ip='127.0.0.1', local_reader_ranks=[1],
buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at
0x757adcf7df00>, local_subscribe_port=59477, remote_subscribe_port=None)\n
2025-01-10 11:26:21.677 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:21 custom_all_reduce_utils.py:242] reading GPU P2P
access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:21
custom_all_reduce_utils.py:242] reading GPU P2P access cache from
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:11
custom_all_reduce_utils.py:204] generating GPU P2P access cache in
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:11
pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:11 pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:11
utils.py:960] Found nccl from library libnccl.so.2\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:11 utils.py:960] Found nccl from library
libnccl.so.2\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:10 multiproc_worker_utils.py:215] Worker ready;
awaiting tasks\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:10 selector.py:135] Using Flash Attention backend.\
n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:10
selector.py:135] Using Flash Attention backend.\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:10
custom_cache_manager.py:17] Setting Triton cache manager to:
vllm.triton_utils.custom_cache_manager:CustomCacheManager\n
2025-01-10 11:26:21.424 | warning | 3yn75aympd52v1 | WARNING 01-10 14:26:10
multiproc_gpu_executor.py:56] Reducing Torch parallelism from 48 threads to 1 to
avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment
to tune this value as needed.\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:09
llm_engine.py:249] Initializing an LLM engine (v0.6.4) with config:
model='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', speculative_config=None,
tokenizer='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', skip_tokenizer_init=False,
tokenizer_mode=auto, revision=None, override_neuron_config=None,
tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16,
max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO,
tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False,
quantization=awq, enforce_eager=False, kv_cache_dtype=auto,
quantization_param_path=None, device_config=cuda,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'),
observability_config=ObservabilityConfig(otlp_traces_endpoint=None,
collect_model_forward_time=False, collect_model_execute_time=False), seed=0,
served_model_name=/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-
v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, num_scheduler_steps=1,
chunked_prefill_enabled=False multi_step_stream_outputs=True,
enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False,
chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:09
config.py:1020] Defaulting to use mp for distributed inference\n
2025-01-10 11:26:21.424 | warning | 3yn75aympd52v1 | WARNING 01-10 14:26:09
config.py:428] awq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:09
awq_marlin.py:113] Detected that the model can run with awq_marlin, however you
specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin
for faster inference\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:09
config.py:350] This model supports multiple tasks: {'embedding', 'generate'}.
Defaulting to 'generate'.\n
2025-01-10 11:26:21.423 | info | 3yn75aympd52v1 | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 11:17:34.388 | info | q5xcbyb3ls1u0i |
warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-10 11:17:33.010 | info | q5xcbyb3ls1u0i |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-10 11:17:31.904 | warning | q5xcbyb3ls1u0i | [rank0]:[W110
14:17:31.188496691 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has
NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the
application should call destroy_process_group to ensure that any pending NCCL
operations have finished in this process. In rare cases this process can exit
before this point and block the progress of another member of the process group.
This constraint has always been present, but this warning has only been added
since PyTorch 2.4 (function operator())\n
2025-01-10 11:17:31.903 | info | q5xcbyb3ls1u0i | INFO 01-10 14:17:29
async_llm_engine.py:62] Engine is gracefully shutting down.\n
2025-01-10 11:17:29.698 | info | q5xcbyb3ls1u0i | Kill worker.
2025-01-10 11:14:16.474 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 11:14:16.170 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 11:14:15.932 | info | q5xcbyb3ls1u0i | INFO 01-10 14:14:15
async_llm_engine.py:176] Finished request 0f1b5a743e554fe7aa083edc5659dbce.\n
2025-01-10 11:14:15.932 | info | q5xcbyb3ls1u0i | INFO 01-10 14:14:11
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 11:14:11.717 | info | q5xcbyb3ls1u0i | INFO 01-10 14:14:11
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:14:11.717 | info | q5xcbyb3ls1u0i | INFO 01-10 14:14:06
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 11:14:06.691 | info | q5xcbyb3ls1u0i | INFO 01-10 14:14:06
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:14:06.690 | info | q5xcbyb3ls1u0i | INFO 01-10 14:14:01
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 11:14:01.672 | info | q5xcbyb3ls1u0i | INFO 01-10 14:14:01
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:14:01.672 | info | q5xcbyb3ls1u0i | INFO 01-10 14:13:56
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 11:13:56.655 | info | q5xcbyb3ls1u0i | INFO 01-10 14:13:56
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:13:56.655 | info | q5xcbyb3ls1u0i | INFO 01-10 14:13:51
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 11:13:51.652 | info | q5xcbyb3ls1u0i | INFO 01-10 14:13:51
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.2
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:13:51.652 | info | q5xcbyb3ls1u0i | INFO 01-10 14:13:51
async_llm_engine.py:208] Added request 0f1b5a743e554fe7aa083edc5659dbce.\n
2025-01-10 11:13:51.652 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 11:13:51.587 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 11:01:45.203 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 11:01:45.088 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 11:01:44.963 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:44
async_llm_engine.py:176] Finished request 25c7acd97002407a93c11a34afcc2c7d.\n
2025-01-10 11:01:44.963 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:40
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:01:40.042 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:40
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:01:40.042 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:35
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:01:35.040 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:35
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.2%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:01:35.040 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:30
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:01:30.036 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:30
metrics.py:449] Avg prompt throughput: 4.6 tokens/s, Avg generation throughput:
29.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:01:30.036 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:25
async_llm_engine.py:208] Added request 25c7acd97002407a93c11a34afcc2c7d.\n
2025-01-10 11:01:30.036 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 11:01:30.036 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 11:01:25.518 | info | q5xcbyb3ls1u0i | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-10 11:01:25.518 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:25
chat_utils.py:431] ' }}{% endif %}\n
2025-01-10 11:01:25.518 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:25
chat_utils.py:431] \n
2025-01-10 11:01:25.518 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:25
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-10 11:01:25.518 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:25
chat_utils.py:431] \n
2025-01-10 11:01:25.518 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:25
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-10 11:01:25.518 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:25
chat_utils.py:431] Using supplied chat template:\n
2025-01-10 11:01:25.355 | info | q5xcbyb3ls1u0i | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 11:01:25.355 | info | q5xcbyb3ls1u0i | engine.py :26 2025-
01-10 14:01:25,043 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 11:01:25.355 | info | q5xcbyb3ls1u0i | engine_args.py :126 2025-
01-10 14:01:25,036 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 11:01:25.036 | info | q5xcbyb3ls1u0i | engine.py :112 2025-
01-10 14:01:25,036 Initialized vLLM engine in 57.79s\n
2025-01-10 11:01:24.706 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:24
model_runner.py:1518] Graph capturing finished in 26 secs, took 0.99 GiB\n
2025-01-10 11:01:24.706 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:01:24 model_runner.py:1518] Graph capturing finished
in 26 secs, took 0.99 GiB\n
2025-01-10 11:01:24.639 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:01:24 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-10 11:01:22.860 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:22
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-10 11:01:22.860 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:58
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-10 11:01:22.860 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:58
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-10 11:01:22.860 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:58 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-10 11:00:58.461 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:58 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-10 11:00:58.461 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:55
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
17.62x\n
2025-01-10 11:00:55.231 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:55
distributed_gpu_executor.py:57] # GPU blocks: 9020, # CPU blocks: 1638\n
2025-01-10 11:00:55.063 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:55
worker.py:232] Memory profiling results: total_gpu_memory=47.50GiB
initial_memory_usage=19.30GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=19.44Gib non_torch_memory=0.85GiB kv_cache_size=22.02GiB
gpu_memory_utilization=0.90\n
2025-01-10 11:00:54.997 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:54 worker.py:232] Memory profiling results:
total_gpu_memory=47.50GiB initial_memory_usage=19.30GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.42Gib non_torch_memory=0.83GiB kv_cache_size=22.08GiB
gpu_memory_utilization=0.90\n
2025-01-10 11:00:49.914 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:49 model_runner.py:1077] Loading model weights took
18.5769 GB\n
2025-01-10 11:00:49.815 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:49
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-10 11:00:49.815 | info | q5xcbyb3ls1u0i | \n
2025-01-10 11:00:49.815 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:10<00:00, 1.17s/it]\n
2025-01-10 11:00:49.518 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:10<00:00, 1.07it/s]\n
2025-01-10 11:00:49.293 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 89% Completed | 8/9 [00:10<00:01, 1.26s/it]\n
2025-01-10 11:00:48.298 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 78% Completed | 7/9 [00:09<00:02, 1.38s/it]\n
2025-01-10 11:00:46.902 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 67% Completed | 6/9 [00:07<00:04, 1.37s/it]\n
2025-01-10 11:00:45.525 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 56% Completed | 5/9 [00:06<00:05, 1.37s/it]\n
2025-01-10 11:00:44.111 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 44% Completed | 4/9 [00:05<00:06, 1.34s/it]\n
2025-01-10 11:00:42.691 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 33% Completed | 3/9 [00:03<00:07, 1.29s/it]\n
2025-01-10 11:00:41.272 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 22% Completed | 2/9 [00:02<00:08, 1.18s/it]\n
2025-01-10 11:00:39.886 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 11% Completed | 1/9 [00:00<00:07, 1.14it/s]\n
2025-01-10 11:00:39.008 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 0% Completed | 0/9 [00:00<?, ?it/s]\n
2025-01-10 11:00:39.008 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:38 model_runner.py:1072] Starting to load model
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 11:00:39.008 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:38
model_runner.py:1072] Starting to load model /models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 11:00:39.008 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:38
shm_broadcast.py:236] vLLM message queue communication handle:
Handle(connect_ip='127.0.0.1', local_reader_ranks=[1],
buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at
0x776f611e5f60>, local_subscribe_port=57731, remote_subscribe_port=None)\n
2025-01-10 11:00:39.007 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:38 custom_all_reduce_utils.py:242] reading GPU P2P
access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:38
custom_all_reduce_utils.py:242] reading GPU P2P access cache from
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:31
custom_all_reduce_utils.py:204] generating GPU P2P access cache in
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:31 pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:31
pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:31 utils.py:960] Found nccl from library
libnccl.so.2\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:31
utils.py:960] Found nccl from library libnccl.so.2\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:30 multiproc_worker_utils.py:215] Worker ready;
awaiting tasks\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:30 selector.py:135] Using Flash Attention backend.\
n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:30
selector.py:135] Using Flash Attention backend.\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:30
custom_cache_manager.py:17] Setting Triton cache manager to:
vllm.triton_utils.custom_cache_manager:CustomCacheManager\n
2025-01-10 11:00:38.745 | warning | q5xcbyb3ls1u0i | WARNING 01-10 14:00:30
multiproc_gpu_executor.py:56] Reducing Torch parallelism from 64 threads to 1 to
avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment
to tune this value as needed.\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:30
llm_engine.py:249] Initializing an LLM engine (v0.6.4) with config:
model='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', speculative_config=None,
tokenizer='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', skip_tokenizer_init=False,
tokenizer_mode=auto, revision=None, override_neuron_config=None,
tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16,
max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO,
tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False,
quantization=awq, enforce_eager=False, kv_cache_dtype=auto,
quantization_param_path=None, device_config=cuda,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'),
observability_config=ObservabilityConfig(otlp_traces_endpoint=None,
collect_model_forward_time=False, collect_model_execute_time=False), seed=0,
served_model_name=/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-
v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, num_scheduler_steps=1,
chunked_prefill_enabled=False multi_step_stream_outputs=True,
enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False,
chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:30
config.py:1020] Defaulting to use mp for distributed inference\n
2025-01-10 11:00:38.745 | warning | q5xcbyb3ls1u0i | WARNING 01-10 14:00:30
config.py:428] awq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:30
awq_marlin.py:113] Detected that the model can run with awq_marlin, however you
specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin
for faster inference\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:30
config.py:350] This model supports multiple tasks: {'embedding', 'generate'}.
Defaulting to 'generate'.\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | engine.py :26 2025-
01-10 14:00:26,956 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 11:00:38.744 | info | q5xcbyb3ls1u0i | engine_args.py :126 2025-
01-10 14:00:26,932 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 11:00:17.932 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:14
metrics.py:465] Prefix cache hit rate: GPU: 90.00%, CPU: 0.00%\n
2025-01-10 11:00:14.461 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:14
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
32.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:00:12.218 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 11:00:12.218 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:09
metrics.py:465] Prefix cache hit rate: GPU: 90.00%, CPU: 0.00%\n
2025-01-10 11:00:09.446 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:09
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:00:09.446 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:04
metrics.py:465] Prefix cache hit rate: GPU: 90.00%, CPU: 0.00%\n
2025-01-10 11:00:05.733 | info | 0a13jc93spuadh |
warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-10 11:00:04.668 | info | 0a13jc93spuadh |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-10 11:00:04.668 | info | 0a13jc93spuadh | Extension modules:
multidict._multidict, yarl._quoting_c, propcache._helpers_c, _brotli,
aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask,
aiohttp._websocket.reader_c, _cffi_backend, frozenlist._frozenlist,
charset_normalizer.md, requests.packages.charset_normalizer.md,
requests.packages.chardet.md, ujson, numpy.core._multiarray_umath,
numpy.core._multiarray_tests, numpy.linalg._umath_linalg,
numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator,
numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand,
numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64,
numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler,
torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils,
torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse,
torch._C._special, yaml._yaml, markupsafe._speedups, PIL._imaging,
psutil._psutil_linux, psutil._psutil_posix, msgspec._core,
sentencepiece._sentencepiece, regex._regex, PIL._imagingft, msgpack._cmsgpack,
google._upb._message, setproctitle, uvloop.loop, ray._raylet,
zmq.backend.cython._zmq (total: 53)\n
2025-01-10 11:00:04.668 | info | 0a13jc93spuadh | \n
2025-01-10 11:00:04.668 | info | 0a13jc93spuadh | <no Python frame>\n
2025-01-10 11:00:04.668 | info | 0a13jc93spuadh | Current thread
0x000072a27cfc4480 (most recent call first):\n
2025-01-10 11:00:04.668 | info | 0a13jc93spuadh | \n
2025-01-10 11:00:04.668 | info | 0a13jc93spuadh | Python runtime state: finalizing
(tstate=0x0000640cf844e0d0)\n
2025-01-10 11:00:04.435 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:04
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:00:03.218 | info | 0a13jc93spuadh | Fatal Python error:
_enter_buffered_busy: could not acquire lock for <_io.BufferedWriter
name='<stdout>'> at interpreter shutdown, possibly due to daemon threads\n
2025-01-10 11:00:02.218 | error | 0a13jc93spuadh | ERROR 01-10 14:00:02
multiproc_worker_utils.py:116] Worker VllmWorkerProcess pid 228 died, exit code: -
15\n
2025-01-10 11:00:02.218 | info | 0a13jc93spuadh | Kill worker.
2025-01-10 11:00:01.111 | info | 0a13jc93spuadh | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-10 11:00:01.111 | info | 0a13jc93spuadh | INFO 01-10 13:54:39
chat_utils.py:431] ' }}{% endif %}\n
2025-01-10 11:00:01.111 | info | 0a13jc93spuadh | INFO 01-10 13:54:39
chat_utils.py:431] \n
2025-01-10 11:00:01.111 | info | 0a13jc93spuadh | INFO 01-10 13:54:39
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-10 11:00:01.110 | info | 0a13jc93spuadh | INFO 01-10 13:54:39
chat_utils.py:431] \n
2025-01-10 11:00:01.110 | info | 0a13jc93spuadh | INFO 01-10 13:54:39
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-10 11:00:01.110 | info | 0a13jc93spuadh | INFO 01-10 13:54:39
chat_utils.py:431] Using supplied chat template:\n
2025-01-10 11:00:00.338 | info | q5xcbyb3ls1u0i | Kill worker.
2025-01-10 11:00:00.338 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:59
metrics.py:465] Prefix cache hit rate: GPU: 90.00%, CPU: 0.00%\n
2025-01-10 10:59:59.433 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:59
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:59.433 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:54
metrics.py:465] Prefix cache hit rate: GPU: 90.00%, CPU: 0.00%\n
2025-01-10 10:59:59.433 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:54
metrics.py:449] Avg prompt throughput: 2.7 tokens/s, Avg generation throughput:
13.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:59.433 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:54
async_llm_engine.py:208] Added request 83462dab2c104cc6a83abef0d9084599.\n
2025-01-10 10:59:59.433 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:59:54.367 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:59:49.448 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:59:49.296 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:59:49.172 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:49
async_llm_engine.py:176] Finished request 6ceebc6f9aad4ddcb5622989e8de5661.\n
2025-01-10 10:59:49.172 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:45
metrics.py:465] Prefix cache hit rate: GPU: 88.89%, CPU: 0.00%\n
2025-01-10 10:59:45.877 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:45
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:45.877 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:40
metrics.py:465] Prefix cache hit rate: GPU: 88.89%, CPU: 0.00%\n
2025-01-10 10:59:40.872 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:40
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:40.872 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:35
metrics.py:465] Prefix cache hit rate: GPU: 88.89%, CPU: 0.00%\n
2025-01-10 10:59:35.845 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:35
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:35.845 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:30
metrics.py:465] Prefix cache hit rate: GPU: 88.89%, CPU: 0.00%\n
2025-01-10 10:59:30.844 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:30
metrics.py:449] Avg prompt throughput: 3.1 tokens/s, Avg generation throughput:
14.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:30.843 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:30
async_llm_engine.py:208] Added request 6ceebc6f9aad4ddcb5622989e8de5661.\n
2025-01-10 10:59:30.843 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:59:30.792 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:59:30.701 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:59:26.483 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:26
async_llm_engine.py:176] Finished request a357e73a5d3d43cdbe6bfa113cf198d9.\n
2025-01-10 10:59:26.483 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:23
metrics.py:465] Prefix cache hit rate: GPU: 87.50%, CPU: 0.00%\n
2025-01-10 10:59:23.398 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:23
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:23.398 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:18
metrics.py:465] Prefix cache hit rate: GPU: 87.50%, CPU: 0.00%\n
2025-01-10 10:59:18.380 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:18
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:18.380 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:13
metrics.py:465] Prefix cache hit rate: GPU: 87.50%, CPU: 0.00%\n
2025-01-10 10:59:13.367 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:13
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:13.367 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:08
metrics.py:465] Prefix cache hit rate: GPU: 87.50%, CPU: 0.00%\n
2025-01-10 10:59:08.362 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:08
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:03.707 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:59:03.707 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:03
metrics.py:465] Prefix cache hit rate: GPU: 87.50%, CPU: 0.00%\n
2025-01-10 10:59:03.342 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:03
metrics.py:449] Avg prompt throughput: 0.5 tokens/s, Avg generation throughput: 0.3
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:03.342 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:03
async_llm_engine.py:208] Added request a357e73a5d3d43cdbe6bfa113cf198d9.\n
2025-01-10 10:59:03.342 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:59:03.286 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:58:20.677 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:58:20.580 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:58:20.468 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:20
async_llm_engine.py:176] Finished request 1ff44cd19f2a46bd84bc0183ca24b0b2.\n
2025-01-10 10:58:20.468 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:20
metrics.py:465] Prefix cache hit rate: GPU: 85.71%, CPU: 0.00%\n
2025-01-10 10:58:20.168 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:20
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
40.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.6%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:58:20.168 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:58:20.168 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:58:16.487 | error | q5xcbyb3ls1u0i | Failed to return job results. |
400, message='Bad Request', url='https://api.runpod.ai/v2/vllm-vw4dt9p5fckyh4/job-
done/q5xcbyb3ls1u0i/fee8ee7a-829f-4d4e-b042-b6bbe5fe17eb-u1?
gpu=NVIDIA+RTX+6000+Ada+Generation&isStream=true'
2025-01-10 10:58:16.396 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:58:16.269 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:16
async_llm_engine.py:176] Finished request 8c1cc66405214aab80f5f2cad5928737.\n
2025-01-10 10:58:16.269 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:15
metrics.py:465] Prefix cache hit rate: GPU: 85.71%, CPU: 0.00%\n
2025-01-10 10:58:15.146 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:15
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
65.9 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.9%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:58:15.146 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:10
metrics.py:465] Prefix cache hit rate: GPU: 85.71%, CPU: 0.00%\n
2025-01-10 10:58:10.135 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:10
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
65.3 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.6%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:58:10.135 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:05
metrics.py:465] Prefix cache hit rate: GPU: 85.71%, CPU: 0.00%\n
2025-01-10 10:58:05.108 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:05
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
65.4 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:58:05.108 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:00
metrics.py:465] Prefix cache hit rate: GPU: 85.71%, CPU: 0.00%\n
2025-01-10 10:58:00.091 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:00
metrics.py:449] Avg prompt throughput: 9.2 tokens/s, Avg generation throughput:
54.0 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.2%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:58:00.091 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:56
async_llm_engine.py:208] Added request 8c1cc66405214aab80f5f2cad5928737.\n
2025-01-10 10:58:00.091 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:56
async_llm_engine.py:208] Added request 1ff44cd19f2a46bd84bc0183ca24b0b2.\n
2025-01-10 10:58:00.091 | info | q5xcbyb3ls1u0i | Jobs in progress: 2
2025-01-10 10:57:56.669 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:57:56.560 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:57:56.417 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:56
async_llm_engine.py:176] Finished request d9ab2f54431a48b59339b37c177f4896.\n
2025-01-10 10:57:56.417 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:55
metrics.py:465] Prefix cache hit rate: GPU: 80.00%, CPU: 0.00%\n
2025-01-10 10:57:55.076 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:55
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:57:55.076 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:50
metrics.py:465] Prefix cache hit rate: GPU: 80.00%, CPU: 0.00%\n
2025-01-10 10:57:50.069 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:50
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:57:45.862 | info | q5xcbyb3ls1u0i | Jobs in queue: 2
2025-01-10 10:57:45.862 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:45
metrics.py:465] Prefix cache hit rate: GPU: 80.00%, CPU: 0.00%\n
2025-01-10 10:57:45.039 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:45
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:57:40.739 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:57:40.739 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:40
metrics.py:465] Prefix cache hit rate: GPU: 80.00%, CPU: 0.00%\n
2025-01-10 10:57:40.030 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:40
metrics.py:449] Avg prompt throughput: 0.3 tokens/s, Avg generation throughput: 0.2
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:57:40.030 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:39
async_llm_engine.py:208] Added request d9ab2f54431a48b59339b37c177f4896.\n
2025-01-10 10:57:40.030 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:57:39.972 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:56:12.300 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:56:12.204 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:56:12.071 | info | q5xcbyb3ls1u0i | INFO 01-10 13:56:12
async_llm_engine.py:176] Finished request 9fa24c31ca4d4b34b9b7f02eb3f04f5c.\n
2025-01-10 10:56:12.071 | info | q5xcbyb3ls1u0i | INFO 01-10 13:56:11
metrics.py:465] Prefix cache hit rate: GPU: 75.00%, CPU: 0.00%\n
2025-01-10 10:56:11.592 | info | q5xcbyb3ls1u0i | INFO 01-10 13:56:11
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.6%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:56:11.591 | info | q5xcbyb3ls1u0i | INFO 01-10 13:56:06
metrics.py:465] Prefix cache hit rate: GPU: 75.00%, CPU: 0.00%\n
2025-01-10 10:56:06.589 | info | q5xcbyb3ls1u0i | INFO 01-10 13:56:06
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:56:06.589 | info | q5xcbyb3ls1u0i | INFO 01-10 13:56:01
metrics.py:465] Prefix cache hit rate: GPU: 75.00%, CPU: 0.00%\n
2025-01-10 10:56:01.580 | info | q5xcbyb3ls1u0i | INFO 01-10 13:56:01
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:56:01.580 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:56
metrics.py:465] Prefix cache hit rate: GPU: 75.00%, CPU: 0.00%\n
2025-01-10 10:55:56.551 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:56
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:55:56.551 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:51
metrics.py:465] Prefix cache hit rate: GPU: 75.00%, CPU: 0.00%\n
2025-01-10 10:55:51.543 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:51
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:55:51.543 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:46
metrics.py:465] Prefix cache hit rate: GPU: 75.00%, CPU: 0.00%\n
2025-01-10 10:55:51.543 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:46
metrics.py:449] Avg prompt throughput: 2.5 tokens/s, Avg generation throughput: 8.0
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:55:51.543 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:46
async_llm_engine.py:208] Added request 9fa24c31ca4d4b34b9b7f02eb3f04f5c.\n
2025-01-10 10:55:51.543 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:55:46.473 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:55:39.621 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:55:39.511 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:55:39.380 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:39
async_llm_engine.py:176] Finished request 8b26b1b484dc4fba9edcfe8c151df159.\n
2025-01-10 10:55:39.380 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:37
metrics.py:465] Prefix cache hit rate: GPU: 66.67%, CPU: 0.00%\n
2025-01-10 10:55:37.184 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:37
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.6%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:55:37.184 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:32
metrics.py:465] Prefix cache hit rate: GPU: 66.67%, CPU: 0.00%\n
2025-01-10 10:55:32.158 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:32
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:55:32.158 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:27
metrics.py:465] Prefix cache hit rate: GPU: 66.67%, CPU: 0.00%\n
2025-01-10 10:55:27.138 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:27
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:55:27.138 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:22
metrics.py:465] Prefix cache hit rate: GPU: 66.67%, CPU: 0.00%\n
2025-01-10 10:55:22.119 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:22
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:55:22.118 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:17
metrics.py:465] Prefix cache hit rate: GPU: 66.67%, CPU: 0.00%\n
2025-01-10 10:55:17.091 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:17
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:55:17.091 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:12
metrics.py:465] Prefix cache hit rate: GPU: 66.67%, CPU: 0.00%\n
2025-01-10 10:55:12.072 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:12
metrics.py:449] Avg prompt throughput: 0.7 tokens/s, Avg generation throughput: 3.1
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:55:12.072 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:12
async_llm_engine.py:208] Added request 8b26b1b484dc4fba9edcfe8c151df159.\n
2025-01-10 10:55:12.072 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:55:12.008 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:54:42.868 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:54:42.792 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:54:42.681 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:42
async_llm_engine.py:176] Finished request 1ee6e7ea73b0473e815ecb461b63e5be.\n
2025-01-10 10:54:42.681 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:39
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 10:54:39.654 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:39
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.6%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:54:39.654 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:54:39.648 | info | 0a13jc93spuadh | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 10:54:39.648 | info | 0a13jc93spuadh | engine.py :26 2025-
01-10 13:54:39,307 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 10:54:39.648 | info | 0a13jc93spuadh | engine_args.py :126 2025-
01-10 13:54:39,303 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 10:54:39.303 | info | 0a13jc93spuadh | engine.py :112 2025-
01-10 13:54:39,302 Initialized vLLM engine in 80.50s\n
2025-01-10 10:54:39.011 | info | 0a13jc93spuadh | INFO 01-10 13:54:39
model_runner.py:1518] Graph capturing finished in 34 secs, took 0.99 GiB\n
2025-01-10 10:54:39.011 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:54:38 model_runner.py:1518] Graph capturing finished
in 34 secs, took 0.99 GiB\n
2025-01-10 10:54:38.950 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:54:38 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-10 10:54:38.080 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:54:37.992 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:54:37.992 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:34
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 10:54:35.635 | info | 0a13jc93spuadh | INFO 01-10 13:54:35
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-10 10:54:35.635 | info | 0a13jc93spuadh | INFO 01-10 13:54:05
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-10 10:54:35.635 | info | 0a13jc93spuadh | INFO 01-10 13:54:05
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-10 10:54:35.635 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:54:05 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-10 10:54:34.644 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:34
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
60.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:54:33.860 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:33
async_llm_engine.py:176] Finished request 9c282aeeba774afcaa7857b84d712d8e.\n
2025-01-10 10:54:33.860 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:29
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 10:54:29.630 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:29
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
65.6 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.7%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:54:29.629 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:24
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 10:54:24.626 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:24
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
65.1 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:54:24.626 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:19
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 10:54:19.619 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:19
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
65.3 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:54:19.619 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:14
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 10:54:14.599 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:14
metrics.py:449] Avg prompt throughput: 9.2 tokens/s, Avg generation throughput: 6.8
tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:54:14.599 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:13
async_llm_engine.py:208] Added request 1ee6e7ea73b0473e815ecb461b63e5be.\n
2025-01-10 10:54:14.599 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:13
async_llm_engine.py:208] Added request 9c282aeeba774afcaa7857b84d712d8e.\n
2025-01-10 10:54:14.599 | info | q5xcbyb3ls1u0i | Jobs in progress: 2
2025-01-10 10:54:14.599 | info | q5xcbyb3ls1u0i | Jobs in queue: 2
2025-01-10 10:54:13.993 | info | q5xcbyb3ls1u0i | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-10 10:54:13.993 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:09
chat_utils.py:431] ' }}{% endif %}\n
2025-01-10 10:54:13.993 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:09
chat_utils.py:431] \n
2025-01-10 10:54:13.993 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:09
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-10 10:54:13.993 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:09
chat_utils.py:431] \n
2025-01-10 10:54:13.993 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:09
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-10 10:54:13.992 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:09
chat_utils.py:431] Using supplied chat template:\n
2025-01-10 10:54:09.834 | info | q5xcbyb3ls1u0i | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 10:54:09.834 | info | q5xcbyb3ls1u0i | engine.py :26 2025-
01-10 13:54:09,578 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 10:54:09.834 | info | q5xcbyb3ls1u0i | engine_args.py :126 2025-
01-10 13:54:09,573 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 10:54:09.573 | info | q5xcbyb3ls1u0i | engine.py :112 2025-
01-10 13:54:09,573 Initialized vLLM engine in 56.31s\n
2025-01-10 10:54:09.573 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:09
model_runner.py:1518] Graph capturing finished in 24 secs, took 0.99 GiB\n
2025-01-10 10:54:09.573 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:54:09 model_runner.py:1518] Graph capturing finished
in 24 secs, took 0.99 GiB\n
2025-01-10 10:54:09.258 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:09
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-10 10:54:07.765 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:54:07 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-10 10:54:07.765 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:45
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-10 10:54:07.765 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:45
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-10 10:54:07.765 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:45 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-10 10:54:05.108 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:54:05 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-10 10:54:05.107 | info | 0a13jc93spuadh | INFO 01-10 13:54:02
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
15.62x\n
2025-01-10 10:54:02.730 | info | 0a13jc93spuadh | INFO 01-10 13:54:02
distributed_gpu_executor.py:57] # GPU blocks: 7999, # CPU blocks: 1638\n
2025-01-10 10:54:02.571 | info | 0a13jc93spuadh | INFO 01-10 13:54:02
worker.py:232] Memory profiling results: total_gpu_memory=44.34GiB
initial_memory_usage=19.04GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.53GiB
gpu_memory_utilization=0.90\n
2025-01-10 10:54:02.513 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:54:02 worker.py:232] Memory profiling results:
total_gpu_memory=44.34GiB initial_memory_usage=19.04GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.57GiB
gpu_memory_utilization=0.90\n
2025-01-10 10:53:54.866 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:54 model_runner.py:1077] Loading model weights took
18.5769 GB\n
2025-01-10 10:53:54.811 | info | 0a13jc93spuadh | INFO 01-10 13:53:54
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-10 10:53:54.811 | info | 0a13jc93spuadh | \n
2025-01-10 10:53:54.811 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:17<00:00, 2.00s/it]\n
2025-01-10 10:53:54.570 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:17<00:00, 1.61s/it]\n
2025-01-10 10:53:54.052 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 89% Completed | 8/9 [00:17<00:02, 2.10s/it]\n
2025-01-10 10:53:52.326 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 78% Completed | 7/9 [00:15<00:04, 2.28s/it]\n
2025-01-10 10:53:50.153 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 67% Completed | 6/9 [00:13<00:06, 2.33s/it]\n
2025-01-10 10:53:47.788 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 56% Completed | 5/9 [00:11<00:09, 2.31s/it]\n
2025-01-10 10:53:45.376 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 44% Completed | 4/9 [00:08<00:11, 2.26s/it]\n
2025-01-10 10:53:45.315 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:45 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-10 10:53:45.315 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:42
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
17.62x\n
2025-01-10 10:53:43.050 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 33% Completed | 3/9 [00:06<00:13, 2.21s/it]\n
2025-01-10 10:53:42.239 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:42
distributed_gpu_executor.py:57] # GPU blocks: 9020, # CPU blocks: 1638\n
2025-01-10 10:53:42.073 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:42
worker.py:232] Memory profiling results: total_gpu_memory=47.50GiB
initial_memory_usage=19.30GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=19.44Gib non_torch_memory=0.85GiB kv_cache_size=22.02GiB
gpu_memory_utilization=0.90\n
2025-01-10 10:53:42.009 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:42 worker.py:232] Memory profiling results:
total_gpu_memory=47.50GiB initial_memory_usage=19.30GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.42Gib non_torch_memory=0.83GiB kv_cache_size=22.08GiB
gpu_memory_utilization=0.90\n
2025-01-10 10:53:40.802 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 22% Completed | 2/9 [00:04<00:15, 2.18s/it]\n
2025-01-10 10:53:38.263 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 11% Completed | 1/9 [00:01<00:13, 1.66s/it]\n
2025-01-10 10:53:37.146 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:37
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-10 10:53:37.146 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:36 model_runner.py:1077] Loading model weights took
18.5769 GB\n
2025-01-10 10:53:37.146 | info | q5xcbyb3ls1u0i | \n
2025-01-10 10:53:37.146 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:10<00:00, 1.14s/it]\n
2025-01-10 10:53:36.856 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:10<00:00, 1.14it/s]\n
2025-01-10 10:53:36.636 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 89% Completed | 8/9 [00:10<00:01, 1.17s/it]\n
2025-01-10 10:53:36.607 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 0% Completed | 0/9 [00:00<?, ?it/s]\n
2025-01-10 10:53:36.607 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:36 model_runner.py:1072] Starting to load model
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 10:53:36.607 | info | 0a13jc93spuadh | INFO 01-10 13:53:36
model_runner.py:1072] Starting to load model /models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 10:53:36.607 | info | 0a13jc93spuadh | INFO 01-10 13:53:36
shm_broadcast.py:236] vLLM message queue communication handle:
Handle(connect_ip='127.0.0.1', local_reader_ranks=[1],
buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at
0x72a0efd89ea0>, local_subscribe_port=39357, remote_subscribe_port=None)\n
2025-01-10 10:53:36.607 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:36 custom_all_reduce_utils.py:242] reading GPU P2P
access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:53:36.286 | info | 0a13jc93spuadh | INFO 01-10 13:53:36
custom_all_reduce_utils.py:242] reading GPU P2P access cache from
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:53:36.286 | info | 0a13jc93spuadh | INFO 01-10 13:53:26
custom_all_reduce_utils.py:204] generating GPU P2P access cache in
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:53:36.286 | info | 0a13jc93spuadh | INFO 01-10 13:53:25
pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:53:36.286 | info | 0a13jc93spuadh | INFO 01-10 13:53:25
utils.py:960] Found nccl from library libnccl.so.2\n
2025-01-10 10:53:36.286 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:25 pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:53:36.286 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:25 utils.py:960] Found nccl from library
libnccl.so.2\n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:24 multiproc_worker_utils.py:215] Worker ready;
awaiting tasks\n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:24 selector.py:135] Using Flash Attention backend.\
n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | INFO 01-10 13:53:24
selector.py:135] Using Flash Attention backend.\n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | INFO 01-10 13:53:24
custom_cache_manager.py:17] Setting Triton cache manager to:
vllm.triton_utils.custom_cache_manager:CustomCacheManager\n
2025-01-10 10:53:36.285 | warning | 0a13jc93spuadh | WARNING 01-10 13:53:24
multiproc_gpu_executor.py:56] Reducing Torch parallelism from 48 threads to 1 to
avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment
to tune this value as needed.\n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | INFO 01-10 13:53:24
llm_engine.py:249] Initializing an LLM engine (v0.6.4) with config:
model='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', speculative_config=None,
tokenizer='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', skip_tokenizer_init=False,
tokenizer_mode=auto, revision=None, override_neuron_config=None,
tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16,
max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO,
tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False,
quantization=awq, enforce_eager=False, kv_cache_dtype=auto,
quantization_param_path=None, device_config=cuda,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'),
observability_config=ObservabilityConfig(otlp_traces_endpoint=None,
collect_model_forward_time=False, collect_model_execute_time=False), seed=0,
served_model_name=/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-
v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, num_scheduler_steps=1,
chunked_prefill_enabled=False multi_step_stream_outputs=True,
enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False,
chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)\n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | INFO 01-10 13:53:24
config.py:1020] Defaulting to use mp for distributed inference\n
2025-01-10 10:53:36.285 | warning | 0a13jc93spuadh | WARNING 01-10 13:53:24
config.py:428] awq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.\n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | INFO 01-10 13:53:24
awq_marlin.py:113] Detected that the model can run with awq_marlin, however you
specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin
for faster inference\n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | INFO 01-10 13:53:24
config.py:350] This model supports multiple tasks: {'embedding', 'generate'}.
Defaulting to 'generate'.\n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | engine.py :26 2025-
01-10 13:53:18,382 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 10:53:36.284 | info | 0a13jc93spuadh | engine_args.py :126 2025-
01-10 13:53:18,340 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 10:53:35.723 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 78% Completed | 7/9 [00:09<00:02, 1.30s/it]\n
2025-01-10 10:53:34.471 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 67% Completed | 6/9 [00:07<00:03, 1.32s/it]\n
2025-01-10 10:53:33.204 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 56% Completed | 5/9 [00:06<00:05, 1.34s/it]\n
2025-01-10 10:53:31.929 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 44% Completed | 4/9 [00:05<00:06, 1.38s/it]\n
2025-01-10 10:53:30.648 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 33% Completed | 3/9 [00:04<00:08, 1.45s/it]\n
2025-01-10 10:53:28.834 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 22% Completed | 2/9 [00:02<00:08, 1.15s/it]\n
2025-01-10 10:53:27.577 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 11% Completed | 1/9 [00:00<00:07, 1.01it/s]\n
2025-01-10 10:53:26.588 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 0% Completed | 0/9 [00:00<?, ?it/s]\n
2025-01-10 10:53:26.588 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:26 model_runner.py:1072] Starting to load model
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 10:53:26.588 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:26
model_runner.py:1072] Starting to load model /models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 10:53:26.588 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:26
shm_broadcast.py:236] vLLM message queue communication handle:
Handle(connect_ip='127.0.0.1', local_reader_ranks=[1],
buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at
0x78019ec69ed0>, local_subscribe_port=55929, remote_subscribe_port=None)\n
2025-01-10 10:53:26.588 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:26 custom_all_reduce_utils.py:242] reading GPU P2P
access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:53:26.325 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:26
custom_all_reduce_utils.py:242] reading GPU P2P access cache from
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:53:18.634 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:18
custom_all_reduce_utils.py:204] generating GPU P2P access cache in
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:53:18.634 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:18
pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:53:18.634 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:18 pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:53:18.634 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:18
utils.py:960] Found nccl from library libnccl.so.2\n
2025-01-10 10:53:18.424 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:18 utils.py:960] Found nccl from library
libnccl.so.2\n
2025-01-10 10:53:18.424 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:17 multiproc_worker_utils.py:215] Worker ready;
awaiting tasks\n
2025-01-10 10:53:18.424 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:17 selector.py:135] Using Flash Attention backend.\
n
2025-01-10 10:53:17.247 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:17
selector.py:135] Using Flash Attention backend.\n
2025-01-10 10:53:17.247 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:17
custom_cache_manager.py:17] Setting Triton cache manager to:
vllm.triton_utils.custom_cache_manager:CustomCacheManager\n
2025-01-10 10:53:17.162 | warning | q5xcbyb3ls1u0i | WARNING 01-10 13:53:17
multiproc_gpu_executor.py:56] Reducing Torch parallelism from 64 threads to 1 to
avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment
to tune this value as needed.\n
2025-01-10 10:53:17.162 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:16
llm_engine.py:249] Initializing an LLM engine (v0.6.4) with config:
model='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', speculative_config=None,
tokenizer='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', skip_tokenizer_init=False,
tokenizer_mode=auto, revision=None, override_neuron_config=None,
tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16,
max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO,
tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False,
quantization=awq, enforce_eager=False, kv_cache_dtype=auto,
quantization_param_path=None, device_config=cuda,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'),
observability_config=ObservabilityConfig(otlp_traces_endpoint=None,
collect_model_forward_time=False, collect_model_execute_time=False), seed=0,
served_model_name=/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-
v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, num_scheduler_steps=1,
chunked_prefill_enabled=False multi_step_stream_outputs=True,
enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False,
chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)\n
2025-01-10 10:53:17.162 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:16
config.py:1020] Defaulting to use mp for distributed inference\n
2025-01-10 10:53:17.162 | warning | q5xcbyb3ls1u0i | WARNING 01-10 13:53:16
config.py:428] awq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.\n
2025-01-10 10:53:17.162 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:16
awq_marlin.py:113] Detected that the model can run with awq_marlin, however you
specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin
for faster inference\n
2025-01-10 10:53:17.162 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:16
config.py:350] This model supports multiple tasks: {'generate', 'embedding'}.
Defaulting to 'generate'.\n
2025-01-10 10:53:16.851 | info | q5xcbyb3ls1u0i | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 10:53:16.850 | info | q5xcbyb3ls1u0i | engine.py :26 2025-
01-10 13:53:12,979 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 10:53:12.942 | info | q5xcbyb3ls1u0i | engine_args.py :126 2025-
01-10 13:53:12,941 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 10:51:05.935 | info | q5xcbyb3ls1u0i |
warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-10 10:51:04.561 | info | q5xcbyb3ls1u0i |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-10 10:51:03.196 | warning | q5xcbyb3ls1u0i | [rank0]:[W110
13:51:03.481207011 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has
NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the
application should call destroy_process_group to ensure that any pending NCCL
operations have finished in this process. In rare cases this process can exit
before this point and block the progress of another member of the process group.
This constraint has always been present, but this warning has only been added
since PyTorch 2.4 (function operator())\n
2025-01-10 10:51:02.661 | info | q5xcbyb3ls1u0i | INFO 01-10 13:51:02
multiproc_worker_utils.py:120] Killing local vLLM worker processes\n
2025-01-10 10:51:02.661 | info | q5xcbyb3ls1u0i | INFO 01-10 13:51:00
async_llm_engine.py:62] Engine is gracefully shutting down.\n
2025-01-10 10:51:02.661 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:51:00.983 | error | q5xcbyb3ls1u0i | Failed to return job results. |
400, message='Bad Request', url='https://api.runpod.ai/v2/vllm-vw4dt9p5fckyh4/job-
done/q5xcbyb3ls1u0i/efd80437-4eac-4482-bd95-83a96461c99e-u1?
gpu=NVIDIA+RTX+6000+Ada+Generation&isStream=true'
2025-01-10 10:51:00.861 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:51:00.753 | info | q5xcbyb3ls1u0i | INFO 01-10 13:51:00
async_llm_engine.py:176] Finished request 5696d6de05a043a9892f8b91e7272e63.\n
2025-01-10 10:51:00.753 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:59
metrics.py:465] Prefix cache hit rate: GPU: 66.58%, CPU: 0.00%\n
2025-01-10 10:50:59.675 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:59
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.6%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:59.675 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:54
metrics.py:465] Prefix cache hit rate: GPU: 66.58%, CPU: 0.00%\n
2025-01-10 10:50:54.647 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:54
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:51.034 | info | q5xcbyb3ls1u0i | Kill worker.
2025-01-10 10:50:51.034 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:49
metrics.py:465] Prefix cache hit rate: GPU: 66.58%, CPU: 0.00%\n
2025-01-10 10:50:49.644 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:49
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:49.644 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:44
metrics.py:465] Prefix cache hit rate: GPU: 66.58%, CPU: 0.00%\n
2025-01-10 10:50:44.615 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:44
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:44.615 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:39
metrics.py:465] Prefix cache hit rate: GPU: 66.58%, CPU: 0.00%\n
2025-01-10 10:50:39.611 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:39
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:39.611 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:50:39.611 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:50:34.721 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:50:34.721 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:34
metrics.py:465] Prefix cache hit rate: GPU: 66.58%, CPU: 0.00%\n
2025-01-10 10:50:34.721 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:34
metrics.py:449] Avg prompt throughput: 3.0 tokens/s, Avg generation throughput:
34.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:34.721 | info | q5xcbyb3ls1u0i | Jobs in progress: 2
2025-01-10 10:50:34.721 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:50:34.721 | error | q5xcbyb3ls1u0i | n must be an int, but is of
type <class 'NoneType'>
2025-01-10 10:50:34.721 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:34
async_llm_engine.py:208] Added request 5696d6de05a043a9892f8b91e7272e63.\n
2025-01-10 10:50:34.721 | info | q5xcbyb3ls1u0i | Jobs in progress: 3
2025-01-10 10:50:34.563 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:50:34.563 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:50:34.459 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:50:31.802 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:31
async_llm_engine.py:176] Finished request ba1d0bc7343446598f0d8a09a8e54953.\n
2025-01-10 10:50:30.284 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:30
async_llm_engine.py:176] Finished request e3f164415a914240a3137449e682cffb.\n
2025-01-10 10:50:30.284 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:27
metrics.py:465] Prefix cache hit rate: GPU: 66.54%, CPU: 0.00%\n
2025-01-10 10:50:27.033 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:27
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
66.4 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 1.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:27.033 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:22
metrics.py:465] Prefix cache hit rate: GPU: 66.54%, CPU: 0.00%\n
2025-01-10 10:50:22.030 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:22
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
66.5 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.7%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:22.030 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:17
metrics.py:465] Prefix cache hit rate: GPU: 66.54%, CPU: 0.00%\n
2025-01-10 10:50:17.011 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:17
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
66.3 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:12.866 | info | q5xcbyb3ls1u0i | Jobs in queue: 2
2025-01-10 10:50:12.494 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:50:12.494 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:12
metrics.py:465] Prefix cache hit rate: GPU: 66.54%, CPU: 0.00%\n
2025-01-10 10:50:12.004 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:12
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
66.3 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.2%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:12.004 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:06
metrics.py:465] Prefix cache hit rate: GPU: 66.54%, CPU: 0.00%\n
2025-01-10 10:50:06.999 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:06
metrics.py:449] Avg prompt throughput: 0.1 tokens/s, Avg generation throughput: 0.1
tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:06.999 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:06
async_llm_engine.py:208] Added request e3f164415a914240a3137449e682cffb.\n
2025-01-10 10:50:06.999 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:06
async_llm_engine.py:208] Added request ba1d0bc7343446598f0d8a09a8e54953.\n
2025-01-10 10:50:06.999 | info | q5xcbyb3ls1u0i | Jobs in progress: 2
2025-01-10 10:50:06.934 | info | q5xcbyb3ls1u0i | Jobs in queue: 2
2025-01-10 10:43:33.752 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:43:33.655 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:43:33.510 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:33
async_llm_engine.py:176] Finished request a1efcafc7b714dbe9c156914a037f729.\n
2025-01-10 10:43:33.510 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:32
metrics.py:465] Prefix cache hit rate: GPU: 66.46%, CPU: 0.00%\n
2025-01-10 10:43:32.190 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:32
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:43:32.190 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:27
metrics.py:465] Prefix cache hit rate: GPU: 66.46%, CPU: 0.00%\n
2025-01-10 10:43:27.179 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:27
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:43:27.179 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:22
metrics.py:465] Prefix cache hit rate: GPU: 66.46%, CPU: 0.00%\n
2025-01-10 10:43:22.169 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:22
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:43:22.169 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:17
metrics.py:465] Prefix cache hit rate: GPU: 66.46%, CPU: 0.00%\n
2025-01-10 10:43:17.167 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:17
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:43:17.167 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:12
metrics.py:465] Prefix cache hit rate: GPU: 66.46%, CPU: 0.00%\n
2025-01-10 10:43:12.140 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:12
metrics.py:449] Avg prompt throughput: 0.5 tokens/s, Avg generation throughput: 3.3
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:43:12.140 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:12
async_llm_engine.py:208] Added request a1efcafc7b714dbe9c156914a037f729.\n
2025-01-10 10:43:12.140 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:43:12.081 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:42:31.666 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:42:31.572 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:42:31.424 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:31
async_llm_engine.py:176] Finished request 24e868554299472f88b8710ada09f895.\n
2025-01-10 10:42:31.424 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:26
metrics.py:465] Prefix cache hit rate: GPU: 66.41%, CPU: 0.00%\n
2025-01-10 10:42:26.966 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:26
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
32.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:42:26.966 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:21
metrics.py:465] Prefix cache hit rate: GPU: 66.41%, CPU: 0.00%\n
2025-01-10 10:42:21.946 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:21
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:42:21.946 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:16
metrics.py:465] Prefix cache hit rate: GPU: 66.41%, CPU: 0.00%\n
2025-01-10 10:42:16.925 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:16
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:42:16.925 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:11
metrics.py:465] Prefix cache hit rate: GPU: 66.41%, CPU: 0.00%\n
2025-01-10 10:42:11.906 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:11
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:42:11.906 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:06
metrics.py:465] Prefix cache hit rate: GPU: 66.41%, CPU: 0.00%\n
2025-01-10 10:42:06.891 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:06
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:42:06.891 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:06
async_llm_engine.py:208] Added request 24e868554299472f88b8710ada09f895.\n
2025-01-10 10:42:06.891 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:42:06.821 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:38:12.535 | info | 3yn75aympd52v1 |
warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-10 10:38:11.619 | info | 3yn75aympd52v1 |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-10 10:38:10.729 | warning | 3yn75aympd52v1 | [rank0]:[W110
13:38:10.392588627 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has
NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the
application should call destroy_process_group to ensure that any pending NCCL
operations have finished in this process. In rare cases this process can exit
before this point and block the progress of another member of the process group.
This constraint has always been present, but this warning has only been added
since PyTorch 2.4 (function operator())\n
2025-01-10 10:38:10.729 | info | 3yn75aympd52v1 | INFO 01-10 13:38:10
multiproc_worker_utils.py:120] Killing local vLLM worker processes\n
2025-01-10 10:38:10.319 | error | 3yn75aympd52v1 | ERROR 01-10 13:38:10
multiproc_worker_utils.py:116] Worker VllmWorkerProcess pid 228 died, exit code: -
15\n
2025-01-10 10:38:10.319 | info | 3yn75aympd52v1 | Kill worker.
2025-01-10 10:38:09.185 | info | 3yn75aympd52v1 | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-10 10:38:09.185 | info | 3yn75aympd52v1 | INFO 01-10 13:12:47
chat_utils.py:431] ' }}{% endif %}\n
2025-01-10 10:38:09.185 | info | 3yn75aympd52v1 | INFO 01-10 13:12:47
chat_utils.py:431] \n
2025-01-10 10:38:09.185 | info | 3yn75aympd52v1 | INFO 01-10 13:12:47
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-10 10:38:09.185 | info | 3yn75aympd52v1 | INFO 01-10 13:12:47
chat_utils.py:431] \n
2025-01-10 10:38:09.184 | info | 3yn75aympd52v1 | INFO 01-10 13:12:47
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-10 10:38:09.184 | info | 3yn75aympd52v1 | INFO 01-10 13:12:47
chat_utils.py:431] Using supplied chat template:\n
2025-01-10 10:21:10.112 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:21:09.804 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:21:09.582 | info | q5xcbyb3ls1u0i | INFO 01-10 13:21:09
async_llm_engine.py:176] Finished request 4949b4cdea9345e381c515af15a04d0a.\n
2025-01-10 10:21:09.582 | info | q5xcbyb3ls1u0i | INFO 01-10 13:21:08
metrics.py:465] Prefix cache hit rate: GPU: 66.50%, CPU: 0.00%\n
2025-01-10 10:21:08.115 | info | q5xcbyb3ls1u0i | INFO 01-10 13:21:08
metrics.py:449] Avg prompt throughput: 26.1 tokens/s, Avg generation throughput:
1.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.9%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:21:08.115 | info | q5xcbyb3ls1u0i | INFO 01-10 13:21:08
async_llm_engine.py:208] Added request 4949b4cdea9345e381c515af15a04d0a.\n
2025-01-10 10:21:08.115 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:21:08.046 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:20:24.344 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:20:24.049 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:20:23.820 | info | q5xcbyb3ls1u0i | INFO 01-10 13:20:23
async_llm_engine.py:176] Finished request 85c21cf88ece4591bee714e4b27dd6dc.\n
2025-01-10 10:20:23.820 | info | q5xcbyb3ls1u0i | INFO 01-10 13:20:21
metrics.py:465] Prefix cache hit rate: GPU: 63.06%, CPU: 0.00%\n
2025-01-10 10:20:21.490 | info | q5xcbyb3ls1u0i | INFO 01-10 13:20:21
metrics.py:449] Avg prompt throughput: 18.2 tokens/s, Avg generation throughput:
0.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.8%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:20:21.489 | info | q5xcbyb3ls1u0i | INFO 01-10 13:20:21
async_llm_engine.py:208] Added request 85c21cf88ece4591bee714e4b27dd6dc.\n
2025-01-10 10:20:21.489 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:20:21.414 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:19:40.964 | info | 0a13jc93spuadh | INFO 01-10 13:11:19
pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:19:40.964 | info | 0a13jc93spuadh | INFO 01-10 13:11:19
utils.py:960] Found nccl from library libnccl.so.2\n
2025-01-10 10:19:40.964 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:19 pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:19:21.643 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:19:21.421 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:19:21.186 | info | q5xcbyb3ls1u0i | INFO 01-10 13:19:21
async_llm_engine.py:176] Finished request 66db82b88d184b9bb360d41144af72f9.\n
2025-01-10 10:19:21.186 | info | q5xcbyb3ls1u0i | INFO 01-10 13:19:19
metrics.py:465] Prefix cache hit rate: GPU: 59.19%, CPU: 0.00%\n
2025-01-10 10:19:19.924 | info | q5xcbyb3ls1u0i | INFO 01-10 13:19:19
metrics.py:449] Avg prompt throughput: 9.9 tokens/s, Avg generation throughput: 0.7
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.7%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:19:19.924 | info | q5xcbyb3ls1u0i | INFO 01-10 13:19:19
async_llm_engine.py:208] Added request 66db82b88d184b9bb360d41144af72f9.\n
2025-01-10 10:19:19.924 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:19:19.778 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:12:56.795 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:12:56.795 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:12:56.695 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:12:56.695 | info | q5xcbyb3ls1u0i | Jobs in progress: 2
2025-01-10 10:12:56.494 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:12:56.294 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:12:56.294 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:12:56.183 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:12:56.062 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:56
async_llm_engine.py:176] Finished request dd6e24d3701a49f5868710e068277e58.\n
2025-01-10 10:12:56.062 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:55
async_llm_engine.py:176] Finished request 18cf0fcd4eaf48d7ae15e4bf3d6a84fd.\n
2025-01-10 10:12:55.952 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:55
async_llm_engine.py:176] Finished request 82b21d1b42f54ce9aaa89837d52a8c91.\n
2025-01-10 10:12:55.952 | info | q5xcbyb3ls1u0i | Jobs in progress: 3
2025-01-10 10:12:54.584 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:12:54.232 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:12:53.753 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:53
async_llm_engine.py:176] Finished request c705e34266f54a2086b1ab660a7c7a44.\n
2025-01-10 10:12:53.753 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:53
metrics.py:465] Prefix cache hit rate: GPU: 55.38%, CPU: 0.00%\n
2025-01-10 10:12:53.201 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:53
metrics.py:449] Avg prompt throughput: 623.0 tokens/s, Avg generation throughput:
0.3 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 2.9%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:12:53.201 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:47
async_llm_engine.py:208] Added request 18cf0fcd4eaf48d7ae15e4bf3d6a84fd.\n
2025-01-10 10:12:53.201 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:47
async_llm_engine.py:208] Added request 82b21d1b42f54ce9aaa89837d52a8c91.\n
2025-01-10 10:12:53.200 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:47
async_llm_engine.py:208] Added request c705e34266f54a2086b1ab660a7c7a44.\n
2025-01-10 10:12:53.200 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:47
async_llm_engine.py:208] Added request dd6e24d3701a49f5868710e068277e58.\n
2025-01-10 10:12:53.200 | info | q5xcbyb3ls1u0i | Jobs in progress: 4
2025-01-10 10:12:53.200 | info | q5xcbyb3ls1u0i | Jobs in queue: 4
2025-01-10 10:12:47.960 | info | q5xcbyb3ls1u0i | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-10 10:12:47.960 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:43
chat_utils.py:431] ' }}{% endif %}\n
2025-01-10 10:12:47.960 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:43
chat_utils.py:431] \n
2025-01-10 10:12:47.960 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:43
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-10 10:12:47.960 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:43
chat_utils.py:431] \n
2025-01-10 10:12:47.960 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:43
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-10 10:12:47.959 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:43
chat_utils.py:431] Using supplied chat template:\n
2025-01-10 10:12:47.452 | info | 3yn75aympd52v1 | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 10:12:47.452 | info | 3yn75aympd52v1 | engine.py :26 2025-
01-10 13:12:47,162 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 10:12:47.451 | info | 3yn75aympd52v1 | engine_args.py :126 2025-
01-10 13:12:47,157 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 10:12:47.158 | info | 3yn75aympd52v1 | engine.py :112 2025-
01-10 13:12:47,157 Initialized vLLM engine in 70.69s\n
2025-01-10 10:12:47.158 | info | 3yn75aympd52v1 | INFO 01-10 13:12:46
model_runner.py:1518] Graph capturing finished in 28 secs, took 0.99 GiB\n
2025-01-10 10:12:47.158 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:46 model_runner.py:1518] Graph capturing finished
in 28 secs, took 0.99 GiB\n
2025-01-10 10:12:46.819 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:46 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-10 10:12:46.757 | info | 3yn75aympd52v1 | INFO 01-10 13:12:46
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-10 10:12:46.757 | info | 3yn75aympd52v1 | INFO 01-10 13:12:18
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-10 10:12:43.039 | info | q5xcbyb3ls1u0i | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 10:12:43.039 | info | q5xcbyb3ls1u0i | engine.py :26 2025-
01-10 13:12:42,779 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 10:12:43.039 | info | q5xcbyb3ls1u0i | engine_args.py :126 2025-
01-10 13:12:42,774 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 10:12:42.774 | info | q5xcbyb3ls1u0i | engine.py :112 2025-
01-10 13:12:42,773 Initialized vLLM engine in 67.10s\n
2025-01-10 10:12:42.774 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:42
model_runner.py:1518] Graph capturing finished in 24 secs, took 0.99 GiB\n
2025-01-10 10:12:42.774 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:42 model_runner.py:1518] Graph capturing finished
in 24 secs, took 0.99 GiB\n
2025-01-10 10:12:42.460 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:42
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-10 10:12:40.414 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:40 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-10 10:12:40.414 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:18 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-10 10:12:40.414 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:18 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-10 10:12:40.414 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:18
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-10 10:12:18.934 | info | 3yn75aympd52v1 | INFO 01-10 13:12:18
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-10 10:12:18.934 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:18 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-10 10:12:18.613 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:18 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-10 10:12:18.613 | info | 3yn75aympd52v1 | INFO 01-10 13:12:16
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
15.62x\n
2025-01-10 10:12:18.386 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:18
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-10 10:12:18.386 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:15
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
17.62x\n
2025-01-10 10:12:16.562 | info | 3yn75aympd52v1 | INFO 01-10 13:12:16
distributed_gpu_executor.py:57] # GPU blocks: 7999, # CPU blocks: 1638\n
2025-01-10 10:12:16.407 | info | 3yn75aympd52v1 | INFO 01-10 13:12:16
worker.py:232] Memory profiling results: total_gpu_memory=44.34GiB
initial_memory_usage=19.04GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.53GiB
gpu_memory_utilization=0.90\n
2025-01-10 10:12:16.345 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:16 worker.py:232] Memory profiling results:
total_gpu_memory=44.34GiB initial_memory_usage=19.04GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.57GiB
gpu_memory_utilization=0.90\n
2025-01-10 10:12:16.345 | info | 3yn75aympd52v1 | INFO 01-10 13:12:08
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-10 10:12:15.445 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:15
distributed_gpu_executor.py:57] # GPU blocks: 9020, # CPU blocks: 1638\n
2025-01-10 10:12:15.272 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:15
worker.py:232] Memory profiling results: total_gpu_memory=47.50GiB
initial_memory_usage=19.30GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=19.44Gib non_torch_memory=0.85GiB kv_cache_size=22.02GiB
gpu_memory_utilization=0.90\n
2025-01-10 10:12:15.217 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:15 worker.py:232] Memory profiling results:
total_gpu_memory=47.50GiB initial_memory_usage=19.30GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.42Gib non_torch_memory=0.83GiB kv_cache_size=22.08GiB
gpu_memory_utilization=0.90\n
2025-01-10 10:12:10.318 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:10
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-10 10:12:10.318 | info | q5xcbyb3ls1u0i | \n
2025-01-10 10:12:10.318 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:21<00:00, 2.35s/it]\n
2025-01-10 10:12:10.030 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:21<00:00, 2.19s/it]\n
2025-01-10 10:12:09.806 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:09 model_runner.py:1077] Loading model weights took
18.5769 GB\n
2025-01-10 10:12:08.613 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:08 model_runner.py:1077] Loading model weights took
18.5769 GB\n
2025-01-10 10:12:08.612 | info | 3yn75aympd52v1 | \n
2025-01-10 10:12:08.612 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:14<00:00, 1.66s/it]\n
2025-01-10 10:12:08.481 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 89% Completed | 8/9 [00:19<00:02, 2.48s/it]\n
2025-01-10 10:12:08.449 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:14<00:00, 1.29s/it]\n
2025-01-10 10:12:08.142 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 89% Completed | 8/9 [00:14<00:01, 1.73s/it]\n
2025-01-10 10:12:06.827 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 78% Completed | 7/9 [00:13<00:03, 1.93s/it]\n
2025-01-10 10:12:06.358 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 78% Completed | 7/9 [00:17<00:05, 2.64s/it]\n
2025-01-10 10:12:04.874 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 67% Completed | 6/9 [00:11<00:05, 1.91s/it]\n
2025-01-10 10:12:03.580 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 67% Completed | 6/9 [00:14<00:07, 2.58s/it]\n
2025-01-10 10:12:02.941 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 56% Completed | 5/9 [00:09<00:07, 1.90s/it]\n
2025-01-10 10:12:01.013 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 44% Completed | 4/9 [00:07<00:09, 1.89s/it]\n
2025-01-10 10:12:00.935 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 56% Completed | 5/9 [00:12<00:10, 2.54s/it]\n
2025-01-10 10:11:59.227 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 33% Completed | 3/9 [00:05<00:11, 1.95s/it]\n
2025-01-10 10:11:58.142 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 44% Completed | 4/9 [00:09<00:12, 2.40s/it]\n
2025-01-10 10:11:57.252 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 22% Completed | 2/9 [00:03<00:13, 1.93s/it]\n
2025-01-10 10:11:55.487 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 33% Completed | 3/9 [00:06<00:13, 2.24s/it]\n
2025-01-10 10:11:55.081 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 11% Completed | 1/9 [00:01<00:12, 1.59s/it]\n
2025-01-10 10:11:53.488 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 0% Completed | 0/9 [00:00<?, ?it/s]\n
2025-01-10 10:11:53.488 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:53 model_runner.py:1072] Starting to load model
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 10:11:53.488 | info | 3yn75aympd52v1 | INFO 01-10 13:11:53
model_runner.py:1072] Starting to load model /models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 10:11:53.488 | info | 3yn75aympd52v1 | INFO 01-10 13:11:53
shm_broadcast.py:236] vLLM message queue communication handle:
Handle(connect_ip='127.0.0.1', local_reader_ranks=[1],
buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at
0x72bdaf971fc0>, local_subscribe_port=58619, remote_subscribe_port=None)\n
2025-01-10 10:11:53.488 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:53 custom_all_reduce_utils.py:242] reading GPU P2P
access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:11:53.369 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 22% Completed | 2/9 [00:04<00:16, 2.34s/it]\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:53
custom_all_reduce_utils.py:242] reading GPU P2P access cache from
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:42
custom_all_reduce_utils.py:204] generating GPU P2P access cache in
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:42
pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:42 pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:42
utils.py:960] Found nccl from library libnccl.so.2\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:42 utils.py:960] Found nccl from library
libnccl.so.2\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:41 multiproc_worker_utils.py:215] Worker ready;
awaiting tasks\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:41 selector.py:135] Using Flash Attention backend.\
n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:41
selector.py:135] Using Flash Attention backend.\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:41
custom_cache_manager.py:17] Setting Triton cache manager to:
vllm.triton_utils.custom_cache_manager:CustomCacheManager\n
2025-01-10 10:11:53.183 | warning | 3yn75aympd52v1 | WARNING 01-10 13:11:41
multiproc_gpu_executor.py:56] Reducing Torch parallelism from 48 threads to 1 to
avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment
to tune this value as needed.\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:40
llm_engine.py:249] Initializing an LLM engine (v0.6.4) with config:
model='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', speculative_config=None,
tokenizer='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', skip_tokenizer_init=False,
tokenizer_mode=auto, revision=None, override_neuron_config=None,
tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16,
max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO,
tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False,
quantization=awq, enforce_eager=False, kv_cache_dtype=auto,
quantization_param_path=None, device_config=cuda,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'),
observability_config=ObservabilityConfig(otlp_traces_endpoint=None,
collect_model_forward_time=False, collect_model_execute_time=False), seed=0,
served_model_name=/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-
v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, num_scheduler_steps=1,
chunked_prefill_enabled=False multi_step_stream_outputs=True,
enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False,
chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:40
config.py:1020] Defaulting to use mp for distributed inference\n
2025-01-10 10:11:53.183 | warning | 3yn75aympd52v1 | WARNING 01-10 13:11:40
config.py:428] awq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:40
awq_marlin.py:113] Detected that the model can run with awq_marlin, however you
specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin
for faster inference\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:40
config.py:350] This model supports multiple tasks: {'generate', 'embedding'}.
Defaulting to 'generate'.\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 10:11:50.665 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 11% Completed | 1/9 [00:01<00:14, 1.81s/it]\n
2025-01-10 10:11:48.853 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 0% Completed | 0/9 [00:00<?, ?it/s]\n
2025-01-10 10:11:48.853 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:48 model_runner.py:1072] Starting to load model
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 10:11:48.853 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:48
model_runner.py:1072] Starting to load model /models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 10:11:48.853 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:48
shm_broadcast.py:236] vLLM message queue communication handle:
Handle(connect_ip='127.0.0.1', local_reader_ranks=[1],
buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at
0x79db90ae5ea0>, local_subscribe_port=49363, remote_subscribe_port=None)\n
2025-01-10 10:11:48.853 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:48 custom_all_reduce_utils.py:242] reading GPU P2P
access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:48
custom_all_reduce_utils.py:242] reading GPU P2P access cache from
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:40
custom_all_reduce_utils.py:204] generating GPU P2P access cache in
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:40
pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:40 pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:40
utils.py:960] Found nccl from library libnccl.so.2\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:40 utils.py:960] Found nccl from library
libnccl.so.2\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:39 multiproc_worker_utils.py:215] Worker ready;
awaiting tasks\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:39 selector.py:135] Using Flash Attention backend.\
n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:39
selector.py:135] Using Flash Attention backend.\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:39
custom_cache_manager.py:17] Setting Triton cache manager to:
vllm.triton_utils.custom_cache_manager:CustomCacheManager\n
2025-01-10 10:11:48.565 | warning | q5xcbyb3ls1u0i | WARNING 01-10 13:11:39
multiproc_gpu_executor.py:56] Reducing Torch parallelism from 64 threads to 1 to
avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment
to tune this value as needed.\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:39
llm_engine.py:249] Initializing an LLM engine (v0.6.4) with config:
model='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', speculative_config=None,
tokenizer='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', skip_tokenizer_init=False,
tokenizer_mode=auto, revision=None, override_neuron_config=None,
tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16,
max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO,
tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False,
quantization=awq, enforce_eager=False, kv_cache_dtype=auto,
quantization_param_path=None, device_config=cuda,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'),
observability_config=ObservabilityConfig(otlp_traces_endpoint=None,
collect_model_forward_time=False, collect_model_execute_time=False), seed=0,
served_model_name=/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-
v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, num_scheduler_steps=1,
chunked_prefill_enabled=False multi_step_stream_outputs=True,
enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False,
chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:39
config.py:1020] Defaulting to use mp for distributed inference\n
2025-01-10 10:11:48.565 | warning | q5xcbyb3ls1u0i | WARNING 01-10 13:11:39
config.py:428] awq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:39
awq_marlin.py:113] Detected that the model can run with awq_marlin, however you
specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin
for faster inference\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:39
config.py:350] This model supports multiple tasks: {'generate', 'embedding'}.
Defaulting to 'generate'.\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | engine.py :26 2025-
01-10 13:11:35,381 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 10:11:48.564 | info | q5xcbyb3ls1u0i | engine_args.py :126 2025-
01-10 13:11:35,342 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 10:11:19.636 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:19 utils.py:960] Found nccl from library
libnccl.so.2\n
2025-01-10 10:11:19.636 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:18 multiproc_worker_utils.py:215] Worker ready;
awaiting tasks\n
2025-01-10 10:11:19.636 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:18 selector.py:135] Using Flash Attention backend.\
n
2025-01-10 10:11:18.708 | info | 0a13jc93spuadh | INFO 01-10 13:11:18
selector.py:135] Using Flash Attention backend.\n
2025-01-10 10:11:18.708 | info | 0a13jc93spuadh | INFO 01-10 13:11:18
custom_cache_manager.py:17] Setting Triton cache manager to:
vllm.triton_utils.custom_cache_manager:CustomCacheManager\n
2025-01-10 10:11:18.605 | warning | 0a13jc93spuadh | WARNING 01-10 13:11:18
multiproc_gpu_executor.py:56] Reducing Torch parallelism from 48 threads to 1 to
avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment
to tune this value as needed.\n
2025-01-10 10:11:18.605 | info | 0a13jc93spuadh | INFO 01-10 13:11:18
llm_engine.py:249] Initializing an LLM engine (v0.6.4) with config:
model='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', speculative_config=None,
tokenizer='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', skip_tokenizer_init=False,
tokenizer_mode=auto, revision=None, override_neuron_config=None,
tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16,
max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO,
tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False,
quantization=awq, enforce_eager=False, kv_cache_dtype=auto,
quantization_param_path=None, device_config=cuda,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'),
observability_config=ObservabilityConfig(otlp_traces_endpoint=None,
collect_model_forward_time=False, collect_model_execute_time=False), seed=0,
served_model_name=/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-
v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, num_scheduler_steps=1,
chunked_prefill_enabled=False multi_step_stream_outputs=True,
enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False,
chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)\n
2025-01-10 10:11:18.605 | info | 0a13jc93spuadh | INFO 01-10 13:11:18
config.py:1020] Defaulting to use mp for distributed inference\n
2025-01-10 10:11:18.605 | warning | 0a13jc93spuadh | WARNING 01-10 13:11:18
config.py:428] awq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.\n
2025-01-10 10:11:18.605 | info | 0a13jc93spuadh | INFO 01-10 13:11:18
awq_marlin.py:113] Detected that the model can run with awq_marlin, however you
specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin
for faster inference\n
2025-01-10 10:11:18.605 | info | 0a13jc93spuadh | INFO 01-10 13:11:18
config.py:350] This model supports multiple tasks: {'generate', 'embedding'}.
Defaulting to 'generate'.\n
2025-01-10 10:11:18.122 | info | 0a13jc93spuadh | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 10:11:18.122 | info | 0a13jc93spuadh | engine.py :26 2025-
01-10 13:11:13,117 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 10:11:13.084 | info | 0a13jc93spuadh | engine_args.py :126 2025-
01-10 13:11:13,083 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-09 19:17:04.549 | info | uog4wi2bfchi7t |
warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-09 19:17:03.533 | info | uog4wi2bfchi7t |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-09 19:17:02.690 | warning | uog4wi2bfchi7t | [rank0]:[W109
22:17:02.363404277 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has
NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the
application should call destroy_process_group to ensure that any pending NCCL
operations have finished in this process. In rare cases this process can exit
before this point and block the progress of another member of the process group.
This constraint has always been present, but this warning has only been added
since PyTorch 2.4 (function operator())\n
2025-01-09 19:17:02.690 | info | uog4wi2bfchi7t | INFO 01-09 22:17:01
async_llm_engine.py:62] Engine is gracefully shutting down.\n
2025-01-09 19:17:01.099 | info | uog4wi2bfchi7t | Kill worker.
2025-01-09 19:14:39.867 | info | uog4wi2bfchi7t | Finished.
2025-01-09 19:14:39.799 | info | uog4wi2bfchi7t | Finished running generator.
2025-01-09 19:14:39.678 | info | uog4wi2bfchi7t | INFO 01-09 22:14:39
async_llm_engine.py:176] Finished request cc8c94c4ae0449868aea11ca75885470.\n
2025-01-09 19:14:39.677 | info | uog4wi2bfchi7t | INFO 01-09 22:14:38
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-09 19:14:38.880 | info | uog4wi2bfchi7t | INFO 01-09 22:14:38
metrics.py:449] Avg prompt throughput: 510.2 tokens/s, Avg generation throughput:
7.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 3.1%, CPU KV cache usage: 0.0%.\n
2025-01-09 19:14:38.880 | info | uog4wi2bfchi7t | INFO 01-09 22:14:34
async_llm_engine.py:208] Added request cc8c94c4ae0449868aea11ca75885470.\n
2025-01-09 19:14:38.880 | info | uog4wi2bfchi7t | Jobs in progress: 1
2025-01-09 19:14:38.880 | info | uog4wi2bfchi7t | Jobs in queue: 1
2025-01-09 19:14:34.774 | info | uog4wi2bfchi7t | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-09 19:14:34.774 | info | uog4wi2bfchi7t | INFO 01-09 22:14:34
chat_utils.py:431] ' }}{% endif %}\n
2025-01-09 19:14:34.774 | info | uog4wi2bfchi7t | INFO 01-09 22:14:34
chat_utils.py:431] \n
2025-01-09 19:14:34.774 | info | uog4wi2bfchi7t | INFO 01-09 22:14:34
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-09 19:14:34.774 | info | uog4wi2bfchi7t | INFO 01-09 22:14:34
chat_utils.py:431] \n
2025-01-09 19:14:34.774 | info | uog4wi2bfchi7t | INFO 01-09 22:14:34
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-09 19:14:34.774 | info | uog4wi2bfchi7t | INFO 01-09 22:14:34
chat_utils.py:431] Using supplied chat template:\n
2025-01-09 19:14:34.219 | info | uog4wi2bfchi7t | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-09 19:14:34.219 | info | uog4wi2bfchi7t | engine.py :26 2025-
01-09 22:14:33,860 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-09 19:14:34.218 | info | uog4wi2bfchi7t | engine_args.py :126 2025-
01-09 22:14:33,855 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-09 19:14:33.855 | info | uog4wi2bfchi7t | engine.py :112 2025-
01-09 22:14:33,854 Initialized vLLM engine in 69.96s\n
2025-01-09 19:14:33.855 | info | uog4wi2bfchi7t | INFO 01-09 22:14:33
model_runner.py:1518] Graph capturing finished in 29 secs, took 0.99 GiB\n
2025-01-09 19:14:33.855 | info | uog4wi2bfchi7t | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-09 22:14:33 model_runner.py:1518] Graph capturing finished
in 29 secs, took 0.99 GiB\n
2025-01-09 19:14:33.855 | info | uog4wi2bfchi7t | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-09 22:14:33 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-09 19:14:33.562 | info | uog4wi2bfchi7t | INFO 01-09 22:14:33
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-09 19:14:33.562 | info | uog4wi2bfchi7t | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-09 22:14:04 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-09 19:14:33.562 | info | uog4wi2bfchi7t | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-09 22:14:04 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-09 19:14:33.562 | info | uog4wi2bfchi7t | INFO 01-09 22:14:04
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-09 19:14:04.430 | info | uog4wi2bfchi7t | INFO 01-09 22:14:04
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-09 19:14:04.430 | info | uog4wi2bfchi7t | INFO 01-09 22:14:01
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
10.14x\n
2025-01-09 19:14:01.851 | info | uog4wi2bfchi7t | INFO 01-09 22:14:01
distributed_gpu_executor.py:57] # GPU blocks: 5190, # CPU blocks: 1638\n
2025-01-09 19:14:01.710 | info | uog4wi2bfchi7t | INFO 01-09 22:14:01
worker.py:232] Memory profiling results: total_gpu_memory=44.34GiB
initial_memory_usage=25.90GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=25.95Gib non_torch_memory=7.35GiB kv_cache_size=12.67GiB
gpu_memory_utilization=0.90\n
2025-01-09 19:14:01.652 | info | uog4wi2bfchi7t | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-09 22:14:01 worker.py:232] Memory profiling results:
total_gpu_memory=44.34GiB initial_memory_usage=19.04GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.57GiB
gpu_memory_utilization=0.90\n
2025-01-09 19:14:01.652 | info | uog4wi2bfchi7t | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-09 22:13:54 model_runner.py:1077] Loading model weights took
18.5769 GB\n
2025-01-09 19:13:54.245 | info | uog4wi2bfchi7t | INFO 01-09 22:13:54
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-09 19:13:54.245 | info | uog4wi2bfchi7t | \n
2025-01-09 19:13:54.245 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:15<00:00, 1.67s/it]\n
2025-01-09 19:13:54.062 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:15<00:00, 1.33s/it]\n
2025-01-09 19:13:53.694 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 89% Completed | 8/9 [00:14<00:01, 1.76s/it]\n
2025-01-09 19:13:52.289 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 78% Completed | 7/9 [00:13<00:03, 1.93s/it]\n
2025-01-09 19:13:50.367 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 67% Completed | 6/9 [00:11<00:05, 1.93s/it]\n
2025-01-09 19:13:48.394 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 56% Completed | 5/9 [00:09<00:07, 1.92s/it]\n
2025-01-09 19:13:46.479 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 44% Completed | 4/9 [00:07<00:09, 1.92s/it]\n
2025-01-09 19:13:44.523 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 33% Completed | 3/9 [00:05<00:11, 1.89s/it]\n
2025-01-09 19:13:42.552 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 22% Completed | 2/9 [00:03<00:12, 1.82s/it]\n
2025-01-09 19:13:42.552 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 11% Completed | 1/9 [00:01<00:12, 1.55s/it]\n
2025-01-09 19:13:42.552 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 0% Completed | 0/9 [00:00<?, ?it/s]\n
2025-01-09 19:13:42.552 | info | uog4wi2bfchi7t | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-09 22:13:38 model_runner.py:1072] Starting to load model
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-09 19:13:42.552 | info | uog4wi2bfchi7t | INFO 01-09 22:13:38
model_runner.py:1072] Starting to load model /models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-09 19:13:42.552 | info | uog4wi2bfchi7t | INFO 01-09 22:13:38
shm_broadcast.py:236] vLLM message queue communication handle:
Handle(connect_ip='127.0.0.1', local_reader_ranks=[1],
buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at
0x740b82e65ea0>, local_subscribe_port=57835, remote_subscribe_port=None)\n
2025-01-09 19:13:42.551 | info | uog4wi2bfchi7t | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-09 22:13:38 custom_all_reduce_utils.py:242] reading GPU P2P
access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-09 19:13:42.551 | info | uog4wi2bfchi7t | INFO 01-09 22:13:38
custom_all_reduce_utils.py:242] reading GPU P2P access cache from
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-06 17:01:35.014 | info | olmw98xr7nfth5 |
warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-06 17:01:34.173 | info | olmw98xr7nfth5 |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-06 17:01:33.130 | warning | olmw98xr7nfth5 | [rank0]:[W106
20:01:33.989801744 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has
NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the
application should call destroy_process_group to ensure that any pending NCCL
operations have finished in this process. In rare cases this process can exit
before this point and block the progress of another member of the process group.
This constraint has always been present, but this warning has only been added
since PyTorch 2.4 (function operator())\n
2025-01-06 17:01:33.129 | info | olmw98xr7nfth5 | INFO 01-06 20:01:31
async_llm_engine.py:62] Engine is gracefully shutting down.\n
2025-01-06 17:01:31.624 | info | olmw98xr7nfth5 | Kill worker.
2025-01-06 16:55:37.119 | info | 0a13jc93spuadh |
warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-06 16:55:36.026 | info | 0a13jc93spuadh |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-06 16:55:34.721 | warning | 0a13jc93spuadh | [rank0]:[W106
19:55:34.437777753 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has
NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the
application should call destroy_process_group to ensure that any pending NCCL
operations have finished in this process. In rare cases this process can exit
before this point and block the progress of another member of the process group.
This constraint has always been present, but this warning has only been added
since PyTorch 2.4 (function operator())\n
2025-01-06 16:55:34.720 | info | 0a13jc93spuadh | Kill worker.
2025-01-06 16:55:32.868 | info | 0a13jc93spuadh | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-06 16:55:32.868 | info | 0a13jc93spuadh | INFO 01-06 19:48:05
chat_utils.py:431] ' }}{% endif %}\n
2025-01-06 16:55:32.868 | info | 0a13jc93spuadh | INFO 01-06 19:48:05
chat_utils.py:431] \n
2025-01-06 16:55:32.868 | info | 0a13jc93spuadh | INFO 01-06 19:48:05
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-06 16:55:32.868 | info | 0a13jc93spuadh | INFO 01-06 19:48:05
chat_utils.py:431] \n
2025-01-06 16:55:32.868 | info | 0a13jc93spuadh | INFO 01-06 19:48:05
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-06 16:55:32.868 | info | 0a13jc93spuadh | INFO 01-06 19:48:05
chat_utils.py:431] Using supplied chat template:\n
2025-01-06 16:48:05.604 | info | 0a13jc93spuadh | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-06 16:48:05.604 | info | 0a13jc93spuadh | engine.py :26 2025-
01-06 19:48:05,296 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-06 16:48:05.604 | info | 0a13jc93spuadh | engine_args.py :126 2025-
01-06 19:48:05,291 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-06 16:48:05.291 | info | 0a13jc93spuadh | engine.py :112 2025-
01-06 19:48:05,290 Initialized vLLM engine in 76.47s\n
2025-01-06 16:48:05.291 | info | 0a13jc93spuadh | INFO 01-06 19:48:04
model_runner.py:1518] Graph capturing finished in 30 secs, took 0.99 GiB\n
2025-01-06 16:48:05.291 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-06 19:48:04 model_runner.py:1518] Graph capturing finished
in 30 secs, took 0.99 GiB\n
2025-01-06 16:48:04.929 | info | 0a13jc93spuadh | INFO 01-06 19:48:04
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-06 16:48:04.671 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-06 19:48:04 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-06 16:48:04.671 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-06 19:47:35 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-06 16:48:04.671 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-06 19:47:35 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-06 16:48:04.670 | info | 0a13jc93spuadh | INFO 01-06 19:47:35
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-06 16:47:49.033 | info | olmw98xr7nfth5 | Finished.
2025-01-06 16:47:48.735 | info | olmw98xr7nfth5 | Finished running generator.
2025-01-06 16:47:48.506 | info | olmw98xr7nfth5 | INFO 01-06 19:47:48
async_llm_engine.py:176] Finished request 0640fa52ad8d415ebb9a4227191be936.\n
2025-01-06 16:47:48.506 | info | olmw98xr7nfth5 | INFO 01-06 19:47:46
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-06 16:47:46.696 | info | olmw98xr7nfth5 | INFO 01-06 19:47:46
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
31.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.6%, CPU KV cache usage: 0.0%.\n
2025-01-06 16:47:46.696 | info | olmw98xr7nfth5 | INFO 01-06 19:47:41
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-06 16:47:41.672 | info | olmw98xr7nfth5 | INFO 01-06 19:47:41
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
44.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-06 16:47:41.672 | info | olmw98xr7nfth5 | Jobs in progress: 1
2025-01-06 16:47:39.625 | info | olmw98xr7nfth5 | Finished.
2025-01-06 16:47:39.322 | info | olmw98xr7nfth5 | Finished running generator.
2025-01-06 16:47:39.101 | info | olmw98xr7nfth5 | INFO 01-06 19:47:39
async_llm_engine.py:176] Finished request e9f056edf5ff49d59cfb5406730f3572.\n
2025-01-06 16:47:39.101 | info | olmw98xr7nfth5 | INFO 01-06 19:47:36
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-06 16:47:36.667 | info | olmw98xr7nfth5 | INFO 01-06 19:47:36
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
59.7 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.8%, CPU KV cache usage: 0.0%.\n
2025-01-06 16:47:36.667 | info | olmw98xr7nfth5 | INFO 01-06 19:47:31
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-06 16:47:35.339 | info | 0a13jc93spuadh | INFO 01-06 19:47:35
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-06 16:47:35.339 | info | 0a13jc93spuadh | INFO 01-06 19:47:32
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
15.62x\n
2025-01-06 16:47:32.501 | info | 0a13jc93spuadh | INFO 01-06 19:47:32
distributed_gpu_executor.py:57] # GPU blocks: 7999, # CPU blocks: 1638\n
2025-01-06 16:47:32.302 | info | 0a13jc93spuadh | INFO 01-06 19:47:32
worker.py:232] Memory profiling results: total_gpu_memory=44.34GiB
initial_memory_usage=19.04GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.53GiB
gpu_memory_utilization=0.90\n
2025-01-06 16:47:32.242 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-06 19:47:32 worker.py:232] Memory profiling results:
total_gpu_memory=44.34GiB initial_memory_usage=19.04GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.57GiB
gpu_memory_utilization=0.90\n
2025-01-06 16:47:32.242 | info | 0a13jc93spuadh | INFO 01-06 19:47:24
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-06 16:47:31.644 | info | olmw98xr7nfth5 | INFO 01-06 19:47:31
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
60.1 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.6%, CPU KV cache usage: 0.0%.\n
2025-01-06 16:47:31.644 | info | olmw98xr7nfth5 | INFO 01-06 19:47:26
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-06 16:47:26.621 | info | olmw98xr7nfth5 | INFO 01-06 19:47:26
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
59.7 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-06 16:47:26.620 | info | olmw98xr7nfth5 | INFO 01-06 19:47:21
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-06 16:47:24.733 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-06 19:47:24 model_runner.py:1077] Loading model weights took
18.5769 GB\n

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy