0% found this document useful (0 votes)

47 views59 pages

Logs-Lovefy - VLLM Server (Bacchus) - FB

The document contains log entries from a serverless worker process, detailing various warnings and information messages related to memory management, process group handling, and model loading. It highlights issues such as leaked shared memory objects and the necessity of properly destroying process groups to avoid blocking operations. Additionally, it provides metrics on memory usage, throughput, and the status of ongoing jobs.

Uploaded by

Pedro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views59 pages

Logs-Lovefy - VLLM Server (Bacchus) - FB

Uploaded by

Pedro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 59

2025-01-10 13:37:16.

562 | info | 3yn75aympd52v1 |

warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-10 13:37:15.658 | info | 3yn75aympd52v1 |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-10 13:37:14.772 | warning | 3yn75aympd52v1 | [rank0]:[W110
16:37:14.435087749 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has
NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the
application should call destroy_process_group to ensure that any pending NCCL
operations have finished in this process. In rare cases this process can exit
before this point and block the progress of another member of the process group.
This constraint has always been present, but this warning has only been added
since PyTorch 2.4 (function operator())\n
2025-01-10 13:37:14.772 | info | 3yn75aympd52v1 | INFO 01-10 16:37:13
async_llm_engine.py:62] Engine is gracefully shutting down.\n
2025-01-10 13:37:13.161 | info | 3yn75aympd52v1 | Kill worker.
2025-01-10 13:37:13.161 | info | 3yn75aympd52v1 | Finished.
2025-01-10 13:28:32.226 | info | 3yn75aympd52v1 | Finished running generator.
2025-01-10 13:28:32.133 | info | 3yn75aympd52v1 | INFO 01-10 16:28:32
async_llm_engine.py:176] Finished request b91c6d62c0a648399cd7e92d15d493c4.\n
2025-01-10 13:28:32.133 | info | 3yn75aympd52v1 | INFO 01-10 16:28:31
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 13:28:31.274 | info | 3yn75aympd52v1 | INFO 01-10 16:28:31
metrics.py:449] Avg prompt throughput: 624.7 tokens/s, Avg generation throughput:
3.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 2.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 13:28:31.274 | info | 3yn75aympd52v1 | INFO 01-10 16:28:27
async_llm_engine.py:208] Added request b91c6d62c0a648399cd7e92d15d493c4.\n
2025-01-10 13:28:31.274 | info | 3yn75aympd52v1 | Jobs in progress: 1
2025-01-10 13:28:31.274 | info | 3yn75aympd52v1 | Jobs in queue: 1
2025-01-10 13:28:27.392 | info | 3yn75aympd52v1 | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-10 13:28:27.392 | info | 3yn75aympd52v1 | INFO 01-10 16:28:26
chat_utils.py:431] ' }}{% endif %}\n
2025-01-10 13:28:27.392 | info | 3yn75aympd52v1 | INFO 01-10 16:28:26
chat_utils.py:431] \n
2025-01-10 13:28:27.392 | info | 3yn75aympd52v1 | INFO 01-10 16:28:26
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-10 13:28:27.392 | info | 3yn75aympd52v1 | INFO 01-10 16:28:26
chat_utils.py:431] \n
2025-01-10 13:28:27.392 | info | 3yn75aympd52v1 | INFO 01-10 16:28:26
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-10 13:28:27.392 | info | 3yn75aympd52v1 | INFO 01-10 16:28:26
chat_utils.py:431] Using supplied chat template:\n
2025-01-10 13:28:26.954 | info | 3yn75aympd52v1 | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 13:28:26.954 | info | 3yn75aympd52v1 | engine.py :26 2025-
01-10 16:28:26,252 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 13:28:26.954 | info | 3yn75aympd52v1 | engine_args.py :126 2025-
01-10 16:28:26,242 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 13:28:26.242 | info | 3yn75aympd52v1 | engine.py :112 2025-
01-10 16:28:26,241 Initialized vLLM engine in 78.28s\n
2025-01-10 13:28:26.242 | info | 3yn75aympd52v1 | INFO 01-10 16:28:25
model_runner.py:1518] Graph capturing finished in 34 secs, took 0.99 GiB\n
2025-01-10 13:28:25.681 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:28:25 model_runner.py:1518] Graph capturing finished
in 34 secs, took 0.99 GiB\n
2025-01-10 13:28:25.627 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:28:25 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-10 13:28:24.655 | info | 3yn75aympd52v1 | INFO 01-10 16:28:24
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-10 13:28:24.655 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:52 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-10 13:27:52.080 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:52 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-10 13:27:52.080 | info | 3yn75aympd52v1 | INFO 01-10 16:27:51
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-10 13:27:51.941 | info | 3yn75aympd52v1 | INFO 01-10 16:27:51
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-10 13:27:51.941 | info | 3yn75aympd52v1 | INFO 01-10 16:27:49
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
15.62x\n
2025-01-10 13:27:49.120 | info | 3yn75aympd52v1 | INFO 01-10 16:27:49
distributed_gpu_executor.py:57] # GPU blocks: 7999, # CPU blocks: 1638\n
2025-01-10 13:27:48.940 | info | 3yn75aympd52v1 | INFO 01-10 16:27:48
worker.py:232] Memory profiling results: total_gpu_memory=44.34GiB
initial_memory_usage=19.04GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.53GiB
gpu_memory_utilization=0.90\n
2025-01-10 13:27:48.868 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:48 worker.py:232] Memory profiling results:
total_gpu_memory=44.34GiB initial_memory_usage=19.04GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.57GiB
gpu_memory_utilization=0.90\n
2025-01-10 13:27:40.958 | info | 3yn75aympd52v1 | INFO 01-10 16:27:40
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-10 13:27:40.958 | info | 3yn75aympd52v1 | \n
2025-01-10 13:27:40.958 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:11<00:00, 1.29s/it]\n
2025-01-10 13:27:40.736 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:11<00:00, 1.03s/it]\n
2025-01-10 13:27:40.736 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 89% Completed | 8/9 [00:11<00:01, 1.38s/it]\n
2025-01-10 13:27:40.437 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:40 model_runner.py:1077] Loading model weights took
18.5769 GB\n
2025-01-10 13:27:39.384 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 78% Completed | 7/9 [00:10<00:03, 1.51s/it]\n
2025-01-10 13:27:37.869 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 67% Completed | 6/9 [00:08<00:04, 1.51s/it]\n
2025-01-10 13:27:36.344 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 56% Completed | 5/9 [00:07<00:06, 1.51s/it]\n
2025-01-10 13:27:34.823 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 44% Completed | 4/9 [00:05<00:07, 1.50s/it]\n
2025-01-10 13:27:33.265 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 33% Completed | 3/9 [00:04<00:08, 1.46s/it]\n
2025-01-10 13:27:31.655 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 22% Completed | 2/9 [00:02<00:09, 1.33s/it]\n
2025-01-10 13:27:30.106 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 11% Completed | 1/9 [00:01<00:08, 1.02s/it]\n
2025-01-10 13:27:29.082 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 0% Completed | 0/9 [00:00<?, ?it/s]\n
2025-01-10 13:27:29.082 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:28 model_runner.py:1072] Starting to load model
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 13:27:29.082 | info | 3yn75aympd52v1 | INFO 01-10 16:27:28
model_runner.py:1072] Starting to load model /models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 13:27:29.082 | info | 3yn75aympd52v1 | INFO 01-10 16:27:28
shm_broadcast.py:236] vLLM message queue communication handle:
Handle(connect_ip='127.0.0.1', local_reader_ranks=[1],
buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at
0x7809c6281f90>, local_subscribe_port=41563, remote_subscribe_port=None)\n
2025-01-10 13:27:29.082 | info | 3yn75aympd52v1 | INFO 01-10 16:27:28
custom_all_reduce_utils.py:242] reading GPU P2P access cache from
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 13:27:28.706 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:28 custom_all_reduce_utils.py:242] reading GPU P2P
access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 13:27:15.794 | info | 3yn75aympd52v1 | INFO 01-10 16:27:15
custom_all_reduce_utils.py:204] generating GPU P2P access cache in
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 13:27:15.794 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:15 pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 13:27:15.794 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:15 utils.py:960] Found nccl from library
libnccl.so.2\n
2025-01-10 13:27:15.794 | info | 3yn75aympd52v1 | INFO 01-10 16:27:15
pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 13:27:15.274 | info | 3yn75aympd52v1 | INFO 01-10 16:27:15
utils.py:960] Found nccl from library libnccl.so.2\n
2025-01-10 13:27:15.274 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:13 multiproc_worker_utils.py:215] Worker ready;
awaiting tasks\n
2025-01-10 13:27:15.274 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 16:27:13 selector.py:135] Using Flash Attention backend.\
n
2025-01-10 13:27:13.937 | info | 3yn75aympd52v1 | INFO 01-10 16:27:13
selector.py:135] Using Flash Attention backend.\n
2025-01-10 13:27:13.781 | info | 3yn75aympd52v1 | INFO 01-10 16:27:13
custom_cache_manager.py:17] Setting Triton cache manager to:
vllm.triton_utils.custom_cache_manager:CustomCacheManager\n
2025-01-10 13:27:13.729 | warning | 3yn75aympd52v1 | WARNING 01-10 16:27:13
multiproc_gpu_executor.py:56] Reducing Torch parallelism from 48 threads to 1 to
avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment
to tune this value as needed.\n
2025-01-10 13:27:13.729 | info | 3yn75aympd52v1 | INFO 01-10 16:27:13
llm_engine.py:249] Initializing an LLM engine (v0.6.4) with config:
model='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', speculative_config=None,
tokenizer='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', skip_tokenizer_init=False,
tokenizer_mode=auto, revision=None, override_neuron_config=None,
tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16,
max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO,
tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False,
quantization=awq, enforce_eager=False, kv_cache_dtype=auto,
quantization_param_path=None, device_config=cuda,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'),
observability_config=ObservabilityConfig(otlp_traces_endpoint=None,
collect_model_forward_time=False, collect_model_execute_time=False), seed=0,
served_model_name=/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-
v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, num_scheduler_steps=1,
chunked_prefill_enabled=False multi_step_stream_outputs=True,
enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False,
chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)\n
2025-01-10 13:27:13.729 | info | 3yn75aympd52v1 | INFO 01-10 16:27:13
config.py:1020] Defaulting to use mp for distributed inference\n
2025-01-10 13:27:13.729 | warning | 3yn75aympd52v1 | WARNING 01-10 16:27:13
config.py:428] awq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.\n
2025-01-10 13:27:13.729 | info | 3yn75aympd52v1 | INFO 01-10 16:27:13
awq_marlin.py:113] Detected that the model can run with awq_marlin, however you
specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin
for faster inference\n
2025-01-10 13:27:13.729 | info | 3yn75aympd52v1 | INFO 01-10 16:27:13
config.py:350] This model supports multiple tasks: {'generate', 'embedding'}.
Defaulting to 'generate'.\n
2025-01-10 13:27:13.154 | info | 3yn75aympd52v1 | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 13:27:13.154 | info | 3yn75aympd52v1 | engine.py :26 2025-
01-10 16:27:07,459 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 13:27:07.424 | info | 3yn75aympd52v1 | engine_args.py :126 2025-
01-10 16:27:07,424 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 12:14:08.168 | info | 3yn75aympd52v1 |
warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-10 12:14:07.128 | info | 3yn75aympd52v1 |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-10 12:14:06.092 | warning | 3yn75aympd52v1 | [rank0]:[W110
15:14:06.755547102 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has
NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the
application should call destroy_process_group to ensure that any pending NCCL
operations have finished in this process. In rare cases this process can exit
before this point and block the progress of another member of the process group.
This constraint has always been present, but this warning has only been added
since PyTorch 2.4 (function operator())\n
2025-01-10 12:14:06.092 | info | 3yn75aympd52v1 | INFO 01-10 15:14:04
async_llm_engine.py:62] Engine is gracefully shutting down.\n
2025-01-10 12:14:04.505 | info | 3yn75aympd52v1 | Kill worker.
2025-01-10 11:54:21.263 | info | 3yn75aympd52v1 | Finished.
2025-01-10 11:54:21.211 | info | 3yn75aympd52v1 | Finished running generator.
2025-01-10 11:54:21.073 | info | 3yn75aympd52v1 | INFO 01-10 14:54:21
async_llm_engine.py:176] Finished request 59eadcbebddf4f3986ece12f5a9e9ad3.\n
2025-01-10 11:54:21.073 | info | 3yn75aympd52v1 | INFO 01-10 14:54:18
metrics.py:465] Prefix cache hit rate: GPU: 36.76%, CPU: 0.00%\n
2025-01-10 11:54:18.434 | info | 3yn75aympd52v1 | INFO 01-10 14:54:18
metrics.py:449] Avg prompt throughput: 4.5 tokens/s, Avg generation throughput: 0.0
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
2.2%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:54:18.434 | info | 3yn75aympd52v1 | INFO 01-10 14:54:17
async_llm_engine.py:208] Added request 59eadcbebddf4f3986ece12f5a9e9ad3.\n
2025-01-10 11:54:18.434 | info | 3yn75aympd52v1 | Jobs in progress: 1
2025-01-10 11:54:17.516 | info | 3yn75aympd52v1 | Jobs in queue: 1
2025-01-10 11:54:17.516 | info | 3yn75aympd52v1 | Finished.
2025-01-10 11:43:53.600 | info | 3yn75aympd52v1 | Finished running generator.
2025-01-10 11:43:53.496 | info | 3yn75aympd52v1 | INFO 01-10 14:43:53
async_llm_engine.py:176] Finished request 55c8f07229ae4b338946079e292342da.\n
2025-01-10 11:43:53.496 | info | 3yn75aympd52v1 | INFO 01-10 14:43:52
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:43:52.945 | info | 3yn75aympd52v1 | INFO 01-10 14:43:52
metrics.py:449] Avg prompt throughput: 523.8 tokens/s, Avg generation throughput:
9.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 2.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:43:52.945 | info | 3yn75aympd52v1 | INFO 01-10 14:43:48
async_llm_engine.py:208] Added request 55c8f07229ae4b338946079e292342da.\n
2025-01-10 11:43:52.944 | info | 3yn75aympd52v1 | Jobs in progress: 1
2025-01-10 11:43:52.944 | info | 3yn75aympd52v1 | Jobs in queue: 1
2025-01-10 11:43:48.632 | info | 3yn75aympd52v1 | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-10 11:43:48.632 | info | 3yn75aympd52v1 | INFO 01-10 14:43:48
chat_utils.py:431] ' }}{% endif %}\n
2025-01-10 11:43:48.632 | info | 3yn75aympd52v1 | INFO 01-10 14:43:48
chat_utils.py:431] \n
2025-01-10 11:43:48.632 | info | 3yn75aympd52v1 | INFO 01-10 14:43:48
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-10 11:43:48.632 | info | 3yn75aympd52v1 | INFO 01-10 14:43:48
chat_utils.py:431] \n
2025-01-10 11:43:48.632 | info | 3yn75aympd52v1 | INFO 01-10 14:43:48
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-10 11:43:48.632 | info | 3yn75aympd52v1 | INFO 01-10 14:43:48
chat_utils.py:431] Using supplied chat template:\n
2025-01-10 11:43:48.225 | info | 3yn75aympd52v1 | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 11:43:48.225 | info | 3yn75aympd52v1 | engine.py :26 2025-
01-10 14:43:47,934 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 11:43:48.225 | info | 3yn75aympd52v1 | engine_args.py :126 2025-
01-10 14:43:47,929 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 11:43:47.929 | info | 3yn75aympd52v1 | engine.py :112 2025-
01-10 14:43:47,929 Initialized vLLM engine in 69.75s\n
2025-01-10 11:43:47.645 | info | 3yn75aympd52v1 | INFO 01-10 14:43:47
model_runner.py:1518] Graph capturing finished in 30 secs, took 0.99 GiB\n
2025-01-10 11:43:47.645 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:43:47 model_runner.py:1518] Graph capturing finished
in 30 secs, took 0.99 GiB\n
2025-01-10 11:43:47.587 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:43:47 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-10 11:43:47.374 | info | 3yn75aympd52v1 | INFO 01-10 14:43:47
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-10 11:43:47.374 | info | 3yn75aympd52v1 | INFO 01-10 14:43:17
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-10 11:43:47.374 | info | 3yn75aympd52v1 | INFO 01-10 14:43:17
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-10 11:43:47.373 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:43:17 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-10 11:43:17.355 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:43:17 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-10 11:43:17.355 | info | 3yn75aympd52v1 | INFO 01-10 14:43:14
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
15.53x\n
2025-01-10 11:43:14.762 | info | 3yn75aympd52v1 | INFO 01-10 14:43:14
distributed_gpu_executor.py:57] # GPU blocks: 7949, # CPU blocks: 1638\n
2025-01-10 11:43:14.593 | info | 3yn75aympd52v1 | INFO 01-10 14:43:14
worker.py:232] Memory profiling results: total_gpu_memory=44.34GiB
initial_memory_usage=19.09GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=19.21Gib non_torch_memory=0.62GiB kv_cache_size=19.41GiB
gpu_memory_utilization=0.90\n
2025-01-10 11:43:14.528 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:43:14 worker.py:232] Memory profiling results:
total_gpu_memory=44.34GiB initial_memory_usage=19.09GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.19Gib non_torch_memory=0.60GiB kv_cache_size=19.46GiB
gpu_memory_utilization=0.90\n
2025-01-10 11:43:14.528 | info | 3yn75aympd52v1 | INFO 01-10 14:43:07
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-10 11:43:07.716 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:43:07 model_runner.py:1077] Loading model weights took
18.5769 GB\n
2025-01-10 11:43:07.716 | info | 3yn75aympd52v1 | \n
2025-01-10 11:43:07.716 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:12<00:00, 1.44s/it]\n
2025-01-10 11:43:07.489 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:12<00:00, 1.16s/it]\n
2025-01-10 11:43:07.263 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 89% Completed | 8/9 [00:12<00:01, 1.58s/it]\n
2025-01-10 11:43:05.903 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 78% Completed | 7/9 [00:11<00:03, 1.69s/it]\n
2025-01-10 11:43:04.164 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 67% Completed | 6/9 [00:09<00:04, 1.66s/it]\n
2025-01-10 11:43:02.439 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 56% Completed | 5/9 [00:07<00:06, 1.63s/it]\n
2025-01-10 11:43:00.784 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 44% Completed | 4/9 [00:06<00:08, 1.62s/it]\n
2025-01-10 11:42:59.133 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 33% Completed | 3/9 [00:04<00:09, 1.60s/it]\n
2025-01-10 11:42:57.392 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 22% Completed | 2/9 [00:02<00:10, 1.48s/it]\n
2025-01-10 11:42:55.703 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 11% Completed | 1/9 [00:01<00:09, 1.17s/it]\n
2025-01-10 11:42:54.532 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 0% Completed | 0/9 [00:00<?, ?it/s]\n
2025-01-10 11:42:54.532 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:42:54 model_runner.py:1072] Starting to load model
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 11:42:54.532 | info | 3yn75aympd52v1 | INFO 01-10 14:42:54
model_runner.py:1072] Starting to load model /models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 11:42:54.532 | info | 3yn75aympd52v1 | INFO 01-10 14:42:54
shm_broadcast.py:236] vLLM message queue communication handle:
Handle(connect_ip='127.0.0.1', local_reader_ranks=[1],
buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at
0x78f709e7de70>, local_subscribe_port=38717, remote_subscribe_port=None)\n
2025-01-10 11:42:54.532 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:42:54 custom_all_reduce_utils.py:242] reading GPU P2P
access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 11:42:54.234 | info | 3yn75aympd52v1 | INFO 01-10 14:42:54
custom_all_reduce_utils.py:242] reading GPU P2P access cache from
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 11:42:54.234 | info | 3yn75aympd52v1 | INFO 01-10 14:42:43
custom_all_reduce_utils.py:204] generating GPU P2P access cache in
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 11:42:54.234 | info | 3yn75aympd52v1 | INFO 01-10 14:42:43
pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 11:42:54.234 | info | 3yn75aympd52v1 | INFO 01-10 14:42:43
utils.py:960] Found nccl from library libnccl.so.2\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:42:43 pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:42:43 utils.py:960] Found nccl from library
libnccl.so.2\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:42:42 multiproc_worker_utils.py:215] Worker ready;
awaiting tasks\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:42:42 selector.py:135] Using Flash Attention backend.\
n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | INFO 01-10 14:42:42
selector.py:135] Using Flash Attention backend.\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | INFO 01-10 14:42:42
custom_cache_manager.py:17] Setting Triton cache manager to:
vllm.triton_utils.custom_cache_manager:CustomCacheManager\n
2025-01-10 11:42:54.233 | warning | 3yn75aympd52v1 | WARNING 01-10 14:42:42
multiproc_gpu_executor.py:56] Reducing Torch parallelism from 48 threads to 1 to
avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment
to tune this value as needed.\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | INFO 01-10 14:42:42
llm_engine.py:249] Initializing an LLM engine (v0.6.4) with config:
model='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', speculative_config=None,
tokenizer='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', skip_tokenizer_init=False,
tokenizer_mode=auto, revision=None, override_neuron_config=None,
tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16,
max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO,
tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False,
quantization=awq, enforce_eager=False, kv_cache_dtype=auto,
quantization_param_path=None, device_config=cuda,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'),
observability_config=ObservabilityConfig(otlp_traces_endpoint=None,
collect_model_forward_time=False, collect_model_execute_time=False), seed=0,
served_model_name=/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-
v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, num_scheduler_steps=1,
chunked_prefill_enabled=False multi_step_stream_outputs=True,
enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False,
chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | INFO 01-10 14:42:42
config.py:1020] Defaulting to use mp for distributed inference\n
2025-01-10 11:42:54.233 | warning | 3yn75aympd52v1 | WARNING 01-10 14:42:42
config.py:428] awq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | INFO 01-10 14:42:42
awq_marlin.py:113] Detected that the model can run with awq_marlin, however you
specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin
for faster inference\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | INFO 01-10 14:42:42
config.py:350] This model supports multiple tasks: {'generate', 'embedding'}.
Defaulting to 'generate'.\n
2025-01-10 11:42:54.233 | info | 3yn75aympd52v1 | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 11:38:52.063 | info | 3yn75aympd52v1 |
warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-10 11:38:51.018 | info | 3yn75aympd52v1 |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-10 11:38:50.069 | warning | 3yn75aympd52v1 | [rank0]:[W110
14:38:50.732630541 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has
NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the
application should call destroy_process_group to ensure that any pending NCCL
operations have finished in this process. In rare cases this process can exit
before this point and block the progress of another member of the process group.
This constraint has always been present, but this warning has only been added
since PyTorch 2.4 (function operator())\n
2025-01-10 11:38:50.069 | info | 3yn75aympd52v1 | INFO 01-10 14:38:48
async_llm_engine.py:62] Engine is gracefully shutting down.\n
2025-01-10 11:38:48.520 | info | 3yn75aympd52v1 | Kill worker.
2025-01-10 11:38:48.520 | info | 3yn75aympd52v1 | Finished.
2025-01-10 11:34:06.349 | info | 3yn75aympd52v1 | Finished running generator.
2025-01-10 11:34:06.225 | info | 3yn75aympd52v1 | INFO 01-10 14:34:06
async_llm_engine.py:176] Finished request 48af1202e7a645bba14346888c841c1e.\n
2025-01-10 11:34:06.225 | info | 3yn75aympd52v1 | INFO 01-10 14:34:03
metrics.py:465] Prefix cache hit rate: GPU: 67.26%, CPU: 0.00%\n
2025-01-10 11:34:03.554 | info | 3yn75aympd52v1 | INFO 01-10 14:34:03
metrics.py:449] Avg prompt throughput: 27.0 tokens/s, Avg generation throughput:
1.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 1.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:34:03.554 | info | 3yn75aympd52v1 | INFO 01-10 14:34:03
async_llm_engine.py:208] Added request 48af1202e7a645bba14346888c841c1e.\n
2025-01-10 11:34:03.554 | info | 3yn75aympd52v1 | Jobs in progress: 1
2025-01-10 11:34:03.481 | info | 3yn75aympd52v1 | Jobs in queue: 1
2025-01-10 11:34:03.480 | info | 3yn75aympd52v1 | Finished.
2025-01-10 11:33:19.149 | info | 3yn75aympd52v1 | Finished running generator.
2025-01-10 11:33:19.039 | info | 3yn75aympd52v1 | INFO 01-10 14:33:19
async_llm_engine.py:176] Finished request 7b0b0d8e7bf34ea6ac6521c531da989c.\n
2025-01-10 11:33:19.039 | info | 3yn75aympd52v1 | INFO 01-10 14:33:16
metrics.py:465] Prefix cache hit rate: GPU: 49.66%, CPU: 0.00%\n
2025-01-10 11:33:16.236 | info | 3yn75aympd52v1 | INFO 01-10 14:33:16
metrics.py:449] Avg prompt throughput: 31.4 tokens/s, Avg generation throughput:
1.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 1.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:33:16.236 | info | 3yn75aympd52v1 | INFO 01-10 14:33:16
async_llm_engine.py:208] Added request 7b0b0d8e7bf34ea6ac6521c531da989c.\n
2025-01-10 11:33:16.236 | info | 3yn75aympd52v1 | Jobs in progress: 1
2025-01-10 11:33:16.109 | info | 3yn75aympd52v1 | Jobs in queue: 1
2025-01-10 11:33:16.109 | info | 3yn75aympd52v1 | Finished.
2025-01-10 11:32:39.708 | info | 3yn75aympd52v1 | Finished running generator.
2025-01-10 11:32:39.597 | info | 3yn75aympd52v1 | INFO 01-10 14:32:39
async_llm_engine.py:176] Finished request 3690ce4c84b6483c822fbff4803c3ca6.\n
2025-01-10 11:32:39.597 | info | 3yn75aympd52v1 | INFO 01-10 14:32:37
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:32:37.897 | info | 3yn75aympd52v1 | INFO 01-10 14:32:37
metrics.py:449] Avg prompt throughput: 3.9 tokens/s, Avg generation throughput: 0.0
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.9%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:32:37.897 | info | 3yn75aympd52v1 | INFO 01-10 14:32:36
async_llm_engine.py:208] Added request 3690ce4c84b6483c822fbff4803c3ca6.\n
2025-01-10 11:32:37.897 | info | 3yn75aympd52v1 | Jobs in progress: 1
2025-01-10 11:32:36.737 | info | 3yn75aympd52v1 | Jobs in queue: 1
2025-01-10 11:32:36.737 | info | 3yn75aympd52v1 | Finished.
2025-01-10 11:27:43.693 | info | 3yn75aympd52v1 | Finished running generator.
2025-01-10 11:27:43.577 | info | 3yn75aympd52v1 | INFO 01-10 14:27:43
async_llm_engine.py:176] Finished request 9c5abc581911481b95d316c73ba566b1.\n
2025-01-10 11:27:43.577 | info | 3yn75aympd52v1 | INFO 01-10 14:27:42
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:27:42.981 | info | 3yn75aympd52v1 | INFO 01-10 14:27:42
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
21.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:27:42.980 | info | 3yn75aympd52v1 | INFO 01-10 14:27:37
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:27:37.970 | info | 3yn75aympd52v1 | INFO 01-10 14:27:37
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
21.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:27:37.970 | info | 3yn75aympd52v1 | INFO 01-10 14:27:32
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:27:32.945 | info | 3yn75aympd52v1 | INFO 01-10 14:27:32
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
21.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:27:32.945 | info | 3yn75aympd52v1 | INFO 01-10 14:27:27
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:27:27.936 | info | 3yn75aympd52v1 | INFO 01-10 14:27:27
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
21.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:27:27.936 | info | 3yn75aympd52v1 | INFO 01-10 14:27:22
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:27:22.890 | info | 3yn75aympd52v1 | INFO 01-10 14:27:22
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
21.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.2%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:27:22.890 | info | 3yn75aympd52v1 | INFO 01-10 14:27:17
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:27:17.851 | info | 3yn75aympd52v1 | INFO 01-10 14:27:17
metrics.py:449] Avg prompt throughput: 4.6 tokens/s, Avg generation throughput:
18.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:27:17.851 | info | 3yn75aympd52v1 | INFO 01-10 14:27:13
async_llm_engine.py:208] Added request 9c5abc581911481b95d316c73ba566b1.\n
2025-01-10 11:27:17.851 | info | 3yn75aympd52v1 | Jobs in progress: 1
2025-01-10 11:27:17.850 | info | 3yn75aympd52v1 | Jobs in queue: 1
2025-01-10 11:27:13.558 | info | 3yn75aympd52v1 | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-10 11:27:13.558 | info | 3yn75aympd52v1 | INFO 01-10 14:27:13
chat_utils.py:431] ' }}{% endif %}\n
2025-01-10 11:27:13.558 | info | 3yn75aympd52v1 | INFO 01-10 14:27:13
chat_utils.py:431] \n
2025-01-10 11:27:13.558 | info | 3yn75aympd52v1 | INFO 01-10 14:27:13
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-10 11:27:13.558 | info | 3yn75aympd52v1 | INFO 01-10 14:27:13
chat_utils.py:431] \n
2025-01-10 11:27:13.558 | info | 3yn75aympd52v1 | INFO 01-10 14:27:13
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-10 11:27:13.557 | info | 3yn75aympd52v1 | INFO 01-10 14:27:13
chat_utils.py:431] Using supplied chat template:\n
2025-01-10 11:27:13.197 | info | 3yn75aympd52v1 | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 11:27:13.197 | info | 3yn75aympd52v1 | engine.py :26 2025-
01-10 14:27:12,842 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 11:27:13.197 | info | 3yn75aympd52v1 | engine_args.py :126 2025-
01-10 14:27:12,837 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 11:27:12.838 | info | 3yn75aympd52v1 | engine.py :112 2025-
01-10 14:27:12,837 Initialized vLLM engine in 66.86s\n
2025-01-10 11:27:12.838 | info | 3yn75aympd52v1 | INFO 01-10 14:27:12
model_runner.py:1518] Graph capturing finished in 28 secs, took 0.99 GiB\n
2025-01-10 11:27:12.838 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:27:12 model_runner.py:1518] Graph capturing finished
in 28 secs, took 0.99 GiB\n
2025-01-10 11:27:12.838 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:27:12 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-10 11:27:12.555 | info | 3yn75aympd52v1 | INFO 01-10 14:27:12
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-10 11:27:12.555 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:44 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-10 11:27:12.555 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:44 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-10 11:27:12.555 | info | 3yn75aympd52v1 | INFO 01-10 14:26:44
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-10 11:26:44.515 | info | 3yn75aympd52v1 | INFO 01-10 14:26:44
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-10 11:26:44.515 | info | 3yn75aympd52v1 | INFO 01-10 14:26:42
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
15.62x\n
2025-01-10 11:26:42.190 | info | 3yn75aympd52v1 | INFO 01-10 14:26:42
distributed_gpu_executor.py:57] # GPU blocks: 7999, # CPU blocks: 1638\n
2025-01-10 11:26:42.050 | info | 3yn75aympd52v1 | INFO 01-10 14:26:42
worker.py:232] Memory profiling results: total_gpu_memory=44.34GiB
initial_memory_usage=19.04GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.53GiB
gpu_memory_utilization=0.90\n
2025-01-10 11:26:41.994 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:41 worker.py:232] Memory profiling results:
total_gpu_memory=44.34GiB initial_memory_usage=19.04GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.57GiB
gpu_memory_utilization=0.90\n
2025-01-10 11:26:41.994 | info | 3yn75aympd52v1 | INFO 01-10 14:26:34
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-10 11:26:34.578 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:34 model_runner.py:1077] Loading model weights took
18.5769 GB\n
2025-01-10 11:26:34.578 | info | 3yn75aympd52v1 | \n
2025-01-10 11:26:34.578 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:12<00:00, 1.42s/it]\n
2025-01-10 11:26:34.435 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:12<00:00, 1.16s/it]\n
2025-01-10 11:26:34.120 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 89% Completed | 8/9 [00:12<00:01, 1.55s/it]\n
2025-01-10 11:26:32.821 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 78% Completed | 7/9 [00:11<00:03, 1.66s/it]\n
2025-01-10 11:26:31.165 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 67% Completed | 6/9 [00:09<00:05, 1.67s/it]\n
2025-01-10 11:26:29.276 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 56% Completed | 5/9 [00:07<00:06, 1.55s/it]\n
2025-01-10 11:26:27.750 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 44% Completed | 4/9 [00:06<00:07, 1.57s/it]\n
2025-01-10 11:26:26.122 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 33% Completed | 3/9 [00:04<00:09, 1.53s/it]\n
2025-01-10 11:26:24.496 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 22% Completed | 2/9 [00:02<00:10, 1.46s/it]\n
2025-01-10 11:26:22.825 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 11% Completed | 1/9 [00:01<00:09, 1.15s/it]\n
2025-01-10 11:26:21.678 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 0% Completed | 0/9 [00:00<?, ?it/s]\n
2025-01-10 11:26:21.678 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:21 model_runner.py:1072] Starting to load model
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 11:26:21.678 | info | 3yn75aympd52v1 | INFO 01-10 14:26:21
model_runner.py:1072] Starting to load model /models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 11:26:21.677 | info | 3yn75aympd52v1 | INFO 01-10 14:26:21
shm_broadcast.py:236] vLLM message queue communication handle:
Handle(connect_ip='127.0.0.1', local_reader_ranks=[1],
buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at
0x757adcf7df00>, local_subscribe_port=59477, remote_subscribe_port=None)\n
2025-01-10 11:26:21.677 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:21 custom_all_reduce_utils.py:242] reading GPU P2P
access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:21
custom_all_reduce_utils.py:242] reading GPU P2P access cache from
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:11
custom_all_reduce_utils.py:204] generating GPU P2P access cache in
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:11
pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:11 pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:11
utils.py:960] Found nccl from library libnccl.so.2\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:11 utils.py:960] Found nccl from library
libnccl.so.2\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:10 multiproc_worker_utils.py:215] Worker ready;
awaiting tasks\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=227)#[0;0m INFO 01-10 14:26:10 selector.py:135] Using Flash Attention backend.\
n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:10
selector.py:135] Using Flash Attention backend.\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:10
custom_cache_manager.py:17] Setting Triton cache manager to:
vllm.triton_utils.custom_cache_manager:CustomCacheManager\n
2025-01-10 11:26:21.424 | warning | 3yn75aympd52v1 | WARNING 01-10 14:26:10
multiproc_gpu_executor.py:56] Reducing Torch parallelism from 48 threads to 1 to
avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment
to tune this value as needed.\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:09
llm_engine.py:249] Initializing an LLM engine (v0.6.4) with config:
model='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', speculative_config=None,
tokenizer='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', skip_tokenizer_init=False,
tokenizer_mode=auto, revision=None, override_neuron_config=None,
tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16,
max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO,
tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False,
quantization=awq, enforce_eager=False, kv_cache_dtype=auto,
quantization_param_path=None, device_config=cuda,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'),
observability_config=ObservabilityConfig(otlp_traces_endpoint=None,
collect_model_forward_time=False, collect_model_execute_time=False), seed=0,
served_model_name=/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-
v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, num_scheduler_steps=1,
chunked_prefill_enabled=False multi_step_stream_outputs=True,
enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False,
chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:09
config.py:1020] Defaulting to use mp for distributed inference\n
2025-01-10 11:26:21.424 | warning | 3yn75aympd52v1 | WARNING 01-10 14:26:09
config.py:428] awq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:09
awq_marlin.py:113] Detected that the model can run with awq_marlin, however you
specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin
for faster inference\n
2025-01-10 11:26:21.424 | info | 3yn75aympd52v1 | INFO 01-10 14:26:09
config.py:350] This model supports multiple tasks: {'embedding', 'generate'}.
Defaulting to 'generate'.\n
2025-01-10 11:26:21.423 | info | 3yn75aympd52v1 | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 11:17:34.388 | info | q5xcbyb3ls1u0i |
warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-10 11:17:33.010 | info | q5xcbyb3ls1u0i |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-10 11:17:31.904 | warning | q5xcbyb3ls1u0i | [rank0]:[W110
14:17:31.188496691 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has
NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the
application should call destroy_process_group to ensure that any pending NCCL
operations have finished in this process. In rare cases this process can exit
before this point and block the progress of another member of the process group.
This constraint has always been present, but this warning has only been added
since PyTorch 2.4 (function operator())\n
2025-01-10 11:17:31.903 | info | q5xcbyb3ls1u0i | INFO 01-10 14:17:29
async_llm_engine.py:62] Engine is gracefully shutting down.\n
2025-01-10 11:17:29.698 | info | q5xcbyb3ls1u0i | Kill worker.
2025-01-10 11:14:16.474 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 11:14:16.170 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 11:14:15.932 | info | q5xcbyb3ls1u0i | INFO 01-10 14:14:15
async_llm_engine.py:176] Finished request 0f1b5a743e554fe7aa083edc5659dbce.\n
2025-01-10 11:14:15.932 | info | q5xcbyb3ls1u0i | INFO 01-10 14:14:11
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 11:14:11.717 | info | q5xcbyb3ls1u0i | INFO 01-10 14:14:11
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:14:11.717 | info | q5xcbyb3ls1u0i | INFO 01-10 14:14:06
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 11:14:06.691 | info | q5xcbyb3ls1u0i | INFO 01-10 14:14:06
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:14:06.690 | info | q5xcbyb3ls1u0i | INFO 01-10 14:14:01
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 11:14:01.672 | info | q5xcbyb3ls1u0i | INFO 01-10 14:14:01
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:14:01.672 | info | q5xcbyb3ls1u0i | INFO 01-10 14:13:56
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 11:13:56.655 | info | q5xcbyb3ls1u0i | INFO 01-10 14:13:56
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:13:56.655 | info | q5xcbyb3ls1u0i | INFO 01-10 14:13:51
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 11:13:51.652 | info | q5xcbyb3ls1u0i | INFO 01-10 14:13:51
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.2
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:13:51.652 | info | q5xcbyb3ls1u0i | INFO 01-10 14:13:51
async_llm_engine.py:208] Added request 0f1b5a743e554fe7aa083edc5659dbce.\n
2025-01-10 11:13:51.652 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 11:13:51.587 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 11:01:45.203 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 11:01:45.088 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 11:01:44.963 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:44
async_llm_engine.py:176] Finished request 25c7acd97002407a93c11a34afcc2c7d.\n
2025-01-10 11:01:44.963 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:40
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:01:40.042 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:40
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:01:40.042 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:35
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:01:35.040 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:35
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.2%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:01:35.040 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:30
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-10 11:01:30.036 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:30
metrics.py:449] Avg prompt throughput: 4.6 tokens/s, Avg generation throughput:
29.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:01:30.036 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:25
async_llm_engine.py:208] Added request 25c7acd97002407a93c11a34afcc2c7d.\n
2025-01-10 11:01:30.036 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 11:01:30.036 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 11:01:25.518 | info | q5xcbyb3ls1u0i | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-10 11:01:25.518 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:25
chat_utils.py:431] ' }}{% endif %}\n
2025-01-10 11:01:25.518 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:25
chat_utils.py:431] \n
2025-01-10 11:01:25.518 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:25
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-10 11:01:25.518 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:25
chat_utils.py:431] \n
2025-01-10 11:01:25.518 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:25
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-10 11:01:25.518 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:25
chat_utils.py:431] Using supplied chat template:\n
2025-01-10 11:01:25.355 | info | q5xcbyb3ls1u0i | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 11:01:25.355 | info | q5xcbyb3ls1u0i | engine.py :26 2025-
01-10 14:01:25,043 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 11:01:25.355 | info | q5xcbyb3ls1u0i | engine_args.py :126 2025-
01-10 14:01:25,036 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 11:01:25.036 | info | q5xcbyb3ls1u0i | engine.py :112 2025-
01-10 14:01:25,036 Initialized vLLM engine in 57.79s\n
2025-01-10 11:01:24.706 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:24
model_runner.py:1518] Graph capturing finished in 26 secs, took 0.99 GiB\n
2025-01-10 11:01:24.706 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:01:24 model_runner.py:1518] Graph capturing finished
in 26 secs, took 0.99 GiB\n
2025-01-10 11:01:24.639 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:01:24 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-10 11:01:22.860 | info | q5xcbyb3ls1u0i | INFO 01-10 14:01:22
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-10 11:01:22.860 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:58
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-10 11:01:22.860 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:58
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-10 11:01:22.860 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:58 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-10 11:00:58.461 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:58 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-10 11:00:58.461 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:55
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
17.62x\n
2025-01-10 11:00:55.231 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:55
distributed_gpu_executor.py:57] # GPU blocks: 9020, # CPU blocks: 1638\n
2025-01-10 11:00:55.063 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:55
worker.py:232] Memory profiling results: total_gpu_memory=47.50GiB
initial_memory_usage=19.30GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=19.44Gib non_torch_memory=0.85GiB kv_cache_size=22.02GiB
gpu_memory_utilization=0.90\n
2025-01-10 11:00:54.997 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:54 worker.py:232] Memory profiling results:
total_gpu_memory=47.50GiB initial_memory_usage=19.30GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.42Gib non_torch_memory=0.83GiB kv_cache_size=22.08GiB
gpu_memory_utilization=0.90\n
2025-01-10 11:00:49.914 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:49 model_runner.py:1077] Loading model weights took
18.5769 GB\n
2025-01-10 11:00:49.815 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:49
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-10 11:00:49.815 | info | q5xcbyb3ls1u0i | \n
2025-01-10 11:00:49.815 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:10<00:00, 1.17s/it]\n
2025-01-10 11:00:49.518 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:10<00:00, 1.07it/s]\n
2025-01-10 11:00:49.293 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 89% Completed | 8/9 [00:10<00:01, 1.26s/it]\n
2025-01-10 11:00:48.298 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 78% Completed | 7/9 [00:09<00:02, 1.38s/it]\n
2025-01-10 11:00:46.902 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 67% Completed | 6/9 [00:07<00:04, 1.37s/it]\n
2025-01-10 11:00:45.525 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 56% Completed | 5/9 [00:06<00:05, 1.37s/it]\n
2025-01-10 11:00:44.111 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 44% Completed | 4/9 [00:05<00:06, 1.34s/it]\n
2025-01-10 11:00:42.691 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 33% Completed | 3/9 [00:03<00:07, 1.29s/it]\n
2025-01-10 11:00:41.272 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 22% Completed | 2/9 [00:02<00:08, 1.18s/it]\n
2025-01-10 11:00:39.886 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 11% Completed | 1/9 [00:00<00:07, 1.14it/s]\n
2025-01-10 11:00:39.008 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 0% Completed | 0/9 [00:00<?, ?it/s]\n
2025-01-10 11:00:39.008 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:38 model_runner.py:1072] Starting to load model
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 11:00:39.008 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:38
model_runner.py:1072] Starting to load model /models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 11:00:39.008 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:38
shm_broadcast.py:236] vLLM message queue communication handle:
Handle(connect_ip='127.0.0.1', local_reader_ranks=[1],
buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at
0x776f611e5f60>, local_subscribe_port=57731, remote_subscribe_port=None)\n
2025-01-10 11:00:39.007 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:38 custom_all_reduce_utils.py:242] reading GPU P2P
access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:38
custom_all_reduce_utils.py:242] reading GPU P2P access cache from
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:31
custom_all_reduce_utils.py:204] generating GPU P2P access cache in
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:31 pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:31
pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:31 utils.py:960] Found nccl from library
libnccl.so.2\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:31
utils.py:960] Found nccl from library libnccl.so.2\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:30 multiproc_worker_utils.py:215] Worker ready;
awaiting tasks\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 14:00:30 selector.py:135] Using Flash Attention backend.\
n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:30
selector.py:135] Using Flash Attention backend.\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:30
custom_cache_manager.py:17] Setting Triton cache manager to:
vllm.triton_utils.custom_cache_manager:CustomCacheManager\n
2025-01-10 11:00:38.745 | warning | q5xcbyb3ls1u0i | WARNING 01-10 14:00:30
multiproc_gpu_executor.py:56] Reducing Torch parallelism from 64 threads to 1 to
avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment
to tune this value as needed.\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:30
llm_engine.py:249] Initializing an LLM engine (v0.6.4) with config:
model='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', speculative_config=None,
tokenizer='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', skip_tokenizer_init=False,
tokenizer_mode=auto, revision=None, override_neuron_config=None,
tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16,
max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO,
tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False,
quantization=awq, enforce_eager=False, kv_cache_dtype=auto,
quantization_param_path=None, device_config=cuda,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'),
observability_config=ObservabilityConfig(otlp_traces_endpoint=None,
collect_model_forward_time=False, collect_model_execute_time=False), seed=0,
served_model_name=/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-
v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, num_scheduler_steps=1,
chunked_prefill_enabled=False multi_step_stream_outputs=True,
enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False,
chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:30
config.py:1020] Defaulting to use mp for distributed inference\n
2025-01-10 11:00:38.745 | warning | q5xcbyb3ls1u0i | WARNING 01-10 14:00:30
config.py:428] awq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:30
awq_marlin.py:113] Detected that the model can run with awq_marlin, however you
specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin
for faster inference\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:30
config.py:350] This model supports multiple tasks: {'embedding', 'generate'}.
Defaulting to 'generate'.\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 11:00:38.745 | info | q5xcbyb3ls1u0i | engine.py :26 2025-
01-10 14:00:26,956 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 11:00:38.744 | info | q5xcbyb3ls1u0i | engine_args.py :126 2025-
01-10 14:00:26,932 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 11:00:17.932 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:14
metrics.py:465] Prefix cache hit rate: GPU: 90.00%, CPU: 0.00%\n
2025-01-10 11:00:14.461 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:14
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
32.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:00:12.218 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 11:00:12.218 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:09
metrics.py:465] Prefix cache hit rate: GPU: 90.00%, CPU: 0.00%\n
2025-01-10 11:00:09.446 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:09
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:00:09.446 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:04
metrics.py:465] Prefix cache hit rate: GPU: 90.00%, CPU: 0.00%\n
2025-01-10 11:00:05.733 | info | 0a13jc93spuadh |
warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-10 11:00:04.668 | info | 0a13jc93spuadh |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-10 11:00:04.668 | info | 0a13jc93spuadh | Extension modules:
multidict._multidict, yarl._quoting_c, propcache._helpers_c, _brotli,
aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask,
aiohttp._websocket.reader_c, _cffi_backend, frozenlist._frozenlist,
charset_normalizer.md, requests.packages.charset_normalizer.md,
requests.packages.chardet.md, ujson, numpy.core._multiarray_umath,
numpy.core._multiarray_tests, numpy.linalg._umath_linalg,
numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator,
numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand,
numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64,
numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler,
torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils,
torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse,
torch._C._special, yaml._yaml, markupsafe._speedups, PIL._imaging,
psutil._psutil_linux, psutil._psutil_posix, msgspec._core,
sentencepiece._sentencepiece, regex._regex, PIL._imagingft, msgpack._cmsgpack,
google._upb._message, setproctitle, uvloop.loop, ray._raylet,
zmq.backend.cython._zmq (total: 53)\n
2025-01-10 11:00:04.668 | info | 0a13jc93spuadh | \n
2025-01-10 11:00:04.668 | info | 0a13jc93spuadh | <no Python frame>\n
2025-01-10 11:00:04.668 | info | 0a13jc93spuadh | Current thread
0x000072a27cfc4480 (most recent call first):\n
2025-01-10 11:00:04.668 | info | 0a13jc93spuadh | \n
2025-01-10 11:00:04.668 | info | 0a13jc93spuadh | Python runtime state: finalizing
(tstate=0x0000640cf844e0d0)\n
2025-01-10 11:00:04.435 | info | q5xcbyb3ls1u0i | INFO 01-10 14:00:04
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 11:00:03.218 | info | 0a13jc93spuadh | Fatal Python error:
_enter_buffered_busy: could not acquire lock for <_io.BufferedWriter
name='<stdout>'> at interpreter shutdown, possibly due to daemon threads\n
2025-01-10 11:00:02.218 | error | 0a13jc93spuadh | ERROR 01-10 14:00:02
multiproc_worker_utils.py:116] Worker VllmWorkerProcess pid 228 died, exit code: -
15\n
2025-01-10 11:00:02.218 | info | 0a13jc93spuadh | Kill worker.
2025-01-10 11:00:01.111 | info | 0a13jc93spuadh | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-10 11:00:01.111 | info | 0a13jc93spuadh | INFO 01-10 13:54:39
chat_utils.py:431] ' }}{% endif %}\n
2025-01-10 11:00:01.111 | info | 0a13jc93spuadh | INFO 01-10 13:54:39
chat_utils.py:431] \n
2025-01-10 11:00:01.111 | info | 0a13jc93spuadh | INFO 01-10 13:54:39
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-10 11:00:01.110 | info | 0a13jc93spuadh | INFO 01-10 13:54:39
chat_utils.py:431] \n
2025-01-10 11:00:01.110 | info | 0a13jc93spuadh | INFO 01-10 13:54:39
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-10 11:00:01.110 | info | 0a13jc93spuadh | INFO 01-10 13:54:39
chat_utils.py:431] Using supplied chat template:\n
2025-01-10 11:00:00.338 | info | q5xcbyb3ls1u0i | Kill worker.
2025-01-10 11:00:00.338 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:59
metrics.py:465] Prefix cache hit rate: GPU: 90.00%, CPU: 0.00%\n
2025-01-10 10:59:59.433 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:59
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:59.433 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:54
metrics.py:465] Prefix cache hit rate: GPU: 90.00%, CPU: 0.00%\n
2025-01-10 10:59:59.433 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:54
metrics.py:449] Avg prompt throughput: 2.7 tokens/s, Avg generation throughput:
13.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:59.433 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:54
async_llm_engine.py:208] Added request 83462dab2c104cc6a83abef0d9084599.\n
2025-01-10 10:59:59.433 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:59:54.367 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:59:49.448 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:59:49.296 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:59:49.172 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:49
async_llm_engine.py:176] Finished request 6ceebc6f9aad4ddcb5622989e8de5661.\n
2025-01-10 10:59:49.172 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:45
metrics.py:465] Prefix cache hit rate: GPU: 88.89%, CPU: 0.00%\n
2025-01-10 10:59:45.877 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:45
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:45.877 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:40
metrics.py:465] Prefix cache hit rate: GPU: 88.89%, CPU: 0.00%\n
2025-01-10 10:59:40.872 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:40
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:40.872 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:35
metrics.py:465] Prefix cache hit rate: GPU: 88.89%, CPU: 0.00%\n
2025-01-10 10:59:35.845 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:35
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:35.845 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:30
metrics.py:465] Prefix cache hit rate: GPU: 88.89%, CPU: 0.00%\n
2025-01-10 10:59:30.844 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:30
metrics.py:449] Avg prompt throughput: 3.1 tokens/s, Avg generation throughput:
14.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:30.843 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:30
async_llm_engine.py:208] Added request 6ceebc6f9aad4ddcb5622989e8de5661.\n
2025-01-10 10:59:30.843 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:59:30.792 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:59:30.701 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:59:26.483 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:26
async_llm_engine.py:176] Finished request a357e73a5d3d43cdbe6bfa113cf198d9.\n
2025-01-10 10:59:26.483 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:23
metrics.py:465] Prefix cache hit rate: GPU: 87.50%, CPU: 0.00%\n
2025-01-10 10:59:23.398 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:23
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:23.398 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:18
metrics.py:465] Prefix cache hit rate: GPU: 87.50%, CPU: 0.00%\n
2025-01-10 10:59:18.380 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:18
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:18.380 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:13
metrics.py:465] Prefix cache hit rate: GPU: 87.50%, CPU: 0.00%\n
2025-01-10 10:59:13.367 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:13
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:13.367 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:08
metrics.py:465] Prefix cache hit rate: GPU: 87.50%, CPU: 0.00%\n
2025-01-10 10:59:08.362 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:08
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:03.707 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:59:03.707 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:03
metrics.py:465] Prefix cache hit rate: GPU: 87.50%, CPU: 0.00%\n
2025-01-10 10:59:03.342 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:03
metrics.py:449] Avg prompt throughput: 0.5 tokens/s, Avg generation throughput: 0.3
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:59:03.342 | info | q5xcbyb3ls1u0i | INFO 01-10 13:59:03
async_llm_engine.py:208] Added request a357e73a5d3d43cdbe6bfa113cf198d9.\n
2025-01-10 10:59:03.342 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:59:03.286 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:58:20.677 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:58:20.580 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:58:20.468 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:20
async_llm_engine.py:176] Finished request 1ff44cd19f2a46bd84bc0183ca24b0b2.\n
2025-01-10 10:58:20.468 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:20
metrics.py:465] Prefix cache hit rate: GPU: 85.71%, CPU: 0.00%\n
2025-01-10 10:58:20.168 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:20
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
40.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.6%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:58:20.168 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:58:20.168 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:58:16.487 | error | q5xcbyb3ls1u0i | Failed to return job results. |
400, message='Bad Request', url='https://api.runpod.ai/v2/vllm-vw4dt9p5fckyh4/job-
done/q5xcbyb3ls1u0i/fee8ee7a-829f-4d4e-b042-b6bbe5fe17eb-u1?
gpu=NVIDIA+RTX+6000+Ada+Generation&isStream=true'
2025-01-10 10:58:16.396 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:58:16.269 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:16
async_llm_engine.py:176] Finished request 8c1cc66405214aab80f5f2cad5928737.\n
2025-01-10 10:58:16.269 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:15
metrics.py:465] Prefix cache hit rate: GPU: 85.71%, CPU: 0.00%\n
2025-01-10 10:58:15.146 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:15
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
65.9 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.9%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:58:15.146 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:10
metrics.py:465] Prefix cache hit rate: GPU: 85.71%, CPU: 0.00%\n
2025-01-10 10:58:10.135 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:10
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
65.3 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.6%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:58:10.135 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:05
metrics.py:465] Prefix cache hit rate: GPU: 85.71%, CPU: 0.00%\n
2025-01-10 10:58:05.108 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:05
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
65.4 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:58:05.108 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:00
metrics.py:465] Prefix cache hit rate: GPU: 85.71%, CPU: 0.00%\n
2025-01-10 10:58:00.091 | info | q5xcbyb3ls1u0i | INFO 01-10 13:58:00
metrics.py:449] Avg prompt throughput: 9.2 tokens/s, Avg generation throughput:
54.0 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.2%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:58:00.091 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:56
async_llm_engine.py:208] Added request 8c1cc66405214aab80f5f2cad5928737.\n
2025-01-10 10:58:00.091 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:56
async_llm_engine.py:208] Added request 1ff44cd19f2a46bd84bc0183ca24b0b2.\n
2025-01-10 10:58:00.091 | info | q5xcbyb3ls1u0i | Jobs in progress: 2
2025-01-10 10:57:56.669 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:57:56.560 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:57:56.417 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:56
async_llm_engine.py:176] Finished request d9ab2f54431a48b59339b37c177f4896.\n
2025-01-10 10:57:56.417 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:55
metrics.py:465] Prefix cache hit rate: GPU: 80.00%, CPU: 0.00%\n
2025-01-10 10:57:55.076 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:55
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:57:55.076 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:50
metrics.py:465] Prefix cache hit rate: GPU: 80.00%, CPU: 0.00%\n
2025-01-10 10:57:50.069 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:50
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:57:45.862 | info | q5xcbyb3ls1u0i | Jobs in queue: 2
2025-01-10 10:57:45.862 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:45
metrics.py:465] Prefix cache hit rate: GPU: 80.00%, CPU: 0.00%\n
2025-01-10 10:57:45.039 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:45
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:57:40.739 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:57:40.739 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:40
metrics.py:465] Prefix cache hit rate: GPU: 80.00%, CPU: 0.00%\n
2025-01-10 10:57:40.030 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:40
metrics.py:449] Avg prompt throughput: 0.3 tokens/s, Avg generation throughput: 0.2
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:57:40.030 | info | q5xcbyb3ls1u0i | INFO 01-10 13:57:39
async_llm_engine.py:208] Added request d9ab2f54431a48b59339b37c177f4896.\n
2025-01-10 10:57:40.030 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:57:39.972 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:56:12.300 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:56:12.204 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:56:12.071 | info | q5xcbyb3ls1u0i | INFO 01-10 13:56:12
async_llm_engine.py:176] Finished request 9fa24c31ca4d4b34b9b7f02eb3f04f5c.\n
2025-01-10 10:56:12.071 | info | q5xcbyb3ls1u0i | INFO 01-10 13:56:11
metrics.py:465] Prefix cache hit rate: GPU: 75.00%, CPU: 0.00%\n
2025-01-10 10:56:11.592 | info | q5xcbyb3ls1u0i | INFO 01-10 13:56:11
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.6%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:56:11.591 | info | q5xcbyb3ls1u0i | INFO 01-10 13:56:06
metrics.py:465] Prefix cache hit rate: GPU: 75.00%, CPU: 0.00%\n
2025-01-10 10:56:06.589 | info | q5xcbyb3ls1u0i | INFO 01-10 13:56:06
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:56:06.589 | info | q5xcbyb3ls1u0i | INFO 01-10 13:56:01
metrics.py:465] Prefix cache hit rate: GPU: 75.00%, CPU: 0.00%\n
2025-01-10 10:56:01.580 | info | q5xcbyb3ls1u0i | INFO 01-10 13:56:01
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:56:01.580 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:56
metrics.py:465] Prefix cache hit rate: GPU: 75.00%, CPU: 0.00%\n
2025-01-10 10:55:56.551 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:56
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:55:56.551 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:51
metrics.py:465] Prefix cache hit rate: GPU: 75.00%, CPU: 0.00%\n
2025-01-10 10:55:51.543 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:51
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:55:51.543 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:46
metrics.py:465] Prefix cache hit rate: GPU: 75.00%, CPU: 0.00%\n
2025-01-10 10:55:51.543 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:46
metrics.py:449] Avg prompt throughput: 2.5 tokens/s, Avg generation throughput: 8.0
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:55:51.543 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:46
async_llm_engine.py:208] Added request 9fa24c31ca4d4b34b9b7f02eb3f04f5c.\n
2025-01-10 10:55:51.543 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:55:46.473 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:55:39.621 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:55:39.511 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:55:39.380 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:39
async_llm_engine.py:176] Finished request 8b26b1b484dc4fba9edcfe8c151df159.\n
2025-01-10 10:55:39.380 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:37
metrics.py:465] Prefix cache hit rate: GPU: 66.67%, CPU: 0.00%\n
2025-01-10 10:55:37.184 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:37
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.6%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:55:37.184 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:32
metrics.py:465] Prefix cache hit rate: GPU: 66.67%, CPU: 0.00%\n
2025-01-10 10:55:32.158 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:32
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:55:32.158 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:27
metrics.py:465] Prefix cache hit rate: GPU: 66.67%, CPU: 0.00%\n
2025-01-10 10:55:27.138 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:27
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:55:27.138 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:22
metrics.py:465] Prefix cache hit rate: GPU: 66.67%, CPU: 0.00%\n
2025-01-10 10:55:22.119 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:22
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:55:22.118 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:17
metrics.py:465] Prefix cache hit rate: GPU: 66.67%, CPU: 0.00%\n
2025-01-10 10:55:17.091 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:17
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:55:17.091 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:12
metrics.py:465] Prefix cache hit rate: GPU: 66.67%, CPU: 0.00%\n
2025-01-10 10:55:12.072 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:12
metrics.py:449] Avg prompt throughput: 0.7 tokens/s, Avg generation throughput: 3.1
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:55:12.072 | info | q5xcbyb3ls1u0i | INFO 01-10 13:55:12
async_llm_engine.py:208] Added request 8b26b1b484dc4fba9edcfe8c151df159.\n
2025-01-10 10:55:12.072 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:55:12.008 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:54:42.868 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:54:42.792 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:54:42.681 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:42
async_llm_engine.py:176] Finished request 1ee6e7ea73b0473e815ecb461b63e5be.\n
2025-01-10 10:54:42.681 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:39
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 10:54:39.654 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:39
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.6%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:54:39.654 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:54:39.648 | info | 0a13jc93spuadh | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 10:54:39.648 | info | 0a13jc93spuadh | engine.py :26 2025-
01-10 13:54:39,307 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 10:54:39.648 | info | 0a13jc93spuadh | engine_args.py :126 2025-
01-10 13:54:39,303 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 10:54:39.303 | info | 0a13jc93spuadh | engine.py :112 2025-
01-10 13:54:39,302 Initialized vLLM engine in 80.50s\n
2025-01-10 10:54:39.011 | info | 0a13jc93spuadh | INFO 01-10 13:54:39
model_runner.py:1518] Graph capturing finished in 34 secs, took 0.99 GiB\n
2025-01-10 10:54:39.011 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:54:38 model_runner.py:1518] Graph capturing finished
in 34 secs, took 0.99 GiB\n
2025-01-10 10:54:38.950 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:54:38 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-10 10:54:38.080 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:54:37.992 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:54:37.992 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:34
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 10:54:35.635 | info | 0a13jc93spuadh | INFO 01-10 13:54:35
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-10 10:54:35.635 | info | 0a13jc93spuadh | INFO 01-10 13:54:05
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-10 10:54:35.635 | info | 0a13jc93spuadh | INFO 01-10 13:54:05
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-10 10:54:35.635 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:54:05 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-10 10:54:34.644 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:34
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
60.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:54:33.860 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:33
async_llm_engine.py:176] Finished request 9c282aeeba774afcaa7857b84d712d8e.\n
2025-01-10 10:54:33.860 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:29
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 10:54:29.630 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:29
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
65.6 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.7%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:54:29.629 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:24
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 10:54:24.626 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:24
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
65.1 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:54:24.626 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:19
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 10:54:19.619 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:19
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
65.3 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:54:19.619 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:14
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-10 10:54:14.599 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:14
metrics.py:449] Avg prompt throughput: 9.2 tokens/s, Avg generation throughput: 6.8
tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:54:14.599 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:13
async_llm_engine.py:208] Added request 1ee6e7ea73b0473e815ecb461b63e5be.\n
2025-01-10 10:54:14.599 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:13
async_llm_engine.py:208] Added request 9c282aeeba774afcaa7857b84d712d8e.\n
2025-01-10 10:54:14.599 | info | q5xcbyb3ls1u0i | Jobs in progress: 2
2025-01-10 10:54:14.599 | info | q5xcbyb3ls1u0i | Jobs in queue: 2
2025-01-10 10:54:13.993 | info | q5xcbyb3ls1u0i | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-10 10:54:13.993 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:09
chat_utils.py:431] ' }}{% endif %}\n
2025-01-10 10:54:13.993 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:09
chat_utils.py:431] \n
2025-01-10 10:54:13.993 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:09
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-10 10:54:13.993 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:09
chat_utils.py:431] \n
2025-01-10 10:54:13.993 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:09
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-10 10:54:13.992 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:09
chat_utils.py:431] Using supplied chat template:\n
2025-01-10 10:54:09.834 | info | q5xcbyb3ls1u0i | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 10:54:09.834 | info | q5xcbyb3ls1u0i | engine.py :26 2025-
01-10 13:54:09,578 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 10:54:09.834 | info | q5xcbyb3ls1u0i | engine_args.py :126 2025-
01-10 13:54:09,573 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 10:54:09.573 | info | q5xcbyb3ls1u0i | engine.py :112 2025-
01-10 13:54:09,573 Initialized vLLM engine in 56.31s\n
2025-01-10 10:54:09.573 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:09
model_runner.py:1518] Graph capturing finished in 24 secs, took 0.99 GiB\n
2025-01-10 10:54:09.573 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:54:09 model_runner.py:1518] Graph capturing finished
in 24 secs, took 0.99 GiB\n
2025-01-10 10:54:09.258 | info | q5xcbyb3ls1u0i | INFO 01-10 13:54:09
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-10 10:54:07.765 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:54:07 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-10 10:54:07.765 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:45
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-10 10:54:07.765 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:45
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-10 10:54:07.765 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:45 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-10 10:54:05.108 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:54:05 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-10 10:54:05.107 | info | 0a13jc93spuadh | INFO 01-10 13:54:02
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
15.62x\n
2025-01-10 10:54:02.730 | info | 0a13jc93spuadh | INFO 01-10 13:54:02
distributed_gpu_executor.py:57] # GPU blocks: 7999, # CPU blocks: 1638\n
2025-01-10 10:54:02.571 | info | 0a13jc93spuadh | INFO 01-10 13:54:02
worker.py:232] Memory profiling results: total_gpu_memory=44.34GiB
initial_memory_usage=19.04GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.53GiB
gpu_memory_utilization=0.90\n
2025-01-10 10:54:02.513 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:54:02 worker.py:232] Memory profiling results:
total_gpu_memory=44.34GiB initial_memory_usage=19.04GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.57GiB
gpu_memory_utilization=0.90\n
2025-01-10 10:53:54.866 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:54 model_runner.py:1077] Loading model weights took
18.5769 GB\n
2025-01-10 10:53:54.811 | info | 0a13jc93spuadh | INFO 01-10 13:53:54
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-10 10:53:54.811 | info | 0a13jc93spuadh | \n
2025-01-10 10:53:54.811 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:17<00:00, 2.00s/it]\n
2025-01-10 10:53:54.570 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:17<00:00, 1.61s/it]\n
2025-01-10 10:53:54.052 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 89% Completed | 8/9 [00:17<00:02, 2.10s/it]\n
2025-01-10 10:53:52.326 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 78% Completed | 7/9 [00:15<00:04, 2.28s/it]\n
2025-01-10 10:53:50.153 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 67% Completed | 6/9 [00:13<00:06, 2.33s/it]\n
2025-01-10 10:53:47.788 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 56% Completed | 5/9 [00:11<00:09, 2.31s/it]\n
2025-01-10 10:53:45.376 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 44% Completed | 4/9 [00:08<00:11, 2.26s/it]\n
2025-01-10 10:53:45.315 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:45 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-10 10:53:45.315 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:42
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
17.62x\n
2025-01-10 10:53:43.050 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 33% Completed | 3/9 [00:06<00:13, 2.21s/it]\n
2025-01-10 10:53:42.239 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:42
distributed_gpu_executor.py:57] # GPU blocks: 9020, # CPU blocks: 1638\n
2025-01-10 10:53:42.073 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:42
worker.py:232] Memory profiling results: total_gpu_memory=47.50GiB
initial_memory_usage=19.30GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=19.44Gib non_torch_memory=0.85GiB kv_cache_size=22.02GiB
gpu_memory_utilization=0.90\n
2025-01-10 10:53:42.009 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:42 worker.py:232] Memory profiling results:
total_gpu_memory=47.50GiB initial_memory_usage=19.30GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.42Gib non_torch_memory=0.83GiB kv_cache_size=22.08GiB
gpu_memory_utilization=0.90\n
2025-01-10 10:53:40.802 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 22% Completed | 2/9 [00:04<00:15, 2.18s/it]\n
2025-01-10 10:53:38.263 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 11% Completed | 1/9 [00:01<00:13, 1.66s/it]\n
2025-01-10 10:53:37.146 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:37
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-10 10:53:37.146 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:36 model_runner.py:1077] Loading model weights took
18.5769 GB\n
2025-01-10 10:53:37.146 | info | q5xcbyb3ls1u0i | \n
2025-01-10 10:53:37.146 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:10<00:00, 1.14s/it]\n
2025-01-10 10:53:36.856 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:10<00:00, 1.14it/s]\n
2025-01-10 10:53:36.636 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 89% Completed | 8/9 [00:10<00:01, 1.17s/it]\n
2025-01-10 10:53:36.607 | info | 0a13jc93spuadh | \rLoading safetensors checkpoint
shards: 0% Completed | 0/9 [00:00<?, ?it/s]\n
2025-01-10 10:53:36.607 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:36 model_runner.py:1072] Starting to load model
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 10:53:36.607 | info | 0a13jc93spuadh | INFO 01-10 13:53:36
model_runner.py:1072] Starting to load model /models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 10:53:36.607 | info | 0a13jc93spuadh | INFO 01-10 13:53:36
shm_broadcast.py:236] vLLM message queue communication handle:
Handle(connect_ip='127.0.0.1', local_reader_ranks=[1],
buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at
0x72a0efd89ea0>, local_subscribe_port=39357, remote_subscribe_port=None)\n
2025-01-10 10:53:36.607 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:36 custom_all_reduce_utils.py:242] reading GPU P2P
access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:53:36.286 | info | 0a13jc93spuadh | INFO 01-10 13:53:36
custom_all_reduce_utils.py:242] reading GPU P2P access cache from
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:53:36.286 | info | 0a13jc93spuadh | INFO 01-10 13:53:26
custom_all_reduce_utils.py:204] generating GPU P2P access cache in
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:53:36.286 | info | 0a13jc93spuadh | INFO 01-10 13:53:25
pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:53:36.286 | info | 0a13jc93spuadh | INFO 01-10 13:53:25
utils.py:960] Found nccl from library libnccl.so.2\n
2025-01-10 10:53:36.286 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:25 pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:53:36.286 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:25 utils.py:960] Found nccl from library
libnccl.so.2\n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:24 multiproc_worker_utils.py:215] Worker ready;
awaiting tasks\n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:24 selector.py:135] Using Flash Attention backend.\
n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | INFO 01-10 13:53:24
selector.py:135] Using Flash Attention backend.\n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | INFO 01-10 13:53:24
custom_cache_manager.py:17] Setting Triton cache manager to:
vllm.triton_utils.custom_cache_manager:CustomCacheManager\n
2025-01-10 10:53:36.285 | warning | 0a13jc93spuadh | WARNING 01-10 13:53:24
multiproc_gpu_executor.py:56] Reducing Torch parallelism from 48 threads to 1 to
avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment
to tune this value as needed.\n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | INFO 01-10 13:53:24
llm_engine.py:249] Initializing an LLM engine (v0.6.4) with config:
model='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', speculative_config=None,
tokenizer='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', skip_tokenizer_init=False,
tokenizer_mode=auto, revision=None, override_neuron_config=None,
tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16,
max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO,
tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False,
quantization=awq, enforce_eager=False, kv_cache_dtype=auto,
quantization_param_path=None, device_config=cuda,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'),
observability_config=ObservabilityConfig(otlp_traces_endpoint=None,
collect_model_forward_time=False, collect_model_execute_time=False), seed=0,
served_model_name=/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-
v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, num_scheduler_steps=1,
chunked_prefill_enabled=False multi_step_stream_outputs=True,
enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False,
chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)\n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | INFO 01-10 13:53:24
config.py:1020] Defaulting to use mp for distributed inference\n
2025-01-10 10:53:36.285 | warning | 0a13jc93spuadh | WARNING 01-10 13:53:24
config.py:428] awq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.\n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | INFO 01-10 13:53:24
awq_marlin.py:113] Detected that the model can run with awq_marlin, however you
specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin
for faster inference\n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | INFO 01-10 13:53:24
config.py:350] This model supports multiple tasks: {'embedding', 'generate'}.
Defaulting to 'generate'.\n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 10:53:36.285 | info | 0a13jc93spuadh | engine.py :26 2025-
01-10 13:53:18,382 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 10:53:36.284 | info | 0a13jc93spuadh | engine_args.py :126 2025-
01-10 13:53:18,340 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 10:53:35.723 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 78% Completed | 7/9 [00:09<00:02, 1.30s/it]\n
2025-01-10 10:53:34.471 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 67% Completed | 6/9 [00:07<00:03, 1.32s/it]\n
2025-01-10 10:53:33.204 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 56% Completed | 5/9 [00:06<00:05, 1.34s/it]\n
2025-01-10 10:53:31.929 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 44% Completed | 4/9 [00:05<00:06, 1.38s/it]\n
2025-01-10 10:53:30.648 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 33% Completed | 3/9 [00:04<00:08, 1.45s/it]\n
2025-01-10 10:53:28.834 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 22% Completed | 2/9 [00:02<00:08, 1.15s/it]\n
2025-01-10 10:53:27.577 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 11% Completed | 1/9 [00:00<00:07, 1.01it/s]\n
2025-01-10 10:53:26.588 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 0% Completed | 0/9 [00:00<?, ?it/s]\n
2025-01-10 10:53:26.588 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:26 model_runner.py:1072] Starting to load model
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 10:53:26.588 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:26
model_runner.py:1072] Starting to load model /models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 10:53:26.588 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:26
shm_broadcast.py:236] vLLM message queue communication handle:
Handle(connect_ip='127.0.0.1', local_reader_ranks=[1],
buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at
0x78019ec69ed0>, local_subscribe_port=55929, remote_subscribe_port=None)\n
2025-01-10 10:53:26.588 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:26 custom_all_reduce_utils.py:242] reading GPU P2P
access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:53:26.325 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:26
custom_all_reduce_utils.py:242] reading GPU P2P access cache from
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:53:18.634 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:18
custom_all_reduce_utils.py:204] generating GPU P2P access cache in
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:53:18.634 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:18
pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:53:18.634 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:18 pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:53:18.634 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:18
utils.py:960] Found nccl from library libnccl.so.2\n
2025-01-10 10:53:18.424 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:18 utils.py:960] Found nccl from library
libnccl.so.2\n
2025-01-10 10:53:18.424 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:17 multiproc_worker_utils.py:215] Worker ready;
awaiting tasks\n
2025-01-10 10:53:18.424 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:53:17 selector.py:135] Using Flash Attention backend.\
n
2025-01-10 10:53:17.247 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:17
selector.py:135] Using Flash Attention backend.\n
2025-01-10 10:53:17.247 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:17
custom_cache_manager.py:17] Setting Triton cache manager to:
vllm.triton_utils.custom_cache_manager:CustomCacheManager\n
2025-01-10 10:53:17.162 | warning | q5xcbyb3ls1u0i | WARNING 01-10 13:53:17
multiproc_gpu_executor.py:56] Reducing Torch parallelism from 64 threads to 1 to
avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment
to tune this value as needed.\n
2025-01-10 10:53:17.162 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:16
llm_engine.py:249] Initializing an LLM engine (v0.6.4) with config:
model='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', speculative_config=None,
tokenizer='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', skip_tokenizer_init=False,
tokenizer_mode=auto, revision=None, override_neuron_config=None,
tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16,
max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO,
tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False,
quantization=awq, enforce_eager=False, kv_cache_dtype=auto,
quantization_param_path=None, device_config=cuda,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'),
observability_config=ObservabilityConfig(otlp_traces_endpoint=None,
collect_model_forward_time=False, collect_model_execute_time=False), seed=0,
served_model_name=/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-
v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, num_scheduler_steps=1,
chunked_prefill_enabled=False multi_step_stream_outputs=True,
enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False,
chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)\n
2025-01-10 10:53:17.162 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:16
config.py:1020] Defaulting to use mp for distributed inference\n
2025-01-10 10:53:17.162 | warning | q5xcbyb3ls1u0i | WARNING 01-10 13:53:16
config.py:428] awq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.\n
2025-01-10 10:53:17.162 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:16
awq_marlin.py:113] Detected that the model can run with awq_marlin, however you
specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin
for faster inference\n
2025-01-10 10:53:17.162 | info | q5xcbyb3ls1u0i | INFO 01-10 13:53:16
config.py:350] This model supports multiple tasks: {'generate', 'embedding'}.
Defaulting to 'generate'.\n
2025-01-10 10:53:16.851 | info | q5xcbyb3ls1u0i | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 10:53:16.850 | info | q5xcbyb3ls1u0i | engine.py :26 2025-
01-10 13:53:12,979 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 10:53:12.942 | info | q5xcbyb3ls1u0i | engine_args.py :126 2025-
01-10 13:53:12,941 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 10:51:05.935 | info | q5xcbyb3ls1u0i |
warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-10 10:51:04.561 | info | q5xcbyb3ls1u0i |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-10 10:51:03.196 | warning | q5xcbyb3ls1u0i | [rank0]:[W110
13:51:03.481207011 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has
NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the
application should call destroy_process_group to ensure that any pending NCCL
operations have finished in this process. In rare cases this process can exit
before this point and block the progress of another member of the process group.
This constraint has always been present, but this warning has only been added
since PyTorch 2.4 (function operator())\n
2025-01-10 10:51:02.661 | info | q5xcbyb3ls1u0i | INFO 01-10 13:51:02
multiproc_worker_utils.py:120] Killing local vLLM worker processes\n
2025-01-10 10:51:02.661 | info | q5xcbyb3ls1u0i | INFO 01-10 13:51:00
async_llm_engine.py:62] Engine is gracefully shutting down.\n
2025-01-10 10:51:02.661 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:51:00.983 | error | q5xcbyb3ls1u0i | Failed to return job results. |
400, message='Bad Request', url='https://api.runpod.ai/v2/vllm-vw4dt9p5fckyh4/job-
done/q5xcbyb3ls1u0i/efd80437-4eac-4482-bd95-83a96461c99e-u1?
gpu=NVIDIA+RTX+6000+Ada+Generation&isStream=true'
2025-01-10 10:51:00.861 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:51:00.753 | info | q5xcbyb3ls1u0i | INFO 01-10 13:51:00
async_llm_engine.py:176] Finished request 5696d6de05a043a9892f8b91e7272e63.\n
2025-01-10 10:51:00.753 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:59
metrics.py:465] Prefix cache hit rate: GPU: 66.58%, CPU: 0.00%\n
2025-01-10 10:50:59.675 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:59
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.6%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:59.675 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:54
metrics.py:465] Prefix cache hit rate: GPU: 66.58%, CPU: 0.00%\n
2025-01-10 10:50:54.647 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:54
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:51.034 | info | q5xcbyb3ls1u0i | Kill worker.
2025-01-10 10:50:51.034 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:49
metrics.py:465] Prefix cache hit rate: GPU: 66.58%, CPU: 0.00%\n
2025-01-10 10:50:49.644 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:49
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:49.644 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:44
metrics.py:465] Prefix cache hit rate: GPU: 66.58%, CPU: 0.00%\n
2025-01-10 10:50:44.615 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:44
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:44.615 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:39
metrics.py:465] Prefix cache hit rate: GPU: 66.58%, CPU: 0.00%\n
2025-01-10 10:50:39.611 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:39
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:39.611 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:50:39.611 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:50:34.721 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:50:34.721 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:34
metrics.py:465] Prefix cache hit rate: GPU: 66.58%, CPU: 0.00%\n
2025-01-10 10:50:34.721 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:34
metrics.py:449] Avg prompt throughput: 3.0 tokens/s, Avg generation throughput:
34.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:34.721 | info | q5xcbyb3ls1u0i | Jobs in progress: 2
2025-01-10 10:50:34.721 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:50:34.721 | error | q5xcbyb3ls1u0i | n must be an int, but is of
type <class 'NoneType'>
2025-01-10 10:50:34.721 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:34
async_llm_engine.py:208] Added request 5696d6de05a043a9892f8b91e7272e63.\n
2025-01-10 10:50:34.721 | info | q5xcbyb3ls1u0i | Jobs in progress: 3
2025-01-10 10:50:34.563 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:50:34.563 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:50:34.459 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:50:31.802 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:31
async_llm_engine.py:176] Finished request ba1d0bc7343446598f0d8a09a8e54953.\n
2025-01-10 10:50:30.284 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:30
async_llm_engine.py:176] Finished request e3f164415a914240a3137449e682cffb.\n
2025-01-10 10:50:30.284 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:27
metrics.py:465] Prefix cache hit rate: GPU: 66.54%, CPU: 0.00%\n
2025-01-10 10:50:27.033 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:27
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
66.4 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 1.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:27.033 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:22
metrics.py:465] Prefix cache hit rate: GPU: 66.54%, CPU: 0.00%\n
2025-01-10 10:50:22.030 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:22
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
66.5 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.7%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:22.030 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:17
metrics.py:465] Prefix cache hit rate: GPU: 66.54%, CPU: 0.00%\n
2025-01-10 10:50:17.011 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:17
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
66.3 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:12.866 | info | q5xcbyb3ls1u0i | Jobs in queue: 2
2025-01-10 10:50:12.494 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:50:12.494 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:12
metrics.py:465] Prefix cache hit rate: GPU: 66.54%, CPU: 0.00%\n
2025-01-10 10:50:12.004 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:12
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
66.3 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.2%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:12.004 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:06
metrics.py:465] Prefix cache hit rate: GPU: 66.54%, CPU: 0.00%\n
2025-01-10 10:50:06.999 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:06
metrics.py:449] Avg prompt throughput: 0.1 tokens/s, Avg generation throughput: 0.1
tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:50:06.999 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:06
async_llm_engine.py:208] Added request e3f164415a914240a3137449e682cffb.\n
2025-01-10 10:50:06.999 | info | q5xcbyb3ls1u0i | INFO 01-10 13:50:06
async_llm_engine.py:208] Added request ba1d0bc7343446598f0d8a09a8e54953.\n
2025-01-10 10:50:06.999 | info | q5xcbyb3ls1u0i | Jobs in progress: 2
2025-01-10 10:50:06.934 | info | q5xcbyb3ls1u0i | Jobs in queue: 2
2025-01-10 10:43:33.752 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:43:33.655 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:43:33.510 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:33
async_llm_engine.py:176] Finished request a1efcafc7b714dbe9c156914a037f729.\n
2025-01-10 10:43:33.510 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:32
metrics.py:465] Prefix cache hit rate: GPU: 66.46%, CPU: 0.00%\n
2025-01-10 10:43:32.190 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:32
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:43:32.190 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:27
metrics.py:465] Prefix cache hit rate: GPU: 66.46%, CPU: 0.00%\n
2025-01-10 10:43:27.179 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:27
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:43:27.179 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:22
metrics.py:465] Prefix cache hit rate: GPU: 66.46%, CPU: 0.00%\n
2025-01-10 10:43:22.169 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:22
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:43:22.169 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:17
metrics.py:465] Prefix cache hit rate: GPU: 66.46%, CPU: 0.00%\n
2025-01-10 10:43:17.167 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:17
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:43:17.167 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:12
metrics.py:465] Prefix cache hit rate: GPU: 66.46%, CPU: 0.00%\n
2025-01-10 10:43:12.140 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:12
metrics.py:449] Avg prompt throughput: 0.5 tokens/s, Avg generation throughput: 3.3
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:43:12.140 | info | q5xcbyb3ls1u0i | INFO 01-10 13:43:12
async_llm_engine.py:208] Added request a1efcafc7b714dbe9c156914a037f729.\n
2025-01-10 10:43:12.140 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:43:12.081 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:42:31.666 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:42:31.572 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:42:31.424 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:31
async_llm_engine.py:176] Finished request 24e868554299472f88b8710ada09f895.\n
2025-01-10 10:42:31.424 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:26
metrics.py:465] Prefix cache hit rate: GPU: 66.41%, CPU: 0.00%\n
2025-01-10 10:42:26.966 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:26
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
32.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:42:26.966 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:21
metrics.py:465] Prefix cache hit rate: GPU: 66.41%, CPU: 0.00%\n
2025-01-10 10:42:21.946 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:21
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:42:21.946 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:16
metrics.py:465] Prefix cache hit rate: GPU: 66.41%, CPU: 0.00%\n
2025-01-10 10:42:16.925 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:16
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.3%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:42:16.925 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:11
metrics.py:465] Prefix cache hit rate: GPU: 66.41%, CPU: 0.00%\n
2025-01-10 10:42:11.906 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:11
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
33.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.1%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:42:11.906 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:06
metrics.py:465] Prefix cache hit rate: GPU: 66.41%, CPU: 0.00%\n
2025-01-10 10:42:06.891 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:06
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:42:06.891 | info | q5xcbyb3ls1u0i | INFO 01-10 13:42:06
async_llm_engine.py:208] Added request 24e868554299472f88b8710ada09f895.\n
2025-01-10 10:42:06.891 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:42:06.821 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:38:12.535 | info | 3yn75aympd52v1 |
warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-10 10:38:11.619 | info | 3yn75aympd52v1 |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-10 10:38:10.729 | warning | 3yn75aympd52v1 | [rank0]:[W110
13:38:10.392588627 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has
NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the
application should call destroy_process_group to ensure that any pending NCCL
operations have finished in this process. In rare cases this process can exit
before this point and block the progress of another member of the process group.
This constraint has always been present, but this warning has only been added
since PyTorch 2.4 (function operator())\n
2025-01-10 10:38:10.729 | info | 3yn75aympd52v1 | INFO 01-10 13:38:10
multiproc_worker_utils.py:120] Killing local vLLM worker processes\n
2025-01-10 10:38:10.319 | error | 3yn75aympd52v1 | ERROR 01-10 13:38:10
multiproc_worker_utils.py:116] Worker VllmWorkerProcess pid 228 died, exit code: -
15\n
2025-01-10 10:38:10.319 | info | 3yn75aympd52v1 | Kill worker.
2025-01-10 10:38:09.185 | info | 3yn75aympd52v1 | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-10 10:38:09.185 | info | 3yn75aympd52v1 | INFO 01-10 13:12:47
chat_utils.py:431] ' }}{% endif %}\n
2025-01-10 10:38:09.185 | info | 3yn75aympd52v1 | INFO 01-10 13:12:47
chat_utils.py:431] \n
2025-01-10 10:38:09.185 | info | 3yn75aympd52v1 | INFO 01-10 13:12:47
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-10 10:38:09.185 | info | 3yn75aympd52v1 | INFO 01-10 13:12:47
chat_utils.py:431] \n
2025-01-10 10:38:09.184 | info | 3yn75aympd52v1 | INFO 01-10 13:12:47
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-10 10:38:09.184 | info | 3yn75aympd52v1 | INFO 01-10 13:12:47
chat_utils.py:431] Using supplied chat template:\n
2025-01-10 10:21:10.112 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:21:09.804 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:21:09.582 | info | q5xcbyb3ls1u0i | INFO 01-10 13:21:09
async_llm_engine.py:176] Finished request 4949b4cdea9345e381c515af15a04d0a.\n
2025-01-10 10:21:09.582 | info | q5xcbyb3ls1u0i | INFO 01-10 13:21:08
metrics.py:465] Prefix cache hit rate: GPU: 66.50%, CPU: 0.00%\n
2025-01-10 10:21:08.115 | info | q5xcbyb3ls1u0i | INFO 01-10 13:21:08
metrics.py:449] Avg prompt throughput: 26.1 tokens/s, Avg generation throughput:
1.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.9%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:21:08.115 | info | q5xcbyb3ls1u0i | INFO 01-10 13:21:08
async_llm_engine.py:208] Added request 4949b4cdea9345e381c515af15a04d0a.\n
2025-01-10 10:21:08.115 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:21:08.046 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:20:24.344 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:20:24.049 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:20:23.820 | info | q5xcbyb3ls1u0i | INFO 01-10 13:20:23
async_llm_engine.py:176] Finished request 85c21cf88ece4591bee714e4b27dd6dc.\n
2025-01-10 10:20:23.820 | info | q5xcbyb3ls1u0i | INFO 01-10 13:20:21
metrics.py:465] Prefix cache hit rate: GPU: 63.06%, CPU: 0.00%\n
2025-01-10 10:20:21.490 | info | q5xcbyb3ls1u0i | INFO 01-10 13:20:21
metrics.py:449] Avg prompt throughput: 18.2 tokens/s, Avg generation throughput:
0.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.8%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:20:21.489 | info | q5xcbyb3ls1u0i | INFO 01-10 13:20:21
async_llm_engine.py:208] Added request 85c21cf88ece4591bee714e4b27dd6dc.\n
2025-01-10 10:20:21.489 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:20:21.414 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:19:40.964 | info | 0a13jc93spuadh | INFO 01-10 13:11:19
pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:19:40.964 | info | 0a13jc93spuadh | INFO 01-10 13:11:19
utils.py:960] Found nccl from library libnccl.so.2\n
2025-01-10 10:19:40.964 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:19 pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:19:21.643 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:19:21.421 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:19:21.186 | info | q5xcbyb3ls1u0i | INFO 01-10 13:19:21
async_llm_engine.py:176] Finished request 66db82b88d184b9bb360d41144af72f9.\n
2025-01-10 10:19:21.186 | info | q5xcbyb3ls1u0i | INFO 01-10 13:19:19
metrics.py:465] Prefix cache hit rate: GPU: 59.19%, CPU: 0.00%\n
2025-01-10 10:19:19.924 | info | q5xcbyb3ls1u0i | INFO 01-10 13:19:19
metrics.py:449] Avg prompt throughput: 9.9 tokens/s, Avg generation throughput: 0.7
tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.7%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:19:19.924 | info | q5xcbyb3ls1u0i | INFO 01-10 13:19:19
async_llm_engine.py:208] Added request 66db82b88d184b9bb360d41144af72f9.\n
2025-01-10 10:19:19.924 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:19:19.778 | info | q5xcbyb3ls1u0i | Jobs in queue: 1
2025-01-10 10:12:56.795 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:12:56.795 | info | q5xcbyb3ls1u0i | Jobs in progress: 1
2025-01-10 10:12:56.695 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:12:56.695 | info | q5xcbyb3ls1u0i | Jobs in progress: 2
2025-01-10 10:12:56.494 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:12:56.294 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:12:56.294 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:12:56.183 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:12:56.062 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:56
async_llm_engine.py:176] Finished request dd6e24d3701a49f5868710e068277e58.\n
2025-01-10 10:12:56.062 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:55
async_llm_engine.py:176] Finished request 18cf0fcd4eaf48d7ae15e4bf3d6a84fd.\n
2025-01-10 10:12:55.952 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:55
async_llm_engine.py:176] Finished request 82b21d1b42f54ce9aaa89837d52a8c91.\n
2025-01-10 10:12:55.952 | info | q5xcbyb3ls1u0i | Jobs in progress: 3
2025-01-10 10:12:54.584 | info | q5xcbyb3ls1u0i | Finished.
2025-01-10 10:12:54.232 | info | q5xcbyb3ls1u0i | Finished running generator.
2025-01-10 10:12:53.753 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:53
async_llm_engine.py:176] Finished request c705e34266f54a2086b1ab660a7c7a44.\n
2025-01-10 10:12:53.753 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:53
metrics.py:465] Prefix cache hit rate: GPU: 55.38%, CPU: 0.00%\n
2025-01-10 10:12:53.201 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:53
metrics.py:449] Avg prompt throughput: 623.0 tokens/s, Avg generation throughput:
0.3 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 2.9%, CPU KV cache usage: 0.0%.\n
2025-01-10 10:12:53.201 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:47
async_llm_engine.py:208] Added request 18cf0fcd4eaf48d7ae15e4bf3d6a84fd.\n
2025-01-10 10:12:53.201 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:47
async_llm_engine.py:208] Added request 82b21d1b42f54ce9aaa89837d52a8c91.\n
2025-01-10 10:12:53.200 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:47
async_llm_engine.py:208] Added request c705e34266f54a2086b1ab660a7c7a44.\n
2025-01-10 10:12:53.200 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:47
async_llm_engine.py:208] Added request dd6e24d3701a49f5868710e068277e58.\n
2025-01-10 10:12:53.200 | info | q5xcbyb3ls1u0i | Jobs in progress: 4
2025-01-10 10:12:53.200 | info | q5xcbyb3ls1u0i | Jobs in queue: 4
2025-01-10 10:12:47.960 | info | q5xcbyb3ls1u0i | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-10 10:12:47.960 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:43
chat_utils.py:431] ' }}{% endif %}\n
2025-01-10 10:12:47.960 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:43
chat_utils.py:431] \n
2025-01-10 10:12:47.960 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:43
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-10 10:12:47.960 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:43
chat_utils.py:431] \n
2025-01-10 10:12:47.960 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:43
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-10 10:12:47.959 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:43
chat_utils.py:431] Using supplied chat template:\n
2025-01-10 10:12:47.452 | info | 3yn75aympd52v1 | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 10:12:47.452 | info | 3yn75aympd52v1 | engine.py :26 2025-
01-10 13:12:47,162 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 10:12:47.451 | info | 3yn75aympd52v1 | engine_args.py :126 2025-
01-10 13:12:47,157 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 10:12:47.158 | info | 3yn75aympd52v1 | engine.py :112 2025-
01-10 13:12:47,157 Initialized vLLM engine in 70.69s\n
2025-01-10 10:12:47.158 | info | 3yn75aympd52v1 | INFO 01-10 13:12:46
model_runner.py:1518] Graph capturing finished in 28 secs, took 0.99 GiB\n
2025-01-10 10:12:47.158 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:46 model_runner.py:1518] Graph capturing finished
in 28 secs, took 0.99 GiB\n
2025-01-10 10:12:46.819 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:46 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-10 10:12:46.757 | info | 3yn75aympd52v1 | INFO 01-10 13:12:46
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-10 10:12:46.757 | info | 3yn75aympd52v1 | INFO 01-10 13:12:18
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-10 10:12:43.039 | info | q5xcbyb3ls1u0i | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 10:12:43.039 | info | q5xcbyb3ls1u0i | engine.py :26 2025-
01-10 13:12:42,779 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 10:12:43.039 | info | q5xcbyb3ls1u0i | engine_args.py :126 2025-
01-10 13:12:42,774 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 10:12:42.774 | info | q5xcbyb3ls1u0i | engine.py :112 2025-
01-10 13:12:42,773 Initialized vLLM engine in 67.10s\n
2025-01-10 10:12:42.774 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:42
model_runner.py:1518] Graph capturing finished in 24 secs, took 0.99 GiB\n
2025-01-10 10:12:42.774 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:42 model_runner.py:1518] Graph capturing finished
in 24 secs, took 0.99 GiB\n
2025-01-10 10:12:42.460 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:42
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-10 10:12:40.414 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:40 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-10 10:12:40.414 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:18 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-10 10:12:40.414 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:18 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-10 10:12:40.414 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:18
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-10 10:12:18.934 | info | 3yn75aympd52v1 | INFO 01-10 13:12:18
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-10 10:12:18.934 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:18 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-10 10:12:18.613 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:18 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-10 10:12:18.613 | info | 3yn75aympd52v1 | INFO 01-10 13:12:16
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
15.62x\n
2025-01-10 10:12:18.386 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:18
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-10 10:12:18.386 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:15
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
17.62x\n
2025-01-10 10:12:16.562 | info | 3yn75aympd52v1 | INFO 01-10 13:12:16
distributed_gpu_executor.py:57] # GPU blocks: 7999, # CPU blocks: 1638\n
2025-01-10 10:12:16.407 | info | 3yn75aympd52v1 | INFO 01-10 13:12:16
worker.py:232] Memory profiling results: total_gpu_memory=44.34GiB
initial_memory_usage=19.04GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.53GiB
gpu_memory_utilization=0.90\n
2025-01-10 10:12:16.345 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:16 worker.py:232] Memory profiling results:
total_gpu_memory=44.34GiB initial_memory_usage=19.04GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.57GiB
gpu_memory_utilization=0.90\n
2025-01-10 10:12:16.345 | info | 3yn75aympd52v1 | INFO 01-10 13:12:08
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-10 10:12:15.445 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:15
distributed_gpu_executor.py:57] # GPU blocks: 9020, # CPU blocks: 1638\n
2025-01-10 10:12:15.272 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:15
worker.py:232] Memory profiling results: total_gpu_memory=47.50GiB
initial_memory_usage=19.30GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=19.44Gib non_torch_memory=0.85GiB kv_cache_size=22.02GiB
gpu_memory_utilization=0.90\n
2025-01-10 10:12:15.217 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:15 worker.py:232] Memory profiling results:
total_gpu_memory=47.50GiB initial_memory_usage=19.30GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.42Gib non_torch_memory=0.83GiB kv_cache_size=22.08GiB
gpu_memory_utilization=0.90\n
2025-01-10 10:12:10.318 | info | q5xcbyb3ls1u0i | INFO 01-10 13:12:10
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-10 10:12:10.318 | info | q5xcbyb3ls1u0i | \n
2025-01-10 10:12:10.318 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:21<00:00, 2.35s/it]\n
2025-01-10 10:12:10.030 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:21<00:00, 2.19s/it]\n
2025-01-10 10:12:09.806 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:09 model_runner.py:1077] Loading model weights took
18.5769 GB\n
2025-01-10 10:12:08.613 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:12:08 model_runner.py:1077] Loading model weights took
18.5769 GB\n
2025-01-10 10:12:08.612 | info | 3yn75aympd52v1 | \n
2025-01-10 10:12:08.612 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:14<00:00, 1.66s/it]\n
2025-01-10 10:12:08.481 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 89% Completed | 8/9 [00:19<00:02, 2.48s/it]\n
2025-01-10 10:12:08.449 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:14<00:00, 1.29s/it]\n
2025-01-10 10:12:08.142 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 89% Completed | 8/9 [00:14<00:01, 1.73s/it]\n
2025-01-10 10:12:06.827 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 78% Completed | 7/9 [00:13<00:03, 1.93s/it]\n
2025-01-10 10:12:06.358 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 78% Completed | 7/9 [00:17<00:05, 2.64s/it]\n
2025-01-10 10:12:04.874 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 67% Completed | 6/9 [00:11<00:05, 1.91s/it]\n
2025-01-10 10:12:03.580 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 67% Completed | 6/9 [00:14<00:07, 2.58s/it]\n
2025-01-10 10:12:02.941 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 56% Completed | 5/9 [00:09<00:07, 1.90s/it]\n
2025-01-10 10:12:01.013 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 44% Completed | 4/9 [00:07<00:09, 1.89s/it]\n
2025-01-10 10:12:00.935 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 56% Completed | 5/9 [00:12<00:10, 2.54s/it]\n
2025-01-10 10:11:59.227 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 33% Completed | 3/9 [00:05<00:11, 1.95s/it]\n
2025-01-10 10:11:58.142 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 44% Completed | 4/9 [00:09<00:12, 2.40s/it]\n
2025-01-10 10:11:57.252 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 22% Completed | 2/9 [00:03<00:13, 1.93s/it]\n
2025-01-10 10:11:55.487 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 33% Completed | 3/9 [00:06<00:13, 2.24s/it]\n
2025-01-10 10:11:55.081 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 11% Completed | 1/9 [00:01<00:12, 1.59s/it]\n
2025-01-10 10:11:53.488 | info | 3yn75aympd52v1 | \rLoading safetensors checkpoint
shards: 0% Completed | 0/9 [00:00<?, ?it/s]\n
2025-01-10 10:11:53.488 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:53 model_runner.py:1072] Starting to load model
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 10:11:53.488 | info | 3yn75aympd52v1 | INFO 01-10 13:11:53
model_runner.py:1072] Starting to load model /models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 10:11:53.488 | info | 3yn75aympd52v1 | INFO 01-10 13:11:53
shm_broadcast.py:236] vLLM message queue communication handle:
Handle(connect_ip='127.0.0.1', local_reader_ranks=[1],
buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at
0x72bdaf971fc0>, local_subscribe_port=58619, remote_subscribe_port=None)\n
2025-01-10 10:11:53.488 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:53 custom_all_reduce_utils.py:242] reading GPU P2P
access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:11:53.369 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 22% Completed | 2/9 [00:04<00:16, 2.34s/it]\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:53
custom_all_reduce_utils.py:242] reading GPU P2P access cache from
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:42
custom_all_reduce_utils.py:204] generating GPU P2P access cache in
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:42
pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:42 pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:42
utils.py:960] Found nccl from library libnccl.so.2\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:42 utils.py:960] Found nccl from library
libnccl.so.2\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:41 multiproc_worker_utils.py:215] Worker ready;
awaiting tasks\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:41 selector.py:135] Using Flash Attention backend.\
n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:41
selector.py:135] Using Flash Attention backend.\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:41
custom_cache_manager.py:17] Setting Triton cache manager to:
vllm.triton_utils.custom_cache_manager:CustomCacheManager\n
2025-01-10 10:11:53.183 | warning | 3yn75aympd52v1 | WARNING 01-10 13:11:41
multiproc_gpu_executor.py:56] Reducing Torch parallelism from 48 threads to 1 to
avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment
to tune this value as needed.\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:40
llm_engine.py:249] Initializing an LLM engine (v0.6.4) with config:
model='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', speculative_config=None,
tokenizer='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', skip_tokenizer_init=False,
tokenizer_mode=auto, revision=None, override_neuron_config=None,
tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16,
max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO,
tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False,
quantization=awq, enforce_eager=False, kv_cache_dtype=auto,
quantization_param_path=None, device_config=cuda,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'),
observability_config=ObservabilityConfig(otlp_traces_endpoint=None,
collect_model_forward_time=False, collect_model_execute_time=False), seed=0,
served_model_name=/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-
v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, num_scheduler_steps=1,
chunked_prefill_enabled=False multi_step_stream_outputs=True,
enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False,
chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:40
config.py:1020] Defaulting to use mp for distributed inference\n
2025-01-10 10:11:53.183 | warning | 3yn75aympd52v1 | WARNING 01-10 13:11:40
config.py:428] awq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:40
awq_marlin.py:113] Detected that the model can run with awq_marlin, however you
specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin
for faster inference\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | INFO 01-10 13:11:40
config.py:350] This model supports multiple tasks: {'generate', 'embedding'}.
Defaulting to 'generate'.\n
2025-01-10 10:11:53.183 | info | 3yn75aympd52v1 | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 10:11:50.665 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 11% Completed | 1/9 [00:01<00:14, 1.81s/it]\n
2025-01-10 10:11:48.853 | info | q5xcbyb3ls1u0i | \rLoading safetensors checkpoint
shards: 0% Completed | 0/9 [00:00<?, ?it/s]\n
2025-01-10 10:11:48.853 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:48 model_runner.py:1072] Starting to load model
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 10:11:48.853 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:48
model_runner.py:1072] Starting to load model /models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-10 10:11:48.853 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:48
shm_broadcast.py:236] vLLM message queue communication handle:
Handle(connect_ip='127.0.0.1', local_reader_ranks=[1],
buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at
0x79db90ae5ea0>, local_subscribe_port=49363, remote_subscribe_port=None)\n
2025-01-10 10:11:48.853 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:48 custom_all_reduce_utils.py:242] reading GPU P2P
access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:48
custom_all_reduce_utils.py:242] reading GPU P2P access cache from
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:40
custom_all_reduce_utils.py:204] generating GPU P2P access cache in
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:40
pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:40 pynccl.py:69] vLLM is using nccl==2.21.5\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:40
utils.py:960] Found nccl from library libnccl.so.2\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:40 utils.py:960] Found nccl from library
libnccl.so.2\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:39 multiproc_worker_utils.py:215] Worker ready;
awaiting tasks\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:39 selector.py:135] Using Flash Attention backend.\
n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:39
selector.py:135] Using Flash Attention backend.\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:39
custom_cache_manager.py:17] Setting Triton cache manager to:
vllm.triton_utils.custom_cache_manager:CustomCacheManager\n
2025-01-10 10:11:48.565 | warning | q5xcbyb3ls1u0i | WARNING 01-10 13:11:39
multiproc_gpu_executor.py:56] Reducing Torch parallelism from 64 threads to 1 to
avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment
to tune this value as needed.\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:39
llm_engine.py:249] Initializing an LLM engine (v0.6.4) with config:
model='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', speculative_config=None,
tokenizer='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', skip_tokenizer_init=False,
tokenizer_mode=auto, revision=None, override_neuron_config=None,
tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16,
max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO,
tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False,
quantization=awq, enforce_eager=False, kv_cache_dtype=auto,
quantization_param_path=None, device_config=cuda,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'),
observability_config=ObservabilityConfig(otlp_traces_endpoint=None,
collect_model_forward_time=False, collect_model_execute_time=False), seed=0,
served_model_name=/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-
v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, num_scheduler_steps=1,
chunked_prefill_enabled=False multi_step_stream_outputs=True,
enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False,
chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:39
config.py:1020] Defaulting to use mp for distributed inference\n
2025-01-10 10:11:48.565 | warning | q5xcbyb3ls1u0i | WARNING 01-10 13:11:39
config.py:428] awq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:39
awq_marlin.py:113] Detected that the model can run with awq_marlin, however you
specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin
for faster inference\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | INFO 01-10 13:11:39
config.py:350] This model supports multiple tasks: {'generate', 'embedding'}.
Defaulting to 'generate'.\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 10:11:48.565 | info | q5xcbyb3ls1u0i | engine.py :26 2025-
01-10 13:11:35,381 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 10:11:48.564 | info | q5xcbyb3ls1u0i | engine_args.py :126 2025-
01-10 13:11:35,342 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-10 10:11:19.636 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:19 utils.py:960] Found nccl from library
libnccl.so.2\n
2025-01-10 10:11:19.636 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:18 multiproc_worker_utils.py:215] Worker ready;
awaiting tasks\n
2025-01-10 10:11:19.636 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-10 13:11:18 selector.py:135] Using Flash Attention backend.\
n
2025-01-10 10:11:18.708 | info | 0a13jc93spuadh | INFO 01-10 13:11:18
selector.py:135] Using Flash Attention backend.\n
2025-01-10 10:11:18.708 | info | 0a13jc93spuadh | INFO 01-10 13:11:18
custom_cache_manager.py:17] Setting Triton cache manager to:
vllm.triton_utils.custom_cache_manager:CustomCacheManager\n
2025-01-10 10:11:18.605 | warning | 0a13jc93spuadh | WARNING 01-10 13:11:18
multiproc_gpu_executor.py:56] Reducing Torch parallelism from 48 threads to 1 to
avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment
to tune this value as needed.\n
2025-01-10 10:11:18.605 | info | 0a13jc93spuadh | INFO 01-10 13:11:18
llm_engine.py:249] Initializing an LLM engine (v0.6.4) with config:
model='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', speculative_config=None,
tokenizer='/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/
snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', skip_tokenizer_init=False,
tokenizer_mode=auto, revision=None, override_neuron_config=None,
tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16,
max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO,
tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False,
quantization=awq, enforce_eager=False, kv_cache_dtype=auto,
quantization_param_path=None, device_config=cuda,
decoding_config=DecodingConfig(guided_decoding_backend='outlines'),
observability_config=ObservabilityConfig(otlp_traces_endpoint=None,
collect_model_forward_time=False, collect_model_execute_time=False), seed=0,
served_model_name=/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-
v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, num_scheduler_steps=1,
chunked_prefill_enabled=False multi_step_stream_outputs=True,
enable_prefix_caching=True, use_async_output_proc=True, use_cached_outputs=False,
chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)\n
2025-01-10 10:11:18.605 | info | 0a13jc93spuadh | INFO 01-10 13:11:18
config.py:1020] Defaulting to use mp for distributed inference\n
2025-01-10 10:11:18.605 | warning | 0a13jc93spuadh | WARNING 01-10 13:11:18
config.py:428] awq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.\n
2025-01-10 10:11:18.605 | info | 0a13jc93spuadh | INFO 01-10 13:11:18
awq_marlin.py:113] Detected that the model can run with awq_marlin, however you
specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin
for faster inference\n
2025-01-10 10:11:18.605 | info | 0a13jc93spuadh | INFO 01-10 13:11:18
config.py:350] This model supports multiple tasks: {'generate', 'embedding'}.
Defaulting to 'generate'.\n
2025-01-10 10:11:18.122 | info | 0a13jc93spuadh | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-10 10:11:18.122 | info | 0a13jc93spuadh | engine.py :26 2025-
01-10 13:11:13,117 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-10 10:11:13.084 | info | 0a13jc93spuadh | engine_args.py :126 2025-
01-10 13:11:13,083 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-09 19:17:04.549 | info | uog4wi2bfchi7t |
warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-09 19:17:03.533 | info | uog4wi2bfchi7t |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-09 19:17:02.690 | warning | uog4wi2bfchi7t | [rank0]:[W109
22:17:02.363404277 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has
NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the
application should call destroy_process_group to ensure that any pending NCCL
operations have finished in this process. In rare cases this process can exit
before this point and block the progress of another member of the process group.
This constraint has always been present, but this warning has only been added
since PyTorch 2.4 (function operator())\n
2025-01-09 19:17:02.690 | info | uog4wi2bfchi7t | INFO 01-09 22:17:01
async_llm_engine.py:62] Engine is gracefully shutting down.\n
2025-01-09 19:17:01.099 | info | uog4wi2bfchi7t | Kill worker.
2025-01-09 19:14:39.867 | info | uog4wi2bfchi7t | Finished.
2025-01-09 19:14:39.799 | info | uog4wi2bfchi7t | Finished running generator.
2025-01-09 19:14:39.678 | info | uog4wi2bfchi7t | INFO 01-09 22:14:39
async_llm_engine.py:176] Finished request cc8c94c4ae0449868aea11ca75885470.\n
2025-01-09 19:14:39.677 | info | uog4wi2bfchi7t | INFO 01-09 22:14:38
metrics.py:465] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00%\n
2025-01-09 19:14:38.880 | info | uog4wi2bfchi7t | INFO 01-09 22:14:38
metrics.py:449] Avg prompt throughput: 510.2 tokens/s, Avg generation throughput:
7.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 3.1%, CPU KV cache usage: 0.0%.\n
2025-01-09 19:14:38.880 | info | uog4wi2bfchi7t | INFO 01-09 22:14:34
async_llm_engine.py:208] Added request cc8c94c4ae0449868aea11ca75885470.\n
2025-01-09 19:14:38.880 | info | uog4wi2bfchi7t | Jobs in progress: 1
2025-01-09 19:14:38.880 | info | uog4wi2bfchi7t | Jobs in queue: 1
2025-01-09 19:14:34.774 | info | uog4wi2bfchi7t | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-09 19:14:34.774 | info | uog4wi2bfchi7t | INFO 01-09 22:14:34
chat_utils.py:431] ' }}{% endif %}\n
2025-01-09 19:14:34.774 | info | uog4wi2bfchi7t | INFO 01-09 22:14:34
chat_utils.py:431] \n
2025-01-09 19:14:34.774 | info | uog4wi2bfchi7t | INFO 01-09 22:14:34
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-09 19:14:34.774 | info | uog4wi2bfchi7t | INFO 01-09 22:14:34
chat_utils.py:431] \n
2025-01-09 19:14:34.774 | info | uog4wi2bfchi7t | INFO 01-09 22:14:34
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-09 19:14:34.774 | info | uog4wi2bfchi7t | INFO 01-09 22:14:34
chat_utils.py:431] Using supplied chat template:\n
2025-01-09 19:14:34.219 | info | uog4wi2bfchi7t | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-09 19:14:34.219 | info | uog4wi2bfchi7t | engine.py :26 2025-
01-09 22:14:33,860 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-09 19:14:34.218 | info | uog4wi2bfchi7t | engine_args.py :126 2025-
01-09 22:14:33,855 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-09 19:14:33.855 | info | uog4wi2bfchi7t | engine.py :112 2025-
01-09 22:14:33,854 Initialized vLLM engine in 69.96s\n
2025-01-09 19:14:33.855 | info | uog4wi2bfchi7t | INFO 01-09 22:14:33
model_runner.py:1518] Graph capturing finished in 29 secs, took 0.99 GiB\n
2025-01-09 19:14:33.855 | info | uog4wi2bfchi7t | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-09 22:14:33 model_runner.py:1518] Graph capturing finished
in 29 secs, took 0.99 GiB\n
2025-01-09 19:14:33.855 | info | uog4wi2bfchi7t | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-09 22:14:33 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-09 19:14:33.562 | info | uog4wi2bfchi7t | INFO 01-09 22:14:33
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-09 19:14:33.562 | info | uog4wi2bfchi7t | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-09 22:14:04 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-09 19:14:33.562 | info | uog4wi2bfchi7t | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-09 22:14:04 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-09 19:14:33.562 | info | uog4wi2bfchi7t | INFO 01-09 22:14:04
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-09 19:14:04.430 | info | uog4wi2bfchi7t | INFO 01-09 22:14:04
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-09 19:14:04.430 | info | uog4wi2bfchi7t | INFO 01-09 22:14:01
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
10.14x\n
2025-01-09 19:14:01.851 | info | uog4wi2bfchi7t | INFO 01-09 22:14:01
distributed_gpu_executor.py:57] # GPU blocks: 5190, # CPU blocks: 1638\n
2025-01-09 19:14:01.710 | info | uog4wi2bfchi7t | INFO 01-09 22:14:01
worker.py:232] Memory profiling results: total_gpu_memory=44.34GiB
initial_memory_usage=25.90GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=25.95Gib non_torch_memory=7.35GiB kv_cache_size=12.67GiB
gpu_memory_utilization=0.90\n
2025-01-09 19:14:01.652 | info | uog4wi2bfchi7t | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-09 22:14:01 worker.py:232] Memory profiling results:
total_gpu_memory=44.34GiB initial_memory_usage=19.04GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.57GiB
gpu_memory_utilization=0.90\n
2025-01-09 19:14:01.652 | info | uog4wi2bfchi7t | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-09 22:13:54 model_runner.py:1077] Loading model weights took
18.5769 GB\n
2025-01-09 19:13:54.245 | info | uog4wi2bfchi7t | INFO 01-09 22:13:54
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-09 19:13:54.245 | info | uog4wi2bfchi7t | \n
2025-01-09 19:13:54.245 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:15<00:00, 1.67s/it]\n
2025-01-09 19:13:54.062 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 100% Completed | 9/9 [00:15<00:00, 1.33s/it]\n
2025-01-09 19:13:53.694 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 89% Completed | 8/9 [00:14<00:01, 1.76s/it]\n
2025-01-09 19:13:52.289 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 78% Completed | 7/9 [00:13<00:03, 1.93s/it]\n
2025-01-09 19:13:50.367 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 67% Completed | 6/9 [00:11<00:05, 1.93s/it]\n
2025-01-09 19:13:48.394 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 56% Completed | 5/9 [00:09<00:07, 1.92s/it]\n
2025-01-09 19:13:46.479 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 44% Completed | 4/9 [00:07<00:09, 1.92s/it]\n
2025-01-09 19:13:44.523 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 33% Completed | 3/9 [00:05<00:11, 1.89s/it]\n
2025-01-09 19:13:42.552 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 22% Completed | 2/9 [00:03<00:12, 1.82s/it]\n
2025-01-09 19:13:42.552 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 11% Completed | 1/9 [00:01<00:12, 1.55s/it]\n
2025-01-09 19:13:42.552 | info | uog4wi2bfchi7t | \rLoading safetensors checkpoint
shards: 0% Completed | 0/9 [00:00<?, ?it/s]\n
2025-01-09 19:13:42.552 | info | uog4wi2bfchi7t | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-09 22:13:38 model_runner.py:1072] Starting to load model
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-09 19:13:42.552 | info | uog4wi2bfchi7t | INFO 01-09 22:13:38
model_runner.py:1072] Starting to load model /models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83...\n
2025-01-09 19:13:42.552 | info | uog4wi2bfchi7t | INFO 01-09 22:13:38
shm_broadcast.py:236] vLLM message queue communication handle:
Handle(connect_ip='127.0.0.1', local_reader_ranks=[1],
buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at
0x740b82e65ea0>, local_subscribe_port=57835, remote_subscribe_port=None)\n
2025-01-09 19:13:42.551 | info | uog4wi2bfchi7t | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-09 22:13:38 custom_all_reduce_utils.py:242] reading GPU P2P
access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-09 19:13:42.551 | info | uog4wi2bfchi7t | INFO 01-09 22:13:38
custom_all_reduce_utils.py:242] reading GPU P2P access cache from
/root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json\n
2025-01-06 17:01:35.014 | info | olmw98xr7nfth5 |
warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-06 17:01:34.173 | info | olmw98xr7nfth5 |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-06 17:01:33.130 | warning | olmw98xr7nfth5 | [rank0]:[W106
20:01:33.989801744 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has
NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the
application should call destroy_process_group to ensure that any pending NCCL
operations have finished in this process. In rare cases this process can exit
before this point and block the progress of another member of the process group.
This constraint has always been present, but this warning has only been added
since PyTorch 2.4 (function operator())\n
2025-01-06 17:01:33.129 | info | olmw98xr7nfth5 | INFO 01-06 20:01:31
async_llm_engine.py:62] Engine is gracefully shutting down.\n
2025-01-06 17:01:31.624 | info | olmw98xr7nfth5 | Kill worker.
2025-01-06 16:55:37.119 | info | 0a13jc93spuadh |
warnings.warn('resource_tracker: There appear to be %d '\n
2025-01-06 16:55:36.026 | info | 0a13jc93spuadh |
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning:
resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at
shutdown\n
2025-01-06 16:55:34.721 | warning | 0a13jc93spuadh | [rank0]:[W106
19:55:34.437777753 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has
NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the
application should call destroy_process_group to ensure that any pending NCCL
operations have finished in this process. In rare cases this process can exit
before this point and block the progress of another member of the process group.
This constraint has always been present, but this warning has only been added
since PyTorch 2.4 (function operator())\n
2025-01-06 16:55:34.720 | info | 0a13jc93spuadh | Kill worker.
2025-01-06 16:55:32.868 | info | 0a13jc93spuadh | --- Starting Serverless Worker |
Version 1.7.5 ---\n
2025-01-06 16:55:32.868 | info | 0a13jc93spuadh | INFO 01-06 19:48:05
chat_utils.py:431] ' }}{% endif %}\n
2025-01-06 16:55:32.868 | info | 0a13jc93spuadh | INFO 01-06 19:48:05
chat_utils.py:431] \n
2025-01-06 16:55:32.868 | info | 0a13jc93spuadh | INFO 01-06 19:48:05
chat_utils.py:431] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0
== 0 %}{% set content = '<|begin_of_text|>' + content %}{% endif %}{{ content }}{%
endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|
end_header_id|>\n
2025-01-06 16:55:32.868 | info | 0a13jc93spuadh | INFO 01-06 19:48:05
chat_utils.py:431] \n
2025-01-06 16:55:32.868 | info | 0a13jc93spuadh | INFO 01-06 19:48:05
chat_utils.py:431] {% set loop_messages = messages %}{% for message in
loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|
end_header_id|>\n
2025-01-06 16:55:32.868 | info | 0a13jc93spuadh | INFO 01-06 19:48:05
chat_utils.py:431] Using supplied chat template:\n
2025-01-06 16:48:05.604 | info | 0a13jc93spuadh | tokenizer_name_or_path:
/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83, tokenizer_revision: None,
trust_remote_code: False\n
2025-01-06 16:48:05.604 | info | 0a13jc93spuadh | engine.py :26 2025-
01-06 19:48:05,296 Engine args:
AsyncEngineArgs(model='/models/huggingface-cache/hub/models--loremipsum3658--
Bacchus-v4.6-AWQ/snapshots/4e04a8040020bd72ab6bebcc89b5e0463d9d8c83',
served_model_name=None, tokenizer='/models/huggingface-cache/hub/models--
loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', task='auto', skip_tokenizer_init=False,
tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False,
allowed_local_media_path='', download_dir=None, load_format='auto',
config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto',
quantization_param_path=None, seed=0, max_model_len=8192, worker_use_ray=False,
distributed_executor_backend=None, pipeline_parallel_size=1,
tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16,
enable_prefix_caching=True, disable_sliding_window=False,
use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0,
gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256,
max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None,
rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None,
quantization='awq', enforce_eager=False, max_seq_len_to_capture=8192,
disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray',
tokenizer_pool_extra_config=None, limit_mm_per_prompt=None,
mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1,
max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1,
max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256,
long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None,
device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True,
ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0,
model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None,
scheduler_delay_factor=0.0, enable_chunked_prefill=None,
guided_decoding_backend='outlines', speculative_model=None,
speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None,
num_speculative_tokens=None, speculative_disable_mqa_scorer=False,
speculative_max_model_len=None, speculative_disable_by_batch_size=None,
ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None,
spec_decoding_acceptance_method='rejection_sampler',
typical_acceptance_sampler_posterior_threshold=None,
typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None,
disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None,
collect_detailed_traces=None, disable_async_output_proc=False,
scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None,
disable_log_requests=False)\n
2025-01-06 16:48:05.604 | info | 0a13jc93spuadh | engine_args.py :126 2025-
01-06 19:48:05,291 Using baked in model with args: {'MODEL_NAME':
'/models/huggingface-cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83', 'TOKENIZER_NAME': '/models/huggingface-
cache/hub/models--loremipsum3658--Bacchus-v4.6-AWQ/snapshots/
4e04a8040020bd72ab6bebcc89b5e0463d9d8c83'}\n
2025-01-06 16:48:05.291 | info | 0a13jc93spuadh | engine.py :112 2025-
01-06 19:48:05,290 Initialized vLLM engine in 76.47s\n
2025-01-06 16:48:05.291 | info | 0a13jc93spuadh | INFO 01-06 19:48:04
model_runner.py:1518] Graph capturing finished in 30 secs, took 0.99 GiB\n
2025-01-06 16:48:05.291 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-06 19:48:04 model_runner.py:1518] Graph capturing finished
in 30 secs, took 0.99 GiB\n
2025-01-06 16:48:04.929 | info | 0a13jc93spuadh | INFO 01-06 19:48:04
custom_all_reduce.py:224] Registering 5635 cuda graph addresses\n
2025-01-06 16:48:04.671 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-06 19:48:04 custom_all_reduce.py:224] Registering 5635 cuda
graph addresses\n
2025-01-06 16:48:04.671 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-06 19:47:35 model_runner.py:1404] If out-of-memory error
occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or
switching to eager mode. You can also reduce the `max_num_seqs` as needed to
decrease memory usage.\n
2025-01-06 16:48:04.671 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-06 19:47:35 model_runner.py:1400] Capturing cudagraphs for
decoding. This may lead to unexpected consequences if the model is not static. To
run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in
the CLI.\n
2025-01-06 16:48:04.670 | info | 0a13jc93spuadh | INFO 01-06 19:47:35
model_runner.py:1404] If out-of-memory error occurs during cudagraph capture,
consider decreasing `gpu_memory_utilization` or switching to eager mode. You can
also reduce the `max_num_seqs` as needed to decrease memory usage.\n
2025-01-06 16:47:49.033 | info | olmw98xr7nfth5 | Finished.
2025-01-06 16:47:48.735 | info | olmw98xr7nfth5 | Finished running generator.
2025-01-06 16:47:48.506 | info | olmw98xr7nfth5 | INFO 01-06 19:47:48
async_llm_engine.py:176] Finished request 0640fa52ad8d415ebb9a4227191be936.\n
2025-01-06 16:47:48.506 | info | olmw98xr7nfth5 | INFO 01-06 19:47:46
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-06 16:47:46.696 | info | olmw98xr7nfth5 | INFO 01-06 19:47:46
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
31.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.6%, CPU KV cache usage: 0.0%.\n
2025-01-06 16:47:46.696 | info | olmw98xr7nfth5 | INFO 01-06 19:47:41
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-06 16:47:41.672 | info | olmw98xr7nfth5 | INFO 01-06 19:47:41
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
44.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.5%, CPU KV cache usage: 0.0%.\n
2025-01-06 16:47:41.672 | info | olmw98xr7nfth5 | Jobs in progress: 1
2025-01-06 16:47:39.625 | info | olmw98xr7nfth5 | Finished.
2025-01-06 16:47:39.322 | info | olmw98xr7nfth5 | Finished running generator.
2025-01-06 16:47:39.101 | info | olmw98xr7nfth5 | INFO 01-06 19:47:39
async_llm_engine.py:176] Finished request e9f056edf5ff49d59cfb5406730f3572.\n
2025-01-06 16:47:39.101 | info | olmw98xr7nfth5 | INFO 01-06 19:47:36
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-06 16:47:36.667 | info | olmw98xr7nfth5 | INFO 01-06 19:47:36
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
59.7 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.8%, CPU KV cache usage: 0.0%.\n
2025-01-06 16:47:36.667 | info | olmw98xr7nfth5 | INFO 01-06 19:47:31
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-06 16:47:35.339 | info | 0a13jc93spuadh | INFO 01-06 19:47:35
model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to
unexpected consequences if the model is not static. To run the model in eager mode,
set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n
2025-01-06 16:47:35.339 | info | 0a13jc93spuadh | INFO 01-06 19:47:32
distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request:
15.62x\n
2025-01-06 16:47:32.501 | info | 0a13jc93spuadh | INFO 01-06 19:47:32
distributed_gpu_executor.py:57] # GPU blocks: 7999, # CPU blocks: 1638\n
2025-01-06 16:47:32.302 | info | 0a13jc93spuadh | INFO 01-06 19:47:32
worker.py:232] Memory profiling results: total_gpu_memory=44.34GiB
initial_memory_usage=19.04GiB peak_torch_memory=19.88GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.53GiB
gpu_memory_utilization=0.90\n
2025-01-06 16:47:32.242 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-06 19:47:32 worker.py:232] Memory profiling results:
total_gpu_memory=44.34GiB initial_memory_usage=19.04GiB peak_torch_memory=19.84GiB
memory_usage_post_profile=19.09Gib non_torch_memory=0.50GiB kv_cache_size=19.57GiB
gpu_memory_utilization=0.90\n
2025-01-06 16:47:32.242 | info | 0a13jc93spuadh | INFO 01-06 19:47:24
model_runner.py:1077] Loading model weights took 18.5769 GB\n
2025-01-06 16:47:31.644 | info | olmw98xr7nfth5 | INFO 01-06 19:47:31
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
60.1 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.6%, CPU KV cache usage: 0.0%.\n
2025-01-06 16:47:31.644 | info | olmw98xr7nfth5 | INFO 01-06 19:47:26
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-06 16:47:26.621 | info | olmw98xr7nfth5 | INFO 01-06 19:47:26
metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput:
59.7 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache
usage: 0.4%, CPU KV cache usage: 0.0%.\n
2025-01-06 16:47:26.620 | info | olmw98xr7nfth5 | INFO 01-06 19:47:21
metrics.py:465] Prefix cache hit rate: GPU: 50.00%, CPU: 0.00%\n
2025-01-06 16:47:24.733 | info | 0a13jc93spuadh | #[1;36m(VllmWorkerProcess
pid=228)#[0;0m INFO 01-06 19:47:24 model_runner.py:1077] Loading model weights took
18.5769 GB\n

Logs FluxAPI FB
No ratings yet
Logs FluxAPI FB
25 pages
Logs
No ratings yet
Logs
46 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Logs
No ratings yet
Logs
11 pages
Logs ComplianceAPI FB
No ratings yet
Logs ComplianceAPI FB
16 pages
ANN Ipynb
No ratings yet
ANN Ipynb
186 pages
103
No ratings yet
103
237 pages
101
No ratings yet
101
355 pages
7.copy of Text To Image Generation With LLM With Hugging Face - Ipynb
No ratings yet
7.copy of Text To Image Generation With LLM With Hugging Face - Ipynb
1,156 pages
Logs
No ratings yet
Logs
158 pages
8.text To Speech Generation With LLM With Hugging Face - Ipynb
No ratings yet
8.text To Speech Generation With LLM With Hugging Face - Ipynb
100 pages
DL 1 - ComputerVision With PyTorch Notes
No ratings yet
DL 1 - ComputerVision With PyTorch Notes
304 pages
Yolov 8
No ratings yet
Yolov 8
43 pages
ABC
No ratings yet
ABC
26 pages
Loss Log Mix
No ratings yet
Loss Log Mix
87 pages
CIS Oracle Linux 7 Benchmark v3.0.0
No ratings yet
CIS Oracle Linux 7 Benchmark v3.0.0
530 pages
PyTorch PDF
No ratings yet
PyTorch PDF
72 pages
Logs VLLM Zippy Azure Landfowl FB
No ratings yet
Logs VLLM Zippy Azure Landfowl FB
14 pages
Logs ComplianceAPI FB
No ratings yet
Logs ComplianceAPI FB
13 pages
Advanced Asyncio PDF
No ratings yet
Advanced Asyncio PDF
224 pages
Trainrealfill
No ratings yet
Trainrealfill
19 pages
Logs
No ratings yet
Logs
18 pages
HP ELITEBOOK 8740W Inventec Armani 6050A2266501
0% (2)
HP ELITEBOOK 8740W Inventec Armani 6050A2266501
61 pages
Convo
No ratings yet
Convo
12 pages
Unit 4 Part 3 DL - 1
No ratings yet
Unit 4 Part 3 DL - 1
5 pages
Pytorch
No ratings yet
Pytorch
38 pages
Read Pratice 0101
No ratings yet
Read Pratice 0101
11 pages
Cse519 hw3
No ratings yet
Cse519 hw3
50 pages
Assignm Dent 6
No ratings yet
Assignm Dent 6
10 pages
Logs
No ratings yet
Logs
12 pages
Error While Working
No ratings yet
Error While Working
6 pages
Claude Comparet DB
No ratings yet
Claude Comparet DB
8 pages
Sip
100% (1)
Sip
44 pages
Logs
No ratings yet
Logs
7 pages
SSCA SIP Training - Course Details
No ratings yet
SSCA SIP Training - Course Details
23 pages
Yuzu Log Filtered
No ratings yet
Yuzu Log Filtered
1,034 pages
Advanced LangChain AI Assistant Framework For Comp
No ratings yet
Advanced LangChain AI Assistant Framework For Comp
7 pages
Deep Fake For Free - Ipynb
No ratings yet
Deep Fake For Free - Ipynb
5 pages
Logs VLLM - Numind - NuExtract v1.5 FB
No ratings yet
Logs VLLM - Numind - NuExtract v1.5 FB
4 pages
Introduction To PyTorch
No ratings yet
Introduction To PyTorch
9 pages
Assistants API in OpenAI-1
No ratings yet
Assistants API in OpenAI-1
35 pages
Introduction To PyTorch
No ratings yet
Introduction To PyTorch
9 pages
Cookie
No ratings yet
Cookie
21 pages
Model Training
No ratings yet
Model Training
8 pages
Training Log Error
No ratings yet
Training Log Error
3 pages
Sysinfo 2024 09 28 19 59
No ratings yet
Sysinfo 2024 09 28 19 59
12 pages
Ollama Ai Colab - Ipynb
No ratings yet
Ollama Ai Colab - Ipynb
2 pages
Roop Unleashed 02.ipynb
No ratings yet
Roop Unleashed 02.ipynb
15 pages
Logs
No ratings yet
Logs
3 pages
Sample
No ratings yet
Sample
2 pages
Logs
No ratings yet
Logs
2 pages
Idlcv Exercise 2 1 HPC
No ratings yet
Idlcv Exercise 2 1 HPC
4 pages
Arohimusic Logs 1646753972340
No ratings yet
Arohimusic Logs 1646753972340
2 pages
Codesys Opc Server v3 User Guide
No ratings yet
Codesys Opc Server v3 User Guide
37 pages
Logs
No ratings yet
Logs
2 pages
Roop-Unleashed Ipynb
No ratings yet
Roop-Unleashed Ipynb
9 pages
Coloab RDP
No ratings yet
Coloab RDP
12 pages
Exp 11 NLI USING BERT
No ratings yet
Exp 11 NLI USING BERT
4 pages
Mnist
No ratings yet
Mnist
3 pages
Requirements
No ratings yet
Requirements
3 pages
Wika D20-9 Pressure Transmitter
No ratings yet
Wika D20-9 Pressure Transmitter
51 pages
Python3 Rec - Py - Base - Model 'Viet-M
No ratings yet
Python3 Rec - Py - Base - Model 'Viet-M
3 pages
About PyTorch
No ratings yet
About PyTorch
2 pages
Task VIII Quantum Vision Transformer
No ratings yet
Task VIII Quantum Vision Transformer
1 page
RAG Project Documentation
No ratings yet
RAG Project Documentation
3 pages
Initial Configuration On CME
No ratings yet
Initial Configuration On CME
68 pages
Roche CEDEX HiRes Manual
No ratings yet
Roche CEDEX HiRes Manual
287 pages
Predict
No ratings yet
Predict
3 pages
03 Data Manipulation
100% (1)
03 Data Manipulation
32 pages
Vsat SkyEdge II Product Presentation CENSIPAM Fev 2010 PDF
No ratings yet
Vsat SkyEdge II Product Presentation CENSIPAM Fev 2010 PDF
17 pages
Electronics Club, IIT Delhi 16 Feb, 2014
No ratings yet
Electronics Club, IIT Delhi 16 Feb, 2014
34 pages
5g Core Guide Cloud Native Transformation
No ratings yet
5g Core Guide Cloud Native Transformation
9 pages
Handling Encrypted Evidence & Password Recovery: Nataly Koukoushkina June 2010 CCFC 2010, Workshop
No ratings yet
Handling Encrypted Evidence & Password Recovery: Nataly Koukoushkina June 2010 CCFC 2010, Workshop
36 pages
Migrate Database FILESYSTEM To ASM
No ratings yet
Migrate Database FILESYSTEM To ASM
4 pages
c05115630 Using The HP Cloud Recovery Download Tool
No ratings yet
c05115630 Using The HP Cloud Recovery Download Tool
12 pages
Sample Interview Questions For Network Engineers
No ratings yet
Sample Interview Questions For Network Engineers
5 pages
Most Usefull Commands in Linux
No ratings yet
Most Usefull Commands in Linux
11 pages
Samsung Colour Laser Printer CLP-775ND: Product Datasheet
No ratings yet
Samsung Colour Laser Printer CLP-775ND: Product Datasheet
2 pages
BOA Spot Communication Guide v1.1
No ratings yet
BOA Spot Communication Guide v1.1
61 pages
Logs
No ratings yet
Logs
9 pages
Logs ImageRouterAPI FB
No ratings yet
Logs ImageRouterAPI FB
17 pages
L11 - Instruction Scheduling
No ratings yet
L11 - Instruction Scheduling
15 pages
Ramdump Wcss Msa0 2023-04-26 18-55-45 Props
No ratings yet
Ramdump Wcss Msa0 2023-04-26 18-55-45 Props
9 pages
ThinkPad x390 Datasheet en
No ratings yet
ThinkPad x390 Datasheet en
2 pages
How To Solve An Expired Key (KEYEXPIRED) With Apt
No ratings yet
How To Solve An Expired Key (KEYEXPIRED) With Apt
11 pages
It For Business Administraation Assignment - Ankit Bagdi (Sec-B)
No ratings yet
It For Business Administraation Assignment - Ankit Bagdi (Sec-B)
106 pages
Lesson 3: Remote Method Invocation (RMI) Mixing RMI and Sockets
No ratings yet
Lesson 3: Remote Method Invocation (RMI) Mixing RMI and Sockets
17 pages
Assignment2 OS (BSEF19M532)
No ratings yet
Assignment2 OS (BSEF19M532)
4 pages
Some Tutorials in Computer Networking Hacking
From Everand
Some Tutorials in Computer Networking Hacking
Dr. Hidaia Mahmood Alassouli
No ratings yet
Error 339 Mscomctl Ocx Pdfcreator
No ratings yet
Error 339 Mscomctl Ocx Pdfcreator
2 pages
WWW Techquila Co in TSMC First 7nm Octa Core A72 Chip Is A Performance Beast
No ratings yet
WWW Techquila Co in TSMC First 7nm Octa Core A72 Chip Is A Performance Beast
4 pages
Aki - 2025-05-08 - 14 - 49
No ratings yet
Aki - 2025-05-08 - 14 - 49
2 pages
Aki (HARD) 4
No ratings yet
Aki (HARD) 4
1 page
Aki (Should Be Hard) 3
No ratings yet
Aki (Should Be Hard) 3
1 page
Microprocessor - Universitypaper - 2016-17 - Question Paper With Answer Key
No ratings yet
Microprocessor - Universitypaper - 2016-17 - Question Paper With Answer Key
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Logs-Lovefy - VLLM Server (Bacchus) - FB

Uploaded by

Logs-Lovefy - VLLM Server (Bacchus) - FB

Uploaded by

2025-01-10 13:37:16.

562 | info | 3yn75aympd52v1 |

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.