-
Notifications
You must be signed in to change notification settings - Fork 12.2k
kv-cache : use ggml_set_rows #14285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
kv-cache : use ggml_set_rows #14285
Conversation
1e86597
to
2b940c0
Compare
I tried this PR with the following change in the RPC backend: diff --git a/ggml/src/ggml-rpc/ggml-rpc.cpp b/ggml/src/ggml-rpc/ggml-rpc.cpp
index f468f796..dcbede89 100644
--- a/ggml/src/ggml-rpc/ggml-rpc.cpp
+++ b/ggml/src/ggml-rpc/ggml-rpc.cpp
@@ -761,6 +761,8 @@ static enum ggml_status ggml_backend_rpc_graph_compute(ggml_backend_t backend, g
ggml_backend_rpc_context * rpc_ctx = (ggml_backend_rpc_context *)backend->context;
std::vector<uint8_t> input;
serialize_graph(cgraph, input);
+ auto graph_hash = fnv_hash(input.data(), input.size());
+ printf("RPC graph compute: hash = 0x%" PRIx64 ", size = %zu\n", graph_hash, input.size());
rpc_msg_graph_compute_rsp response;
auto sock = get_socket(rpc_ctx->endpoint);
bool status = send_rpc_cmd(sock, RPC_CMD_GRAPH_COMPUTE, input.data(), input.size(), &response, sizeof(response));
The compute graph doesn't change and produces the same hash with gpt2, tinyllama and mistral-7b models. However, the hash does change with gemma3 models. The serialized graph includes tensor addresses, so it's possible that we rebuild same tensors on different addresses resulting in different graph hash. But in any way this looks like a great progress! |
8f1c5e3
to
5f87f28
Compare
Yes, this is expected. I've applied the change only for the unified cache. For the unified+iswa, it is still using the |
Should work with Gemma now as well. |
4d0c0ea
to
db0cd69
Compare
The non-FA path is now also supported, though I am not 100% sure this is the best way to do it. |
I don't observe any performance regression with CPU-only build, so the implementation should be good enough I think. |
Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'. ref: #8366
a0c0fb6
to
d40f705
Compare
d40f705
to
d1da992
Compare
1031a5d
to
14554a8
Compare
depends #14274
Utilize
ggml_set_rows()
for updating the KV cache.head
offsetCurrently enabled only if the environment variable
LLAMA_SET_ROWS
is defined. If not, we fallback to the original way of updating the KV cache using a view + cpy of continuous slots. This is needed until theggml_set_rows()
implementation is finalized and supported by all backends.Will merge after #14274.
Testing
Next PRs
llama_kv_cache_unified
to support virtual sequences