Skip to content

Conversation

montanalow
Copy link
Contributor

@montanalow montanalow commented Jun 16, 2023

Fix #717

GPT 2 comparison:
ggml version is 4x faster, I'm no longer able to reproduce a significant speed difference in just gpt2, but that may be because it all fits in GPU either way.

SELECT pgml.transform(
    task => '{
      "task": "text-generation",
      "model": "gpt2"
    }'::JSONB,
    inputs => ARRAY[
        'Once upon a time,',
        'It was the best of times'
    ],
    args => '{"max_new_tokens": 32}'::JSONB
);

                                                                                                                                                                                   transform

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [[{"generated_text": "Once upon a time, they felt the feeling. Now, it would not be the first time that this was happening with them, nor would it be the first time that it had occurred"}]
, [{"generated_text": "It was the best of times,\" the actor replied.\n\nSue Pearce, a New York theater actress who plays the mother of Sam (Cameron Diaz) on \"House of Cards"}]]
(1 row)

Time: 458.381 ms
SELECT pgml.transform(
    task => '{
      "task": "text-generation",
      "model": "marella/gpt-2-ggml"
    }'::JSONB,
    inputs => ARRAY[
        'Once upon a time,',
        'It was the best of times'
    ],
    args => '{"max_new_tokens": 32}'::JSONB
);

                                                                                                                     transform

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------
 [" I was not only able to make the best of my situation but also have some fun with it.\n\nI've been playing games for years now and even", ". I didn't have to go through that, but it's no
t like you can do anything else with your life.\"\n"]
(1 row)

Time: 455.720 ms
SELECT pgml.transform(
    task => '{
      "task": "text-generation",
      "model": "mlabonne/gpt2-GPTQ-4bit"
    }'::JSONB,
    inputs => ARRAY[
        'Once upon a time,',
        'It was the best of times'
    ],
    args => '{"max_new_tokens": 32}'::JSONB
);

                                                                                                                                              transform

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------
 ["Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger. The world was a place of great danger. The world", "It was the best of ti
mes.\n\n\"I'm not going to be a part of that team,\" he said. \"I'm going to be a part of the team. I"]
(1 row)

Time: 577.400 ms

@montanalow montanalow marked this pull request as ready for review June 16, 2023 05:44
@levkk
Copy link
Contributor

levkk commented Jun 16, 2023

Huh that's interesting. Would be cool to try llama too, if it's on HF. I wonder if the Rust interface is faster? Probably not significantly. I think it would be "cooler" to implement the llm crate eventually just so we can run some of these models without Python.

@Tostino
Copy link

Tostino commented Jun 17, 2023

The quality does often degrade slightly with quantization for the same model, but you can fit a larger, more capable model in the same RAM when you are using a quantized model vs full precision. I can fit a 30b parameter model on 1x 3090s, or a 65b parameter on 2x 3090s with either GGML or GPTQ quantization. Those will give vastly better answers than the largest full precision models I can fit on the same hardware.

@montanalow montanalow changed the title ggml compatibility GGML and GPTQ compatibility Jun 17, 2023
@montanalow montanalow merged commit 996b514 into master Jun 17, 2023
@montanalow montanalow deleted the montana/ggml branch June 17, 2023 18:49
SilasMarvin pushed a commit that referenced this pull request Oct 5, 2023
Co-authored-by: Montana Low <montanalow@gmail.com>
@VGerris
Copy link

VGerris commented Nov 19, 2023

@Tostino can you explain or point to documentation that shows how to set up multiple GPUs and use them ? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

any plans to add support for llama.cpp?
4 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy