GGML and GPTQ compatibility #748

montanalow · 2023-06-16T05:20:44Z

GPT 2 comparison:
~~ggml version is 4x faster~~, I'm no longer able to reproduce a significant speed difference in just gpt2, but that may be because it all fits in GPU either way.

SELECT pgml.transform(
    task => '{
      "task": "text-generation",
      "model": "gpt2"
    }'::JSONB,
    inputs => ARRAY[
        'Once upon a time,',
        'It was the best of times'
    ],
    args => '{"max_new_tokens": 32}'::JSONB
);

                                                                                                                                                                                   transform

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [[{"generated_text": "Once upon a time, they felt the feeling. Now, it would not be the first time that this was happening with them, nor would it be the first time that it had occurred"}]
, [{"generated_text": "It was the best of times,\" the actor replied.\n\nSue Pearce, a New York theater actress who plays the mother of Sam (Cameron Diaz) on \"House of Cards"}]]
(1 row)

Time: 458.381 ms

SELECT pgml.transform(
    task => '{
      "task": "text-generation",
      "model": "marella/gpt-2-ggml"
    }'::JSONB,
    inputs => ARRAY[
        'Once upon a time,',
        'It was the best of times'
    ],
    args => '{"max_new_tokens": 32}'::JSONB
);

                                                                                                                     transform

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------
 [" I was not only able to make the best of my situation but also have some fun with it.\n\nI've been playing games for years now and even", ". I didn't have to go through that, but it's no
t like you can do anything else with your life.\"\n"]
(1 row)

Time: 455.720 ms

SELECT pgml.transform(
    task => '{
      "task": "text-generation",
      "model": "mlabonne/gpt2-GPTQ-4bit"
    }'::JSONB,
    inputs => ARRAY[
        'Once upon a time,',
        'It was the best of times'
    ],
    args => '{"max_new_tokens": 32}'::JSONB
);

                                                                                                                                              transform

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------
 ["Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger. The world was a place of great danger. The world", "It was the best of ti
mes.\n\n\"I'm not going to be a part of that team,\" he said. \"I'm going to be a part of the team. I"]
(1 row)

Time: 577.400 ms

levkk · 2023-06-16T07:28:45Z

Huh that's interesting. Would be cool to try llama too, if it's on HF. I wonder if the Rust interface is faster? Probably not significantly. I think it would be "cooler" to implement the llm crate eventually just so we can run some of these models without Python.

Tostino · 2023-06-17T16:10:11Z

The quality does often degrade slightly with quantization for the same model, but you can fit a larger, more capable model in the same RAM when you are using a quantized model vs full precision. I can fit a 30b parameter model on 1x 3090s, or a 65b parameter on 2x 3090s with either GGML or GPTQ quantization. Those will give vastly better answers than the largest full precision models I can fit on the same hardware.

Co-authored-by: Montana Low <montanalow@gmail.com>

VGerris · 2023-11-19T13:27:36Z

@Tostino can you explain or point to documentation that shows how to set up multiple GPUs and use them ? Thank you!

Montana Low added 3 commits June 16, 2023 01:19

ggml compatibility

86b18c8

support args for ctransformers

9a99bca

wrap output for consistency with non ggml

2d6455e

montanalow marked this pull request as ready for review June 16, 2023 05:44

add gptq support

16b2945

montanalow changed the title ~~ggml compatibility~~ GGML and GPTQ compatibility Jun 17, 2023

montanalow merged commit 996b514 into master Jun 17, 2023

montanalow deleted the montana/ggml branch June 17, 2023 18:49

montanalow mentioned this pull request Jul 21, 2023

MLC LLM Compiled Model Support #700

Closed

SilasMarvin pushed a commit that referenced this pull request Oct 5, 2023

GGML and GPTQ compatibility (#748)

e1824a7

Co-authored-by: Montana Low <montanalow@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GGML and GPTQ compatibility #748

GGML and GPTQ compatibility #748

Uh oh!

montanalow commented Jun 16, 2023 •

edited

Loading

Uh oh!

levkk commented Jun 16, 2023

Uh oh!

Tostino commented Jun 17, 2023

Uh oh!

VGerris commented Nov 19, 2023

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

GGML and GPTQ compatibility #748

GGML and GPTQ compatibility #748

Uh oh!

Conversation

montanalow commented Jun 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

levkk commented Jun 16, 2023

Uh oh!

Tostino commented Jun 17, 2023

Uh oh!

VGerris commented Nov 19, 2023

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

montanalow commented Jun 16, 2023 •

edited

Loading