Skip to content

Commit c2cf3a2

Browse files
authored
v2 huggingface support (#546)
1 parent d1d8e04 commit c2cf3a2

File tree

19 files changed

+1864
-499
lines changed

19 files changed

+1864
-499
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -161,3 +161,4 @@ cython_debug/
161161

162162
# local scratch pad
163163
scratch.sql
164+
scratch.py

pgml-dashboard/src/models.rs

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,16 +60,22 @@ impl Project {
6060

6161
pub fn key_metric_name(&self) -> anyhow::Result<&'static str> {
6262
match self.task.as_ref().unwrap().as_str() {
63-
"classification" | "text-classification" => Ok("f1"),
63+
"classification" | "text_classification" | "question_answering" => Ok("f1"),
6464
"regression" => Ok("r2"),
65+
"summarization" => Ok("rouge_ngram_f1"),
66+
"translation" => Ok("bleu"),
67+
"text_generation" | "text2text" => Ok("perplexity"),
6568
task => Err(anyhow::anyhow!("Unhandled task: {}", task)),
6669
}
6770
}
6871

6972
pub fn key_metric_display_name(&self) -> anyhow::Result<&'static str> {
7073
match self.task.as_ref().unwrap().as_str() {
71-
"classification" | "text-classification" => Ok("F<sup>1</sup>"),
74+
"classification" | "text_classification" | "question_answering" => Ok("F<sup>1</sup>"),
7275
"regression" => Ok("R<sup>2</sup>"),
76+
"summarization" => Ok("Rouge Ngram F<sup>1</sup>"),
77+
"translation" => Ok("Bleu"),
78+
"text_generation" | "text2text" => Ok("Perplexity"),
7379
task => Err(anyhow::anyhow!("Unhandled task: {}", task)),
7480
}
7581
}

pgml-docs/docs/user_guides/transformers/fine_tuning.md

Lines changed: 77 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -34,18 +34,63 @@ You can view the newly loaded data in your Postgres database:
3434
103 | {"en": "ROLES_OF_TRANSLATORS", "es": "Rafael Osuna rosuna@wol. es Traductor"}
3535
(5 rows)
3636
```
37+
This huggingface dataset stores the data as language key pairs in a JSON document. To use it with PostgresML, we'll need to provide a `VIEW` that structures the data into more primitively typed columns.
38+
39+
=== "SQL"
40+
41+
```sql linenums="1"
42+
CREATE OR REPLACE VIEW kde4_en_to_es AS
43+
SELECT translation->>'en' AS "en", translation->>'es' AS "es"
44+
FROM pgml.kde4
45+
LIMIT 10;
46+
```
47+
48+
=== "Result"
49+
50+
```sql linenums="1"
51+
CREATE VIEW
52+
```
53+
54+
Now, we can see the data in more normalized form. The exact column names don't matter for now, we'll specify which one is the target during the training call, and the other one will be used as the input.
55+
56+
=== "SQL"
57+
58+
```sql linenums="1"
59+
SELECT * FROM kde4_en_to_es LIMIT 10;
60+
```
61+
62+
=== "Result"
63+
64+
```sql linenums="1"
65+
en | es
66+
67+
--------------------------------------------------------------------------------------------+--------------------------------------------------------------------------
68+
------------------------------
69+
Lauri Watts | Lauri Watts
70+
& Lauri. Watts. mail; | & Lauri. Watts. mail;
71+
ROLES_OF_TRANSLATORS | Rafael Osuna rosuna@wol. es Traductor Miguel Revilla Rodríguez yo@miguelr
72+
evilla. com Traductor
73+
2006-02-26 3.5.1 | 2006-02-26 3.5.1
74+
The Babel & konqueror; plugin gives you quick access to the Babelfish translation service. | La extensión Babel de & konqueror; le permite un acceso rápido al servici
75+
o de traducción de Babelfish.
76+
KDE | KDE
77+
kdeaddons | kdeaddons
78+
konqueror | konqueror
79+
plugins | extensiones
80+
babelfish | babelfish
81+
(10 rows)
82+
```
3783

38-
When you're constructing your own datasets for translation, it's important to mirror the same table structure. You'll need a `JSONB` column named `translation`, that has first has a "from" language name/value pair, and then a "to" language name/value pair. In this English to Spanish example we use from "en" to "es". You'll pass a `y_column_name` of `translation` to tune the model.
3984

4085
### Tune the model
4186
Tuning is very similar to training with PostgresML, although we specify a `model_name` to download from Hugging Face instead of the base `algorithm`.
4287

4388
```sql linenums="1" title="tune.sql"
4489
SELECT pgml.tune(
4590
'Translate English to Spanish',
46-
task => 'translation_en_to_es',
47-
relation_name => 'pgml.kde4',
48-
y_column_name => 'translation',
91+
task => 'translation',
92+
relation_name => 'kde4_en_to_es',
93+
y_column_name => 'es', -- translate into spanish
4994
model_name => 'Helsinki-NLP/opus-mt-en-es',
5095
hyperparams => '{
5196
"learning_rate": 2e-5,
@@ -289,7 +334,8 @@ Or, it might be interesting to concat the title to the text field to see how rel
289334

290335
```sql linenums="1" title="concat_title.sql"
291336
CREATE OR REPLACE VIEW billsum_training_data
292-
AS SELECT title || '\n' || "text" AS "text", summary FROM pgml.billsum;
337+
AS SELECT title || '\n' || "text" AS "text", summary FROM pgml.billsum
338+
LIMIT 10;
293339
```
294340

295341

@@ -310,14 +356,14 @@ SELECT pgml.tune(
310356
"per_device_eval_batch_size": 2,
311357
"num_train_epochs": 1,
312358
"weight_decay": 0.01,
313-
"max_input_length": 1024,
314-
"max_summary_length": 128
359+
"max_length": 1024
315360
}',
316361
test_size => 0.2,
317362
test_sampling => 'last'
318363
);
319364
```
320365

366+
321367
### Make predictions
322368

323369
=== "SQL"
@@ -355,3 +401,27 @@ The default for predict in a classification problem classifies the statement as
355401
This shows that there is a 6.26% chance for category 0 (negative sentiment), and a 93.73% chance it's category 1 (positive sentiment).
356402

357403
See the [task documentation](https://huggingface.co/tasks/text-classification) for more examples, use cases, models and datasets.
404+
405+
406+
407+
## Text Generation
408+
409+
```postgresql linenums="1"
410+
SELECT pgml.load_dataset('bookcorpus', "limit" => 100);
411+
412+
SELECT pgml.tune(
413+
'GPT Generator',
414+
task => 'text-generation',
415+
relation_name => 'pgml.bookcorpus',
416+
y_column_name => 'text',
417+
model_name => 'gpt2',
418+
hyperparams => '{
419+
"learning_rate": 2e-5,
420+
"num_train_epochs": 1
421+
}',
422+
test_size => 0.2,
423+
test_sampling => 'last'
424+
);
425+
426+
SELECT pgml.generate('GPT Generator', 'While I wandered weak and weary');
427+
```

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy