You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: pgml-docs/docs/user_guides/transformers/fine_tuning.md
+77-7Lines changed: 77 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,18 +34,63 @@ You can view the newly loaded data in your Postgres database:
34
34
103 | {"en": "ROLES_OF_TRANSLATORS", "es": "Rafael Osuna rosuna@wol. es Traductor"}
35
35
(5 rows)
36
36
```
37
+
This huggingface dataset stores the data as language key pairs in a JSON document. To use it with PostgresML, we'll need to provide a `VIEW` that structures the data into more primitively typed columns.
38
+
39
+
=== "SQL"
40
+
41
+
```sql linenums="1"
42
+
CREATE OR REPLACE VIEW kde4_en_to_es AS
43
+
SELECT translation->>'en' AS "en", translation->>'es' AS "es"
44
+
FROM pgml.kde4
45
+
LIMIT 10;
46
+
```
47
+
48
+
=== "Result"
49
+
50
+
```sql linenums="1"
51
+
CREATE VIEW
52
+
```
53
+
54
+
Now, we can see the data in more normalized form. The exact column names don't matter for now, we'll specify which one is the target during the training call, and the other one will be used as the input.
ROLES_OF_TRANSLATORS | Rafael Osuna rosuna@wol. es Traductor Miguel Revilla Rodríguez yo@miguelr
72
+
evilla. com Traductor
73
+
2006-02-26 3.5.1 | 2006-02-26 3.5.1
74
+
The Babel & konqueror; plugin gives you quick access to the Babelfish translation service. | La extensión Babel de & konqueror; le permite un acceso rápido al servici
75
+
o de traducción de Babelfish.
76
+
KDE | KDE
77
+
kdeaddons | kdeaddons
78
+
konqueror | konqueror
79
+
plugins | extensiones
80
+
babelfish | babelfish
81
+
(10 rows)
82
+
```
37
83
38
-
When you're constructing your own datasets for translation, it's important to mirror the same table structure. You'll need a `JSONB` column named `translation`, that has first has a "from" language name/value pair, and then a "to" language name/value pair. In this English to Spanish example we use from "en" to "es". You'll pass a `y_column_name` of `translation` to tune the model.
39
84
40
85
### Tune the model
41
86
Tuning is very similar to training with PostgresML, although we specify a `model_name` to download from Hugging Face instead of the base `algorithm`.
42
87
43
88
```sql linenums="1" title="tune.sql"
44
89
SELECTpgml.tune(
45
90
'Translate English to Spanish',
46
-
task =>'translation_en_to_es',
47
-
relation_name =>'pgml.kde4',
48
-
y_column_name =>'translation',
91
+
task =>'translation',
92
+
relation_name =>'kde4_en_to_es',
93
+
y_column_name =>'es', -- translate into spanish
49
94
model_name =>'Helsinki-NLP/opus-mt-en-es',
50
95
hyperparams =>'{
51
96
"learning_rate": 2e-5,
@@ -289,7 +334,8 @@ Or, it might be interesting to concat the title to the text field to see how rel
289
334
290
335
```sql linenums="1" title="concat_title.sql"
291
336
CREATE OR REPLACEVIEWbillsum_training_data
292
-
ASSELECT title ||'\n'||"text"AS"text", summary FROMpgml.billsum;
337
+
ASSELECT title ||'\n'||"text"AS"text", summary FROMpgml.billsum
338
+
LIMIT10;
293
339
```
294
340
295
341
@@ -310,14 +356,14 @@ SELECT pgml.tune(
310
356
"per_device_eval_batch_size": 2,
311
357
"num_train_epochs": 1,
312
358
"weight_decay": 0.01,
313
-
"max_input_length": 1024,
314
-
"max_summary_length": 128
359
+
"max_length": 1024
315
360
}',
316
361
test_size =>0.2,
317
362
test_sampling =>'last'
318
363
);
319
364
```
320
365
366
+
321
367
### Make predictions
322
368
323
369
=== "SQL"
@@ -355,3 +401,27 @@ The default for predict in a classification problem classifies the statement as
355
401
This shows that there is a 6.26% chance for category 0 (negative sentiment), and a 93.73% chance it's category 1 (positive sentiment).
356
402
357
403
See the [task documentation](https://huggingface.co/tasks/text-classification) for more examples, use cases, models and datasets.
0 commit comments