File size: 7,135 Bytes
05d4aef
e062ce3
05d4aef
 
 
 
 
 
 
e658123
 
 
 
 
 
 
05d4aef
 
 
 
e658123
05d4aef
 
 
 
 
e658123
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05d4aef
 
 
e658123
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05d4aef
 
 
 
 
 
 
 
 
 
 
 
 
 
e658123
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05d4aef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e658123
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05d4aef
 
 
 
 
 
e658123
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
---
base_model: t5-small
library_name: transformers
license: apache-2.0
tags:
- generated_from_trainer
model-index:
- name: en-af-sql-training-1727527893
  results: []
datasets:
- b-mc2/sql-create-context
- Clinton/Text-to-sql-v1
- knowrohit07/know_sql
language:
- af
- en
---

# en-af-sql-training-1727527893

This model is a fine-tuned version of [t5-small](https://huggingface.co/t5-small) on three datasets: b-mc2/sql-create-context, Clinton/Text-to-sql-v1, knowrohit07/know-sql.
It achieves the following results on the evaluation set:
- Loss: 0.0210

## Model description

This is a fine-tuned Afrikaans-to-SQL model. The pretrained [t5-small](https://huggingface.co/t5-small) was used to train our SQL model.

## Training and Evaluation Datasets

As mentioned, to train the model we used a combination of three dataset which we split into training, testing, and validation sets. THe dataset can be found by following these links:

- [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)
- [Clinton/Text-to-sql-v1](https://huggingface.co/datasets/Clinton/Text-to-sql-v1)
- [knowrohit07/know-sql](https://huggingface.co/datasets/knowrohit07/know_sql)

We did a 80-10-10 split on each dataset and then combined them into a single `DatasetDict` object with `train`, `test,` and `validation` sets.
```json
DatasetDict({
    train: Dataset({
        features: ['answer', 'question', 'context', 'afr question'],
        num_rows: 118692
    })
    test: Dataset({
        features: ['answer', 'question', 'context', 'afr question'],
        num_rows: 14838
    })
    validation: Dataset({
        features: ['answer', 'question', 'context', 'afr question'],
        num_rows: 14838
    })
})
```

The pretrained model was then fine-tuned on the dataset splits. Rather than using only the `question`, the model also takes in the schema context such that it can generate more accurate queries for a given database.

*Input prompt*
```python
Table context: CREATE TABLE table_55794 (
    "Home team" text,
    "Home team score" text,
    "Away team" text,
    "Away team score" text,
    "Venue" text,
    "Crowd" real,
    "Date" text
)
Question: Watter tuisspan het'n span mebbourne?
Answer:
```
*Expected Output*
```sql
SELECT "Home team score" FROM table_55794 WHERE "Away team" = 'melbourne'
```

## Intended uses & limitations

This model takes in a single prompt (similar to the one above) that is tokenized and it then uses the `input_ids` to generate an output SQL query. However the prompt must be structured in a specific way.

The `prompt` must start with the table/schema description followed by the question followed by an empty answer. Below we illustrate an example on how to use it. Furthermore, our combined dataset looks as follows:

*Tokenized Dataset*
```json
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 118692
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 14838
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 14838
    })
})
```
*Usage*
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments
# Load the model and tokenizer from Hugging Face Hub
repo_name = "JsteReubsSoftware/en-af-sql-training-1727527893"
en_af_sql_model = AutoModelForSeq2SeqLM.from_pretrained(repo_name, torch_dtype=torch.bfloat16)
en_af_sql_model = en_af_sql_model.to('cuda')
tokenizer = AutoTokenizer.from_pretrained(repo_name)

question = "Watter tuisspan het'n span mebbourne?"
context = "CREATE TABLE table_55794 (
    "Home team" text,
    "Home team score" text,
    "Away team" text,
    "Away team score" text,
    "Venue" text,
    "Crowd" real,
    "Date" text
)"

prompt = f"""Tables:
{context}

Question:
{question}

Answer:
"""
inputs = tokenizer(prompt, return_tensors='pt')
inputs = inputs.to('cuda')

output = tokenizer.decode(
    en_af_sql_model.generate(
        inputs["input_ids"], 
        max_new_tokens=200,
    )[0], 
    skip_special_tokens=True
)

print("Predicted SQL Query:")
print(output)
```

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.005
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 2

We used the following in our program:
```python
output_dir = f'./en-af-sql-training-{str(int(time.time()))}'

training_args = TrainingArguments(
        output_dir=output_dir,
        learning_rate=5e-3,
        num_train_epochs=2,
        per_device_train_batch_size=16,     # batch size per device during training
        per_device_eval_batch_size=16,      # batch size for evaluation
        weight_decay=0.01,
        logging_steps=50,
        evaluation_strategy='steps',        # evaluation strategy to adopt during training
        eval_steps=500,                     # number of steps between evaluation
)

trainer = Trainer(
        model=finetuned_model,
        args=training_args,
        train_dataset=tokenized_datasets['train'],
        eval_dataset=tokenized_datasets['validation'],
)
```

### Training results

| Training Loss | Epoch  | Step | Validation Loss |
|:-------------:|:------:|:----:|:---------------:|
| 0.0573        | 0.1348 | 500  | 0.0452          |
| 0.0424        | 0.2695 | 1000 | 0.0364          |
| 0.037         | 0.4043 | 1500 | 0.0323          |
| 0.0356        | 0.5391 | 2000 | 0.0287          |
| 0.0328        | 0.6739 | 2500 | 0.0269          |
| 0.0281        | 0.8086 | 3000 | 0.0255          |
| 0.0286        | 0.9434 | 3500 | 0.0238          |
| 0.0269        | 1.0782 | 4000 | 0.0233          |
| 0.0247        | 1.2129 | 4500 | 0.0225          |
| 0.0245        | 1.3477 | 5000 | 0.0217          |
| 0.0226        | 1.4825 | 5500 | 0.0214          |
| 0.0245        | 1.6173 | 6000 | 0.0211          |
| 0.024         | 1.7520 | 6500 | 0.0210          |
| 0.0249        | 1.8868 | 7000 | 0.0210          |

### Testing results

After our model was trained and validated, we evaluated the model using four evaluation metrics.

- *Exact Match Accuracy:* This measured the accuracy of our model predicting the exact same SQL query as the target query.
- *TSED score:* This metric ranges from 0 to 1 and was proposed by [this](https://dl.acm.org/doi/abs/10.1145/3639477.3639732) paper. It allows us to estimate the execution performance of the output query, allowing us to estimate the model's execution accuracy.
- *SQAM accuracy:* Similar to TSED, we can used this to estimate the output query's execution accuracy (also see [this](https://dl.acm.org/doi/abs/10.1145/3639477.3639732) paper).
- *BLEU score:* This helps us measure the similarity between the output query and the target query.

The following were the obtained results over the testing set (14838 records):

- Exact Match = 35.98 %
- TSED score: 0.897
- SQAM score: 74.31 %
- BLEU score: 0.762

### Framework versions

- Transformers 4.44.2
- Pytorch 2.4.0
- Datasets 3.0.0
- Tokenizers 0.19.1