File size: 7,135 Bytes
05d4aef e062ce3 05d4aef e658123 05d4aef e658123 05d4aef e658123 05d4aef e658123 05d4aef e658123 05d4aef e658123 05d4aef e658123 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 |
---
base_model: t5-small
library_name: transformers
license: apache-2.0
tags:
- generated_from_trainer
model-index:
- name: en-af-sql-training-1727527893
results: []
datasets:
- b-mc2/sql-create-context
- Clinton/Text-to-sql-v1
- knowrohit07/know_sql
language:
- af
- en
---
# en-af-sql-training-1727527893
This model is a fine-tuned version of [t5-small](https://huggingface.co/t5-small) on three datasets: b-mc2/sql-create-context, Clinton/Text-to-sql-v1, knowrohit07/know-sql.
It achieves the following results on the evaluation set:
- Loss: 0.0210
## Model description
This is a fine-tuned Afrikaans-to-SQL model. The pretrained [t5-small](https://huggingface.co/t5-small) was used to train our SQL model.
## Training and Evaluation Datasets
As mentioned, to train the model we used a combination of three dataset which we split into training, testing, and validation sets. THe dataset can be found by following these links:
- [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)
- [Clinton/Text-to-sql-v1](https://huggingface.co/datasets/Clinton/Text-to-sql-v1)
- [knowrohit07/know-sql](https://huggingface.co/datasets/knowrohit07/know_sql)
We did a 80-10-10 split on each dataset and then combined them into a single `DatasetDict` object with `train`, `test,` and `validation` sets.
```json
DatasetDict({
train: Dataset({
features: ['answer', 'question', 'context', 'afr question'],
num_rows: 118692
})
test: Dataset({
features: ['answer', 'question', 'context', 'afr question'],
num_rows: 14838
})
validation: Dataset({
features: ['answer', 'question', 'context', 'afr question'],
num_rows: 14838
})
})
```
The pretrained model was then fine-tuned on the dataset splits. Rather than using only the `question`, the model also takes in the schema context such that it can generate more accurate queries for a given database.
*Input prompt*
```python
Table context: CREATE TABLE table_55794 (
"Home team" text,
"Home team score" text,
"Away team" text,
"Away team score" text,
"Venue" text,
"Crowd" real,
"Date" text
)
Question: Watter tuisspan het'n span mebbourne?
Answer:
```
*Expected Output*
```sql
SELECT "Home team score" FROM table_55794 WHERE "Away team" = 'melbourne'
```
## Intended uses & limitations
This model takes in a single prompt (similar to the one above) that is tokenized and it then uses the `input_ids` to generate an output SQL query. However the prompt must be structured in a specific way.
The `prompt` must start with the table/schema description followed by the question followed by an empty answer. Below we illustrate an example on how to use it. Furthermore, our combined dataset looks as follows:
*Tokenized Dataset*
```json
DatasetDict({
train: Dataset({
features: ['input_ids', 'labels'],
num_rows: 118692
})
test: Dataset({
features: ['input_ids', 'labels'],
num_rows: 14838
})
validation: Dataset({
features: ['input_ids', 'labels'],
num_rows: 14838
})
})
```
*Usage*
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments
# Load the model and tokenizer from Hugging Face Hub
repo_name = "JsteReubsSoftware/en-af-sql-training-1727527893"
en_af_sql_model = AutoModelForSeq2SeqLM.from_pretrained(repo_name, torch_dtype=torch.bfloat16)
en_af_sql_model = en_af_sql_model.to('cuda')
tokenizer = AutoTokenizer.from_pretrained(repo_name)
question = "Watter tuisspan het'n span mebbourne?"
context = "CREATE TABLE table_55794 (
"Home team" text,
"Home team score" text,
"Away team" text,
"Away team score" text,
"Venue" text,
"Crowd" real,
"Date" text
)"
prompt = f"""Tables:
{context}
Question:
{question}
Answer:
"""
inputs = tokenizer(prompt, return_tensors='pt')
inputs = inputs.to('cuda')
output = tokenizer.decode(
en_af_sql_model.generate(
inputs["input_ids"],
max_new_tokens=200,
)[0],
skip_special_tokens=True
)
print("Predicted SQL Query:")
print(output)
```
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.005
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 2
We used the following in our program:
```python
output_dir = f'./en-af-sql-training-{str(int(time.time()))}'
training_args = TrainingArguments(
output_dir=output_dir,
learning_rate=5e-3,
num_train_epochs=2,
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=16, # batch size for evaluation
weight_decay=0.01,
logging_steps=50,
evaluation_strategy='steps', # evaluation strategy to adopt during training
eval_steps=500, # number of steps between evaluation
)
trainer = Trainer(
model=finetuned_model,
args=training_args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['validation'],
)
```
### Training results
| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:------:|:----:|:---------------:|
| 0.0573 | 0.1348 | 500 | 0.0452 |
| 0.0424 | 0.2695 | 1000 | 0.0364 |
| 0.037 | 0.4043 | 1500 | 0.0323 |
| 0.0356 | 0.5391 | 2000 | 0.0287 |
| 0.0328 | 0.6739 | 2500 | 0.0269 |
| 0.0281 | 0.8086 | 3000 | 0.0255 |
| 0.0286 | 0.9434 | 3500 | 0.0238 |
| 0.0269 | 1.0782 | 4000 | 0.0233 |
| 0.0247 | 1.2129 | 4500 | 0.0225 |
| 0.0245 | 1.3477 | 5000 | 0.0217 |
| 0.0226 | 1.4825 | 5500 | 0.0214 |
| 0.0245 | 1.6173 | 6000 | 0.0211 |
| 0.024 | 1.7520 | 6500 | 0.0210 |
| 0.0249 | 1.8868 | 7000 | 0.0210 |
### Testing results
After our model was trained and validated, we evaluated the model using four evaluation metrics.
- *Exact Match Accuracy:* This measured the accuracy of our model predicting the exact same SQL query as the target query.
- *TSED score:* This metric ranges from 0 to 1 and was proposed by [this](https://dl.acm.org/doi/abs/10.1145/3639477.3639732) paper. It allows us to estimate the execution performance of the output query, allowing us to estimate the model's execution accuracy.
- *SQAM accuracy:* Similar to TSED, we can used this to estimate the output query's execution accuracy (also see [this](https://dl.acm.org/doi/abs/10.1145/3639477.3639732) paper).
- *BLEU score:* This helps us measure the similarity between the output query and the target query.
The following were the obtained results over the testing set (14838 records):
- Exact Match = 35.98 %
- TSED score: 0.897
- SQAM score: 74.31 %
- BLEU score: 0.762
### Framework versions
- Transformers 4.44.2
- Pytorch 2.4.0
- Datasets 3.0.0
- Tokenizers 0.19.1 |