JsteReubsSoftware commited on
Commit
e658123
·
verified ·
1 Parent(s): d021b67

Added descriptions and results about the model and datasets

Browse files
Files changed (1) hide show
  1. README.md +159 -11
README.md CHANGED
@@ -7,28 +7,137 @@ tags:
7
  model-index:
8
  - name: en-af-sql-training-1727527893
9
  results: []
 
 
 
 
 
 
 
10
  ---
11
 
12
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
- should probably proofread and complete it, then remove this comment. -->
14
-
15
  # en-af-sql-training-1727527893
16
 
17
- This model is a fine-tuned version of [t5-small](https://huggingface.co/t5-small) on the None dataset.
18
  It achieves the following results on the evaluation set:
19
  - Loss: 0.0210
20
 
21
  ## Model description
22
 
23
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ## Intended uses & limitations
26
 
27
- More information needed
28
-
29
- ## Training and evaluation data
30
-
31
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  ## Training procedure
34
 
@@ -43,6 +152,30 @@ The following hyperparameters were used during training:
43
  - lr_scheduler_type: linear
44
  - num_epochs: 2
45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  ### Training results
47
 
48
  | Training Loss | Epoch | Step | Validation Loss |
@@ -62,10 +195,25 @@ The following hyperparameters were used during training:
62
  | 0.024 | 1.7520 | 6500 | 0.0210 |
63
  | 0.0249 | 1.8868 | 7000 | 0.0210 |
64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
  ### Framework versions
67
 
68
  - Transformers 4.44.2
69
  - Pytorch 2.4.0
70
  - Datasets 3.0.0
71
- - Tokenizers 0.19.1
 
7
  model-index:
8
  - name: en-af-sql-training-1727527893
9
  results: []
10
+ datasets:
11
+ - b-mc2/sql-create-context
12
+ - Clinton/Text-to-sql-v1
13
+ - knowrohit07/know_sql
14
+ language:
15
+ - af
16
+ - en
17
  ---
18
 
 
 
 
19
  # en-af-sql-training-1727527893
20
 
21
+ This model is a fine-tuned version of [t5-small](https://huggingface.co/t5-small) on three datasets: b-mc2/sql-create-context, Clinton/Text-to-sql-v1, knowrohit07/know-sql.
22
  It achieves the following results on the evaluation set:
23
  - Loss: 0.0210
24
 
25
  ## Model description
26
 
27
+ This is a fine-tuned Afrikaans-to-SQL model. The pretrained [t5-small](https://huggingface.co/t5-small) was used to train our SQL model.
28
+
29
+ ## Training and Evaluation Datasets
30
+
31
+ As mentioned, to train the model we used a combination of three dataset which we split into training, testing, and validation sets. THe dataset can be found by following these links:
32
+
33
+ - [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)
34
+ - [Clinton/Text-to-sql-v1](https://huggingface.co/datasets/Clinton/Text-to-sql-v1)
35
+ - [knowrohit07/know-sql](https://huggingface.co/datasets/knowrohit07/know_sql)
36
+
37
+ We did a 80-10-10 split on each dataset and then combined them into a single `DatasetDict` object with `train`, `test,` and `validation` sets.
38
+ ```json
39
+ DatasetDict({
40
+ train: Dataset({
41
+ features: ['answer', 'question', 'context', 'afr question'],
42
+ num_rows: 118692
43
+ })
44
+ test: Dataset({
45
+ features: ['answer', 'question', 'context', 'afr question'],
46
+ num_rows: 14838
47
+ })
48
+ validation: Dataset({
49
+ features: ['answer', 'question', 'context', 'afr question'],
50
+ num_rows: 14838
51
+ })
52
+ })
53
+ ```
54
+
55
+ The pretrained model was then fine-tuned on the dataset splits. Rather than using only the `question`, the model also takes in the schema context such that it can generate more accurate queries for a given database.
56
+
57
+ *Input prompt*
58
+ ```python
59
+ Table context: CREATE TABLE table_55794 (
60
+ "Home team" text,
61
+ "Home team score" text,
62
+ "Away team" text,
63
+ "Away team score" text,
64
+ "Venue" text,
65
+ "Crowd" real,
66
+ "Date" text
67
+ )
68
+ Question: Watter tuisspan het'n span mebbourne?
69
+ Answer:
70
+ ```
71
+ *Expected Output*
72
+ ```sql
73
+ SELECT "Home team score" FROM table_55794 WHERE "Away team" = 'melbourne'
74
+ ```
75
 
76
  ## Intended uses & limitations
77
 
78
+ This model takes in a single prompt (similar to the one above) that is tokenized and it then uses the `input_ids` to generate an output SQL query. However the prompt must be structured in a specific way.
79
+
80
+ The `prompt` must start with the table/schema description followed by the question followed by an empty answer. Below we illustrate an example on how to use it. Furthermore, our combined dataset looks as follows:
81
+
82
+ *Tokenized Dataset*
83
+ ```json
84
+ DatasetDict({
85
+ train: Dataset({
86
+ features: ['input_ids', 'labels'],
87
+ num_rows: 118692
88
+ })
89
+ test: Dataset({
90
+ features: ['input_ids', 'labels'],
91
+ num_rows: 14838
92
+ })
93
+ validation: Dataset({
94
+ features: ['input_ids', 'labels'],
95
+ num_rows: 14838
96
+ })
97
+ })
98
+ ```
99
+ *Usage*
100
+ ```python
101
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments
102
+ # Load the model and tokenizer from Hugging Face Hub
103
+ repo_name = "JsteReubsSoftware/en-af-sql-training-1727527893"
104
+ en_af_sql_model = AutoModelForSeq2SeqLM.from_pretrained(repo_name, torch_dtype=torch.bfloat16)
105
+ en_af_sql_model = en_af_sql_model.to('cuda')
106
+ tokenizer = AutoTokenizer.from_pretrained(repo_name)
107
+
108
+ question = "Watter tuisspan het'n span mebbourne?"
109
+ context = "CREATE TABLE table_55794 (
110
+ "Home team" text,
111
+ "Home team score" text,
112
+ "Away team" text,
113
+ "Away team score" text,
114
+ "Venue" text,
115
+ "Crowd" real,
116
+ "Date" text
117
+ )"
118
+
119
+ prompt = f"""Tables:
120
+ {context}
121
+
122
+ Question:
123
+ {question}
124
+
125
+ Answer:
126
+ """
127
+ inputs = tokenizer(prompt, return_tensors='pt')
128
+ inputs = inputs.to('cuda')
129
+
130
+ output = tokenizer.decode(
131
+ en_af_sql_model.generate(
132
+ inputs["input_ids"],
133
+ max_new_tokens=200,
134
+ )[0],
135
+ skip_special_tokens=True
136
+ )
137
+
138
+ print("Predicted SQL Query:")
139
+ print(output)
140
+ ```
141
 
142
  ## Training procedure
143
 
 
152
  - lr_scheduler_type: linear
153
  - num_epochs: 2
154
 
155
+ We used the following in our program:
156
+ ```python
157
+ output_dir = f'./en-af-sql-training-{str(int(time.time()))}'
158
+
159
+ training_args = TrainingArguments(
160
+ output_dir=output_dir,
161
+ learning_rate=5e-3,
162
+ num_train_epochs=2,
163
+ per_device_train_batch_size=16, # batch size per device during training
164
+ per_device_eval_batch_size=16, # batch size for evaluation
165
+ weight_decay=0.01,
166
+ logging_steps=50,
167
+ evaluation_strategy='steps', # evaluation strategy to adopt during training
168
+ eval_steps=500, # number of steps between evaluation
169
+ )
170
+
171
+ trainer = Trainer(
172
+ model=finetuned_model,
173
+ args=training_args,
174
+ train_dataset=tokenized_datasets['train'],
175
+ eval_dataset=tokenized_datasets['validation'],
176
+ )
177
+ ```
178
+
179
  ### Training results
180
 
181
  | Training Loss | Epoch | Step | Validation Loss |
 
195
  | 0.024 | 1.7520 | 6500 | 0.0210 |
196
  | 0.0249 | 1.8868 | 7000 | 0.0210 |
197
 
198
+ ### Testing results
199
+
200
+ After our model was trained and validated, we evaluated the model using four evaluation metrics.
201
+
202
+ - *Exact Match Accuracy:* This measured the accuracy of our model predicting the exact same SQL query as the target query.
203
+ - *TSED score:* This metric ranges from 0 to 1 and was proposed by [this](https://dl.acm.org/doi/abs/10.1145/3639477.3639732) paper. It allows us to estimate the execution performance of the output query, allowing us to estimate the model's execution accuracy.
204
+ - *SQAM accuracy:* Similar to TSED, we can used this to estimate the output query's execution accuracy (also see [this](https://dl.acm.org/doi/abs/10.1145/3639477.3639732) paper).
205
+ - *BLEU score:* This helps us measure the similarity between the output query and the target query.
206
+
207
+ The following were the obtained results over the testing set (14838 records):
208
+
209
+ - Exact Match = 35.98 %
210
+ - TSED score: 0.897
211
+ - SQAM score: 74.31 %
212
+ - BLEU score: 0.762
213
 
214
  ### Framework versions
215
 
216
  - Transformers 4.44.2
217
  - Pytorch 2.4.0
218
  - Datasets 3.0.0
219
+ - Tokenizers 0.19.1