JsteReubsSoftware
/

en-af-sql-training-1727527893

@@ -7,28 +7,137 @@ tags:
 model-index:
 - name: en-af-sql-training-1727527893
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
 # en-af-sql-training-1727527893
-This model is a fine-tuned version of [t5-small](https://huggingface.co/t5-small) on the None dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.0210
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
 ## Training procedure
@@ -43,6 +152,30 @@ The following hyperparameters were used during training:
 - lr_scheduler_type: linear
 - num_epochs: 2
 ### Training results
 | Training Loss | Epoch  | Step | Validation Loss |
@@ -62,10 +195,25 @@ The following hyperparameters were used during training:
 | 0.024         | 1.7520 | 6500 | 0.0210          |
 | 0.0249        | 1.8868 | 7000 | 0.0210          |
 ### Framework versions
 - Transformers 4.44.2
 - Pytorch 2.4.0
 - Datasets 3.0.0
-- Tokenizers 0.19.1

 model-index:
 - name: en-af-sql-training-1727527893
   results: []
+datasets:
+- b-mc2/sql-create-context
+- Clinton/Text-to-sql-v1
+- knowrohit07/know_sql
+language:
+- af
+- en
 ---
 # en-af-sql-training-1727527893
+This model is a fine-tuned version of [t5-small](https://huggingface.co/t5-small) on three datasets: b-mc2/sql-create-context, Clinton/Text-to-sql-v1, knowrohit07/know-sql.
 It achieves the following results on the evaluation set:
 - Loss: 0.0210
 ## Model description
+This is a fine-tuned Afrikaans-to-SQL model. The pretrained [t5-small](https://huggingface.co/t5-small) was used to train our SQL model.
+## Training and Evaluation Datasets
+As mentioned, to train the model we used a combination of three dataset which we split into training, testing, and validation sets. THe dataset can be found by following these links:
+- [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)
+- [Clinton/Text-to-sql-v1](https://huggingface.co/datasets/Clinton/Text-to-sql-v1)
+- [knowrohit07/know-sql](https://huggingface.co/datasets/knowrohit07/know_sql)
+We did a 80-10-10 split on each dataset and then combined them into a single `DatasetDict` object with `train`, `test,` and `validation` sets.
+```json
+DatasetDict({
+    train: Dataset({
+        features: ['answer', 'question', 'context', 'afr question'],
+        num_rows: 118692
+    })
+    test: Dataset({
+        features: ['answer', 'question', 'context', 'afr question'],
+        num_rows: 14838
+    })
+    validation: Dataset({
+        features: ['answer', 'question', 'context', 'afr question'],
+        num_rows: 14838
+    })
+})
+```
+The pretrained model was then fine-tuned on the dataset splits. Rather than using only the `question`, the model also takes in the schema context such that it can generate more accurate queries for a given database.
+*Input prompt*
+```python
+Table context: CREATE TABLE table_55794 (
+    "Home team" text,
+    "Home team score" text,
+    "Away team" text,
+    "Away team score" text,
+    "Venue" text,
+    "Crowd" real,
+    "Date" text
+)
+Question: Watter tuisspan het'n span mebbourne?
+Answer:
+```
+*Expected Output*
+```sql
+SELECT "Home team score" FROM table_55794 WHERE "Away team" = 'melbourne'
+```
 ## Intended uses & limitations
+This model takes in a single prompt (similar to the one above) that is tokenized and it then uses the `input_ids` to generate an output SQL query. However the prompt must be structured in a specific way.
+The `prompt` must start with the table/schema description followed by the question followed by an empty answer. Below we illustrate an example on how to use it. Furthermore, our combined dataset looks as follows:
+*Tokenized Dataset*
+```json
+DatasetDict({
+    train: Dataset({
+        features: ['input_ids', 'labels'],
+        num_rows: 118692
+    })
+    test: Dataset({
+        features: ['input_ids', 'labels'],
+        num_rows: 14838
+    })
+    validation: Dataset({
+        features: ['input_ids', 'labels'],
+        num_rows: 14838
+    })
+})
+```
+*Usage*
+```python
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments
+# Load the model and tokenizer from Hugging Face Hub
+repo_name = "JsteReubsSoftware/en-af-sql-training-1727527893"
+en_af_sql_model = AutoModelForSeq2SeqLM.from_pretrained(repo_name, torch_dtype=torch.bfloat16)
+en_af_sql_model = en_af_sql_model.to('cuda')
+tokenizer = AutoTokenizer.from_pretrained(repo_name)
+question = "Watter tuisspan het'n span mebbourne?"
+context = "CREATE TABLE table_55794 (
+    "Home team" text,
+    "Home team score" text,
+    "Away team" text,
+    "Away team score" text,
+    "Venue" text,
+    "Crowd" real,
+    "Date" text
+)"
+prompt = f"""Tables:
+{context}
+Question:
+{question}
+Answer:
+"""
+inputs = tokenizer(prompt, return_tensors='pt')
+inputs = inputs.to('cuda')
+output = tokenizer.decode(
+    en_af_sql_model.generate(
+        inputs["input_ids"],
+        max_new_tokens=200,
+    )[0],
+    skip_special_tokens=True
+)
+print("Predicted SQL Query:")
+print(output)
+```
 ## Training procedure
 - lr_scheduler_type: linear
 - num_epochs: 2
+We used the following in our program:
+```python
+output_dir = f'./en-af-sql-training-{str(int(time.time()))}'
+training_args = TrainingArguments(
+        output_dir=output_dir,
+        learning_rate=5e-3,
+        num_train_epochs=2,
+        per_device_train_batch_size=16,     # batch size per device during training
+        per_device_eval_batch_size=16,      # batch size for evaluation
+        weight_decay=0.01,
+        logging_steps=50,
+        evaluation_strategy='steps',        # evaluation strategy to adopt during training
+        eval_steps=500,                     # number of steps between evaluation
+)
+trainer = Trainer(
+        model=finetuned_model,
+        args=training_args,
+        train_dataset=tokenized_datasets['train'],
+        eval_dataset=tokenized_datasets['validation'],
+)
+```
 ### Training results
 | Training Loss | Epoch  | Step | Validation Loss |
 | 0.024         | 1.7520 | 6500 | 0.0210          |
 | 0.0249        | 1.8868 | 7000 | 0.0210          |
+### Testing results
+After our model was trained and validated, we evaluated the model using four evaluation metrics.
+- *Exact Match Accuracy:* This measured the accuracy of our model predicting the exact same SQL query as the target query.
+- *TSED score:* This metric ranges from 0 to 1 and was proposed by [this](https://dl.acm.org/doi/abs/10.1145/3639477.3639732) paper. It allows us to estimate the execution performance of the output query, allowing us to estimate the model's execution accuracy.
+- *SQAM accuracy:* Similar to TSED, we can used this to estimate the output query's execution accuracy (also see [this](https://dl.acm.org/doi/abs/10.1145/3639477.3639732) paper).
+- *BLEU score:* This helps us measure the similarity between the output query and the target query.
+The following were the obtained results over the testing set (14838 records):
+- Exact Match = 35.98 %
+- TSED score: 0.897
+- SQAM score: 74.31 %
+- BLEU score: 0.762
 ### Framework versions
 - Transformers 4.44.2
 - Pytorch 2.4.0
 - Datasets 3.0.0
+- Tokenizers 0.19.1