awni00 commited on
Commit
4d14164
·
verified ·
1 Parent(s): b448ce4

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +101 -3
README.md CHANGED
@@ -1,9 +1,107 @@
1
  ---
 
 
 
2
  tags:
3
  - model_hub_mixin
4
  - pytorch_model_hub_mixin
 
5
  ---
6
 
7
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
- - Library: [More Information Needed]
9
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
+ license: mit
4
+ pipeline_tag: text-generation
5
  tags:
6
  - model_hub_mixin
7
  - pytorch_model_hub_mixin
8
+ dataset: HuggingFaceFW/fineweb-edu
9
  ---
10
 
11
+ # DAT-sa8-ra8-nr32-ns1024-sh8-nkvh4-343M
12
+
13
+ <!-- Provide a quick summary of what the model is/does. -->
14
+
15
+ This is a Dual-Attention Transformer Language Model, trained on the `fineweb-edu` dataset. The model is 344M parameters.
16
+
17
+
18
+ ## Model Details
19
+
20
+ | Size | Training Tokens| Layers | Model Dimension | Self-Attention Heads | Relational Attention Heads | Relation Dimension | Context Length |
21
+ |--|--|--|--|--|--|--|--|
22
+ | 344M | 10B | 24| 1024 | 8 | 8 | 32 | 1024 |
23
+
24
+
25
+ ### Model Description
26
+
27
+ - **Developed by:** Awni Altabaa, John Lafferty
28
+ - **Model type:** Decoder-only Dual Attention Transformer
29
+ - **Tokenizer:** GPT-2 BPE tokenizer
30
+ - **Language(s):** English
31
+ <!-- - **License:** MIT -->
32
+ <!-- - **Contact:** [email protected] -->
33
+ - **Date:** 21, 2024
34
+
35
+ ### Model Sources
36
+
37
+ - **Repository:** https://github.com/Awni00/abstract_transformer
38
+ - **Paper:** [Disentangling and Integrating Relational and Sensory Information in Transformer Architectures](https://arxiv.org/abs/2405.16727)
39
+ - **Huggingface Collection:** [Dual Attention Transformer Collection](https://huggingface.co/collections/awni00/dual-attention-transformer-66c23425a545b0cefe4b9489)
40
+
41
+
42
+ ## Model Usage
43
+
44
+ Use the code below to get started with the model. First, install the `dual-attention` [python package hosted on PyPI](https://pypi.org/project/dual-attention/) via `pip install dual-attention`.
45
+
46
+ To load directly from huggingface hub, use the HFHub wrapper.
47
+ ```
48
+ from dual_attention.hf import DualAttnTransformerLM_HFHub
49
+
50
+ DualAttnTransformerLM_HFHub.from_pretrained('awni00/DAT-sa8-ra8-nr32-ns1024-sh8-nkvh4-343M')
51
+ ```
52
+
53
+ Alternatively, you can download the pytorch checkpoint containing the state dict.
54
+
55
+ To download the PyTorch checkpoint, run:
56
+ ```wget https://huggingface.co/awni00/DAT-sa8-ra8-nr32-ns1024-sh8-nkvh4-343M/resolve/main/pytorch_checkpoint.pt```
57
+
58
+ Then, you can load model weights via:
59
+ ```
60
+ from dual_attention.language_models import DualAttnTransformerLM
61
+
62
+ ckpt = torch.load(ckpt_path)
63
+ model_config = ckpt['config']
64
+ model_state_dict = ckpt['model']
65
+
66
+ model = DualAttnTransformerLM(**model_config)
67
+ model.load_state_dict(model_state_dict)
68
+ ```
69
+
70
+ ## Training Details
71
+
72
+ The model was trained using the following setup:
73
+ - **Architecture:** Decoder-only Dual Attention Transformer
74
+ - **Framework:** PyTorch
75
+ - **Optimizer:** AdamW
76
+ - **Learning Rate:** 6e-4 (peak)
77
+ - **Weight Decay:** 0.1
78
+ - **Batch Size:** 524,288 Tokens
79
+ - **Sequence Length:** 1024 tokens
80
+ - **Total Training Tokens:** 10B Tokens
81
+
82
+ For more detailed training information, please refer to the paper.
83
+
84
+ ## Evaluation
85
+
86
+ See paper.
87
+
88
+
89
+ ## Model Interpretability Analysis
90
+
91
+ The [DAT-LM-Visualization app](https://huggingface.co/spaces/awni00/DAT-LM-Visualization/) is built to visualize the representations learned in a Dual Attention Transformer language model. It is hosted on Huggingface spaces using their free CPU resources. You can select a pre-trained DAT-LM model, enter a prompt, and visualize the internal representations in different parts of the model. You can also run the app locally (e.g., to use your own GPU) via the PyPI package.
92
+
93
+ Also, see paper.
94
+
95
+ ## Citation
96
+
97
+ ```
98
+ @misc{altabaa2024disentanglingintegratingrelationalsensory,
99
+ title={Disentangling and Integrating Relational and Sensory Information in Transformer Architectures},
100
+ author={Awni Altabaa and John Lafferty},
101
+ year={2024},
102
+ eprint={2405.16727},
103
+ archivePrefix={arXiv},
104
+ primaryClass={cs.LG},
105
+ url={https://arxiv.org/abs/2405.16727},
106
+ }
107
+ ```