awni00
/

DAT-sa8-ra8-nr32-ns1024-sh8-nkvh4-343M

Text Generation

model_hub_mixin

pytorch_model_hub_mixin

Model card Files Files and versions Community

DAT-sa8-ra8-nr32-ns1024-sh8-nkvh4-343M / README.md

awni00's picture

Upload README.md with huggingface_hub

49337c2 verified 7 months ago

|

history blame contribute delete

3.09 kB

	---
	language: en
	license: mit
	pipeline_tag: text-generation
	tags:
	- model_hub_mixin
	- pytorch_model_hub_mixin
	dataset: HuggingFaceFW/fineweb-edu
	---

	# DAT-sa8-ra8-nr32-ns1024-sh8-nkvh4-343M

	<!-- Provide a quick summary of what the model is/does. -->

	This is a Dual-Attention Transformer Language Model, trained on the `fineweb-edu` dataset. The model is 344M parameters.


	## Model Details

	\| Size \| Training Tokens\| Layers \| Model Dimension \| Self-Attention Heads \| Relational Attention Heads \| Relation Dimension \| Context Length \|
	\|--\|--\|--\|--\|--\|--\|--\|--\|
	\| 344M \| 10B \| 24\| 1024 \| 8 \| 8 \| 32 \| 1024 \|


	### Model Description

	- Developed by: Awni Altabaa, John Lafferty
	- Model type: Decoder-only Dual Attention Transformer
	- Tokenizer: GPT-2 BPE tokenizer
	- Language(s): English
	<!-- - License: MIT -->
	<!-- - Contact: [email protected] -->
	- Date: August, 2024

	### Model Sources

	- Repository: https://github.com/Awni00/abstract_transformer
	- Paper: [Disentangling and Integrating Relational and Sensory Information in Transformer Architectures](https://arxiv.org/abs/2405.16727)
	- Huggingface Collection: [Dual Attention Transformer Collection](https://huggingface.co/collections/awni00/dual-attention-transformer-66c23425a545b0cefe4b9489)


	## Model Usage

	Use the code below to get started with the model. First, install the `dual-attention` [python package hosted on PyPI](https://pypi.org/project/dual-attention/) via `pip install dual-attention`.

	To load directly from huggingface hub, use the HFHub wrapper.
	```
	from dual_attention.hf import DualAttnTransformerLM_HFHub

	DualAttnTransformerLM_HFHub.from_pretrained('awni00/DAT-sa8-ra8-nr32-ns1024-sh8-nkvh4-343M')
	```

	## Training Details

	The model was trained using the following setup:
	- Architecture: Decoder-only Dual Attention Transformer
	- Framework: PyTorch
	- Optimizer: AdamW
	- Learning Rate: 6e-4 (peak)
	- Weight Decay: 0.1
	- Batch Size: 524,288 Tokens
	- Sequence Length: 1024 tokens
	- Total Training Tokens: 10B Tokens

	For more detailed training information, please refer to the paper.

	## Evaluation

	See paper.


	## Model Interpretability Analysis

	The [DAT-LM-Visualization app](https://huggingface.co/spaces/awni00/DAT-LM-Visualization/) is built to visualize the representations learned in a Dual Attention Transformer language model. It is hosted on Huggingface spaces using their free CPU resources. You can select a pre-trained DAT-LM model, enter a prompt, and visualize the internal representations in different parts of the model. You can also run the app locally (e.g., to use your own GPU) via the PyPI package.

	Also, see paper.

	## Citation

	```
	@misc{altabaa2024disentanglingintegratingrelationalsensory,
	title={Disentangling and Integrating Relational and Sensory Information in Transformer Architectures},
	author={Awni Altabaa and John Lafferty},
	year={2024},
	eprint={2405.16727},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2405.16727},
	}
	```