Update README.md

5b9d843 verified 4 months ago

8.15 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	base_model:
	- Qwen/Qwen2.5-1.5B-Instruct
	pipeline_tag: text-generation
	model-index:
	- name: Bellatrix-1.5B-xElite
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: IFEval (0-Shot)
	type: wis-k/instruction-following-eval
	split: train
	args:
	num_few_shot: 0
	metrics:
	- type: inst_level_strict_acc and prompt_level_strict_acc
	value: 19.64
	name: averaged accuracy
	source:
	url: >-
	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FBellatrix-1.5B-xElite
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: BBH (3-Shot)
	type: SaylorTwift/bbh
	split: test
	args:
	num_few_shot: 3
	metrics:
	- type: acc_norm
	value: 9.49
	name: normalized accuracy
	source:
	url: >-
	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FBellatrix-1.5B-xElite
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MATH Lvl 5 (4-Shot)
	type: lighteval/MATH-Hard
	split: test
	args:
	num_few_shot: 4
	metrics:
	- type: exact_match
	value: 12.61
	name: exact match
	source:
	url: >-
	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FBellatrix-1.5B-xElite
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GPQA (0-shot)
	type: Idavidrein/gpqa
	split: train
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 3.8
	name: acc_norm
	source:
	url: >-
	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FBellatrix-1.5B-xElite
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MuSR (0-shot)
	type: TAUR-Lab/MuSR
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 4.44
	name: acc_norm
	source:
	url: >-
	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FBellatrix-1.5B-xElite
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU-PRO (5-shot)
	type: TIGER-Lab/MMLU-Pro
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 7.3
	name: accuracy
	source:
	url: >-
	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FBellatrix-1.5B-xElite
	name: Open LLM Leaderboard
	tags:
	- qwen
	- qwq
	---
	<pre align="center">
	____ ____ __ __ __ ____ ____ ____ _ _
	( _ \( ___)( ) ( ) /__\ (_ _)( _ \(_ _)( \/ )
	) _ < )__) )(__ )(__ /(__)\ )( ) / _)(_ ) (
	(____/(____)(____)(____)(__)(__)(__) (_)\_)(____)(_/\_)
	</pre>

	# Bellatrix-1.5B-xElite

	Bellatrix-1.5B-xElite is based on a reasoning-based model designed for the QWQ synthetic dataset entries. The pipeline's instruction-tuned, text-only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. These models outperform many of the available open-source options. Bellatrix is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions utilize supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF).

	# Quickstart with Transformers

	Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "prithivMLmods/Bellatrix-1.5B-xElite"

	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	prompt = "Give me a short introduction to large language model."
	messages = [
	{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
	{"role": "user", "content": prompt}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=512
	)
	generated_ids = [
	output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
	]

	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
	```

	# Intended Use:

	1. Multilingual Dialogue Systems:
	- Designed for conversational AI applications, capable of handling dialogue across multiple languages.
	- Useful in customer service, chatbots, and other dialogue-centric use cases.

	2. Reasoning and QWQ Dataset Applications:
	- Optimized for tasks requiring logical reasoning and contextual understanding, particularly in synthetic datasets like QWQ.

	3. Agentic Retrieval:
	- Supports retrieval-augmented generation tasks, helping systems fetch and synthesize information effectively.

	4. Summarization Tasks:
	- Excels in summarizing long or complex text while maintaining coherence and relevance.

	5. Instruction-Following Tasks:
	- Can execute tasks based on specific user instructions due to instruction-tuning during training.

	6. Language Generation:
	- Suitable for generating coherent and contextually relevant text in various domains and styles.

	# Limitations:

	1. Synthetic Dataset Bias:
	- Optimization for QWQ and similar datasets may make the model less effective on real-world or less structured data.

	2. Data Dependency:
	- Performance may degrade on tasks or languages not well-represented in the training dataset.

	3. Computational Requirements:
	- The optimized transformer architecture may demand significant computational resources, especially for fine-tuning or large-scale deployments.

	4. Potential Hallucinations:
	- Like most auto-regressive models, it may generate plausible-sounding but factually incorrect or nonsensical outputs.

	5. RLHF-Specific Biases:
	- Reinforcement Learning with Human Feedback (RLHF) can introduce biases based on the preferences of the annotators involved in the feedback process.

	6. Limited Domain Adaptability:
	- While effective in reasoning and dialogue tasks, it may struggle with highly specialized domains or out-of-distribution tasks.

	7. Multilingual Limitations:
	- Although optimized for multilingual use, certain low-resource languages may exhibit poorer performance compared to high-resource ones.

	8. Ethical Concerns:
	- May inadvertently generate inappropriate or harmful content if safeguards are not applied, particularly in sensitive applications.

	9. Real-Time Usability:
	- Latency in inference time could limit its effectiveness in real-time applications or when scaling to large user bases.
	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/prithivMLmods__Bellatrix-1.5B-xElite-details)!
	Summarized results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/contents/viewer/default/train?q=prithivMLmods%2FBellatrix-1.5B-xElite&sort[column]=Average%20%E2%AC%86%EF%B8%8F&sort[direction]=desc)!

	\| Metric \|Value (%)\|
	\|-------------------\|--------:\|
	\|Average \| 9.55\|
	\|IFEval (0-Shot) \| 19.64\|
	\|BBH (3-Shot) \| 9.49\|
	\|MATH Lvl 5 (4-Shot)\| 12.61\|
	\|GPQA (0-shot) \| 3.80\|
	\|MuSR (0-shot) \| 4.44\|
	\|MMLU-PRO (5-shot) \| 7.30\|