---
language:
- en
tags:
- llama
- instruct
- instruction
- empirischtech
pipeline_tag: text-generation
base_model:
- meta-llama/Llama-3.1-8B-Instruct
license: llama3.1
---
# LLaMa-10b-instruct model card

## Model Details

* **Developed by**: [EmpirischTech](https://empirischtech.at)/[ChaperoneAI](https://chaperoneai.net)
* **Backbone Model**: [LLaMA](https://github.com/meta-llama/llama3)
* **Language(s)**: English
* **Library**: [HuggingFace Transformers](https://github.com/huggingface/transformers)
* **License**: This model is under a **Non-commercial** Bespoke License and governed by the Meta license. You should only use this repository if you have been granted access to the model by filling out [this form](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform), but have either lost your copy of the weights or encountered issues converting them to the Transformers format
* **Where to send comments**: Instructions on how to provide feedback or comments on a model can be found by opening an issue in the [Hugging Face community's model repository](https://huggingface.co/upstage/llama-30b-instruct-2048/discussions)
* **Contact**: For questions and comments about the model, please email [contact-us](https://chaperoneai.net/contact)

## Training
Bigger models, more data, and better hardware have consistently improved deep learning performance. Whether in NLP or computer vision, larger models have led to major breakthroughs. However, most cutting-edge models are still trained from scratch, meaning they start with randomly initialized weights. The problem? Training costs are skyrocketing.
To address the escalating computational costs of training large-scale models, various approaches have been proposed. 
We present our results validating depth up-scaling—a method that combines depthwise scaling with continued pretraining. Unlike other LLM up-scaling approaches that rely on mixture-of-experts, DUS requires no complex modifications for efficient training and inference, making it a simple yet effective strategy for scaling high-performance LLMs from smaller models.

In this work, we take a step toward realizing such an approach. Specifically, we extend an existing **8B**-parameter model to **10B** parameters by initializing the 
additional layers with pretrained weights, followed by continued pretraining on a smaller dataset across multiple epochs. Due to budget constraints, we were unable to 
surpass the foundational model on the **EleutherAI** evaluation benchmark. However, the average scores are very clsoe, demonstrating potential for cost-efficient scaling strategies in large language model development.


## Usage

- Tested on A100 80GB
- Our model can handle up to 132k input tokens as supported by the Llama-3.1 architecture.

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
model_id="empirischtech/Llama-3.1-10B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16
)

prompt = "### User:\nEmma feels perfectly fine, yet she still has an appointment at the hospital. What might be the reasons?\n\n### Assistant:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
del inputs["token_type_ids"]
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

output = model.generate(**inputs, streamer=streamer, use_cache=True, max_new_tokens=1024)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
```

## Hardware and Software

* **Hardware**: We utilized an A100x8  for training our model
* **Training Factors**: The model was pretrained using a combination of the [DeepSpeed library](https://github.com/microsoft/DeepSpeed) and the [HuggingFace Trainer](https://huggingface.co/docs/transformers/main_classes/trainer)

## Evaluation Results

<!-- 
The following two different evaluations are performed.


### Preplexity as Evaluation Metric

Perplexity (PPL) is a metric used to evaluate the performance of language models. It measures how well a probability distribution or a language model predicts a sample. A **lower perplexity** score indicates better performance (i.e., the model is more confident in its predictions).


#### Main Results

| Model |  Perplexity Score |
|---------------------------------------------|----------|
| **Llama-3.1-8B-Instruct** | 842611366.59 |
| **Llama-3.1-10B-Instruct** | 2890.31 |


#### Scripts to generate evalution results
```python
from evaluate import load
import datasets


perplexity = load("perplexity", module_type="metric")
input_texts = datasets.load_dataset("wikitext",
                                    "wikitext-2-raw-v1",
                                    split="test")["text"]

input_texts = [s for s in input_texts if s!='']

model_path='empirischtech/Llama-3.1-10B-Instruct'
results = perplexity.compute(model_id=model_name_or_path,
                             add_start_token=False,
                             predictions=input_texts)


print(round(results["mean_perplexity"], 2))
```
-->


### Harness Evaluation

- The performance evaluation is based on the tasks being evaluated on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
The model is evaluated on three benchmark datasets, which include `ARC-Challenge`, `HellaSwag`, `MMLU` and `IFEval`.
The library used is [lm-evaluation-harness repository](https://github.com/EleutherAI/lm-evaluation-harness)


#### Main Results
| Benchmark   |   **Llama-3.1-8B-Instruct**   |   **Llama-3.1-10B-Instruct**   |
|------------|:------------------------:|:------------------------:|
| ARC        |           55.05           |           52.47           |
| HellaSwag  |           79.28           |           77.08           |
| MMLU-Pro   |           40.34           |           33.59           |
| IFEval     |           59.95           |           54.80           |
| **average** | **58.66** | **54.49** | 


#### Scripts to generate evalution results

```python
# install from https://github.com/EleutherAI/lm-evaluation-harness
pip install lm-eval>=0.4.7

from lm_eval import evaluator

tasks_list = ["arc_challenge", "ifeval", "mmlu_pro", "hellaswag"]  # Benchmark dataset

model_path="empirischtech/Llama-3.1-10B-Instruct"

# Run evaluation
results = evaluator.simple_evaluate(
    model="hf",  # Hugging Face model
    cache_requests=False,
    model_args=f"pretrained={model_path}",
    tasks=tasks_list, 
    batch_size=4,
    device="cuda:0" 
)

# Extract results
results = results['results']
json_string = json.dumps(results, indent=4)

```

## Ethical Issues

### Ethical Considerations
- There were no ethical issues involved, as we did not include the benchmark test set or the training set in the model's training process

## Contact Us

### Why Our LLMs?
- [EmpirischTech](https://empirischtech.at)/[ChaperoneAI](https://chaperoneai.net) Unlock the full potential of private LLMs for your business with ease. Customize and fine-tune them using your own data for a solution that fits your unique needs. Want a seamless integration? Let’s connect! ► [Get in touch](https://chaperoneai.net/contact)