Text Generation
Transformers
Safetensors
English
ddllama
conversational
custom_code

FlexiDepth-Llama-3-8B-Instruct

The implementation of the paper Adaptive Layer-skipping in Pre-trained LLMs. Explore layer-skipping patterns at xuan-luo/FlexiPatterns-Llama-3-8B-Instruct.

Model Details

Model Description

FlexiDepth-Llama-3-8B-Instruct is an enhanced version of the Llama-3-8B-Instruct model, incorporating the FlexiDepth method to enable adaptive layer-skipping during text generation. This approach reveals unique layer allocation patterns, showing how computational demands vary across different tokens. The token depth map visualization (see below) demonstrates that summarization tasks typically require more layers than extractive question answering, while in mathematical reasoning tasks like addition, tokens on the left-hand side of equations use fewer layers than those on the right. For further insights, refer to the dataset at steven2521/FlexiPatterns-Llama-3-8B-Instruct.

FlexiDepth banner
  • Developed by: Xuan Luo, Weizhi Wang, Xifeng Yan
  • Model type: Causal Language Model with adaptive layer-skipping
  • Language(s) (NLP): English (en)
  • License: Apache-2.0
  • Finetuned from model: meta-llama/Meta-Llama-3-8B-Instruct

Get the number of layers used when generating different tokens

import transformers
from transformers import TextStreamer
import torch
from transformers.generation.streamers import BaseStreamer


class TokenStreamer(BaseStreamer):
    """
    Simple token streamer that prints each token with its corresponding layers used.
    
    Parameters:
        tokenizer (`AutoTokenizer`):
            The tokenizer used to decode the tokens.
        skip_prompt (`bool`, *optional*, defaults to `False`):
            Whether to skip the prompt tokens in the output. Useful for chatbots.
    """

    def __init__(self, tokenizer, skip_prompt=True):
        self.tokenizer = tokenizer
        self.skip_prompt = skip_prompt
        self.next_tokens_are_prompt = True

    def put(self, value):
        """
        Receives tokens and prints each one surrounded by brackets.
        """
        if len(value.shape) > 1 and value.shape[0] > 1:
            raise ValueError("TokenStreamer only supports batch size 1")
        elif len(value.shape) > 1:
            value = value[0]

        if self.skip_prompt and self.next_tokens_are_prompt:
            self.next_tokens_are_prompt = False
            return

        # Process each token in the received tensor
        for token_id in value.tolist():
            token_text = self.tokenizer.decode([token_id])
            print(f"={repr(token_text)}", end="\n", flush=True)

    def end(self):
        """Prints a newline at the end of generation."""
        self.next_tokens_are_prompt = True
        print()  # Print a newline at the end



# model path
model_id = "xuan-luo/FlexiDepth-Llama-3-8B-Instruct"
# tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained("xuan-luo/FlexiDepth-Llama-3-8B-Instruct", trust_remote_code=True)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
    trust_remote_code=True
)

messages = [
    {"role": "user", "content": \
"""
Jan has three times the number of pets as Marcia. Marcia has two more pets than Cindy. If Cindy has four pets, how many total pets do the three have?
"""},
]

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]


streamer = TokenStreamer(tokenizer)
outputs = pipeline(
    messages,
    max_new_tokens=512,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=1.0,
    streamer=streamer,
)

Evaluation

The performance of FlexiDepth-Llama-3-8B-Instruct was evaluated using the lm_eval framework (version 0.4.8) and compared against the original Llama-3-8B-Instruct model. Below are the results for both models across multiple benchmarks, including metric scores and, for FlexiDepth, the average number of layers used.

FlexiDepth-Llama-3-8B-Instruct

Benchmark Shots Metric Score Avg. Layers
MMLU 5 acc 0.6634 27.88
Hellaswag 5 acc_norm 0.7430 30.00
Winogrande 5 acc 0.7556 28.03
GSM8K 5 strict-match 0.6573 21.58
HumanEval 0 pass@1 0.3232 22.55
CoQA 0 f1 0.8028 24.56

Llama-3-8B-Instruct

Benchmark Shots Metric Score Layers
MMLU 5 acc 0.6733 32
Hellaswag 5 acc_norm 0.7117 32
Winogrande 5 acc 0.7427 32
GSM8K 5 strict-match 0.6732 32
HumanEval 0 pass@1 0.2927 32
CoQA 0 f1 0.7846 32

These results show that FlexiDepth-Llama-3-8B-Instruct maintains comparable or improved performance on most benchmarks while using fewer layers on average.

Model Card Authors

Xuan Luo, Weizhi Wang, Xifeng Yan

Model Card Contact

For questions or inquiries, please contact [email protected].

Downloads last month
10
Safetensors
Model size
8.24B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xuan-luo/FlexiDepth-Llama-3-8B-Instruct

Finetuned
(585)
this model

Datasets used to train xuan-luo/FlexiDepth-Llama-3-8B-Instruct