---
license: mit
datasets:
- CreitinGameplays/Raiden-DeepSeek-R1-llama3.1
language:
- en
base_model:
- meta-llama/Llama-3.1-8B-Instruct
pipeline_tag: text-generation
library_name: transformers
---

## Llama 3.1 8B R1 v0.1
![Llama](https://autumn.revolt.chat/attachments/Dpj0Up0lYE2-BVOQRTDXeLk5xa7EE0WxBntXqgJGAo/DALL%C2%B7E%202025-02-19%2010.03.42%20-%20A%20futuristic%20robotic%20white%20llama%20with%20sleek%20metallic%20plating%20and%20glowing%20blue%20eyes.%20The%20llama%20has%20intricate%20mechanical%20joints%20and%20a%20high-tech%20design.%20.png)

Took **28 hours** to finetune on **2x Nvidia RTX A6000** with the following settings:
- Batch size: 8
- Gradient accumulation steps: 1
- Epochs: 2
- Learning rate: 1e-4
- Warmup ratio: 0.1

Run the model:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, BitsAndBytesConfig
import bitsandbytes

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_enable_fp32_cpu_offload=True
)

model_id = "CreitinGameplays/Llama-3.1-8B-R1-v0.1"

# Initialize model and tokenizer with streaming support
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    quantization_config=quantization_config
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Custom streamer that collects the output into a string while streaming
class CollectingStreamer(TextStreamer):
    def __init__(self, tokenizer):
        super().__init__(tokenizer)
        self.output = ""
    def on_llm_new_token(self, token: str, **kwargs):
        self.output += token
        print(token, end="", flush=True)  # prints the token as it's generated

print("Chat session started. Type 'exit' to quit.\n")

# Initialize chat history as a list of messages
chat_history = []
chat_history.append({"role": "system", "content": "You are an AI assistant made by Meta AI."})

while True:
    user_input = input("You: ")
    if user_input.strip().lower() == "exit":
        break

    # Append the user message to the chat history
    chat_history.append({"role": "user", "content": user_input})

    # Prepare the prompt by formatting the complete chat history
    inputs = tokenizer.apply_chat_template(
        chat_history,
        return_tensors="pt"
    ).to(model.device)

    # Create a new streamer for the current generation
    streamer = CollectingStreamer(tokenizer)

    # Generate streamed response
    model.generate(
        inputs,
        streamer=streamer,
        temperature=0.6,
        top_p=0.9,
        top_k=50,
        repetition_penalty=1.1,
        max_new_tokens=6112,
        do_sample=True
    )

    # The complete response text is stored in streamer.output
    response_text = streamer.output
    print("\nAssistant:", response_text)

    # Append the assistant response to the chat history
    chat_history.append({"role": "assistant", "content": response_text})
```

### Current Limitations
The model may not output the final response after the reasoning step.