--- license: mit datasets: - CreitinGameplays/Raiden-DeepSeek-R1-llama3.1 language: - en base_model: - meta-llama/Llama-3.1-8B-Instruct pipeline_tag: text-generation library_name: transformers --- ## Llama 3.1 8B R1 v0.1 ![Llama](https://autumn.revolt.chat/attachments/Dpj0Up0lYE2-BVOQRTDXeLk5xa7EE0WxBntXqgJGAo/DALL%C2%B7E%202025-02-19%2010.03.42%20-%20A%20futuristic%20robotic%20white%20llama%20with%20sleek%20metallic%20plating%20and%20glowing%20blue%20eyes.%20The%20llama%20has%20intricate%20mechanical%20joints%20and%20a%20high-tech%20design.%20.png) Took **28 hours** to finetune on **2x Nvidia RTX A6000** with the following settings: - Batch size: 8 - Gradient accumulation steps: 1 - Epochs: 2 - Learning rate: 1e-4 - Warmup ratio: 0.1 Run the model: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, BitsAndBytesConfig import bitsandbytes quantization_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_enable_fp32_cpu_offload=True ) model_id = "CreitinGameplays/Llama-3.1-8B-R1-v0.1" # Initialize model and tokenizer with streaming support model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", quantization_config=quantization_config ) tokenizer = AutoTokenizer.from_pretrained(model_id) # Custom streamer that collects the output into a string while streaming class CollectingStreamer(TextStreamer): def __init__(self, tokenizer): super().__init__(tokenizer) self.output = "" def on_llm_new_token(self, token: str, **kwargs): self.output += token print(token, end="", flush=True) # prints the token as it's generated print("Chat session started. Type 'exit' to quit.\n") # Initialize chat history as a list of messages chat_history = [] chat_history.append({"role": "system", "content": "You are an AI assistant made by Meta AI."}) while True: user_input = input("You: ") if user_input.strip().lower() == "exit": break # Append the user message to the chat history chat_history.append({"role": "user", "content": user_input}) # Prepare the prompt by formatting the complete chat history inputs = tokenizer.apply_chat_template( chat_history, return_tensors="pt" ).to(model.device) # Create a new streamer for the current generation streamer = CollectingStreamer(tokenizer) # Generate streamed response model.generate( inputs, streamer=streamer, temperature=0.6, top_p=0.9, top_k=50, repetition_penalty=1.1, max_new_tokens=6112, do_sample=True ) # The complete response text is stored in streamer.output response_text = streamer.output print("\nAssistant:", response_text) # Append the assistant response to the chat history chat_history.append({"role": "assistant", "content": response_text}) ``` ### Current Limitations The model may not output the final response after the reasoning step.