OlympicCoder-7B GGUF Models

Choosing the Right Model Format

Selecting the correct model format depends on your hardware capabilities and memory constraints.

BF16 (Brain Float 16) – Use if BF16 acceleration is available

  • A 16-bit floating-point format designed for faster computation while retaining good precision.
  • Provides similar dynamic range as FP32 but with lower memory usage.
  • Recommended if your hardware supports BF16 acceleration (check your device’s specs).
  • Ideal for high-performance inference with reduced memory footprint compared to FP32.

📌 Use BF16 if:
✔ Your hardware has native BF16 support (e.g., newer GPUs, TPUs).
✔ You want higher precision while saving memory.
✔ You plan to requantize the model into another format.

📌 Avoid BF16 if:
❌ Your hardware does not support BF16 (it may fall back to FP32 and run slower).
❌ You need compatibility with older devices that lack BF16 optimization.


F16 (Float 16) – More widely supported than BF16

  • A 16-bit floating-point high precision but with less of range of values than BF16.
  • Works on most devices with FP16 acceleration support (including many GPUs and some CPUs).
  • Slightly lower numerical precision than BF16 but generally sufficient for inference.

📌 Use F16 if:
✔ Your hardware supports FP16 but not BF16.
✔ You need a balance between speed, memory usage, and accuracy.
✔ You are running on a GPU or another device optimized for FP16 computations.

📌 Avoid F16 if:
❌ Your device lacks native FP16 support (it may run slower than expected).
❌ You have memory limitations.


Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference

Quantization reduces model size and memory usage while maintaining as much accuracy as possible.

  • Lower-bit models (Q4_K)Best for minimal memory usage, may have lower precision.
  • Higher-bit models (Q6_K, Q8_0)Better accuracy, requires more memory.

📌 Use Quantized Models if:
✔ You are running inference on a CPU and need an optimized model.
✔ Your device has low VRAM and cannot load full-precision models.
✔ You want to reduce memory footprint while keeping reasonable accuracy.

📌 Avoid Quantized Models if:
❌ You need maximum accuracy (full-precision models are better for this).
❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16).


Summary Table: Model Format Selection

Model Format Precision Memory Usage Device Requirements Best Use Case
BF16 Highest High BF16-supported GPU/CPUs High-speed inference with reduced memory
F16 High High FP16-supported devices GPU inference when BF16 isn’t available
Q4_K Low Very Low CPU or Low-VRAM devices Best for memory-constrained environments
Q6_K Medium Low Low CPU with more memory Better accuracy while still being quantized
Q8 Medium Moderate CPU or GPU with enough VRAM Best accuracy among quantized models

Included Files & Details

OlympicCoder-7B-bf16.gguf

  • Model weights preserved in BF16.
  • Use this if you want to requantize the model into a different format.
  • Best if your device supports BF16 acceleration.

OlympicCoder-7B-f16.gguf

  • Model weights stored in F16.
  • Use if your device supports FP16, especially if BF16 is not available.

OlympicCoder-7B-bf16-q8_0.gguf

  • Output & embeddings remain in BF16.
  • All other layers quantized to Q8_0.
  • Use if your device supports BF16 and you want a quantized version.

OlympicCoder-7B-f16-q8_0.gguf

  • Output & embeddings remain in F16.
  • All other layers quantized to Q8_0.

OlympicCoder-7B-q4_k.gguf

  • Output & embeddings quantized to Q8_0.
  • All other layers quantized to Q4_K.
  • Good for CPU inference with limited memory.

OlympicCoder-7B-q4_k_s.gguf

  • Smallest Q4_K variant, using less memory at the cost of accuracy.
  • Best for very low-memory setups.

OlympicCoder-7B-q6_k.gguf

  • Output & embeddings quantized to Q8_0.
  • All other layers quantized to Q6_K .

OlympicCoder-7B-q8_0.gguf

  • Fully Q8 quantized model for better accuracy.
  • Requires more memory but offers higher precision

🚀 If you find these models useful

Please click like ❤ . Also I’d really appreciate it if you could test my Network Monitor Assistant at 👉 Network Monitor Assitant.

💬 Click the chat icon (bottom right of the main and dashboard pages) . Choose a LLM; toggle between the LLM Types TurboLLM -> FreeLLM -> TestLLM.

What I'm Testing

I'm experimenting with function calling against my network monitoring service. Using small open source models. I am into the question "How small can it go and still function".

🟡 TestLLM – Runs the current testing model using llama.cpp on 6 threads of a Cpu VM (Should take about 15s to load. Inference speed is quite slow and it only processes one user prompt at a time—still working on scaling!). If you're curious, I'd be happy to share how it works! .

The other Available AI Assistants

🟢 TurboLLM – Uses gpt-4o-mini Fast! . Note: tokens are limited since OpenAI models are pricey, but you can Login or Download the Free Network Monitor agent to get more tokens, Alternatively use the FreeLLM .

🔵 FreeLLM – Runs open-source Hugging Face models Medium speed (unlimited, subject to Hugging Face API availability).

Model Card for OlympicCoder-7B

OlympicCoder-7B is a code model that achieves strong performance on competitive coding benchmarks such as LiveCodeBench and the 2024 International Olympiad in Informatics.

Model description

  • Model type: A 7B parameter model fine-tuned on a decontaminated version of the codeforces dataset.
  • Language(s) (NLP): Primarily English
  • License: apache-2.0
  • Finetuned from model: Qwen/Qwen2.5-Coder-7B-Instruct

Evaluation

Usage

Here's how you can run the model using the pipeline() function from 🤗 Transformers:

# pip install transformers
# pip install accelerate

import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="open-r1/OlympicCoder-7B", torch_dtype=torch.bfloat16, device_map="auto")

# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {"role": "user", "content": "Write a python program to calculate the 10th Fibonacci number"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=8000, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
#<|im_start|>user
#Write a python program to calculate the 10th fibonacci number<|im_end|>
#<|im_start|>assistant
#<think>Okay, I need to write a Python program that calculates the 10th Fibonacci number. Hmm, the Fibonacci sequence starts with 0 and 1. Each subsequent number is the sum of the two preceding ones. So the sequence goes: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, and so on. ...

To ensure that the model consistently outputs a long chain-of-thought, we have edited the chat template to prefill the first assistant turn with a <think> token. As a result, the outputs from this model will not show the opening <think> token if you use the model's generate() method. To apply reinforcement learning with a format reward, either prepend the <think> token to the model's completions or amend the chat template to remove the prefill.

Training procedure

Training hyper-parameters

The following hyperparameters were used during training:

  • dataset: open-r1/codeforces-cots
  • learning_rate: 4.0e-5
  • train_batch_size: 2
  • seed: 42
  • packing: false
  • distributed_type: deepspeed-zero-3
  • num_devices: 8
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine_with_min_lr
  • min_lr_rate: 0.1
  • lr_scheduler_warmup_ratio: 0.03
  • num_epochs: 10.0
Downloads last month
88
GGUF
Model size
7.62B params
Architecture
qwen2

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for Mungert/OlympicCoder-7B-GGUF

Base model

Qwen/Qwen2.5-7B
Quantized
(124)
this model

Dataset used to train Mungert/OlympicCoder-7B-GGUF