--- pipeline_tag: text-generation inference: true widget: - text: 'def print_hello_world():' example_title: Hello world group: Python license: bigscience-openrail-m pretrain-datasets: - books - arxiv - c4 - falcon-refinedweb - wiki - github-issues - stack_markdown - self-made dataset of permissive github code datasets: - bigcode/the-stack-dedup - rombodawg/2XUNCENSORED_MegaCodeTraining188k - bigcode/commitpackft metrics: - code_eval library_name: transformers tags: - code model-index: - name: Refact-1.6B results: - task: type: text-generation dataset: type: openai_humaneval name: HumanEval metrics: - name: pass@1 (T=0.01) type: pass@1 value: 32.0 verified: false - name: pass@1 (T=0.2) type: pass@1 value: 31.5 verified: false - name: pass@10 (T=0.8) type: pass@10 value: 53.0 verified: false - name: pass@100 (T=0.8) type: pass@100 value: 76.9 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalSynthesize Python metrics: - name: pass@1 (T=0.2) type: pass@1 value: 35.8 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalSynthesize JavaScript metrics: - name: pass@1 (T=0.2) type: pass@1 value: 31.6 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalSynthesize Java metrics: - name: pass@1 (T=0.2) type: pass@1 value: 29.1 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalSynthesize Go metrics: - name: pass@1 (T=0.2) type: pass@1 value: -1 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalSynthesize C++ metrics: - name: pass@1 (T=0.2) type: pass@1 value: 26.3 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalSynthesize Rust metrics: - name: pass@1 (T=0.2) type: pass@1 value: -1 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalSynthesize Average metrics: - name: pass@1 (T=0.2) type: pass@1 value: -1 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalFixTests Python metrics: - name: pass@1 (T=0.2) type: pass@1 value: 18.38 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalFixTests JavaScript metrics: - name: pass@1 (T=0.2) type: pass@1 value: 12.28 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalFixTests Java metrics: - name: pass@1 (T=0.2) type: pass@1 value: 15.12 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalFixTests Go metrics: - name: pass@1 (T=0.2) type: pass@1 value: -1 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalFixTests C++ metrics: - name: pass@1 (T=0.2) type: pass@1 value: 13.17 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalFixTests Rust metrics: - name: pass@1 (T=0.2) type: pass@1 value: 2.8 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalFixTests Average metrics: - name: pass@1 (T=0.2) type: pass@1 value: -1 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalFixDocs Python metrics: - name: pass@1 (T=0.2) type: pass@1 value: 26.92 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalFixDocs JavaScript metrics: - name: pass@1 (T=0.2) type: pass@1 value: 26.85 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalFixDocs Java metrics: - name: pass@1 (T=0.2) type: pass@1 value: 30.76 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalFixDocs Go metrics: - name: pass@1 (T=0.2) type: pass@1 value: -1 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalFixDocs C++ metrics: - name: pass@1 (T=0.2) type: pass@1 value: 25.94 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalFixDocs Rust metrics: - name: pass@1 (T=0.2) type: pass@1 value: 8.44 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalFixDocs Average metrics: - name: pass@1 (T=0.2) type: pass@1 value: -1 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalExplain Python metrics: - name: pass@1 (T=0.2) type: pass@1 value: 26.46 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalExplain JavaScript metrics: - name: pass@1 (T=0.2) type: pass@1 value: 17.86 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalExplain Java metrics: - name: pass@1 (T=0.2) type: pass@1 value: 20.94 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalExplain Go metrics: - name: pass@1 (T=0.2) type: pass@1 value: -1 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalExplain C++ metrics: - name: pass@1 (T=0.2) type: pass@1 value: 18.78 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalExplain Rust metrics: - name: pass@1 (T=0.2) type: pass@1 value: -1 verified: false - task: type: text-generation dataset: type: bigcode/humanevalpack name: HumanEvalExplain Average metrics: - name: pass@1 (T=0.2) type: pass@1 value: -1 verified: false - task: type: text-generation dataset: type: mbpp name: MBPP metrics: - name: pass@1 (T=0.01) type: pass@1 value: 31.15 verified: false - task: type: text-generation dataset: type: ds1000 name: DS-1000 (Overall Completion) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 10.1 verified: false - task: type: text-generation dataset: type: nuprl/MultiPL-E name: MultiPL-HumanEval (C++) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 21.61 verified: false - task: type: text-generation dataset: type: nuprl/MultiPL-E name: MultiPL-HumanEval (C#) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 13.91 verified: false - task: type: text-generation dataset: type: nuprl/MultiPL-E name: MultiPL-HumanEval (D) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 9.5 verified: false - task: type: text-generation dataset: type: nuprl/MultiPL-E name: MultiPL-HumanEval (Go) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 53.57 verified: false - task: type: text-generation dataset: type: nuprl/MultiPL-E name: MultiPL-HumanEval (Java) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 21.58 verified: false - task: type: text-generation dataset: type: nuprl/MultiPL-E name: MultiPL-HumanEval (Julia) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 13.75 verified: false - task: type: text-generation dataset: type: nuprl/MultiPL-E name: MultiPL-HumanEval (JavaScript) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 26.88 verified: false - task: type: text-generation dataset: type: nuprl/MultiPL-E name: MultiPL-HumanEval (Lua) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 15.26 verified: false - task: type: text-generation dataset: type: nuprl/MultiPL-E name: MultiPL-HumanEval (PHP) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 23.04 verified: false - task: type: text-generation dataset: type: nuprl/MultiPL-E name: MultiPL-HumanEval (Perl) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 12.1 verified: false - task: type: text-generation dataset: type: nuprl/MultiPL-E name: MultiPL-HumanEval (Python) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 29.6 verified: false - task: type: text-generation dataset: type: nuprl/MultiPL-E name: MultiPL-HumanEval (R) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 13.77 verified: false - task: type: text-generation dataset: type: nuprl/MultiPL-E name: MultiPL-HumanEval (Ruby) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 12.68 verified: false - task: type: text-generation dataset: type: nuprl/MultiPL-E name: MultiPL-HumanEval (Racket) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 4.29 verified: false - task: type: text-generation dataset: type: nuprl/MultiPL-E name: MultiPL-HumanEval (Rust) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 19.54 verified: false - task: type: text-generation dataset: type: nuprl/MultiPL-E name: MultiPL-HumanEval (Scala) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 18.33 verified: false - task: type: text-generation dataset: type: nuprl/MultiPL-E name: MultiPL-HumanEval (Bash) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 5.7 verified: false - task: type: text-generation dataset: type: nuprl/MultiPL-E name: MultiPL-HumanEval (Swift) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 17.68 verified: false - task: type: text-generation dataset: type: nuprl/MultiPL-E name: MultiPL-HumanEval (TypeScript) metrics: - name: pass@1 (T=0.2) type: pass@1 value: 25 verified: false language: - en --- # Refact-1_6B-fim GGUF Models ## **Choosing the Right Model Format** Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**. ### **BF16 (Brain Float 16) – Use if BF16 acceleration is available** - A 16-bit floating-point format designed for **faster computation** while retaining good precision. - Provides **similar dynamic range** as FP32 but with **lower memory usage**. - Recommended if your hardware supports **BF16 acceleration** (check your device’s specs). - Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32. πŸ“Œ **Use BF16 if:** βœ” Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs). βœ” You want **higher precision** while saving memory. βœ” You plan to **requantize** the model into another format. πŸ“Œ **Avoid BF16 if:** ❌ Your hardware does **not** support BF16 (it may fall back to FP32 and run slower). ❌ You need compatibility with older devices that lack BF16 optimization. --- ### **F16 (Float 16) – More widely supported than BF16** - A 16-bit floating-point **high precision** but with less of range of values than BF16. - Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs). - Slightly lower numerical precision than BF16 but generally sufficient for inference. πŸ“Œ **Use F16 if:** βœ” Your hardware supports **FP16** but **not BF16**. βœ” You need a **balance between speed, memory usage, and accuracy**. βœ” You are running on a **GPU** or another device optimized for FP16 computations. πŸ“Œ **Avoid F16 if:** ❌ Your device lacks **native FP16 support** (it may run slower than expected). ❌ You have memory limitations. --- ### **Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference** Quantization reduces model size and memory usage while maintaining as much accuracy as possible. - **Lower-bit models (Q4_K)** β†’ **Best for minimal memory usage**, may have lower precision. - **Higher-bit models (Q6_K, Q8_0)** β†’ **Better accuracy**, requires more memory. πŸ“Œ **Use Quantized Models if:** βœ” You are running inference on a **CPU** and need an optimized model. βœ” Your device has **low VRAM** and cannot load full-precision models. βœ” You want to reduce **memory footprint** while keeping reasonable accuracy. πŸ“Œ **Avoid Quantized Models if:** ❌ You need **maximum accuracy** (full-precision models are better for this). ❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16). --- ### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)** These models are optimized for **extreme memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint. - **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **extreme memory efficiency**. - **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large. - **Trade-off**: Lower accuracy compared to higher-bit quantizations. - **IQ3_S**: Small block size for **maximum memory efficiency**. - **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive. - **IQ3_M**: Medium block size for better accuracy than **IQ3_S**. - **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting. - **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy. - **Use case**: Best for **low-memory devices** where **Q6_K** is too large. - **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**. - **Use case**: Best for **ARM-based devices** or **low-memory environments**. --- ### **Summary Table: Model Format Selection** | Model Format | Precision | Memory Usage | Device Requirements | Best Use Case | |--------------|------------|---------------|----------------------|---------------| | **BF16** | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory | | **F16** | High | High | FP16-supported devices | GPU inference when BF16 isn’t available | | **Q4_K** | Medium Low | Low | CPU or Low-VRAM devices | Best for memory-constrained environments | | **Q6_K** | Medium | Moderate | CPU with more memory | Better accuracy while still being quantized | | **Q8_0** | High | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models | | **IQ3_XS** | Very Low | Very Low | Ultra-low-memory devices | Extreme memory efficiency and low accuracy | | **Q4_0** | Low | Low | ARM or low-memory devices | llama.cpp can optimize for ARM devices | --- ## **Included Files & Details** ### `Refact-1_6B-fim-bf16.gguf` - Model weights preserved in **BF16**. - Use this if you want to **requantize** the model into a different format. - Best if your device supports **BF16 acceleration**. ### `Refact-1_6B-fim-f16.gguf` - Model weights stored in **F16**. - Use if your device supports **FP16**, especially if BF16 is not available. ### `Refact-1_6B-fim-bf16-q8_0.gguf` - **Output & embeddings** remain in **BF16**. - All other layers quantized to **Q8_0**. - Use if your device supports **BF16** and you want a quantized version. ### `Refact-1_6B-fim-f16-q8_0.gguf` - **Output & embeddings** remain in **F16**. - All other layers quantized to **Q8_0**. ### `Refact-1_6B-fim-q4_k.gguf` - **Output & embeddings** quantized to **Q8_0**. - All other layers quantized to **Q4_K**. - Good for **CPU inference** with limited memory. ### `Refact-1_6B-fim-q4_k_s.gguf` - Smallest **Q4_K** variant, using less memory at the cost of accuracy. - Best for **very low-memory setups**. ### `Refact-1_6B-fim-q6_k.gguf` - **Output & embeddings** quantized to **Q8_0**. - All other layers quantized to **Q6_K** . ### `Refact-1_6B-fim-q8_0.gguf` - Fully **Q8** quantized model for better accuracy. - Requires **more memory** but offers higher precision. ### `Refact-1_6B-fim-iq3_xs.gguf` - **IQ3_XS** quantization, optimized for **extreme memory efficiency**. - Best for **ultra-low-memory devices**. ### `Refact-1_6B-fim-iq3_m.gguf` - **IQ3_M** quantization, offering a **medium block size** for better accuracy. - Suitable for **low-memory devices**. ### `Refact-1_6B-fim-q4_0.gguf` - Pure **Q4_0** quantization, optimized for **ARM devices**. - Best for **low-memory environments**. - Prefer IQ4_NL for better accuracy. # πŸš€ If you find these models useful Please click like ❀ . Also I’d really appreciate it if you could test my Network Monitor Assistant at πŸ‘‰ [Network Monitor Assitant](https://freenetworkmonitor.click/dashboard). πŸ’¬ Click the **chat icon** (bottom right of the main and dashboard pages) . Choose a LLM; toggle between the LLM Types TurboLLM -> FreeLLM -> TestLLM. ### What I'm Testing I'm experimenting with **function calling** against my network monitoring service. Using small open source models. I am into the question "How small can it go and still function". 🟑 **TestLLM** – Runs the current testing model using llama.cpp on 6 threads of a Cpu VM (Should take about 15s to load. Inference speed is quite slow and it only processes one user prompt at a timeβ€”still working on scaling!). If you're curious, I'd be happy to share how it works! . ### The other Available AI Assistants 🟒 **TurboLLM** – Uses **gpt-4o-mini** Fast! . Note: tokens are limited since OpenAI models are pricey, but you can [Login](https://freenetworkmonitor.click) or [Download](https://freenetworkmonitor.click/download) the Free Network Monitor agent to get more tokens, Alternatively use the FreeLLM . πŸ”΅ **FreeLLM** – Runs **open-source Hugging Face models** Medium speed (unlimited, subject to Hugging Face API availability). ![image/png](https://cdn-uploads.huggingface.co/production/uploads/643a9dd0c5f633a7fa7e804a/HkB0QYV0BbmB3ktMugbZy.png) # Refact-1.6B Finally, the model we started training with our [blog post](https://refact.ai/blog/2023/applying-recent-innovations-to-train-model/) is ready πŸŽ‰ After fine-tuning on generated data, it beats Replit 3b, Stability Code 3b and many other models. It almost beats StarCoder ten times the size! Model | Size | HumanEval pass@1 | HumanEval pass@10 | ----------------------|---------------|--------------------|--------------------| DeciCoder-1b | 1b | 19.1% | | Refact-1.6-fim | 1.6b | 32.0% | 53.0% | StableCode | 3b | 20.2% | 33.8% | ReplitCode v1 | 3b | 21.9% | | CodeGen2.5-multi | 7b | 28.4% | 47.5% | CodeLlama | 7b | 33.5% | 59.6% | StarCoder | 15b | 33.6% | | Likely, it's the best model for practical use in your IDE for code completion because it's smart and fast! You can start using it right now by downloading the [Refact plugin](https://refact.ai/). You can host the model yourself, too, using the [open source docker container](https://github.com/smallcloudai/refact). And it's multi-language (see MultiPL-HumanEval and other metrics below) and it works as a chat (see the section below). # It Works As a Chat The primary application of this model is code completion (infill) in multiple programming languages. But it works as a chat quite well. HumanEval results using instruction following (chat) format, against models specialized for chat only: Model | Size | pass@1 | pass@10 | -----------------------|--------|----------|----------| Refact-1.6-fim | 1.6b | 38.4% | 55.6% | StableCode-instruct | 3b | 26.9% | 36.2% | OctoGeeX | 6b | 44.7% | | CodeLlama-instruct | 7b | 34.8% | 64.3% | CodeGen2.5-instruct | 7b | 36.2% | 60.87 | CodeLlama-instruct | 13b | 42.7% | 71.6% | StarChat-Ξ² | 15b | 33.5% | | OctoCoder | 15b | 46.2% | | # Example Fill-in-the-middle uses special tokens to identify the prefix/middle/suffix part of the input and output: ```python # pip install -q transformers from transformers import AutoModelForCausalLM, AutoTokenizer checkpoint = "smallcloudai/Refact-1_6B-fim" device = "cuda" # for GPU usage or "cpu" for CPU usage tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device) prompt = 'def print_hello_world():\n """\n print("Hello world!")' inputs = tokenizer.encode(prompt, return_tensors="pt").to(device) outputs = model.generate(inputs, max_length=100, temperature=0.2) print("-"*80) print(tokenizer.decode(outputs[0])) ``` # Chat Format The same model works as chat (experimental). ```python prompt_template = "SYSTEM {system}\n" \ "USER {query}\n" \ "ASSISTANT" prompt = prompt_template.format(system="You are a programming assistant", query="How do I sort a list in Python?") ``` # Architecture As described in more detail in the blog post, we used: - [ALiBi](https://arxiv.org/abs/2108.12409) based attention - [LayerNorm](https://arxiv.org/abs/1607.06450v1) instead of [RMSNorm](https://arxiv.org/pdf/1910.07467.pdf) - [Multi Query Attention](https://arxiv.org/abs/1911.02150) We also used LiON, flash attention, early dropout. It's not that innovative that you can't run it, in fact you can -- see an example below. # Pretraining For the base model, we used our own dataset that contains code with permissive licenses only, and open text datasets. Filtering is the key to success of this model: - We only used text in English - Only topics related to computer science - Applied heavy deduplication The text to code proportion was 50:50, model trained for 1.2T tokens. We don't release the base model, because its Fill-in-the-Middle (FIM) capability likes to repeat itself too much, so its practical use is limited. But if you still want it, write us a message on Discord. # Finetuning We tested our hypothesis that chat data should boost base model performance in FIM and regular left-to-right code completion. We found that just 15% of open [code](https://huggingface.co/datasets/bigcode/commitpackft) [instruction-following](https://huggingface.co/datasets/rombodawg/2XUNCENSORED_MegaCodeTraining188k) datasets, that we filtered for quality, improves almost all metrics. Additionally, to improve FIM, we observed common failure modes, and prepared a synthetic dataset based on [The Stack dedup v1.1](https://huggingface.co/datasets/bigcode/the-stack-dedup) to address them. There is a distribution shift between typical code on the internet, and the code you write in your IDE. The former is likely finished, so the model tries to come up with a suggestion that makes the code complete. You are likely to have half-written code as you work on it, there is no single addition that can repair it fully. In practice, model needs to have a tendency to stop after a couple of lines are added, and sometimes don't write anything at all. We found that just giving it empty completions, single line completions, multiline completions that end with a smaller text indent or at least a newline -- makes it much more usable. This data was used as the rest 85% of the finetune dataset. The final model is the result of several attempts to make it work as good as possible for code completion, and to perform well on a wide range of metrics. The best attempt took 40B tokens. # Limitations and Bias The Refact-1.6B model was trained on text in English. But it has seen a lot more languages in code comments. Its performance on non-English languages is lower, for sure. # Model Stats - **Architecture:** LLAMA-like model with multi-query attention - **Objectives** Fill-in-the-Middle, Chat - **Tokens context:** 4096 - **Pretraining tokens:** 1.2T - **Finetuning tokens:** 40B - **Precision:** bfloat16 - **GPUs** 64 NVidia A5000 - **Training time** 28 days # License The model is licensed under the BigScience OpenRAIL-M v1 license agreement # Citation If you are using this model, please give a link to this page.