Mungert
/

Refact-1_6B-fim-GGUF

+---
+pipeline_tag: text-generation
+inference: true
+widget:
+- text: 'def print_hello_world():'
+  example_title: Hello world
+  group: Python
+license: bigscience-openrail-m
+pretrain-datasets:
+- books
+- arxiv
+- c4
+- falcon-refinedweb
+- wiki
+- github-issues
+- stack_markdown
+- self-made dataset of permissive github code
+datasets:
+- bigcode/the-stack-dedup
+- rombodawg/2XUNCENSORED_MegaCodeTraining188k
+- bigcode/commitpackft
+metrics:
+- code_eval
+library_name: transformers
+tags:
+- code
+model-index:
+- name: Refact-1.6B
+  results:
+  - task:
+      type: text-generation
+    dataset:
+      type: openai_humaneval
+      name: HumanEval
+    metrics:
+    - name: pass@1 (T=0.01)
+      type: pass@1
+      value: 32.0
+      verified: false
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 31.5
+      verified: false
+    - name: pass@10 (T=0.8)
+      type: pass@10
+      value: 53.0
+      verified: false
+    - name: pass@100 (T=0.8)
+      type: pass@100
+      value: 76.9
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalSynthesize Python
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 35.8
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalSynthesize JavaScript
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 31.6
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalSynthesize Java
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 29.1
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalSynthesize Go
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: -1
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalSynthesize C++
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 26.3
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalSynthesize Rust
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: -1
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalSynthesize Average
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: -1
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalFixTests Python
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 18.38
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalFixTests JavaScript
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 12.28
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalFixTests Java
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 15.12
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalFixTests Go
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: -1
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalFixTests C++
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 13.17
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalFixTests Rust
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 2.8
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalFixTests Average
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: -1
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalFixDocs Python
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 26.92
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalFixDocs JavaScript
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 26.85
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalFixDocs Java
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 30.76
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalFixDocs Go
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: -1
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalFixDocs C++
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 25.94
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalFixDocs Rust
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 8.44
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalFixDocs Average
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: -1
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalExplain Python
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 26.46
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalExplain JavaScript
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 17.86
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalExplain Java
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 20.94
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalExplain Go
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: -1
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalExplain C++
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 18.78
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalExplain Rust
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: -1
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: bigcode/humanevalpack
+      name: HumanEvalExplain Average
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: -1
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: mbpp
+      name: MBPP
+    metrics:
+    - name: pass@1 (T=0.01)
+      type: pass@1
+      value: 31.15
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: ds1000
+      name: DS-1000 (Overall Completion)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 10.1
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (C++)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 21.61
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (C#)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 13.91
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (D)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 9.5
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (Go)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 53.57
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (Java)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 21.58
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (Julia)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 13.75
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (JavaScript)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 26.88
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (Lua)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 15.26
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (PHP)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 23.04
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (Perl)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 12.1
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (Python)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 29.6
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (R)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 13.77
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (Ruby)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 12.68
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (Racket)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 4.29
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (Rust)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 19.54
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (Scala)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 18.33
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (Bash)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 5.7
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (Swift)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 17.68
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (TypeScript)
+    metrics:
+    - name: pass@1 (T=0.2)
+      type: pass@1
+      value: 25
+      verified: false
+language:
+- en
+---
+# <span style="color: #7FFF7F;">Refact-1_6B-fim GGUF Models</span>
+## **Choosing the Right Model Format**
+Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**.
+### **BF16 (Brain Float 16) – Use if BF16 acceleration is available**
+- A 16-bit floating-point format designed for **faster computation** while retaining good precision.
+- Provides **similar dynamic range** as FP32 but with **lower memory usage**.
+- Recommended if your hardware supports **BF16 acceleration** (check your device’s specs).
+- Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32.
+📌 **Use BF16 if:**
+✔ Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs).
+✔ You want **higher precision** while saving memory.
+✔ You plan to **requantize** the model into another format.
+📌 **Avoid BF16 if:**
+❌ Your hardware does **not** support BF16 (it may fall back to FP32 and run slower).
+❌ You need compatibility with older devices that lack BF16 optimization.
+---
+### **F16 (Float 16) – More widely supported than BF16**
+- A 16-bit floating-point **high precision** but with less of range of values than BF16.
+- Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs).
+- Slightly lower numerical precision than BF16 but generally sufficient for inference.
+📌 **Use F16 if:**
+✔ Your hardware supports **FP16** but **not BF16**.
+✔ You need a **balance between speed, memory usage, and accuracy**.
+✔ You are running on a **GPU** or another device optimized for FP16 computations.
+📌 **Avoid F16 if:**
+❌ Your device lacks **native FP16 support** (it may run slower than expected).
+❌ You have memory limitations.
+---
+### **Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference**
+Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
+- **Lower-bit models (Q4_K)** → **Best for minimal memory usage**, may have lower precision.
+- **Higher-bit models (Q6_K, Q8_0)** → **Better accuracy**, requires more memory.
+📌 **Use Quantized Models if:**
+✔ You are running inference on a **CPU** and need an optimized model.
+✔ Your device has **low VRAM** and cannot load full-precision models.
+✔ You want to reduce **memory footprint** while keeping reasonable accuracy.
+📌 **Avoid Quantized Models if:**
+❌ You need **maximum accuracy** (full-precision models are better for this).
+❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16).
+---
+### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)**
+These models are optimized for **extreme memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint.
+- **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **extreme memory efficiency**.
+  - **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large.
+  - **Trade-off**: Lower accuracy compared to higher-bit quantizations.
+- **IQ3_S**: Small block size for **maximum memory efficiency**.
+  - **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive.
+- **IQ3_M**: Medium block size for better accuracy than **IQ3_S**.
+  - **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting.
+- **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy.
+  - **Use case**: Best for **low-memory devices** where **Q6_K** is too large.
+- **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**.
+  - **Use case**: Best for **ARM-based devices** or **low-memory environments**.
+---
+### **Summary Table: Model Format Selection**
+| Model Format  | Precision  | Memory Usage  | Device Requirements  | Best Use Case  |
+|--------------|------------|---------------|----------------------|---------------|
+| **BF16**     | Highest    | High          | BF16-supported GPU/CPUs  | High-speed inference with reduced memory |
+| **F16**      | High       | High          | FP16-supported devices | GPU inference when BF16 isn’t available |
+| **Q4_K**     | Medium Low | Low           | CPU or Low-VRAM devices | Best for memory-constrained environments |
+| **Q6_K**     | Medium     | Moderate      | CPU with more memory | Better accuracy while still being quantized |
+| **Q8_0**     | High       | Moderate      | CPU or GPU with enough VRAM | Best accuracy among quantized models |
+| **IQ3_XS**   | Very Low   | Very Low      | Ultra-low-memory devices | Extreme memory efficiency and low accuracy |
+| **Q4_0**     | Low        | Low           | ARM or low-memory devices | llama.cpp can optimize for ARM devices |
+---
+## **Included Files & Details**
+### `Refact-1_6B-fim-bf16.gguf`
+- Model weights preserved in **BF16**.
+- Use this if you want to **requantize** the model into a different format.
+- Best if your device supports **BF16 acceleration**.
+### `Refact-1_6B-fim-f16.gguf`
+- Model weights stored in **F16**.
+- Use if your device supports **FP16**, especially if BF16 is not available.
+### `Refact-1_6B-fim-bf16-q8_0.gguf`
+- **Output & embeddings** remain in **BF16**.
+- All other layers quantized to **Q8_0**.
+- Use if your device supports **BF16** and you want a quantized version.
+### `Refact-1_6B-fim-f16-q8_0.gguf`
+- **Output & embeddings** remain in **F16**.
+- All other layers quantized to **Q8_0**.
+### `Refact-1_6B-fim-q4_k.gguf`
+- **Output & embeddings** quantized to **Q8_0**.
+- All other layers quantized to **Q4_K**.
+- Good for **CPU inference** with limited memory.
+### `Refact-1_6B-fim-q4_k_s.gguf`
+- Smallest **Q4_K** variant, using less memory at the cost of accuracy.
+- Best for **very low-memory setups**.
+### `Refact-1_6B-fim-q6_k.gguf`
+- **Output & embeddings** quantized to **Q8_0**.
+- All other layers quantized to **Q6_K** .
+### `Refact-1_6B-fim-q8_0.gguf`
+- Fully **Q8** quantized model for better accuracy.
+- Requires **more memory** but offers higher precision.
+### `Refact-1_6B-fim-iq3_xs.gguf`
+- **IQ3_XS** quantization, optimized for **extreme memory efficiency**.
+- Best for **ultra-low-memory devices**.
+### `Refact-1_6B-fim-iq3_m.gguf`
+- **IQ3_M** quantization, offering a **medium block size** for better accuracy.
+- Suitable for **low-memory devices**.
+### `Refact-1_6B-fim-q4_0.gguf`
+- Pure **Q4_0** quantization, optimized for **ARM devices**.
+- Best for **low-memory environments**.
+- Prefer IQ4_NL for better accuracy.
+# <span id="testllm" style="color: #7F7FFF;">🚀 If you find these models useful</span>
+Please click like ❤ . Also I’d really appreciate it if you could test my Network Monitor Assistant at 👉 [Network Monitor Assitant](https://freenetworkmonitor.click/dashboard).
+💬 Click the **chat icon** (bottom right of the main and dashboard pages) . Choose a LLM; toggle between the LLM Types TurboLLM -> FreeLLM -> TestLLM.
+### What I'm Testing
+I'm experimenting with **function calling** against my network monitoring service. Using small open source models. I am into the question "How small can it go and still function".
+🟡 **TestLLM** – Runs the current testing model using llama.cpp on 6 threads of a Cpu VM (Should take about 15s to load. Inference speed is quite slow and it only processes one user prompt at a time—still working on scaling!). If you're curious, I'd be happy to share how it works! .
+### The other Available AI Assistants
+🟢 **TurboLLM** – Uses **gpt-4o-mini** Fast! . Note: tokens are limited since OpenAI models are pricey, but you can [Login](https://freenetworkmonitor.click) or [Download](https://freenetworkmonitor.click/download) the Free Network Monitor agent to get more tokens, Alternatively use the FreeLLM .
+🔵 **FreeLLM** – Runs **open-source Hugging Face models** Medium speed (unlimited, subject to Hugging Face API availability).
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/643a9dd0c5f633a7fa7e804a/HkB0QYV0BbmB3ktMugbZy.png)
+# Refact-1.6B
+Finally, the model we started training with our [blog post](https://refact.ai/blog/2023/applying-recent-innovations-to-train-model/) is ready 🎉
+After fine-tuning on generated data, it beats Replit 3b, Stability Code 3b and many other models. It almost beats
+StarCoder ten times the size!
+Model                 | Size          | HumanEval pass@1   | HumanEval pass@10  |
+----------------------|---------------|--------------------|--------------------|
+DeciCoder-1b          |   1b          |  19.1%             |                    |
+<b>Refact-1.6-fim</b> | <b>1.6b</b>   |  <b>32.0%</b>      | <b>53.0%</b>       |
+StableCode            |   3b          |  20.2%             | 33.8%              |
+ReplitCode v1         |   3b          |  21.9%             |                    |
+CodeGen2.5-multi      |   7b          |  28.4%             | 47.5%              |
+CodeLlama             |   7b          |  33.5%             | 59.6%              |
+StarCoder             |  15b          |  33.6%             |                    |
+Likely, it's the best model for practical use in your IDE for code completion because it's smart and fast!
+You can start using it right now by downloading the
+[Refact plugin](https://refact.ai/). You can host the model yourself, too, using the
+[open source docker container](https://github.com/smallcloudai/refact).
+And it's multi-language (see MultiPL-HumanEval and other metrics below) and it works as a chat (see the section below).
+# It Works As a Chat
+The primary application of this model is code completion (infill) in multiple programming languages.
+But it works as a chat quite well.
+HumanEval results using instruction following (chat) format, against models specialized for chat only:
+Model                  | Size   | pass@1   | pass@10  |
+-----------------------|--------|----------|----------|
+<b>Refact-1.6-fim</b>  | 1.6b   |  38.4%   | 55.6%    |
+StableCode-instruct    |   3b   |  26.9%   | 36.2%    |
+OctoGeeX               |   6b   |  44.7%   |          |
+CodeLlama-instruct     |   7b   |  34.8%   | 64.3%    |
+CodeGen2.5-instruct    |   7b   |  36.2%   | 60.87    |
+CodeLlama-instruct     |  13b   |  42.7%   | 71.6%    |
+StarChat-β             |  15b   |  33.5%   |          |
+OctoCoder              |  15b   |  46.2%   |          |
+# Example
+Fill-in-the-middle uses special tokens to identify the prefix/middle/suffix part of the input and output:
+```python
+# pip install -q transformers
+from transformers import AutoModelForCausalLM, AutoTokenizer
+checkpoint = "smallcloudai/Refact-1_6B-fim"
+device = "cuda" # for GPU usage or "cpu" for CPU usage
+tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device)
+prompt = '<fim_prefix>def print_hello_world():\n    """<fim_suffix>\n    print("Hello world!")<fim_middle>'
+inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
+outputs = model.generate(inputs, max_length=100, temperature=0.2)
+print("-"*80)
+print(tokenizer.decode(outputs[0]))
+```
+# Chat Format
+The same model works as chat (experimental).
+```python
+prompt_template = "<empty_output>SYSTEM {system}\n" \
+                  "<empty_output>USER {query}\n" \
+                  "<empty_output>ASSISTANT"
+prompt = prompt_template.format(system="You are a programming assistant",
+                                query="How do I sort a list in Python?")
+```
+# Architecture
+As described in more detail in the blog post, we used:
+- [ALiBi](https://arxiv.org/abs/2108.12409) based attention
+- [LayerNorm](https://arxiv.org/abs/1607.06450v1) instead of [RMSNorm](https://arxiv.org/pdf/1910.07467.pdf)
+- [Multi Query Attention](https://arxiv.org/abs/1911.02150)
+We also used LiON, flash attention, early dropout. It's not that innovative that you can't run it, in fact you can -- see an example below.
+# Pretraining
+For the base model, we used our own dataset that contains code with permissive licenses only, and open text datasets.
+Filtering is the key to success of this model:
+- We only used text in English
+- Only topics related to computer science
+- Applied heavy deduplication
+The text to code proportion was 50:50, model trained for 1.2T tokens.
+We don't release the base model, because its Fill-in-the-Middle (FIM) capability likes to repeat itself too much, so
+its practical use is limited. But if you still want it, write us a message on Discord.
+# Finetuning
+We tested our hypothesis that chat data should boost base model performance in FIM and
+regular left-to-right code completion. We found that just 15% of open
+[code](https://huggingface.co/datasets/bigcode/commitpackft)
+[instruction-following](https://huggingface.co/datasets/rombodawg/2XUNCENSORED_MegaCodeTraining188k) datasets,
+that we filtered for quality, improves almost all metrics.
+Additionally, to improve FIM, we observed common failure modes, and prepared a synthetic dataset based on
+[The Stack dedup v1.1](https://huggingface.co/datasets/bigcode/the-stack-dedup) to address them.
+There is a distribution shift between typical code on the internet, and the code you write in your IDE.
+The former is likely finished, so the model tries to come up with a suggestion that makes the code complete.
+You are likely to have half-written code as you work on it, there is no single addition that can repair it
+fully.
+In practice, model needs to have a tendency to stop after a couple of lines are added, and sometimes don't write
+anything at all. We found that just giving it empty completions, single line completions, multiline
+completions that end with a smaller text indent or at least a newline -- makes it much more usable. This data
+was used as the rest 85% of the finetune dataset.
+The final model is the result of several attempts to make it work as good as possible for code completion,
+and to perform well on a wide range of metrics. The best attempt took 40B tokens.
+# Limitations and Bias
+The Refact-1.6B model was trained on text in English. But it has seen a lot more languages in
+code comments. Its performance on non-English languages is lower, for sure.
+# Model Stats
+- **Architecture:** LLAMA-like model with multi-query attention
+- **Objectives** Fill-in-the-Middle, Chat
+- **Tokens context:** 4096
+- **Pretraining tokens:** 1.2T
+- **Finetuning tokens:** 40B
+- **Precision:** bfloat16
+- **GPUs** 64 NVidia A5000
+- **Training time** 28 days
+# License
+The model is licensed under the BigScience OpenRAIL-M v1 license agreement
+# Citation
+If you are using this model, please give a link to this page.