--- license: apache-2.0 language: - en base_model: - microsoft/Phi-4-mini-instruct --- ### Phi-3 Large Model and AWQ Quantization Principle #### 1. Introduction to the Phi-3 Large Model 🤖 Phi-3 is a lightweight large language model (LLM) launched by Microsoft. It is positioned as a **high-performance, low-resource consumption instruction fine-tuning model**, suitable for edge devices and lightweight deployment scenarios. Its design goal is to maintain high language understanding and generation capabilities while reducing computational costs through architecture optimization (such as reducing the number of parameters and improving the attention mechanism), supporting fast inference and low memory occupation. - **Core Features**: - Optimized for instruction following tasks, excelling in handling scenarios such as conversations, question answering, and text generation 🗣️❓✍️ - Offers different parameter magnitude versions (such as the Mini version) to balance performance and deployment costs. - Supports multiple languages and context understanding, adapting to diverse application needs 🌐🧠. #### 2. AWQ Quantization Principle (AutoAWQ) 📉 AWQ (**AUtomatic Weight Quantization**) is an automated weight quantization technology used to convert high-precision neural network weights (such as 16-bit or 32-bit floating-point numbers) into low-precision representations (such as 4-bit integers). It can significantly reduce the model size and computational load with almost no loss of model performance. - **Core Advantages**: - **Model Size Compression**: 4-bit quantization can compress the model size to 1/4 to 1/8 of its original size, reducing storage and transmission costs 💾📡. - **Inference Acceleration**: Low-precision calculations (such as INT4) support GPU underlying matrix operation optimizations (such as GEMM kernels), increasing the inference speed by 2-4 times ⚡. - **Performance Maintenance**: Through dynamic group-wise quantization and adaptive calibration, it minimizes quantization errors and approaches the performance of full-precision models 📊. - **Key Technologies**: - **Group-wise Quantization**: Groups the weights by layer or channel, and calculates the quantization parameters (scale factor, zero point) independently for each group to balance precision and compression rate 🧮. - **Automatic Calibration**: Automatically optimizes the quantization parameters through a small amount of calibration data (such as 100-1000 samples), eliminating the need for complex parameter tuning 🛠️. ### Complete Code (in Markdown Format) 📝 The following is the code for inference using the AWQ quantized Phi-3 model, including prompt template construction, streaming output, and generation configuration: ```python from awq import AutoAWQForCausalLM from transformers import AutoTokenizer, TextStreamer from awq.utils.utils import get_best_device # ---------------------- # 1. Device and Model Configuration # ---------------------- device = get_best_device() # Automatically select the best device (GPU/CPU) quant_path = "Phi-3-mini-4k-instruct-awq" # Path of the AWQ quantized model # Load the quantized model and tokenizer model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True) # Fuse layers to optimize inference speed tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True) # Create a streaming outputter (print generated content in real-time) streamer = TextStreamer( tokenizer, skip_prompt=True, # Skip echoing the input prompt skip_special_tokens=True # Skip special tokens (such as ) ) # ---------------------- # 2. Prompt Template Definition # ---------------------- prompt_template = """\ <|system|> <|user|> {prompt} <|assistant|>""" # Special instruction template format for Phi-3 # Example question: A classic geography puzzle prompt = "You're standing on the surface of the Earth. " \ "You walk one mile south, one mile west and one mile north. " \ "You end up exactly where you started. Where are you?" # ---------------------- # 3. Input Processing and Generation # ---------------------- # Convert the prompt into model input tensors tokens = tokenizer( prompt_template.format(prompt=prompt), # Fill the prompt template return_tensors="pt" # Convert to PyTorch tensors ).input_ids.to(device) # Move to the target device (GPU/CPU) # Generate text (Key parameter description) generation_output = model.generate( tokens, streamer=streamer, # Enable streaming output max_new_tokens=512, # Limit the number of newly generated tokens (replace the old parameter max_seq_len) temperature=0.7, # Control the randomness of the output (0.0 is deterministic, 1.0 is highly random) top_p=0.9, # Nucleus sampling threshold, used in conjunction with temperature repetition_penalty=1.2 # Penalize repeated content to avoid generating loops ) ``` ### Key Explanations of the Code 📌 1. **Model Loading**: - `from_quantized` directly loads the AWQ quantized model, and `fuse_layers=True` optimizes the computational graph through layer fusion to improve inference speed ⚙️. 2. **Prompt Template**: - Follows the special format of the Phi-3 model (including `<|user|>` and `<|assistant|>` markers), and needs to strictly match the instruction template structure during training 📄. 3. **Generation Parameters**: - `max_new_tokens` replaces the traditional `max_seq_len`, clearly controlling the number of newly generated tokens and avoiding limitations affected by the input length 📏. - `temperature` and `top_p` adjust the output diversity, suitable for open-domain generation tasks (such as creative writing); if deterministic output is required (such as question answering), it can be set to `temperature=0.0` 🎨🔢. Through the above code, the Phi-3 model can be efficiently run in an environment with limited resources, and the AWQ quantization technology can be used to achieve low-cost and high-speed text generation 🚀.