A Guide to Running Qwen 3 Locally with Ollama and vLLM
The field of artificial intelligence is witnessing an unprecedented acceleration, with Large Language Models (LLMs) becoming increasingly sophisticated. The release of Qwen 3 in April 2025 marks another significant milestone, offering a suite of models that demonstrate state-of-the-art performance in reasoning, coding, and multilingual tasks. Notably, the Qwen team has open-weighted several models from this family under the Apache 2.0 license, democratizing access to powerful AI tools.
While cloud-based APIs provide convenient access, the ability to run these advanced models locally on your own hardware offers compelling advantages in terms of privacy, cost, customization, and offline operation. This guide explores the Qwen 3 model family and provides practical instructions for running them locally using two popular frameworks: Ollama, known for its simplicity, and vLLM, optimized for high-performance serving.
Tired of Postman? Want a decent postman alternative that doesn't suck?
Apidog is a powerful all-in-one API development platform that's revolutionizing how developers design, test, and document their APIs.
Unlike traditional tools like Postman, Apidog seamlessly integrates API design, automated testing, mock servers, and documentation into a single cohesive workflow. With its intuitive interface, collaborative features, and comprehensive toolset, Apidog eliminates the need to juggle multiple applications during your API development process.
Whether you're a solo developer or part of a large team, Apidog streamlines your workflow, increases productivity, and ensures consistent API quality across your projects.
Unpacking Qwen 3: Architecture, Capabilities, and Performance
Qwen 3 represents a significant architectural and training paradigm shift from its predecessors. It introduces a diverse set of models built using both dense and sparse Mixture-of-Experts (MoE) techniques, alongside innovative features designed to enhance reasoning and usability.
A Dual Architecture Strategy: Dense and Sparse Models
Qwen 3 offers two distinct architectural pathways:
Dense Models: These models, ranging from 0.6B to 32B parameters (
Qwen3-0.6B
,1.7B
,4B
,8B
,14B
,32B
), utilize a traditional Transformer architecture where all parameters are engaged during inference. Key characteristics include:- Scalability: Models vary in depth (layers) and width (attention heads) to provide different performance tiers.
- Efficiency Techniques: Grouped-Query Attention (GQA) is implemented across the board to reduce the computational overhead of the attention mechanism, especially relevant for long context lengths. Smaller models also employ tied word embeddings.
- Context Windows: Models up to 4B parameters support a 32K token context window, while the 8B, 14B, and 32B variants handle an impressive 128K tokens, allowing them to process and generate much longer sequences of text.
Model Layers Attention Heads (Q / KV) Tie Embeddings Max Context Qwen3-0.6B 28 16 / 8 Yes 32K Qwen3-1.7B 28 16 / 8 Yes 32K Qwen3-4B 36 32 / 8 Yes 32K Qwen3-8B 36 32 / 8 No 128K Qwen3-14B 40 40 / 8 No 128K Qwen3-32B 64 64 / 8 No 128K Mixture-of-Experts (MoE) Models: Qwen 3 features two MoE models (
Qwen3-30B-A3B
,Qwen3-235B-A22B
). These leverage sparsity for computational efficiency.- Selective Computation: Each MoE layer contains numerous 'expert' sub-networks (128 in Qwen 3). During inference, only a small fraction of these experts (8 per token) are activated by a routing mechanism.
- Performance vs. Cost: This sparse activation allows MoE models to achieve performance characteristic of their large total parameter count (30B or 235B) while incurring inference costs closer to their smaller activated parameter count (3B or 22B, respectively). This drastically reduces computational requirements compared to a dense model of equivalent total size.
- Large Context: Both MoE models support the extended 128K context length.
Model Layers Attention Heads (Q / KV) # Experts (Total / Activated) Max Context Qwen3-30B-A3B 48 32 / 4 128 / 8 128K Qwen3-235B-A22B 94 64 / 4 128 / 8 128K
Core Technical Innovations
Beyond the base architecture, Qwen 3 introduces key features:
Hybrid Thinking Modes: A novel capability allowing fine-grained control over the model's inference process.
- Thinking Mode (Default): The model performs internal step-by-step reasoning (often using latent Chain-of-Thought) before generating the final answer. This enhances performance on complex tasks requiring deliberation. Specific framework integrations can expose or utilize this internal thought process.
- Non-Thinking Mode: The model provides a direct, faster response, suitable for simpler queries where latency is critical.
Users can toggle between these modes, potentially dynamically within a conversation (using
/think
and/no_think
tags where supported), offering a powerful way to manage the trade-off between reasoning depth and computational cost/speed.
Expansive Multilingualism: Pre-trained on data covering 119 languages and dialects, Qwen 3 models possess strong cross-lingual capabilities, making them highly versatile for global applications.
Advanced Training Regimen:
- Foundation: Pre-training utilized trillions of tokens, including specific stages focused on long-context data to enable the large context windows.
- Post-training Pipeline: A multi-stage process refined the models: (1) Supervised Fine-Tuning (SFT) on long Chain-of-Thought data for reasoning, (2) Reinforcement Learning (RL) focused on enhancing these reasoning skills, (3) Fusion stage to integrate non-thinking fast responses with the reasoning model, and (4) General RL across diverse tasks for overall instruction following, safety, and agentic behavior alignment.
Benchmark Standing
Qwen 3 models demonstrate highly competitive results:
- The flagship
Qwen3-235B-A22B
rivals other top-tier models (like DeepSeek-R1, Grok-3, Gemini-2.5-Pro) on major benchmarks. - The
Qwen3-30B-A3B
MoE model significantly outperforms previous dense models of similar size, proving the efficiency of the MoE approach. - Qwen 3 dense models generally match or surpass the performance of larger Qwen 2.5 models (e.g.,
Qwen3-4B
≈Qwen2.5-72B-Instruct
capabilities), especially in STEM and coding. - The MoE architecture provides performance comparable to much larger dense models while using only a fraction of the active parameters during inference.
The Case for Local Execution
Running models like Qwen 3 on your own hardware offers distinct advantages:
- Data Sovereignty: Keep your prompts and data entirely on your system, essential for privacy, confidentiality, and sensitive information.
- Predictable Costs: Avoid potentially escalating API fees. The main costs are hardware and electricity.
- Offline Operation: Functionality independent of internet connectivity (post-download).
- Deep Customization: Enables fine-tuning on specific datasets for tailored behavior.
- Reduced Latency: Eliminate network round-trips for faster, more interactive experiences.
- Unfettered Exploration: Experiment freely without usage limits or quotas.
Path 1: Running Qwen 3 with Ollama (Focus: Simplicity)
Ollama excels at making local LLM execution incredibly straightforward. It's ideal for developers, enthusiasts, and users with moderately powerful hardware (including Apple Silicon).
1. Get Ollama: Download and install Ollama from ollama.com
for your OS (macOS, Linux, Windows).
2. Run Qwen 3 Models: Ollama manages model downloads automatically. Use the ollama run
command with the appropriate model tag. Available tags can be found on ollama.com/library/qwen3
.
# Example: Run the 8B parameter Qwen 3 model
ollama run qwen3:8b
# Example: Run the 30B MoE model
ollama run qwen3:30b-a3b
This command downloads the model if necessary and starts an interactive command-line chat.
3. Interact: Chat directly in the terminal after running the command. Ollama also starts a background API server (usually http://localhost:11434
) compatible with the OpenAI API standard, allowing programmatic interaction.
4. Hardware:
- RAM: Crucial. Even small models need several GB. Larger models (8B+) require 16GB, 32GB, or even 64GB+, depending on quantization.
- GPU (VRAM): Highly recommended for performance. Ollama leverages NVIDIA GPUs (via CUDA) and Apple Silicon (via Metal). Sufficient VRAM allows the entire model to run on the GPU, drastically speeding up responses.
- CPU: Fallback option, but significantly slower.
Ollama Use Cases: Quick experiments, local development, personal chat assistants, learning LLM concepts, running small-to-medium models effectively on consumer hardware.
Path 2: Running Qwen 3 with vLLM (Focus: Performance)
vLLM is a serving library optimized for high throughput and low latency, incorporating techniques like PagedAttention. It's suited for building applications, handling concurrent requests, and deploying large models, often requiring more powerful hardware (especially NVIDIA GPUs).
1. Get vLLM: Install using pip, preferably in a virtual environment. Ensure you have compatible NVIDIA drivers and the CUDA toolkit installed.
pip install -U vllm
2. Serve Qwen 3 Models: Use the vllm serve
command. vLLM offers Day 0 support for Qwen 3, but careful configuration is needed, especially for reasoning modes.
# Example: Serve the Qwen3-30B MoE model enabling reasoning
vllm serve Qwen/Qwen3-30B-A3B \
--enable-reasoning \
--reasoning-parser deepseek_r1
# Example: Serve the large 235B MoE model using FP8 quantization and 4 GPUs
vllm serve Qwen/Qwen3-235B-A22B-FP8 \
--enable-reasoning \
--reasoning-parser deepseek_r1 \
--tensor-parallel-size 4
Key Arguments:
Qwen/Qwen3-...
: The model identifier (often from Hugging Face).FP8
indicates 8-bit quantization to save memory.--enable-reasoning
: Essential flag to activate Qwen 3's hybrid thinking capabilities within vLLM.--reasoning-parser deepseek_r1
: Crucial for Qwen 3. Tells vLLM how to interpret the model's thinking output format. Per the Qwen 3 release info, usedeepseek_r1
for vLLM (SGLang uses a differentqwen3
parser).--tensor-parallel-size N
: Distributes the model acrossN
GPUs, necessary for models too large for a single GPU's memory.
3. Interact: The vllm serve
command starts an OpenAI-compatible API server (default: http://localhost:8000
). Interact using standard clients:
- curl:
bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-30B-A3B", # Match the served model "messages": [{"role": "user", "content": "Explain Grouped-Query Attention."}], "max_tokens": 200 }'
- Python OpenAI Client: ```python from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="no-key-needed")
response = client.chat.completions.create(
model="Qwen/Qwen3-30B-A3B", # Match the served model
messages=[{"role": "user", "content": "Write Python code to list files in a directory."}],
max_tokens=150
)
print(response.choices[0].message.content)
```
4. Hardware: Requires capable NVIDIA GPUs. Larger models or high throughput demands necessitate significant VRAM and potentially multiple GPUs for tensor parallelism.
vLLM Use Cases: Building LLM-powered applications, serving models with high concurrency demands, production deployments, running extremely large models efficiently using multi-GPU setups.
Ollama vs. vLLM: Selecting Your Framework
Feature | Ollama | vLLM |
---|---|---|
Primary Goal | Ease of Use, Accessibility | High Performance, Throughput, Scalability |
Setup | Minimal, integrated downloads | Requires Python env, CUDA setup, more configuration |
Performance | Good (esp. Apple Silicon/NVIDIA) | Excellent, optimized (PagedAttention) |
Hardware | More forgiving, good CPU fallback | Primarily targets NVIDIA GPUs, multi-GPU focus |
Large Models | Can run large models (if RAM/VRAM fits) | Better suited via tensor parallelism |
Use Cases | Dev, testing, personal use, learning | Apps, production serving, high concurrency |
Qwen 3 Reasoning | May handle automatically/via template | Requires explicit flags (--enable-reasoning , --reasoning-parser ) |
Tapping into Qwen 3's Advanced Features Locally
With Qwen 3 running locally via either framework:
- Hybrid Thinking Control: When using the API, investigate how to pass parameters (like
enable_thinking
if using a direct Hugging Face style interaction, though API servers might abstract this) or use the/think
,/no_think
tags in your prompts within multi-turn chat requests to guide the model's reasoning process. Check the specific documentation for your chosen framework (Ollama or vLLM client libraries) on how they expose this control. - Agentic Frameworks: Integrate your local endpoint (e.g.,
http://localhost:11434/v1
for Ollama,http://localhost:8000/v1
for vLLM) into agentic frameworks like Qwen-Agent by configuring themodel_server
or equivalent setting. This allows you to build complex workflows leveraging Qwen 3's tool-use capabilities running entirely on your hardware.
Conclusion
Qwen 3 offers a compelling suite of open-weight models that push the performance envelope. Tools like Ollama and vLLM bridge the gap between these powerful models and practical local execution. Ollama provides an unparalleled entry point for ease of use, while vLLM delivers the raw performance needed for demanding applications. By choosing the right tool for your needs and hardware, you can harness the capabilities of state-of-the-art AI like Qwen 3 directly on your own machine, unlocking new possibilities for private, cost-effective, and customized AI development.