Transformers Streaming Output
Community Article
Published
March 15, 2025
Introduction
With the advancement of AI-driven chatbots, interactive learning has become more engaging. In this blog, we will explore how to build a streaming output using Python, Gradio, and a Qwen-based language model.
Prerequisites
Before we start, ensure you have the following installed:
pip install gradio transformers torch
Code Implementation
import gradio as gr # Import the Gradio library for creating user interfaces
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer # Import necessary classes from the transformers library
from threading import Thread # Import Thread for concurrent execution
import time # Import time for adding delays
model_name = "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-unsloth-bnb-4bit" # Define the model name or path
# Load the pre-trained model with automatic data type and device mapping
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# Load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
def QwenChat(message, history): # Define the QwenChat function
# Construct the messages list with system, history, and user message
messages = [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
]
messages.extend(history) # Add chat history to the messages list
messages.append({"role": "user", "content": message}) # Append the user's message
# Apply chat template to format the messages for the model
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Set up the streamer for token generation
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
# Prepare model inputs by tokenizing the text and moving it to the model's device
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# Set up generation arguments including max tokens and streamer
generation_args = {
"max_new_tokens": 512,
"streamer": streamer,
**model_inputs
}
# Start a separate thread for model generation to allow streaming output
thread = Thread(
target=model.generate,
kwargs=generation_args,
)
thread.start()
# Accumulate and yield text tokens as they are generated
acc_text = ""
for text_token in streamer:
time.sleep(0.01) # Simulate real-time output with a short delay
acc_text += text_token # Append the generated token to the accumulated text
yield acc_text # Yield the accumulated text
# Ensure the generation thread completes
thread.join()
# Create a Gradio chat interface with the QwenChat function
demo = gr.ChatInterface(fn=QwenChat, type="messages")
# Launch the Gradio interface on all available network interfaces
demo.launch(server_name="0.0.0.0")
Features of This AI Tutor
- Real-time response: Generates words dynamically as the model processes input.
- Interactive learning: Users can practice conversations with an AI tutor.
- Customizable: Modify the system prompt to tailor the teaching style.
How It Works
- The user enters a message.
- The system constructs a chat template including previous conversations.
- The AI model processes the input and generates a response word-by-word in real-time.
- The response appears gradually, simulating a natural conversation.
Conclusion
This approach offers an **engaging way to learn using AI. By integrating streaming output, students can experience dynamic, realistic interactions rather than static responses.