Today we make the biggest release in smolagents so far: šš² š²š»š®šÆš¹š² šš¶šš¶š¼š» šŗš¼š±š²š¹š, ššµš¶š°šµ š®š¹š¹š¼šš šš¼ šÆšš¶š¹š± š½š¼šš²šæš³šš¹ šš²šÆ šÆšæš¼ššš¶š»š“ š®š“š²š»šš! š„³
Our agents can now casually open up a web browser, and navigate on it by scrolling, clicking elements on the webpage, going back, just like a user would.
The demo below shows Claude-3.5-Sonnet browsing GitHub for task: "Find how many commits the author of the current top trending repo did over last year." Hi @mlabonne !
Go try it out, it's the most cracked agentic stuff I've seen in a while š¤Æ (well, along with OpenAI's Operator who beat us by one day)
smolagents can see š„ we just shipped vision support to smolagents š¤ agentic computers FTW
you can now: š» let the agent get images dynamically (e.g. agentic web browser) š pass images at the init of the agent (e.g. chatting with documents, filling forms automatically etc) with few LoC change! š¤Æ you can use transformers models locally (like Qwen2VL) OR plug-in your favorite multimodal inference provider (gpt-4o, antrophic & co) š¤
š Multimodal - MiniCPM-o 2.6 is a new sota any-to-any model by OpenBMB (vision, speech and text!) - VideoChat-Flash-Qwen2.5-2B is new video multimodal models by OpenGVLab that come in sizes 2B & 7B in resolutions 224 & 448 - ByteDance released larger SA2VA that comes in 26B parameters - Dataset: VRC-Bench is a new diverse benchmark for multimodal LLM reasoning performance
š¬ LLMs - MiniMax-Text-01 is a new huge language model (456B passive 45.9B active params) by MiniMaxAI with context length of 4M tokens š¤Æ - Dataset: Sky-T1-data-17k is a diverse dataset used to train Sky-T1-32B - kyutai released Helium-1-Preview-2B is a new small multilingual LM - Wayfarer-12B is a new LLM able to write D&D š§š»āāļø - ReaderLM-v2 is a new HTML parsing model by Jina AI - Dria released, Dria-Agent-a-3B, new agentic coding model (Pythonic function calling) based on Qwen2.5 Coder - Unsloth released Phi-4, faster and memory efficient Llama 3.3
š¼ļø Vision - MatchAnything is a new foundation model for matching - FitDit is a high-fidelity VTON model based on DiT architecture
š£ļø Audio - OuteTTS-0.3-1B is a new multilingual text-to-speech model with voice cloning and emotion control capabilities
š Retrieval - lightblue released a new reranker based on Qwen2.5 LB-reranker-0.5B-v1.0 that can handle 95+ languages - cde-small-v2 is a new sota small retrieval model by @jxm
Combining smolagents with Anthropicās best practices simplifies building powerful AI agents:
1. Code-Based Agents: Write actions as Python code, reducing steps by 30%. 2. Prompt Chaining: Break tasks into sequential subtasks with validation gates. 3. Routing: Classify inputs and direct them to specialized handlers. 4. Fallback: Handle tasks even if classification fails.
We outperform Llama 70B with Llama 3B on hard math by scaling test-time compute š„
How? By combining step-wise reward models with tree search algorithms :)
We show that smol models can match or exceed the performance of their much larger siblings when given enough "time to think"
We're open sourcing the full recipe and sharing a detailed blog post.
In our blog post we cover:
š Compute-optimal scaling: How we implemented DeepMind's recipe to boost the mathematical capabilities of open models at test-time.
š Diverse Verifier Tree Search (DVTS): An unpublished extension we developed to the verifier-guided tree search technique. This simple yet effective method improves diversity and delivers better performance, particularly at large test-time compute budgets.
š§ Search and Learn: A lightweight toolkit for implementing search strategies with LLMs and built for speed with vLLM
For anyone looking to boost their LLM fine-tuning and alignment skills this decemeber. We're running this free and open course called smol course. Itās not big like Li Yin and @mlabonne, itās just smol.
š· It focuses on practical use cases, so if youāre working on something, bring it along.
šÆāāļø Itās peer reviewed and open so you can discuss and get feedback.
š¤ If youāre already a smol pro, feel free to drop a star or issue.
> > Part 1 starts now, and itās on instruction tuning!
Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.
- SmolVLM generates tokens 7.5 to 16 times faster than Qwen2-VL! š¤Æ - Other models at this size crash a laptop, but SmolVLM comfortably generates 17 tokens/sec on a macbook! š - SmolVLM can be fine-tuned on a Google collab! Or process millions of documents with a consumer GPU! - SmolVLM even outperforms larger models in video benchmarks, despite not even being trained on videos!
This demo highlights when a person touches an object. For instance, it is useful to know if someone is touching a wall, a vase or a door. It works for multiple people too!