Ed Addario's picture

Ed Addario PRO

eaddario

AI & ML interests

None yet

Recent Activity

updated a model about 11 hours ago
eaddario/Dolphin3.0-R1-Mistral-24B-GGUF
View all activity

Organizations

None yet

eaddario's activity

replied to wolfram's post 1 day ago
view reply

Nicely done!

Purely out of curiosity, what kind of rig (spec) are you using to to run 600B param models locally?

reacted to wolfram's post with 🤗 1 day ago
view post
Post
5254
Finally finished my extensive **Qwen 3 evaluations** across a range of formats and quantisations, focusing on **MMLU-Pro** (Computer Science).

A few take-aways stood out - especially for those interested in local deployment and performance trade-offs:

1️⃣ **Qwen3-235B-A22B** (via Fireworks API) tops the table at **83.66%** with ~55 tok/s.
2️⃣ But the **30B-A3B Unsloth** quant delivered **82.20%** while running locally at ~45 tok/s and with zero API spend.
3️⃣ The same Unsloth build is ~5x faster than Qwen's **Qwen3-32B**, which scores **82.20%** as well yet crawls at <10 tok/s.
4️⃣ On Apple silicon, the **30B MLX** port hits **79.51%** while sustaining ~64 tok/s - arguably today's best speed/quality trade-off for Mac setups.
5️⃣ The **0.6B** micro-model races above 180 tok/s but tops out at **37.56%** - that's why it's not even on the graph (50 % performance cut-off).

All local runs were done with LM Studio on an M4 MacBook Pro, using Qwen's official recommended settings.

**Conclusion:** Quantised 30B models now get you ~98 % of frontier-class accuracy - at a fraction of the latency, cost, and energy. For most local RAG or agent workloads, they're not just good enough - they're the new default.

Well done, Qwen - you really whipped the llama's ass! And to OpenAI: for your upcoming open model, please make it MoE, with toggleable reasoning, and release it in many sizes. *This* is the future!
·
reacted to juhoinkinen's post with 🤝 1 day ago
view post
Post
2158
We ( @osma , @MonaLehtinen & me, i.e. the Annif team at the National Library of Finland) recently took part in the LLMs4Subjects challenge at the SemEval-2025 workshop. The task was to use large language models (LLMs) to generate good quality subject indexing for bibliographic records, i.e. titles and abstracts.

We are glad to report that our system performed well; it was ranked

🥇 1st in the category where the full vocabulary was used
🥈 2nd in the smaller vocabulary category
🏅 4th in the qualitative evaluations.

14 participating teams developed their own solutions for generating subject headings and the output of each system was assessed using both quantitative and qualitative evaluations. Research papers about most of the systems are going to be published around the time of the workshop in late July, and many pre-prints are already available.

We applied Annif together with several LLMs that we used to preprocess the data sets: translated the GND vocabulary terms to English, translated bibliographic records into English and German as required, and generated additional synthetic training data. After the preprocessing, we used the traditional machine learning algorithms in Annif as well as the experimental XTransformer algorithm that is based on language models. We also combined the subject suggestions generated using English and German language records in a novel way.

More information can be found in our system description preprint: Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs (2504.19675)

See also the task description preprint: SemEval-2025 Task 5: LLMs4Subjects -- LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog (2504.07199)

The Annif models trained for this task are available here: NatLibFi/Annif-LLMs4Subjects-data
  • 2 replies
·
reacted to RiverZ's post with 🤗 3 days ago
view post
Post
5674
🔥 We're thrilled to share some exciting news about ICEdit! Currently, ICEdit app ( RiverZ/ICEdit) has soared to the second place on the weekly trend list of Hugging Face Space, just trailing behind Qwen3. What's more, it also holds the second position on the overall space trend list. This achievement wouldn't have been possible without your incredible support and love. A huge thank you to each and every one of you❤!

🎉 The ICEdit community has been incredibly active, and we've seen a plethora of amazing ComfyUI workflows being shared. For instance, with the help of ComfyUI - nunchaku, you can run ICEdit locally with just 4GB of VRAM. This makes it much more accessible for those with limited hardware resources.

🎇 If you're interested in the detailed information, please head over to our repository. We highly encourage you to give these workflows a try and explore the creative possibilities that ICEdit offers.

Github Repo: https://github.com/River-Zhang/ICEdit
Hugging Face Space: RiverZ/ICEdit
reacted to Xenova's post with 👀🔥 11 days ago
reacted to John6666's post with 👍 12 days ago
view post
Post
4525
If your Space stops working after restarting mainly for the last 5 days (https://discuss.huggingface.co/t/my-space-suddenly-went-offline-the-cpu-cannot-restart/151121/22), try some of following.
1. Add pydantic==2.10.6 to requirements.txt or upgrade Gradio to the latest version.
2. Upgrade PyTorch to 2.2.0 or later (torch>=2.2.0 for Zero GPU space).
3. Fix Transformers to 4.49.0 or earlier (transformers<=4.49.0for spaces using Transformers or Diffusers).
4. Fix huggingface_hub to the old version (huggingface_hub==0.25.2 for if an error like cached_download is not available occurs or inference does not work properly)
5. Specifying WORKDIR in Dockerfile may cause the application to fail to start with error 137. (Docker Spaces, https://discuss.huggingface.co/t/error-code-137-cache-error/152177)

About pydantic==2.10.6:
https://discuss.huggingface.co/t/error-no-api-found/146226
https://discuss.huggingface.co/t/internal-server-error-bool-not-iterable/149494

Edit:
Zero GPU space has been upgraded from A100 to H200.
This is likely the reason why older versions of PyTorch are no longer supported.
In fact, an error message to that effect was displayed.
zero-gpu-explorers/README#163
reacted to danielhanchen's post with 🔥❤️ 13 days ago
view post
Post
5725
🦥 Introducing Unsloth Dynamic v2.0 GGUFs!
Our v2.0 quants set new benchmarks on 5-shot MMLU and KL Divergence, meaning you can now run & fine-tune quantized LLMs while preserving as much accuracy as possible.

Llama 4: unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF
DeepSeek-R1: unsloth/DeepSeek-R1-GGUF-UD
Gemma 3: unsloth/gemma-3-27b-it-GGUF

We made selective layer quantization much smarter. Instead of modifying only a subset of layers, we now dynamically quantize all layers so every layer has a different bit. Now, our dynamic method can be applied to all LLM architectures, not just MoE's.

Blog with Details: https://docs.unsloth.ai/basics/dynamic-v2.0

All our future GGUF uploads will leverage Dynamic 2.0 and our hand curated 300K–1.5M token calibration dataset to improve conversational chat performance.

For accurate benchmarking, we built an evaluation framework to match the reported 5-shot MMLU scores of Llama 4 and Gemma 3. This allowed apples-to-apples comparisons between full-precision vs. Dynamic v2.0, QAT and standard iMatrix quants.

Dynamic v2.0 aims to minimize the performance gap between full-precision models and their quantized counterparts.
posted an update 13 days ago
view post
Post
2229
Until recently, watt-ai/watt-tool-70B was the best performing model in the Berkeley Function-Calling Leaderboard (https://gorilla.cs.berkeley.edu/leaderboard.html), which evaluates LLM's ability to call functions (tools) accurately. The top spot now belongs to Salesforce/Llama-xLAM-2-70b-fc-r and by a quite wide margin!

Layer-wise quantized versions for both models are available at eaddario/Llama-xLAM-2-8b-fc-r-GGUF and eaddario/Watt-Tool-8B-GGUF
reacted to samihalawa's post with 👀 16 days ago
view post
Post
2412
SkyReels-V2 INFINITE VIDEO🔥♾️🎬 UNLIMITED duration video generation model by Skywork.

> “Finally is here. An Open-Source model that achieves what we all have waiting for: Infinite Length Videos.’’😮

Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought (2504.05599)

Model: Skywork/SkyReels-V2-T2V-14B-720P

✨ 1.3B & 14B
✨ Generates infinite length videos using Diffusion Forcing with diffusion models + autoregressive methods
posted an update 21 days ago
view post
Post
1506
Tensor-wise (TWQ) and Layer-wise quantization (LWQ) now available in llama.cpp!

As of version b5125 users can now do TWQ, whereby you quantize a whole tensor at a specific level, or perform LWQ by choosing specific layers per tensor/s

The new --tensor-type option enables llama-quantize to apply user-defined quant levels to any combination of allowed tensors (i.e. tensors with 2 or more dimensions) and layer number, with support for regex patterns.

For example, to TWQ the Attention Value tensor you would use --tensor-type attn_v=q6_k and to perform LWQ you'll use something like --tensor-type "\.([0-9]|1[01257]|31)\.attn_v=q4_k"

In the next few days/weeks I'll update the models in my HF repo (and will add some others) but eaddario/DeepSeek-R1-Distill-Llama-8B-GGUF and eaddario/DeepSeek-R1-Distill-Qwen-7B-GGUF have been already LWQed.

For reference, compared to the naive Q4_K_M model, the LWQ Qwen-7B is almost 11% smaller (4.68GB vs 4.18GB) with only a 0.35% penalty on PPL!

I'll update the https://medium.com/@eaddario/squeezing-tensor-bits-the-quest-for-smaller-llms-86b23bd052ca post to explain the process in detail, but in the meantime the following links will provide some background:

- Changes to llama-quantize: https://github.com/ggml-org/llama.cpp/pull/12511
- TWQ & LWQ tests: https://github.com/ggml-org/llama.cpp/discussions/12741
- Modified llama-imatrix (not yet merged) used to generate imatrix statistics to guide the TWQ and LWQ process: https://github.com/ggml-org/llama.cpp/pull/12718
reacted to bartowski's post with 👍 24 days ago
view post
Post
20729
Access requests enabled for latest GLM models

While a fix is being implemented (https://github.com/ggml-org/llama.cpp/pull/12957) I want to leave the models up for visibility and continued discussion, but want to prevent accidental downloads of known broken models (even though there are settings that could fix it at runtime for now)

With this goal, I've enabled access requests. I don't really want your data, so I'm sorry that I don't think there's a way around that? But that's what I'm gonna do for now, and I'll remove the gate when a fix is up and verified and I have a chance to re-convert and quantize!

Hope you don't mind in the mean time :D
  • 1 reply
·
replied to onekq's post about 1 month ago
reacted to danielhanchen's post with ❤️ about 1 month ago
reacted to clem's post with 🤗❤️🔥 about 1 month ago
view post
Post
4031
Before 2020, most of the AI field was open and collaborative. For me, that was the key factor that accelerated scientific progress and made the impossible possible—just look at the “T” in ChatGPT, which comes from the Transformer architecture openly shared by Google.

Then came the myth that AI was too dangerous to share, and companies started optimizing for short-term revenue. That led many major AI labs and researchers to stop sharing and collaborating.

With OAI and sama now saying they're willing to share open weights again, we have a real chance to return to a golden age of AI progress and democratization—powered by openness and collaboration, in the US and around the world.

This is incredibly exciting. Let’s go, open science and open-source AI!
·
reacted to danielhanchen's post with 🔥❤️ about 1 month ago