Kseniase (Ksenia Se)

replied to their post about 15 hours ago

Adaptive attention -> https://huggingface.co/papers/1612.01887
Dynamically adjusts its attention behavior – when or whether to use attention, or how broad the attention should be.
Scaled Dot-Product attention -> https://huggingface.co/papers/2404.16629
Attention scores are computed by the dot product between a query vector and a key vector, and then divided by the square root of the key dimension before applying softmax.
Additive attention -> https://huggingface.co/papers/1409.0473
Computes attention scores using a small feed-forward that combines the query and key vectors.
Global attention -> https://huggingface.co/papers/1508.04025
Is a form of soft attention that considers all possible positions in the input sequence.
Local attention -> https://huggingface.co/papers/1508.04025
It's a compromise between hard and soft attention. The model only attends to a restricted subset of inputs at a given step.
Sparse attention -> https://huggingface.co/papers/1602.02068
Applies patterns that limit what each word can focus on.
Hierarchical attention -> https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf
Model first applies attention at the word level and produces a sentence representation. Then it applies another attention at the sentence level to determine which sentences are important for the document representation.
Temporal attention -> https://huggingface.co/papers/1502.08029
Deals with time-series or sequential data, allowing a model to focus on particular time steps or time segments.

posted an update about 15 hours ago

Post

1518

15 types of attention mechanisms

Attention mechanisms allow models to dynamically focus on specific parts of their input when performing tasks. In our recent article, we discussed Multi-Head Latent Attention (MLA) in detail and now it's time to summarize other existing types of attention.

Here is a list of 15 types of attention mechanisms used in AI models:

1. Soft attention (Deterministic attention) -> Neural Machine Translation by Jointly Learning to Align and Translate (1409.0473)
Assigns a continuous weight distribution over all parts of the input. It produces a weighted sum of the input using attention weights that sum to 1.

2. Hard attention (Stochastic attention) -> Effective Approaches to Attention-based Neural Machine Translation (1508.04025)
Makes a discrete selection of some part of the input to focus on at each step, rather than attending to everything.

3. Self-attention -> Attention Is All You Need (1706.03762)
Each element in the sequence "looks" at other elements and "decides" how much to borrow from each of them for its new representation.

4. Cross-Attention (Encoder-Decoder attention) -> Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation (2104.08771)
The queries come from one sequence and the keys/values come from another sequence. It allows a model to combine information from two different sources.

5. Multi-Head Attention (MHA) -> Attention Is All You Need (1706.03762)
Multiple attention “heads” are run in parallel. The model computes several attention distributions (heads), each with its own set of learned projections of queries, keys, and values.

6. Multi-Head Latent Attention (MLA) -> DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (2405.04434)
Extends MHA by incorporating a latent space where attention heads can dynamically learn different latent factors or representations.

7. Memory-Based attention -> End-To-End Memory Networks (1503.08895)
Involves an external memory and uses attention to read from and write to this memory.

See other types in the comments 👇

1 reply

·

upvoted an article 3 days ago

Article

How to Reduce Memory Use in Reasoning Models

By

and 1 other •

3 days ago

• 8

published an article 3 days ago

Article

How to Reduce Memory Use in Reasoning Models

By

and 1 other •

3 days ago

• 8

upvoted an article 6 days ago

Article

🌁#91: We are failing in AI literacy

By

and 1 other •

6 days ago

• 3

published an article 6 days ago

Article

🌁#91: We are failing in AI literacy

By

and 1 other •

6 days ago

• 3

published an article 6 days ago

Article

🌁#90: Why AI’s Reasoning Tests Keep Failing Us

By

•

13 days ago

• 9

reacted to their post with 🔥🧠 6 days ago

Post

3900

5 New implementations of Diffusion Models

Diffusion models are widely used for image and video generation but remain underexplored in text generation, where autoregressive models (ARMs) dominate. Unlike ARMs, which produce tokens sequentially, diffusion models iteratively refine noise through denoising steps, offering greater flexibility and speed.
Recent advancements show a shift toward using diffusion models in place of, or alongside, ARMs. Researchers also combine strengths from both methods and integrate autoregressive concepts into diffusion.

Here are 5 new implementations of diffusion models:

1. Mercury family of diffusion LLMs (dLLMs) by Inception Labs -> https://www.inceptionlabs.ai/news
It applies diffusion to text and code data, enabling sequence generation 10x faster than today's top LLMs. Now available Mercury Coder can run at over 1,000 tokens/sec on NVIDIA H100s.

2. Diffusion of Thoughts (DoT) -> Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models (2402.07754)
Integrates diffusion models with Chain-of-Thought. DoT allows reasoning steps to diffuse gradually over time. This flexibility enables balancing between reasoning quality and computational cost.

3. LLaDA -> Large Language Diffusion Models (2502.09992)
Shows diffusion models' potential in replacing ARMs. Trained with pre-training and SFT, LLaDA masks tokens, predicts them via a Transformer, and optimizes a likelihood bound. LLaDA matches key LLM skills, and surpasses GPT-4o in reversal poetry.

4. LanDiff -> The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation (2503.04606)
This hybrid text-to-video model combines autoregressive and diffusion paradigms, introducing a semantic tokenizer, an LM for token generation, and a streaming diffusion model. LanDiff outperforms models like Sora.

5. General Interpolating Discrete Diffusion (GIDD) -> Generalized Interpolating Discrete Diffusion (2503.04482)
A flexible noising process with a novel diffusion ELBO enables combining masking and uniform noise, allowing diffusion models to correct mistakes, where ARMs struggle.

3 replies

·

upvoted an article 7 days ago

Article

🦸🏻#13: Action! How AI Agents Execute Tasks with UI and API Tools

By

•

7 days ago

• 4

published an article 7 days ago

Article

🦸🏻#13: Action! How AI Agents Execute Tasks with UI and API Tools

By

•

7 days ago

• 4

reacted to clem's post with 👍 8 days ago

Post

7101

I was chatting with @peakji , one of the cofounders of Manu AI, who told me he was on Hugging Face (very cool!).

He shared an interesting insight which is that agentic capabilities might be more of an alignment problem rather than a foundational capability issue. Similar to the difference between GPT-3 and InstructGPT, some open-source foundation models are simply trained to 'answer everything in one response regardless of the complexity of the question' - after all, that's the user preference in chatbot use cases. Just a bit of post-training on agentic trajectories can make an immediate and dramatic difference.

As a thank you to the community, he shared 100 invite code first-come first serve, just use “HUGGINGFACE” to get access!

6 replies

·

upvoted an article 8 days ago

Article

🦸🏻#12: How Do Agents Learn from Their Own Mistakes? The Role of Reflection in AI

By

•

8 days ago

• 5

published an article 8 days ago

Article

🦸🏻#12: How Do Agents Learn from Their Own Mistakes? The Role of Reflection in AI

By

•

8 days ago

• 5

replied to their post 8 days ago

To study how diffusion models work, check out our post about useful courses and resources on diffusion models -> https://www.turingpost.com/p/6-sources-to-study-diffusion-models

posted an update 8 days ago

Post

3900

5 New implementations of Diffusion Models

Diffusion models are widely used for image and video generation but remain underexplored in text generation, where autoregressive models (ARMs) dominate. Unlike ARMs, which produce tokens sequentially, diffusion models iteratively refine noise through denoising steps, offering greater flexibility and speed.
Recent advancements show a shift toward using diffusion models in place of, or alongside, ARMs. Researchers also combine strengths from both methods and integrate autoregressive concepts into diffusion.

Here are 5 new implementations of diffusion models:

1. Mercury family of diffusion LLMs (dLLMs) by Inception Labs -> https://www.inceptionlabs.ai/news
It applies diffusion to text and code data, enabling sequence generation 10x faster than today's top LLMs. Now available Mercury Coder can run at over 1,000 tokens/sec on NVIDIA H100s.

2. Diffusion of Thoughts (DoT) -> Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models (2402.07754)
Integrates diffusion models with Chain-of-Thought. DoT allows reasoning steps to diffuse gradually over time. This flexibility enables balancing between reasoning quality and computational cost.

3. LLaDA -> Large Language Diffusion Models (2502.09992)
Shows diffusion models' potential in replacing ARMs. Trained with pre-training and SFT, LLaDA masks tokens, predicts them via a Transformer, and optimizes a likelihood bound. LLaDA matches key LLM skills, and surpasses GPT-4o in reversal poetry.

4. LanDiff -> The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation (2503.04606)
This hybrid text-to-video model combines autoregressive and diffusion paradigms, introducing a semantic tokenizer, an LM for token generation, and a streaming diffusion model. LanDiff outperforms models like Sora.

5. General Interpolating Discrete Diffusion (GIDD) -> Generalized Interpolating Discrete Diffusion (2503.04482)
A flexible noising process with a novel diffusion ELBO enables combining masking and uniform noise, allowing diffusion models to correct mistakes, where ARMs struggle.

3 replies

·

upvoted an article 10 days ago

Article

Everything You Need to Know about Knowledge Distillation

By

and 1 other •

10 days ago

• 18

published an article 10 days ago

Article

Everything You Need to Know about Knowledge Distillation

By

and 1 other •

10 days ago

• 18

upvoted an article 13 days ago

Article

🌁#90: Why AI’s Reasoning Tests Keep Failing Us

By

•

13 days ago

• 9

upvoted a paper 13 days ago

Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System

Paper • 2502.16750 • Published 21 days ago • 10

Ksenia Se

AI & ML interests

Recent Activity

Organizations

Kseniase's activity

How to Reduce Memory Use in Reasoning Models

How to Reduce Memory Use in Reasoning Models

🌁#91: We are failing in AI literacy

🌁#91: We are failing in AI literacy

🌁#90: Why AI’s Reasoning Tests Keep Failing Us

🦸🏻#13: Action! How AI Agents Execute Tasks with UI and API Tools

🦸🏻#13: Action! How AI Agents Execute Tasks with UI and API Tools

🦸🏻#12: How Do Agents Learn from Their Own Mistakes? The Role of Reflection in AI

🦸🏻#12: How Do Agents Learn from Their Own Mistakes? The Role of Reflection in AI

Everything You Need to Know about Knowledge Distillation

Everything You Need to Know about Knowledge Distillation

🌁#90: Why AI’s Reasoning Tests Keep Failing Us

Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System