Adaptive attention -> https://huggingface.co/papers/1612.01887
Dynamically adjusts its attention behavior – when or whether to use attention, or how broad the attention should be.Scaled Dot-Product attention -> https://huggingface.co/papers/2404.16629
Attention scores are computed by the dot product between a query vector and a key vector, and then divided by the square root of the key dimension before applying softmax.Additive attention -> https://huggingface.co/papers/1409.0473
Computes attention scores using a small feed-forward that combines the query and key vectors.Global attention -> https://huggingface.co/papers/1508.04025
Is a form of soft attention that considers all possible positions in the input sequence.Local attention -> https://huggingface.co/papers/1508.04025
It's a compromise between hard and soft attention. The model only attends to a restricted subset of inputs at a given step.Sparse attention -> https://huggingface.co/papers/1602.02068
Applies patterns that limit what each word can focus on.Hierarchical attention -> https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf
Model first applies attention at the word level and produces a sentence representation. Then it applies another attention at the sentence level to determine which sentences are important for the document representation.Temporal attention -> https://huggingface.co/papers/1502.08029
Deals with time-series or sequential data, allowing a model to focus on particular time steps or time segments.
Ksenia Se
Kseniase
AI & ML interests
None yet
Recent Activity
replied to
their
post
6 minutes ago
15 types of attention mechanisms
Attention mechanisms allow models to dynamically focus on specific parts of their input when performing tasks. In our recent article, we discussed Multi-Head Latent Attention (MLA) in detail and now it's time to summarize other existing types of attention.
Here is a list of 15 types of attention mechanisms used in AI models:
1. Soft attention (Deterministic attention) -> https://huggingface.co/papers/1409.0473
Assigns a continuous weight distribution over all parts of the input. It produces a weighted sum of the input using attention weights that sum to 1.
2. Hard attention (Stochastic attention) -> https://huggingface.co/papers/1508.04025
Makes a discrete selection of some part of the input to focus on at each step, rather than attending to everything.
3. Self-attention -> https://huggingface.co/papers/1706.03762
Each element in the sequence "looks" at other elements and "decides" how much to borrow from each of them for its new representation.
4. Cross-Attention (Encoder-Decoder attention) -> https://huggingface.co/papers/2104.08771
The queries come from one sequence and the keys/values come from another sequence. It allows a model to combine information from two different sources.
5. Multi-Head Attention (MHA) -> https://huggingface.co/papers/1706.03762
Multiple attention “heads” are run in parallel. The model computes several attention distributions (heads), each with its own set of learned projections of queries, keys, and values.
6. Multi-Head Latent Attention (MLA) -> https://huggingface.co/papers/2405.04434
Extends MHA by incorporating a latent space where attention heads can dynamically learn different latent factors or representations.
7. Memory-Based attention -> https://huggingface.co/papers/1503.08895
Involves an external memory and uses attention to read from and write to this memory.
See other types in the comments 👇
posted
an
update
6 minutes ago
15 types of attention mechanisms
Attention mechanisms allow models to dynamically focus on specific parts of their input when performing tasks. In our recent article, we discussed Multi-Head Latent Attention (MLA) in detail and now it's time to summarize other existing types of attention.
Here is a list of 15 types of attention mechanisms used in AI models:
1. Soft attention (Deterministic attention) -> https://huggingface.co/papers/1409.0473
Assigns a continuous weight distribution over all parts of the input. It produces a weighted sum of the input using attention weights that sum to 1.
2. Hard attention (Stochastic attention) -> https://huggingface.co/papers/1508.04025
Makes a discrete selection of some part of the input to focus on at each step, rather than attending to everything.
3. Self-attention -> https://huggingface.co/papers/1706.03762
Each element in the sequence "looks" at other elements and "decides" how much to borrow from each of them for its new representation.
4. Cross-Attention (Encoder-Decoder attention) -> https://huggingface.co/papers/2104.08771
The queries come from one sequence and the keys/values come from another sequence. It allows a model to combine information from two different sources.
5. Multi-Head Attention (MHA) -> https://huggingface.co/papers/1706.03762
Multiple attention “heads” are run in parallel. The model computes several attention distributions (heads), each with its own set of learned projections of queries, keys, and values.
6. Multi-Head Latent Attention (MLA) -> https://huggingface.co/papers/2405.04434
Extends MHA by incorporating a latent space where attention heads can dynamically learn different latent factors or representations.
7. Memory-Based attention -> https://huggingface.co/papers/1503.08895
Involves an external memory and uses attention to read from and write to this memory.
See other types in the comments 👇
upvoted
an
article
3 days ago
How to Reduce Memory Use in Reasoning Models
Organizations
Kseniase's activity

replied to
their
post
6 minutes ago

posted
an
update
6 minutes ago
Post
15 types of attention mechanisms
Attention mechanisms allow models to dynamically focus on specific parts of their input when performing tasks. In our recent article, we discussed Multi-Head Latent Attention (MLA) in detail and now it's time to summarize other existing types of attention.
Here is a list of 15 types of attention mechanisms used in AI models:
1. Soft attention (Deterministic attention) -> Neural Machine Translation by Jointly Learning to Align and Translate (1409.0473)
Assigns a continuous weight distribution over all parts of the input. It produces a weighted sum of the input using attention weights that sum to 1.
2. Hard attention (Stochastic attention) -> Effective Approaches to Attention-based Neural Machine Translation (1508.04025)
Makes a discrete selection of some part of the input to focus on at each step, rather than attending to everything.
3. Self-attention -> Attention Is All You Need (1706.03762)
Each element in the sequence "looks" at other elements and "decides" how much to borrow from each of them for its new representation.
4. Cross-Attention (Encoder-Decoder attention) -> Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation (2104.08771)
The queries come from one sequence and the keys/values come from another sequence. It allows a model to combine information from two different sources.
5. Multi-Head Attention (MHA) -> Attention Is All You Need (1706.03762)
Multiple attention “heads” are run in parallel. The model computes several attention distributions (heads), each with its own set of learned projections of queries, keys, and values.
6. Multi-Head Latent Attention (MLA) -> DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (2405.04434)
Extends MHA by incorporating a latent space where attention heads can dynamically learn different latent factors or representations.
7. Memory-Based attention -> End-To-End Memory Networks (1503.08895)
Involves an external memory and uses attention to read from and write to this memory.
See other types in the comments 👇
Attention mechanisms allow models to dynamically focus on specific parts of their input when performing tasks. In our recent article, we discussed Multi-Head Latent Attention (MLA) in detail and now it's time to summarize other existing types of attention.
Here is a list of 15 types of attention mechanisms used in AI models:
1. Soft attention (Deterministic attention) -> Neural Machine Translation by Jointly Learning to Align and Translate (1409.0473)
Assigns a continuous weight distribution over all parts of the input. It produces a weighted sum of the input using attention weights that sum to 1.
2. Hard attention (Stochastic attention) -> Effective Approaches to Attention-based Neural Machine Translation (1508.04025)
Makes a discrete selection of some part of the input to focus on at each step, rather than attending to everything.
3. Self-attention -> Attention Is All You Need (1706.03762)
Each element in the sequence "looks" at other elements and "decides" how much to borrow from each of them for its new representation.
4. Cross-Attention (Encoder-Decoder attention) -> Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation (2104.08771)
The queries come from one sequence and the keys/values come from another sequence. It allows a model to combine information from two different sources.
5. Multi-Head Attention (MHA) -> Attention Is All You Need (1706.03762)
Multiple attention “heads” are run in parallel. The model computes several attention distributions (heads), each with its own set of learned projections of queries, keys, and values.
6. Multi-Head Latent Attention (MLA) -> DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (2405.04434)
Extends MHA by incorporating a latent space where attention heads can dynamically learn different latent factors or representations.
7. Memory-Based attention -> End-To-End Memory Networks (1503.08895)
Involves an external memory and uses attention to read from and write to this memory.
See other types in the comments 👇

upvoted
an
article
3 days ago
Article
How to Reduce Memory Use in Reasoning Models
By
and 1 other
•
•
8
published
an
article
3 days ago
Article
How to Reduce Memory Use in Reasoning Models
By
and 1 other
•
•
8
upvoted
an
article
5 days ago
Article
🌁#91: We are failing in AI literacy
By
and 1 other
•
•
3
published
an
article
6 days ago
Article
🌁#91: We are failing in AI literacy
By
and 1 other
•
•
3
published
an
article
6 days ago
Article
🌁#90: Why AI’s Reasoning Tests Keep Failing Us
By
•
•
9Post
3879
5 New implementations of Diffusion Models
Diffusion models are widely used for image and video generation but remain underexplored in text generation, where autoregressive models (ARMs) dominate. Unlike ARMs, which produce tokens sequentially, diffusion models iteratively refine noise through denoising steps, offering greater flexibility and speed.
Recent advancements show a shift toward using diffusion models in place of, or alongside, ARMs. Researchers also combine strengths from both methods and integrate autoregressive concepts into diffusion.
Here are 5 new implementations of diffusion models:
1. Mercury family of diffusion LLMs (dLLMs) by Inception Labs -> https://www.inceptionlabs.ai/news
It applies diffusion to text and code data, enabling sequence generation 10x faster than today's top LLMs. Now available Mercury Coder can run at over 1,000 tokens/sec on NVIDIA H100s.
2. Diffusion of Thoughts (DoT) -> Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models (2402.07754)
Integrates diffusion models with Chain-of-Thought. DoT allows reasoning steps to diffuse gradually over time. This flexibility enables balancing between reasoning quality and computational cost.
3. LLaDA -> Large Language Diffusion Models (2502.09992)
Shows diffusion models' potential in replacing ARMs. Trained with pre-training and SFT, LLaDA masks tokens, predicts them via a Transformer, and optimizes a likelihood bound. LLaDA matches key LLM skills, and surpasses GPT-4o in reversal poetry.
4. LanDiff -> The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation (2503.04606)
This hybrid text-to-video model combines autoregressive and diffusion paradigms, introducing a semantic tokenizer, an LM for token generation, and a streaming diffusion model. LanDiff outperforms models like Sora.
5. General Interpolating Discrete Diffusion (GIDD) -> Generalized Interpolating Discrete Diffusion (2503.04482)
A flexible noising process with a novel diffusion ELBO enables combining masking and uniform noise, allowing diffusion models to correct mistakes, where ARMs struggle.
Diffusion models are widely used for image and video generation but remain underexplored in text generation, where autoregressive models (ARMs) dominate. Unlike ARMs, which produce tokens sequentially, diffusion models iteratively refine noise through denoising steps, offering greater flexibility and speed.
Recent advancements show a shift toward using diffusion models in place of, or alongside, ARMs. Researchers also combine strengths from both methods and integrate autoregressive concepts into diffusion.
Here are 5 new implementations of diffusion models:
1. Mercury family of diffusion LLMs (dLLMs) by Inception Labs -> https://www.inceptionlabs.ai/news
It applies diffusion to text and code data, enabling sequence generation 10x faster than today's top LLMs. Now available Mercury Coder can run at over 1,000 tokens/sec on NVIDIA H100s.
2. Diffusion of Thoughts (DoT) -> Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models (2402.07754)
Integrates diffusion models with Chain-of-Thought. DoT allows reasoning steps to diffuse gradually over time. This flexibility enables balancing between reasoning quality and computational cost.
3. LLaDA -> Large Language Diffusion Models (2502.09992)
Shows diffusion models' potential in replacing ARMs. Trained with pre-training and SFT, LLaDA masks tokens, predicts them via a Transformer, and optimizes a likelihood bound. LLaDA matches key LLM skills, and surpasses GPT-4o in reversal poetry.
4. LanDiff -> The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation (2503.04606)
This hybrid text-to-video model combines autoregressive and diffusion paradigms, introducing a semantic tokenizer, an LM for token generation, and a streaming diffusion model. LanDiff outperforms models like Sora.
5. General Interpolating Discrete Diffusion (GIDD) -> Generalized Interpolating Discrete Diffusion (2503.04482)
A flexible noising process with a novel diffusion ELBO enables combining masking and uniform noise, allowing diffusion models to correct mistakes, where ARMs struggle.

upvoted
an
article
6 days ago
Article
🦸🏻#13: Action! How AI Agents Execute Tasks with UI and API Tools
By
•
•
4
published
an
article
6 days ago
Article
🦸🏻#13: Action! How AI Agents Execute Tasks with UI and API Tools
By
•
•
4
reacted to
clem's
post with 👍
7 days ago
Post
7069
I was chatting with
@peakji
, one of the cofounders of Manu AI, who told me he was on Hugging Face (very cool!).
He shared an interesting insight which is that agentic capabilities might be more of an alignment problem rather than a foundational capability issue. Similar to the difference between GPT-3 and InstructGPT, some open-source foundation models are simply trained to 'answer everything in one response regardless of the complexity of the question' - after all, that's the user preference in chatbot use cases. Just a bit of post-training on agentic trajectories can make an immediate and dramatic difference.
As a thank you to the community, he shared 100 invite code first-come first serve, just use “HUGGINGFACE” to get access!
He shared an interesting insight which is that agentic capabilities might be more of an alignment problem rather than a foundational capability issue. Similar to the difference between GPT-3 and InstructGPT, some open-source foundation models are simply trained to 'answer everything in one response regardless of the complexity of the question' - after all, that's the user preference in chatbot use cases. Just a bit of post-training on agentic trajectories can make an immediate and dramatic difference.
As a thank you to the community, he shared 100 invite code first-come first serve, just use “HUGGINGFACE” to get access!

upvoted
an
article
7 days ago
Article
🦸🏻#12: How Do Agents Learn from Their Own Mistakes? The Role of Reflection in AI
By
•
•
5
published
an
article
7 days ago
Article
🦸🏻#12: How Do Agents Learn from Their Own Mistakes? The Role of Reflection in AI
By
•
•
5
replied to
their
post
7 days ago
To study how diffusion models work, check out our post about useful courses and resources on diffusion models -> https://www.turingpost.com/p/6-sources-to-study-diffusion-models

posted
an
update
7 days ago
Post
3879
5 New implementations of Diffusion Models
Diffusion models are widely used for image and video generation but remain underexplored in text generation, where autoregressive models (ARMs) dominate. Unlike ARMs, which produce tokens sequentially, diffusion models iteratively refine noise through denoising steps, offering greater flexibility and speed.
Recent advancements show a shift toward using diffusion models in place of, or alongside, ARMs. Researchers also combine strengths from both methods and integrate autoregressive concepts into diffusion.
Here are 5 new implementations of diffusion models:
1. Mercury family of diffusion LLMs (dLLMs) by Inception Labs -> https://www.inceptionlabs.ai/news
It applies diffusion to text and code data, enabling sequence generation 10x faster than today's top LLMs. Now available Mercury Coder can run at over 1,000 tokens/sec on NVIDIA H100s.
2. Diffusion of Thoughts (DoT) -> Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models (2402.07754)
Integrates diffusion models with Chain-of-Thought. DoT allows reasoning steps to diffuse gradually over time. This flexibility enables balancing between reasoning quality and computational cost.
3. LLaDA -> Large Language Diffusion Models (2502.09992)
Shows diffusion models' potential in replacing ARMs. Trained with pre-training and SFT, LLaDA masks tokens, predicts them via a Transformer, and optimizes a likelihood bound. LLaDA matches key LLM skills, and surpasses GPT-4o in reversal poetry.
4. LanDiff -> The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation (2503.04606)
This hybrid text-to-video model combines autoregressive and diffusion paradigms, introducing a semantic tokenizer, an LM for token generation, and a streaming diffusion model. LanDiff outperforms models like Sora.
5. General Interpolating Discrete Diffusion (GIDD) -> Generalized Interpolating Discrete Diffusion (2503.04482)
A flexible noising process with a novel diffusion ELBO enables combining masking and uniform noise, allowing diffusion models to correct mistakes, where ARMs struggle.
Diffusion models are widely used for image and video generation but remain underexplored in text generation, where autoregressive models (ARMs) dominate. Unlike ARMs, which produce tokens sequentially, diffusion models iteratively refine noise through denoising steps, offering greater flexibility and speed.
Recent advancements show a shift toward using diffusion models in place of, or alongside, ARMs. Researchers also combine strengths from both methods and integrate autoregressive concepts into diffusion.
Here are 5 new implementations of diffusion models:
1. Mercury family of diffusion LLMs (dLLMs) by Inception Labs -> https://www.inceptionlabs.ai/news
It applies diffusion to text and code data, enabling sequence generation 10x faster than today's top LLMs. Now available Mercury Coder can run at over 1,000 tokens/sec on NVIDIA H100s.
2. Diffusion of Thoughts (DoT) -> Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models (2402.07754)
Integrates diffusion models with Chain-of-Thought. DoT allows reasoning steps to diffuse gradually over time. This flexibility enables balancing between reasoning quality and computational cost.
3. LLaDA -> Large Language Diffusion Models (2502.09992)
Shows diffusion models' potential in replacing ARMs. Trained with pre-training and SFT, LLaDA masks tokens, predicts them via a Transformer, and optimizes a likelihood bound. LLaDA matches key LLM skills, and surpasses GPT-4o in reversal poetry.
4. LanDiff -> The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation (2503.04606)
This hybrid text-to-video model combines autoregressive and diffusion paradigms, introducing a semantic tokenizer, an LM for token generation, and a streaming diffusion model. LanDiff outperforms models like Sora.
5. General Interpolating Discrete Diffusion (GIDD) -> Generalized Interpolating Discrete Diffusion (2503.04482)
A flexible noising process with a novel diffusion ELBO enables combining masking and uniform noise, allowing diffusion models to correct mistakes, where ARMs struggle.

upvoted
an
article
9 days ago
Article
Everything You Need to Know about Knowledge Distillation
By
and 1 other
•
•
18
published
an
article
10 days ago
Article
Everything You Need to Know about Knowledge Distillation
By
and 1 other
•
•
18
upvoted
an
article
13 days ago
Article
🌁#90: Why AI’s Reasoning Tests Keep Failing Us
By
•
•
9
upvoted
a
paper
13 days ago