Ksenia Se

Kseniase

AI & ML interests

None yet

Recent Activity

replied to their post 6 minutes ago
15 types of attention mechanisms Attention mechanisms allow models to dynamically focus on specific parts of their input when performing tasks. In our recent article, we discussed Multi-Head Latent Attention (MLA) in detail and now it's time to summarize other existing types of attention. Here is a list of 15 types of attention mechanisms used in AI models: 1. Soft attention (Deterministic attention) -> https://huggingface.co/papers/1409.0473 Assigns a continuous weight distribution over all parts of the input. It produces a weighted sum of the input using attention weights that sum to 1. 2. Hard attention (Stochastic attention) -> https://huggingface.co/papers/1508.04025 Makes a discrete selection of some part of the input to focus on at each step, rather than attending to everything. 3. Self-attention -> https://huggingface.co/papers/1706.03762 Each element in the sequence "looks" at other elements and "decides" how much to borrow from each of them for its new representation. 4. Cross-Attention (Encoder-Decoder attention) -> https://huggingface.co/papers/2104.08771 The queries come from one sequence and the keys/values come from another sequence. It allows a model to combine information from two different sources. 5. Multi-Head Attention (MHA) -> https://huggingface.co/papers/1706.03762 Multiple attention “heads” are run in parallel.​ The model computes several attention distributions (heads), each with its own set of learned projections of queries, keys, and values. 6. Multi-Head Latent Attention (MLA) -> https://huggingface.co/papers/2405.04434 Extends MHA by incorporating a latent space where attention heads can dynamically learn different latent factors or representations. 7. Memory-Based attention -> https://huggingface.co/papers/1503.08895 Involves an external memory and uses attention to read from and write to this memory. See other types in the comments 👇
posted an update 6 minutes ago
15 types of attention mechanisms Attention mechanisms allow models to dynamically focus on specific parts of their input when performing tasks. In our recent article, we discussed Multi-Head Latent Attention (MLA) in detail and now it's time to summarize other existing types of attention. Here is a list of 15 types of attention mechanisms used in AI models: 1. Soft attention (Deterministic attention) -> https://huggingface.co/papers/1409.0473 Assigns a continuous weight distribution over all parts of the input. It produces a weighted sum of the input using attention weights that sum to 1. 2. Hard attention (Stochastic attention) -> https://huggingface.co/papers/1508.04025 Makes a discrete selection of some part of the input to focus on at each step, rather than attending to everything. 3. Self-attention -> https://huggingface.co/papers/1706.03762 Each element in the sequence "looks" at other elements and "decides" how much to borrow from each of them for its new representation. 4. Cross-Attention (Encoder-Decoder attention) -> https://huggingface.co/papers/2104.08771 The queries come from one sequence and the keys/values come from another sequence. It allows a model to combine information from two different sources. 5. Multi-Head Attention (MHA) -> https://huggingface.co/papers/1706.03762 Multiple attention “heads” are run in parallel.​ The model computes several attention distributions (heads), each with its own set of learned projections of queries, keys, and values. 6. Multi-Head Latent Attention (MLA) -> https://huggingface.co/papers/2405.04434 Extends MHA by incorporating a latent space where attention heads can dynamically learn different latent factors or representations. 7. Memory-Based attention -> https://huggingface.co/papers/1503.08895 Involves an external memory and uses attention to read from and write to this memory. See other types in the comments 👇
View all activity

Organizations

Turing Post's profile picture Journalists on Hugging Face's profile picture Social Post Explorers's profile picture Hugging Face Discord Community's profile picture Sandbox's profile picture

Kseniase's activity

replied to their post 6 minutes ago
view reply
  1. Adaptive attention -> https://huggingface.co/papers/1612.01887
    Dynamically adjusts its attention behavior – when or whether to use attention, or how broad the attention should be.

  2. Scaled Dot-Product attention -> https://huggingface.co/papers/2404.16629
    Attention scores are computed by the dot product between a query vector and a key vector, and then divided by the square root of the key dimension before applying softmax.

  3. Additive attention -> https://huggingface.co/papers/1409.0473
    Computes attention scores using a small feed-forward that combines the query and key vectors.

  4. Global attention -> https://huggingface.co/papers/1508.04025
    Is a form of soft attention that considers all possible positions in the input sequence.

  5. Local attention -> https://huggingface.co/papers/1508.04025
    It's a compromise between hard and soft attention. The model only attends to a restricted subset of inputs at a given step.

  6. Sparse attention -> https://huggingface.co/papers/1602.02068
    Applies patterns that limit what each word can focus on.

  7. Hierarchical attention -> https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf
    Model first applies attention at the word level and produces a sentence representation. Then it applies another attention at the sentence level to determine which sentences are important for the document representation.

  8. Temporal attention -> https://huggingface.co/papers/1502.08029
    Deals with time-series or sequential data, allowing a model to focus on particular time steps or time segments.

posted an update 6 minutes ago
view post
Post
15 types of attention mechanisms

Attention mechanisms allow models to dynamically focus on specific parts of their input when performing tasks. In our recent article, we discussed Multi-Head Latent Attention (MLA) in detail and now it's time to summarize other existing types of attention.

Here is a list of 15 types of attention mechanisms used in AI models:

1. Soft attention (Deterministic attention) -> Neural Machine Translation by Jointly Learning to Align and Translate (1409.0473)
Assigns a continuous weight distribution over all parts of the input. It produces a weighted sum of the input using attention weights that sum to 1.

2. Hard attention (Stochastic attention) -> Effective Approaches to Attention-based Neural Machine Translation (1508.04025)
Makes a discrete selection of some part of the input to focus on at each step, rather than attending to everything.

3. Self-attention -> Attention Is All You Need (1706.03762)
Each element in the sequence "looks" at other elements and "decides" how much to borrow from each of them for its new representation.

4. Cross-Attention (Encoder-Decoder attention) -> Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation (2104.08771)
The queries come from one sequence and the keys/values come from another sequence. It allows a model to combine information from two different sources.

5. Multi-Head Attention (MHA) -> Attention Is All You Need (1706.03762)
Multiple attention “heads” are run in parallel.​ The model computes several attention distributions (heads), each with its own set of learned projections of queries, keys, and values.

6. Multi-Head Latent Attention (MLA) -> DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (2405.04434)
Extends MHA by incorporating a latent space where attention heads can dynamically learn different latent factors or representations.

7. Memory-Based attention -> End-To-End Memory Networks (1503.08895)
Involves an external memory and uses attention to read from and write to this memory.

See other types in the comments 👇
  • 1 reply
·
upvoted an article 3 days ago
view article
Article

How to Reduce Memory Use in Reasoning Models

By Kseniase and 1 other
8
published an article 3 days ago
view article
Article

How to Reduce Memory Use in Reasoning Models

By Kseniase and 1 other
8
upvoted an article 5 days ago
published an article 6 days ago
published an article 6 days ago
view article
Article

🌁#90: Why AI’s Reasoning Tests Keep Failing Us

By Kseniase
9
reacted to their post with 🔥🧠 6 days ago
view post
Post
3879
5 New implementations of Diffusion Models

Diffusion models are widely used for image and video generation but remain underexplored in text generation, where autoregressive models (ARMs) dominate. Unlike ARMs, which produce tokens sequentially, diffusion models iteratively refine noise through denoising steps, offering greater flexibility and speed.
Recent advancements show a shift toward using diffusion models in place of, or alongside, ARMs. Researchers also combine strengths from both methods and integrate autoregressive concepts into diffusion.

Here are 5 new implementations of diffusion models:

1. Mercury family of diffusion LLMs (dLLMs) by Inception Labs -> https://www.inceptionlabs.ai/news
It applies diffusion to text and code data, enabling sequence generation 10x faster than today's top LLMs. Now available Mercury Coder can run at over 1,000 tokens/sec on NVIDIA H100s.

2. Diffusion of Thoughts (DoT) -> Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models (2402.07754)
Integrates diffusion models with Chain-of-Thought. DoT allows reasoning steps to diffuse gradually over time. This flexibility enables balancing between reasoning quality and computational cost.

3. LLaDA -> Large Language Diffusion Models (2502.09992)
Shows diffusion models' potential in replacing ARMs. Trained with pre-training and SFT, LLaDA masks tokens, predicts them via a Transformer, and optimizes a likelihood bound. LLaDA matches key LLM skills, and surpasses GPT-4o in reversal poetry.

4. LanDiff -> The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation (2503.04606)
This hybrid text-to-video model combines autoregressive and diffusion paradigms, introducing a semantic tokenizer, an LM for token generation, and a streaming diffusion model. LanDiff outperforms models like Sora.

5. General Interpolating Discrete Diffusion (GIDD) -> Generalized Interpolating Discrete Diffusion (2503.04482)
A flexible noising process with a novel diffusion ELBO enables combining masking and uniform noise, allowing diffusion models to correct mistakes, where ARMs struggle.
  • 3 replies
·
upvoted an article 6 days ago
view article
Article

🦸🏻#13: Action! How AI Agents Execute Tasks with UI and API Tools

By Kseniase
4
published an article 6 days ago
view article
Article

🦸🏻#13: Action! How AI Agents Execute Tasks with UI and API Tools

By Kseniase
4
reacted to clem's post with 👍 7 days ago
view post
Post
7069
I was chatting with @peakji , one of the cofounders of Manu AI, who told me he was on Hugging Face (very cool!).

He shared an interesting insight which is that agentic capabilities might be more of an alignment problem rather than a foundational capability issue. Similar to the difference between GPT-3 and InstructGPT, some open-source foundation models are simply trained to 'answer everything in one response regardless of the complexity of the question' - after all, that's the user preference in chatbot use cases. Just a bit of post-training on agentic trajectories can make an immediate and dramatic difference.

As a thank you to the community, he shared 100 invite code first-come first serve, just use “HUGGINGFACE” to get access!
·
upvoted an article 7 days ago
view article
Article

🦸🏻#12: How Do Agents Learn from Their Own Mistakes? The Role of Reflection in AI

By Kseniase
5
published an article 7 days ago
view article
Article

🦸🏻#12: How Do Agents Learn from Their Own Mistakes? The Role of Reflection in AI

By Kseniase
5
replied to their post 7 days ago
posted an update 7 days ago
view post
Post
3879
5 New implementations of Diffusion Models

Diffusion models are widely used for image and video generation but remain underexplored in text generation, where autoregressive models (ARMs) dominate. Unlike ARMs, which produce tokens sequentially, diffusion models iteratively refine noise through denoising steps, offering greater flexibility and speed.
Recent advancements show a shift toward using diffusion models in place of, or alongside, ARMs. Researchers also combine strengths from both methods and integrate autoregressive concepts into diffusion.

Here are 5 new implementations of diffusion models:

1. Mercury family of diffusion LLMs (dLLMs) by Inception Labs -> https://www.inceptionlabs.ai/news
It applies diffusion to text and code data, enabling sequence generation 10x faster than today's top LLMs. Now available Mercury Coder can run at over 1,000 tokens/sec on NVIDIA H100s.

2. Diffusion of Thoughts (DoT) -> Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models (2402.07754)
Integrates diffusion models with Chain-of-Thought. DoT allows reasoning steps to diffuse gradually over time. This flexibility enables balancing between reasoning quality and computational cost.

3. LLaDA -> Large Language Diffusion Models (2502.09992)
Shows diffusion models' potential in replacing ARMs. Trained with pre-training and SFT, LLaDA masks tokens, predicts them via a Transformer, and optimizes a likelihood bound. LLaDA matches key LLM skills, and surpasses GPT-4o in reversal poetry.

4. LanDiff -> The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation (2503.04606)
This hybrid text-to-video model combines autoregressive and diffusion paradigms, introducing a semantic tokenizer, an LM for token generation, and a streaming diffusion model. LanDiff outperforms models like Sora.

5. General Interpolating Discrete Diffusion (GIDD) -> Generalized Interpolating Discrete Diffusion (2503.04482)
A flexible noising process with a novel diffusion ELBO enables combining masking and uniform noise, allowing diffusion models to correct mistakes, where ARMs struggle.
  • 3 replies
·
upvoted an article 9 days ago
view article
Article

Everything You Need to Know about Knowledge Distillation

By Kseniase and 1 other
18
published an article 10 days ago
view article
Article

Everything You Need to Know about Knowledge Distillation

By Kseniase and 1 other
18
upvoted an article 13 days ago
view article
Article

🌁#90: Why AI’s Reasoning Tests Keep Failing Us

By Kseniase
9