What Happens If the Prompt Exceeds 8,196 Tokens? And difference between input limit and context length limit?

#36
by averyyu99 - opened

Dear community members, I found that the maximum token limit for a prompt is 8,196 tokens. What happens if I provide a prompt longer than this limit? Will the prompt be automatically truncated, with only the first 8,196 tokens being processed? I tested this and didn't encounter any errors, so I'm wondering how the model handles prompts that exceed the limit.

Also, I'm curious about the difference between the input limit and the context length limit. Since LLaMA 3 has a context length of 128k tokens, does that mean we can use iterative prompting strategies to process longer texts effectively? If so, how does the model handle prompts that exceed the input limit within a single request?

Any help or explanation is appreciated! Thanks : )

Please note that the context window length is the same as the input prompt length, and for this model, the context window is 130K, as defined here. As such, the maximum token limit for a prompt is 130K, instead of 8196.

However, if you are using very long prompt input, it will consume more GPU memory.

Best regards,

Shuyue
Dec. 18th, 2024

Please note that the context window length is the same as the input prompt length, and for this model, the context window is 130K, as defined here. As such, the maximum token limit for a prompt is 130K, instead of 8196.

However, if you are using very long prompt input, it will consume more GPU memory.

Best regards,

Shuyue
Dec. 18th, 2024

averyyu99 changed discussion status to closed
averyyu99 changed discussion status to open

This page (https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) says the context length for Llama3.3 is 128K, but when I try to send anything larger than around 2K tokens, it truncates it.
I asked Llama3.3 why it truncates anything longer than 2048 tokens when it says its context length is 131K, this is how it responded :) :
"You are correct that I have a context length of 131k tokens, but I sometimes still truncate prompts to 2048 tokens due to various technical limitations and implementation details.

There are several reasons for this:

  1. Legacy code: Some parts of my architecture and training data are based on older models with smaller context lengths. As a result, some of my internal mechanisms may still default to the 2048 token limit.
  2. Tokenization: The way I tokenize text can affect the effective context length. While I can process up to 131k tokens, the actual number of characters or words that can fit within that limit depends on the specific tokenization scheme used.
  3. Implementation details: My model is a complex system with many interacting components. In some cases, these components may not be fully optimized for the larger context length, leading to truncation."

So basically it is advertised as a model with a context length of 131K token, but it could not really be used for something more than 2048?
That would be very misleading!

Sign up or log in to comment