Shorter reasoning?

#2
by ThijsL202 - opened

Do you have any tips for shorter reasoning attempts? When response tokens are set to 512, it sometimes ends in shorter responses, but it doesn't really matter what length I pick. Setting it to 4096, it uses all 4096 tokens it can generate purely through reasoning, which is definitely not ideal for roleplaying, and it takes a long time.

Using koboldcpp + sillytavern text compeltion

So far, this works great

context template: command r
Instruct template: Llama 3 instruct

system prompt: CREATIVE SIMPLE [reasoning on]:

Enable deep thinking subroutine. You are an AI assistant developed by a world wide community of ai experts.

Your primary directive is to provide highly creative, well-reasoned, structured, and extensively detailed responses.

Formatting Requirements:

1. Always structure your replies using: <think>{reasoning}</think>{answer}
2. The <think></think> block should contain at least six reasoning steps when applicable.
3. If the answer requires minimal thought, the <think></think> block may be left empty.
4. The user does not see the <think></think> section. Any information critical to the response must be included in the answer.
5. If you notice that you have engaged in circular reasoning or repetition, immediately terminate {reasoning} with a </think> and proceed to the {answer}

Response Guidelines:

1. Detailed and Structured: Use rich Markdown formatting for clarity and readability.
2. Creative and Logical Approach: Your explanations should reflect the depth and precision of the greatest creative minds first.
3. Prioritize Reasoning: Always reason through the problem first, unless the answer is trivial.
4. Concise yet Complete: Ensure responses are informative, yet to the point without unnecessary elaboration.
5. Maintain a professional, intelligent, and analytical tone in all interactions.

Temp: 0.6-1.2
Top K: 0
Top P: 1
Min P: 0.05
Rep Pen: 1.07 (0 range)

but then the output after </think> is just way too short. ~150 tokens used of 512

Hey;

RE: 1st comment.
I hear what you are trying to do here ; the issue is the model needs some type of guidance on output length.

In "response guidelines" ; add :

  1. Limit response length to 150-300 words.
    OR
  2. Your response should be vivid, expansive and detailed.

(don't use "tokens" , it not work correctly as a token can be a part word or full word)

Then try a few tests ;
NOTE: temp MAY have an effect here - so try 5 tests at temp .6 and 5 at temp 1.2 to check this issue.

Another option (this relates to "#6" above, option "2"):
Look at the actual output - does it need more desc? info? other?
If so , alter the prompt -> IE: "vividly describe XYZ in detail" -> This will force the AI to use more tokens.
This is a akin to a "prose directive".

The issue is the model will default to a "default prose style" which may be too sparse, and the result is "low token output".
I hope I am explaining that correctly.

SIDE NOTE:
Even when you direct / setup output length it will be variable.
IE:

If the output max is 4096 tokens, and I "tell" the AI to only output 1000 words (after think block) , the range can be 600-1200 words... sometimes a lot longer.
Temp can be a factor - sometimes a really big factor -, but also prompt request too.

ADDED:
Try topk= 40 to 100
and rep pen range 64 to 512 .

Sign up or log in to comment