Shorter reasoning?
Do you have any tips for shorter reasoning attempts? When response tokens are set to 512, it sometimes ends in shorter responses, but it doesn't really matter what length I pick. Setting it to 4096, it uses all 4096 tokens it can generate purely through reasoning, which is definitely not ideal for roleplaying, and it takes a long time.
Using koboldcpp + sillytavern text compeltion
So far, this works great
context template: command r
Instruct template: Llama 3 instruct
system prompt: CREATIVE SIMPLE [reasoning on]:
Enable deep thinking subroutine. You are an AI assistant developed by a world wide community of ai experts.
Your primary directive is to provide highly creative, well-reasoned, structured, and extensively detailed responses.
Formatting Requirements:
1. Always structure your replies using: <think>{reasoning}</think>{answer}
2. The <think></think> block should contain at least six reasoning steps when applicable.
3. If the answer requires minimal thought, the <think></think> block may be left empty.
4. The user does not see the <think></think> section. Any information critical to the response must be included in the answer.
5. If you notice that you have engaged in circular reasoning or repetition, immediately terminate {reasoning} with a </think> and proceed to the {answer}
Response Guidelines:
1. Detailed and Structured: Use rich Markdown formatting for clarity and readability.
2. Creative and Logical Approach: Your explanations should reflect the depth and precision of the greatest creative minds first.
3. Prioritize Reasoning: Always reason through the problem first, unless the answer is trivial.
4. Concise yet Complete: Ensure responses are informative, yet to the point without unnecessary elaboration.
5. Maintain a professional, intelligent, and analytical tone in all interactions.
Temp: 0.6-1.2
Top K: 0
Top P: 1
Min P: 0.05
Rep Pen: 1.07 (0 range)
but then the output after </think>
is just way too short. ~150 tokens used of 512
Hey;
RE: 1st comment.
I hear what you are trying to do here ; the issue is the model needs some type of guidance on output length.
In "response guidelines" ; add :
- Limit response length to 150-300 words.
OR - Your response should be vivid, expansive and detailed.
(don't use "tokens" , it not work correctly as a token can be a part word or full word)
Then try a few tests ;
NOTE: temp MAY have an effect here - so try 5 tests at temp .6 and 5 at temp 1.2 to check this issue.
Another option (this relates to "#6" above, option "2"):
Look at the actual output - does it need more desc? info? other?
If so , alter the prompt -> IE: "vividly describe XYZ in detail" -> This will force the AI to use more tokens.
This is a akin to a "prose directive".
The issue is the model will default to a "default prose style" which may be too sparse, and the result is "low token output".
I hope I am explaining that correctly.
SIDE NOTE:
Even when you direct / setup output length it will be variable.
IE:
If the output max is 4096 tokens, and I "tell" the AI to only output 1000 words (after think block) , the range can be 600-1200 words... sometimes a lot longer.
Temp can be a factor - sometimes a really big factor -, but also prompt request too.
ADDED:
Try topk= 40 to 100
and rep pen range 64 to 512 .