--- base_model: - Qwen/Qwen2.5-3B-Instruct pipeline_tag: text-generation library_name: transformers --- # AbleCredit Reasoner R0 Qwen 2.5 3B Instruct ## Introduction This model is trained by Deepseek R1 style (GRPO) reinforcement learning on Qwen 2.5 3B Instruct as a base model. Primarily intended for research in application of small LLMs trained using GRPO/RL in the domain of finance, credit underwriting etc. ### Model Description - **Fine Tuned by:** AbleCredit (LightBees Technologies Private Limited, Bengaluru, India) - **License:** We've retained the original Qwen research license. Note that license does not allow commercial use. - **Finetuned from model:** Qwen/Qwen2.5-3B-Instruct ## How to Get Started with the Model Use with standard Huggingface based setup ```python model_name = "AbleCredit/AbleCredit-R0-Qwen-2.5-3B-Instruct" # or local path to model system_prompt = { "role": "system", "content": ( "You are a helpful assistant. User asks a question the assistant answers it.\n" "The assistant first thinks about reasoning process in mind and then provides the user with the answer." ), } suffix_prompt = { "role": "assistant", "content": "Let me solve this step by step.\n", } prompt_msgs = [ system_prompt, {"role": "user", "content": "What is 15 times 3 ?"}, suffix_prompt, ] base_model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", torch_dtype=torch.bfloat16, ) tokenizer = AutoTokenizer.from_pretrained(model_name) prompt = tokenizer.apply_chat_template( prompt_msgs, tokenize=False, continue_final_message=True, add_generation_prompt=False, ) # Tokenize the prompt and move it to the appropriate device. inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0") print("\nGenerating response...\n") outputs = model.generate( **inputs, max_new_tokens=1024, temperature=0.5, min_p=0.01, ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print("\nResponse:\n", response) ``` ## Training Details ### Training Data Trained using open source logical reasoning datasets and a proprietary finance dataset created by AbleCredit.com. ### Training Procedure Trained using deepseek style reinforcement learning using GRPO with rule based rewards. ## Evaluation - Model achieves ~67% score on GSM8K benchmark in a **zero shot** setting (check benchmarking script for more details). ## Model Card Contact [contact Harshad Saykhedkar via LinkedIn](https://www.linkedin.com/in/harshadss/)