code

#1
by rakmik - opened
acl-srw-2024 org

Whats the error message you are stuck on?

Could you please provide a tested code to run the form, and if possible, a Colab T?

acl-srw-2024 org

https://colab.research.google.com/drive/1uqTupx2RNrm1jkzoPLWhU2HouVheDkV2?usp=sharing

here this should work. Just use the gptq library's AutoGPTQForCausalLM.from_quantized() method

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

Load the model

model = AutoGPTQForCausalLM.from_quantized("acl-srw-2024/llama-3-8b-instruct-scb-gptq-2bit", device="cuda:0", use_safetensors=True)

Load the tokenizer

tokenizer = AutoTokenizer.from_pretrained(model.config._name_or_path)

Define the prompt

prompt = "What is the capital of France?"

Tokenize the prompt

tokens = tokenizer(prompt, return_tensors='pt').to(model.device)

Generate text with parameters

generated_ids = model.generate(**tokens,
max_new_tokens=22, # Adjust as needed
temperature=0.2, # Adjust as needed
top_k=33, # Adjust as needed
top_p=0.5, # Adjust as needed
do_sample=True) # Enable sampling

Decode the output

generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

Print the result

print(generated_text)

What is the capital of France? per M, the title of the iyer of the next of my and a *P. under the *

2-bit answers are unclear.

3-bit answers are good.
Is there a way to improve the 2-bit answers?
To use the 70-bit ones.

2bit
auto_gptq is more idolvm p last.. and. intensiraenoruallyman +, and, save Ultos to-e of the Benjamin butl, the 30, and, and, and,, to 30.7ivia, to 7 this,
auto-gptq is theaa,aa all he mentioned : from the door in the at bar- top of the breeze/post and on the on the f of course and and the most popular on a to a in the smoke of and gapl this a, all of late together and some of the end and on and also a to a na - and to make smoke and after the end on to the in and the 8 and 20 tax and so and - atago to of a and S to
3bit
auto_gptq is a Python library that provides an easy-to-use interface to the Google Public Cloud Storage service.

Installation

You can install auto_gptq using pip:

pip install auto_gptq

Usage

You can use auto_gptq like this:

Example:

You can execute the following code to see the result.

Results:

You can see the result of the following code.

Conclusion:

You can use the following code to see the result.

Troubleshooting

acl-srw-2024/llama-8b-unsloth-sft-quip-2bit-pt-v2

ValueError: No quantize_config.json, quant_config.json or config.json file was found in the model repository.

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM
import logging

logging.basicConfig(
format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
level=logging.INFO,
datefmt="%Y-%m-%d %H:%M:%S"
)

مجلد النموذج المكمم مسبقًا

quantized_model_dir = "acl-srw-2024/llama-8b-unsloth-sft-quip-2bit-pt-v2"

تحميل التوكنيزيشن من نموذج LLama3 الأصلي (مطلوب لتجنب الخطأ)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", use_fast=True)

تحميل النموذج المكمم مباشرة على الـ GPU

model = AutoGPTQForCausalLM.from_quantized(
quantized_model_dir,
device="cuda:0",
use_safetensors=True,
trust_remote_code=True
)

توليد نصوص

inputs = tokenizer("auto_gptq is", return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
do_sample=True,
temperature=0.2,
top_p=0.4,
max_new_tokens=100
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

عبر pipeline

pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline(
"auto-gptq is",
do_sample=True,
temperature=0.2,
top_p=0.4,
max_new_tokens=100
)[0]["generated_text"])

ValueError: No quantize_config.json, quant_config.json or config.json file was found in the model repository.

acl-srw-2024/llama-8b-unsloth-sft-quip-2bit-pt-v2

how run it?

Please write a code that runs the forms with correct answers.

Please write a code that runs the forms with correct answers.
Please write a code that runs the forms with correct answers.
Please write a code that runs the forms with correct answers.
Please write a code that runs the forms with correct answers.
Please write a code that runs the forms with correct answers.

Please write a code for run infrance complett from a2z

acl-srw-2024/mistral-7b-unsloth-sft-quip-2bit
acl-srw-2024/llama-8b-unsloth-sft-quip-2bit-pt-v2
acl-srw-2024/llama3-8b-unsloth-sft-awq-2bit-v4

and
thank you

it run good

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "acl-srw-2024/mistral-7b-unsloth-sft-quip-2bit"
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="cuda",
offload_folder="offload", # تحديد مجلد للتفريغ
torch_dtype="auto"
)

input_text = "what is ai?"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=128)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

what is ai?

Artificial Intelligence (AI) is a branch of computer science that focuses on the development of intelligent machines that can perform tasks that typically require human intelligence. AI is based on the idea that machines can be programmed to think and act like humans, and it involves the use of various techniques, such as machine learning, natural language processing, and computer vision, to achieve this goal.

In the context of medical education, AI has the potential to revolutionize the way medical knowledge is acquired, processed, and applied. For example, AI-powered tools can help medical students learn more efficiently by providing personalized learning experiences, such

it run good

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

Set device to Vega

device = torch.device('cuda') # Assuming 'vega' is the device name

Load model and tokenizer

tokenizer = AutoTokenizer.from_pretrained("acl-srw-2024/llama3-8b-unsloth-sft-awq-4bit-v4")
model = AutoModelForCausalLM.from_pretrained("acl-srw-2024/llama3-8b-unsloth-sft-awq-4bit-v4").to(device) # Specify device during loading

Define your input text

input_text = "what is python؟" # Example input in Arabic

Tokenize the input

inputs = tokenizer(input_text, return_tensors="pt").to(device) # Move input to device

Generate text

with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=128)

Decode and print the response

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

low_cpu_mem_usage was None, now default to True since model is quantized.
Loading checkpoint shards: 100%
 2/2 [00:01<00:00,  1.71it/s]
Setting pad_token_id to eos_token_id:128001 for open-end generation.
what is python؟

Answer:

  1. Python is a general-purpose, high-level programming language used for various tasks, such as web development, data analysis, scientific computing, and artificial intelligence. It was created by Guido van Rossum in 1991 and has gained popularity due to its easy-to-read syntax, broad standard library, and interactive interpreter.

Answer:

  1. The main features of Python include an object-oriented approach, automatic memory management, dynamic typing, and su

gooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooood

acl-srw-2024 org

That's great!

For quip, when I was running inference I actually cloned the quip repo and ran it with their provided script.

soemthing like this: !CUDA_VISIBLE_DEVICES=0 python llama.py --wbits 4 --quant ldlq --save

Similarly for AWQ, we cloned the repo and used the script in the README https://github.com/mit-han-lab/llm-awq. We also had to edit a bit of their code to be able to use our own dataset.

Something like this:
python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b
--tasks wikitext
--w_bit 4 --q_group_size 128
--load_quant quant_cache/llama3-8b-w4-g128-awq.pt

If I recall correctly, there might be some issues with the scripts that need to be fixed accordingly, but it should be nothing serious and they both should be able to run.

not run
acl-srw-2024/llama-8b-unsloth-sft-quip-2bit-pt-v2

acl-srw-2024 org

By the way, our models were fine tuned to do English-Thai code switching translation, trained on a medical translation dataset. We performance tested these models on standard benchmarks and used LLM-as-a-judge to check translation quality and failure modes.

The results are published in our research paper, Can General-Purpose Large Language Models Generalize to English-Thai Machine Translation? (https://arxiv.org/html/2410.17145v1), albiet omitting some quantizations.

Feel free to check it out to see our experiemental results!

acl-srw-2024 org

hmm you might have to load the pt ones anotherway. But actually the .pt models are the same as the other ones, just saved in pytorch. So if you can load them one way, just use that because they are the same model.

acl-srw-2024/llama-8b-unsloth-sft-quip-2bit-pt-v2

one file

model.pt

how run it???

colab t4 =16gb vram only

import torch
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

1. تحميل النموذج

model_path = "model.pt" # مسار ملف النموذج

#model_path = "model.pt" # مسار ملف النموذج
#device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # استخدام GPU إذا كان متاحًا
device = torch.device("cuda") # استخدام CPU

torch.cuda.empty_cache()
model = torch.load(model_path, map_location=device)
model.eval() # وضع النموذج في وضع الاستدلال

torch.cuda.empty_cache()
input_data = torch.randn(1, 3, 112, 112).to(device)
with torch.no_grad(): # تعطيل حساب التدرجات لتحسين الأداء
output = model(input_data)
print("Inference Output:", output)

OutOfMemoryError: CUDA out of memory. Tried to allocate 1002.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 590.12 MiB is free. Process 308671 has 14.16 GiB memory in use. Of the allocated memory 14.06 GiB is allocated by PyTorch, and 7.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Nike-Hanmatheekuna changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment