kishizaki-sci/Llama-3.1-405B-Instruct-AWQ-4bit-JP-EN

model information

Llama-3.1-405B-Instructใ‚’AutoAWQใง4bit ้‡ๅญๅŒ–ใ—ใŸใƒขใƒ‡ใƒซใ€‚้‡ๅญๅŒ–ใฎ้š›ใฎใ‚ญใƒฃใƒชใƒ–ใƒฌใƒผใ‚ทใƒงใƒณใƒ‡ใƒผใ‚ฟใซๆ—ฅๆœฌ่ชžใจ่‹ฑ่ชžใ‚’ๅซใ‚€ใƒ‡ใƒผใ‚ฟใ‚’ไฝฟ็”จใ€‚
A model of Llama-3.1-405B-Instruct quantized to 4 bits using AutoAWQ. Calibration data containing Japanese and English was used during the quantization process.

usage

vLLM

from vllm import LLM, SamplingParams
llm = LLM(
    model="kishizaki-sci/Llama-3.1-405B-Instruct-AWQ-4bit-JP-EN",
    tensor_parallel_size=4,
    gpu_memory_utilization=0.97,
    quantization="awq"
)
tokenizer = llm.get_tokenizer()
messages = [
    {"role": "system", "content": "ใ‚ใชใŸใฏๆ—ฅๆœฌ่ชžใงๅฟœ็ญ”ใ™ใ‚‹AIใƒใƒฃใƒƒใƒˆใƒœใƒƒใƒˆใงใ™ใ€‚ใƒฆใƒผใ‚ถใ‚’ใ‚ตใƒใƒผใƒˆใ—ใฆใใ ใ•ใ„ใ€‚"},
    {"role": "user", "content": "plotly.graph_objectsใ‚’ไฝฟใฃใฆๆ•ฃๅธƒๅ›ณใ‚’ไฝœใ‚‹ใ‚ตใƒณใƒ—ใƒซใ‚ณใƒผใƒ‰ใ‚’ๆ›ธใ„ใฆใใ ใ•ใ„ใ€‚"},
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
sampling_params = SamplingParams(
    temperature=0.6,
    top_p=0.9,
    max_tokens=1024
)
outputs = llm.generate(prompt, sampling_params)
print(outputs[0].outputs[0].text)

H100 (94GB)ใ‚’4ๅŸบ็ฉใ‚“ใ ใ‚คใƒณใ‚นใ‚ฟใƒณใ‚นใงใฎๅฎŸ่กŒใฏใ“ใกใ‚‰ใฎnotebookใ‚’ใ”่ฆงใใ ใ•ใ„ใ€‚
Please refer to this notebook for execution on an instance equipped with a four H100 (94GB).

calibration data

ไปฅไธ‹ใฎใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใ‹ใ‚‰512ๅ€‹ใฎใƒ‡ใƒผใ‚ฟ๏ผŒใƒ—ใƒญใƒณใƒ—ใƒˆใ‚’ๆŠฝๅ‡บใ€‚1ใคใฎใƒ‡ใƒผใ‚ฟใฎใƒˆใƒผใ‚ฏใƒณๆ•ฐใฏๆœ€ๅคง350ๅˆถ้™ใ€‚
Extract 512 data points and prompts from the following dataset. The maximum token limit per data point is 350.

  • TFMC/imatrix-dataset-for-japanese-llm
  • meta-math/MetaMathQA
  • m-a-p/CodeFeedback-Filtered-Instruction
  • kunishou/databricks-dolly-15k-ja
  • ใใฎไป–ๆ—ฅๆœฌ่ชž็‰ˆใƒป่‹ฑ่ชž็‰ˆใฎwikipedia่จ˜ไบ‹ใ‹ใ‚‰ไฝœๆˆใ—ใŸใ‚ชใƒชใ‚ธใƒŠใƒซใƒ‡ใƒผใ‚ฟ๏ผŒๆœ‰ๅฎณใƒ—ใƒญใƒณใƒ—ใƒˆๅ›ž้ฟใฎใŸใ‚ใฎใ‚ชใƒชใ‚ธใƒŠใƒซใƒ‡ใƒผใ‚ฟใ‚’ไฝฟ็”จใ€‚ Original data created from Japanese and English Wikipedia articles, as well as original data for avoiding harmful prompts, is used.

License

MIT Licenseใ‚’้ฉ็”จใ™ใ‚‹ใ€‚ใŸใ ใ—้‡ๅญๅŒ–ใฎใƒ™ใƒผใ‚นใƒขใƒ‡ใƒซใซ้ฉ็”จใ•ใ‚Œใฆใ„ใ‚‹Llama 3.1 Community License Agreementใซๅพ“ใฃใฆใใ ใ•ใ„ใ€‚
The MIT License is applied. However, obey the Llama 3.1 Community License Agreement applied to the base model of quantization.

Downloads last month
2
Safetensors
Model size
57.9B params
Tensor type
FP16
ยท
I32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for kishizaki-sci/Llama-3.1-405B-Instruct-AWQ-4bit-JP-EN

Quantized
(29)
this model