Description

This model is a quantized version for original model "meta-llama/Meta-Llama-3-8B", quantized with torchao's autoquant API. It contains both model weights and compilation artifacts caches that records the itermediate compilation artifacts.

Quantization Details

Quantization Type: autoquant
min_sqnr: 20

Usage

You can use this model in your applications by loading it directly from the Hugging Face Hub:

from transformers import AutoModel

model = AutoModel.from_pretrained("jerryzh168/llama3-8b-autoquant")
from huggingface_hub import hf_hub_download
import pickle

hf_hub_download(repo_id="jerryzh168/llama3-8b-autoquant", filename="compile_artifacts.pt2", local_dir="/tmp/")

with open("/tmp/compile_artifacts.pt2", "rb") as f:
    artifacts = pickle.load(f)

artifact_bytes, cache_info = artifacts
torch.compiler.load_cache_artifacts(artifact_bytes)