Description
This model is a quantized version for original model "meta-llama/Meta-Llama-3-8B", quantized with torchao's autoquant API. It contains both model weights and compilation artifacts caches that records the itermediate compilation artifacts.
Quantization Details
- Quantization Type: autoquant
- min_sqnr: 20
Usage
You can use this model in your applications by loading it directly from the Hugging Face Hub:
from transformers import AutoModel
model = AutoModel.from_pretrained("jerryzh168/llama3-8b-autoquant")
from huggingface_hub import hf_hub_download
import pickle
hf_hub_download(repo_id="jerryzh168/llama3-8b-autoquant", filename="compile_artifacts.pt2", local_dir="/tmp/")
with open("/tmp/compile_artifacts.pt2", "rb") as f:
artifacts = pickle.load(f)
artifact_bytes, cache_info = artifacts
torch.compiler.load_cache_artifacts(artifact_bytes)
- Downloads last month
- 117
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.