NOTICE

Currently benchmarking this, not sure how accurate it is yet. I'll be updating this.

Update: Still testing, but this seems to be pretty close to where it should be. I might be able to improve it by 1-2%.

snowflake2_m_uint8

This is a slightly modified version of the uint8 quantized ONNX model from https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0

I have added a linear quantization node before the sentence_embedding output so that it directly outputs a dimension 768 uint8 tensor.

This is compatible with the qdrant uint8 datatype for collections.

No benchmarks, but in my limited testing it's exactly equivalent to the FP32 output of the uint8 quantized ONNX model.

Quantization method

I ran every token through the unmodified uint8 ONNX model and logged the highest/lowest FP32 value seen in the output tensor.

This approximate range is from -0.25 to 0.31. I adjusted the zero point and quantized according to that scale directly in this ONNX model.

Here's what the graph of the original output looks like:

Here's what the new graph in this model looks like:

Example inference code

import onnxruntime as rt
import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained(
    "snowflake2_m_uint8" # path to the folder for this model goes here
)
session = rt.InferenceSession(
    "snowflake2_m_uint8.onnx", providers=["CPUExecutionProvider"]
)
example_text = "text you want get embedding vector for here"
enc = tokenizer(example_text)
embeddings = session.run(
    None, {"input_ids": [enc.input_ids], "attention_mask": [enc.attention_mask]}
)
e = embeddings[1][0]  # this is the output tensor for sentence_embedding, it is uint8 array of size 768