微信扫码
添加专属顾问
我要投稿
pip install -U optimum[neural-compressor] intel-extension-for-transformers
def quantize(model_name: str, output_path: str, calibration_set: "datasets.Dataset"):
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
def preprocess_function(examples):
return tokenizer(examples["text"], padding="max_length", max_length=512, truncation=True)
vectorized_ds = calibration_set.map(preprocess_function, num_proc=10)
vectorized_ds = vectorized_ds.remove_columns(["text"])
quantizer = INCQuantizer.from_pretrained(model)
quantization_config = PostTrainingQuantConfig(approach="static", backend="ipex", domain="nlp")
quantizer.quantize(
quantization_config=quantization_config,
calibration_dataset=vectorized_ds,
save_directory=output_path,
batch_size=1,
)
tokenizer.save_pretrained(output_path)
# 数据集地址https://huggingface.co/datasets/allenai/qasper
from optimum.intel import IPEXModelmodel = IPEXModel.from_pretrained("Intel/bge-small-en-v1.5-rag-int8-static")
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Intel/bge-small-en-v1.5-rag-int8-static")
inputs = tokenizer(sentences, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# get the [CLS] token
embeddings = outputs[0][:, 0]
从上面的结果可以看出,通过量化后模型的延迟和吞吐量都有大幅提升。大家是不是学会的呢。下篇我们继续介绍一个相关工具,辅助我们高效管理RAG流程。
53AI,企业落地大模型首选服务商
产品:场景落地咨询+大模型应用平台+行业解决方案
承诺:免费场景POC验证,效果验证后签署服务协议。零风险落地应用大模型,已交付160+中大型企业
2025-07-01
qodo如何构建大型代码仓库的RAG?
2025-07-01
爆改RAG!让你的AI检索“见人说人话,见鬼说鬼话”——自适应检索的魔法揭秘
2025-07-01
RAG系统的“聪明药”:如何用反馈回路让你的AI越用越聪明?
2025-06-30
EraRAG:突破传统GraphRAG限制,实现动态语料库的高效检索增强生成
2025-06-30
GraphRAG的索引动态更新解法-分桶+局部更新及“上下文工程”新概念?
2025-06-30
RAG搭建个人LLM知识库助手,很多人第一步就走错了...
2025-06-29
你的RAG系统安全么?
2025-06-28
Dify+RAG合同生成:条款级工作流案例拆解
2025-04-13
2025-04-19
2025-04-09
2025-04-16
2025-05-08
2025-04-05
2025-04-23
2025-04-08
2025-04-10
2025-04-09
2025-07-01
2025-07-01
2025-06-30
2025-06-29
2025-06-20
2025-06-19
2025-06-13
2025-06-09