微信扫码
与创始人交个朋友
我要投稿
选择量化策略:这可能包括对权重、激活函数输出等进行静态量化(在模型训练完成后进行),或者动态量化(在模型推理时进行)。
确定量化参数:这包括为每个量化的元素选择最适合表示其分布的量化范围和精度。
应用量化:将模型参数和激活从浮点数转换为选定精度的整数。
量化意识训练(可选):有时,为了补偿由于量化引入的误差,模型会经过进一步的训练过程,以调整其权重以最小化量化的影响。
model_name = "./models/gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# fix dtype post quantization to "pretend" to be fp32
def get_float32_dtype(self):
return torch.float32
GPT2Model.dtype = property(get_float32_dtype)
# This will return the memory footprint of the current model in bytes
model.get_memory_footprint() # 510342192 => 0.47GB
r_min = min(x_1, x_2, ..., x_n)r_max = max(x_1, x_2, ..., x_n)
s = (r_max - r_min) / (q_max - q_min)
z = q_min - r_min / s
q = round(x / s + z)
def quantize(t):
# obtain range of values in the tensor to map between 0 and 255
max_val = t.min(), t.max()
# determine the "zero-point", or value in the tensor to map to 0
scale = (max_val - min_val) / 255
zero_point = min_val
# quantize and clamp to ensure we're in [0, 255]
t_quant = (t - zero_point) / scale
t_quant = torch.clamp(t_quant, min=0, max=255)
# keep track of scale and zero_point for reversing quantization
state = (scale, zero_point)
# cast to uint8 and return
t_quant = t_quant.type(torch.uint8)
return t_quant, state
states = quantize_model(model)
# 137022768 => 0.12G
53AI,企业落地应用大模型首选服务商
产品:大模型应用平台+智能体定制开发+落地咨询服务
承诺:先做场景POC验证,看到效果再签署服务协议。零风险落地应用大模型,已交付160+中大型企业
2024-03-30
2024-08-13
2024-05-10
2024-05-28
2024-04-26
2024-04-12
2024-04-25
2024-05-06
2024-07-25
2024-05-14