GLM-4V-9B 具备 1120 * 1120 高分辨率下的中英双语多轮对话能力,在中英文综合能力、感知推理、文字识别、图表理解等多方面多模态评测中,GLM-4V-9B 表现出超越 GPT-4-turbo-2024-04-09、Gemini 1.0 Pro、Qwen-VL-Max 和 Claude 3 Opus 的卓越性能。
Tech Report:
Abstract
This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B.
GLM-4 models are pre-trained on 10T of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese andEnglish usage.
The high-quality alignment is achieved via a multi-stage posttraining process, which involves supervised fine-tuning and learning from human feedback.
1. Introduction
GPT-3.5 series 在 GPT-3 的基础上进行改进,结合instruction tuning, supervised fine tuning (SFT), and/or reinforcement learning from human feedback (RLHF)
GLM (General Language Model) pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.
GLM: General Language Model Pretraining with Autoregressive Blank Infilling
GLM-4’s instruction following capacities on both prompt and instruction levels are approximately as effective as GPT-4-Turbo in both English and Chinese.
GLM-4 outperforms GPT-4 and matches GPT-4-Turbo across eight dimensions in AlignBench
for long-context tasks, the GLM-4 (128K) model matches the performance level of GPT-4 Turbo and Claude 3 Opus as measured by LongBench-Chat
GLM-4-9B
在接近10T token的多语言语料上进行预训练
context length of 8192 (8K)
post-trained with the same pipeline and data used for GLM-4 (0520).
更少的训练计算, 效果超过 Llama-3- 8B 支持 all the functionality of All Tools in GLM-4
提供GLM-4-9B-Chat-1M with 1 million (1M) context length (about 2 million Chinese characters)
consists of multilingual (mostly English and Chinese) documents from a mixture of different sources
data processing pipeline: deduplication, filtering, and tokenization
使用字节级byte pair encoding (BPE) 算法 to separately learn the Chinese and multilingual tokens merge them with the tokens of the cl100k_base tokenizer in tiktoken into a unified vocabulary with a size of 150,000
data quality and diversity are crucial for building effective LLMs
尽管获得了经验教训和见解,但迄今为止我们尚未确定可以指导数据收集、清理和选择过程的基本原则。
Architecture
GLM family of LLMs is built on Transformer
** No Bias Except QKV**: To increase training speed, we have removed all bias terms with the exception of the biases in Query, Key, and Value (QKV) of the attention layers. In doing so, we observed a slight improvement in length extrapolation.
RMSNorm and SwiGLU: 用于替换 LayerNorm and ReLU
Rotary positional embeddings (RoPE): We have extended the RoPE to a two-dimensional form to accommodate the 2D positional encoding in GLM.
Group Query Attention (GQA): 替换 Multi-Head Attention (MHA) 用于 cut down on the KV cache size during inference. Given GQA uses fewer parameters than MHA, we increased the FFN parameter count to maintain the same model size, i.e., setting dffn to 10/3 of the hidden size.
context length of our models was extended from 2K (ChatGLM), to 32K (ChatGLM2 and ChatGLM3), and to 128K and 1M (GLM-4).