我要投稿

Qilin-Med：多阶段知识注入先进的医学大型语言模型

发布日期：2024-06-30 19:08:48 浏览次数： 2837

作者：南极星医学AI笔记

微信搜一搜，关注“南极星医学AI笔记”

Abstract

Integrating large language models (LLMs) into healthcare presents potential but faces challenges. Directly pre-training LLMs for domains like medicine is resource-heavy and sometimes unfeasible. Sole reliance on Supervised Fine-tuning (SFT) can result in overconfident predictions and may not tap into domainspecific insights. Addressing these challenges,we present a multi-stage training method combining Domain-specific Continued Pre-training(DCPT), SFT, and Direct Preference Optimization (DPO). A notable contribution of our study is the introduction of a 3Gb Chinese Medicine(ChiMed) dataset, encompassing medical question answering, plain texts, knowledge graphs,and dialogues, segmented into three training stages. The medical LLM trained with our pipeline, Qilin-Med, exhibits significant performance boosts. In the CPT and SFT phases,it achieves 38.4% and 40.0% accuracy on the CMExam, surpassing Baichuan-7B’s 33.5%. In the DPO phase, on the Huatuo-26M test set, it scores 16.66 in BLEU-1 and 27.44 in ROUGE-1, outperforming the SFT’s 12.69 and 24.21.This highlights the strength of our training approach in refining LLMs for medical applications.

摘要

将大型语言模型（LLM）整合到医疗保健中具有潜力，但也面临挑战。直接为医学等领域预训练LLM资源消耗大，有时并不可行。仅依靠监督微调（SFT）可能导致过度自信的预测，可能无法充分利用特定领域的见解。为了解决这些挑战，我们提出了一种多阶段训练方法，结合了特定领域的持续预训练（DCPT）、SFT和直接偏好优化（DPO）。我们研究的显著贡献是引入了一个3Gb的中国医学（ChiMed）数据集，涵盖了医学问答、普通文本、知识图谱和对话，分为三个训练阶段。使用我们管道训练的医学LLM，Qilin-Med，在性能上有了显著提升。在CPT和SFT阶段，它在CMExam上实现了38.4%和40.0%的准确率，超过了Baichuan-7B的33.5%。在DPO阶段，在Huatuo-26M测试集上，它在BLEU-1得分为16.66，在ROUGE-1得分为27.44，优于SFT的12.69和24.21。这凸显了我们训练方法在改进LLM以适应医疗应用方面的优势。

1 Introduction

Incorporating LLMs such as GPT-4 (OpenAI,2023) and its open-source counterpart LLaMA(Touvron et al., 2023b) into healthcare and biomedicine marks a significant shift with broad implications. These models show promise to enhance the efficiency and effectiveness of clinical and research operations, potentially revolutionizing patient care (Yang et al., 2023b; Karabacak and Margetis, 2023). They offer diverse downstreamhealthcare applications, from automating medical coding (Tu et al., 2022; Suvirat et al., 2023) to analyzing unstructured data for predictive insights (Jiang et al., 2023; Wornow et al., 2023; Hua et al.,2023; Wu et al., 2023), from decision support (Qiu et al., 2023; Cheng et al., 2023; Chiesa-Estomba et al., 2023) to patient engagement improvement(Seth et al., 2023).

1 引言

将LLM（大型语言模型）如GPT-4（OpenAI，2023）及其开源对应物LLaMA（Touvron等人，2023b）整合到医疗保健和生物医学中标志着一个重大的转变，具有广泛的影响。这些模型承诺可以提高临床和研究操作的效率和效果，有可能革命性地改变患者护理（Yang等人，2023b；Karabacak和Margetis，2023）。它们为下游医疗保健应用提供了多样化的选择，从自动化医疗编码（Tu等人，2022；Suvirat等人，2023）到分析非结构化数据以获得预测洞察（Jiang等人，2023；Wornow等人，2023；Hua等人，2023；Wu等人，2023），从决策支持（Qiu等人，2023；Cheng等人，2023；Chiesa-Estomba等人，2023）到改善患者参与度（Seth等人，2023）。

While the advantages of LLMs in healthcare are compelling, these models still have considerable room for improvement, given that medical and healthcare tasks represent some of the most challenging domains of natural language processing (NLP) (Hendrycks et al., 2021; Gu et al.,2021) and that medical AI stakes are exceptionally high as errors can directly affect patient outcomes (Thirunavukarasu et al., 2023; Gu et al.,2021). One major limitation in current medical LLMs is their reliance on solely SFT during the training phase. While SFT is essential for acquiring domain-specific knowledge, it often results in limited knowledge infusion and can lead to overconfident generalizations if not curated meticulously(Luo et al., 2023). Reinforcement learning from human feedback (RLHF) is a popular method to counteract some of SFT’s limitations, but it’s complex and demands rigorous hyperparameter tuning.Consequently, current LLMs may be ill-equipped to handle the nuanced dynamics integral to actual medical consultations.

尽管LLM在医疗保健中的优势很有吸引力，但这些模型仍然有很大的改进空间，考虑到医疗和医疗保健任务是自然语言处理（NLP）中最具有挑战性的领域之一（Hendrycks等人，2021；Gu等人，2021），而医疗AI的赌注非常高，因为错误可以直接影响患者的结果（Thirunavukarasu等人，2023；Gu等人，2021）。当前医学LLM的主要局限性之一是它们在训练阶段完全依赖SFT（监督微调）。虽然SFT对于获取特定领域的知识至关重要，但它往往导致知识注入有限，如果不精心管理，可能会导致过度自信的泛化（Luo等人，2023）。从人类反馈中进行强化学习（RLHF）是克服SFT某些局限性的流行方法，但它复杂且需要严格的超参数调优。因此，当前的LLM可能无法处理实际医疗咨询中不可或缺的微妙动态。

In response to these challenges, our study introduces Qilin-Med, an advanced Chinese medical LLM, built upon a robust pipeline that integrates DCPT, SFT, and DPO. This comprehensive approach allows Qilin-Med to harness the power of expansive medical datasets, effectively transforming a general-purpose foundation model like Baichuan (Yang et al., 2023a) into a specialized medical expert proficient in understanding complex medical texts and capable of handling intricate medical tasks. In addition, we also curated a unique dataset, ChiMed, which consists of sub-datasets corresponding to each of these three training stages to ensure a balanced and comprehensive injection

为了应对这些挑战，我们的研究引入了Qilin-Med，一个先进的中文医学LLM，建立在整合了DCPT（特定领域的持续预训练）、SFT和DPO（直接偏好优化）的强大管道之上。这种全面的方法使Qilin-Med能够利用广泛的医疗数据集，有效地将通用基础模型（如Baichuan）转变为擅长理解复杂医学文本的专业医学专家，并能够处理复杂的医学任务。此外，我们还整理了一个独特的数据集，ChiMed，它包括与这三个训练阶段相对应的子数据集，以确保LLM中平衡且全面地注入医学知识。

The contributions of this study can be summarized as follows:

1. Construction of the ChiMed dataset, which encompasses sub-datasets for DCPT, SFT, and DPO training stages, offering a holistic source to medical knowledge integration.

2. Implementation of a multi-stage knowledge injection pipeline and development of a Chinese medical LLM named Qilin-Med, effectively improving general-domains models on medical text understanding, instruction following, and preference alignment.

3. Empirical validation of our method across multiple datasets, including CMExam (Liu et al., 2023), CEval (Huang et al., 2023), and Huatuo-26M (Li et al., 2023a), setting new benchmarks in the realm of medical LLMs.

本研究的贡献可以总结如下：

构建了ChiMed数据集，该数据集包含用于DCPT、SFT和DPO训练阶段的子数据集，为医学知识整合提供了全面的来源。
实施了一个多阶段知识注入管道，并开发了一个名为Qilin-Med的中文医学LLM，有效地提高了通用领域模型在医学文本理解、指令遵循和偏好对齐方面的能力。
在多个数据集上实证验证了我们的方法，包括CMExam（Liu等人，2023）、CEval（Huang等人，2023）和Huatuo-26M（Li等人，2023a），在医学LLM领域设立了新的基准。

2 Related Work

2.1 Large Language Models

LLMs’ effectiveness relies on large-scale pretraining, such as on datasets like CommonCrawl,Wiki, and Books (Zhao et al., 2023; Touvron et al.,2023a). They typically use next-token prediction as a key training objective to understand context and predict the next word (Zhao et al., 2023; Touvron et al., 2023a). This training objective has been widely used in existing LLMs, e.g., GPT-series models (OpenAI, 2023; Brown et al., 2020), PaLM (Chowdhery et al., 2022), LLaMA (Touvron et al.,2023a), LLaMA-2 (Touvron et al., 2023b), Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023),and ChatGLM (Zeng et al., 2022a; Du et al., 2022).

2 相关工作

2.1 大语言模型

LLM的有效性依赖于大规模预训练，如在CommonCrawl、Wiki和书籍等数据集上进行预训练（Zhao等人，2023；Touvron等人，2023a）。它们通常使用下一个标记预测作为关键训练目标来理解上下文并预测下一个单词（Zhao等人，2023；Touvron等人，2023a）。这个训练目标已在现有的LLM中得到广泛应用，例如GPT系列模型（OpenAI，2023；Brown等人，2020）、PaLM（Chowdhery等人，2022）、LLaMA（Touvron等人，2023a）、LLaMA-2（Touvron等人，2023b）、Alpaca（Taori等人，2023）、Vicuna（Chiang等人，2023）和ChatGLM（Zeng等人，2022a；Du等人，2022）。

2.2 Large Language Models in Healthcare

Healthcare-oriented LLMs have gained research attention, but current medical LLMs are typically either trained entirely from scratch, incurring high costs, time, and environmental impact, or finetuned from general-purpose LLMs. As an alternative, SFT methods have been introduced to adapt general LLMs into medical contexts. For example, Xiong et al. (2023) and Li et al. (2023b) proposed to fine-tune ChatGLM and LLaMA on the physician-patient conversations to obtain the DoctorGLM and ChatDoctor, respectively; MedAlpaca(Han et al., 2023) is fine-tuned on Alpaca with over 160,000 medical question-answering pairs generated from various medical corpora. BianQue(Yirong et al., 2023) incorporated multi-turn doctor Q&A datasets to perform a Chain of Questioning; Clinicalcamel (Toma et al., 2023) simultaneously incorporated physician-patient conversations,clinical articles, and medical Q&A pairs for finetuning the LLaMA2 model. Additionally, instruction prompt tuning is also proposed to improve medical LLMs by aligning LLMs to the medical domain. For example, Med-PaLM (Singhal et al., 2023a) and Med-PaLM-2 (Singhal et al.,2023b) had qualified clinicians construct the instruction data to fine-tune the PaLM. Huatuo (Wang et al., 2023a) and ChatGLM-Med (Wang et al.,2023b) constructed the knowledge-based instruction data from the knowledge graph to inject the medical knowledge into the LLMs, thus improving the downstream performances. Among existing medical LLMs, Huatuo(Wang et al., 2023a),ChatGLM-Med (Wang et al., 2023b), DoctorGLM (Xiong et al., 2023), and BianQue (Yirong et al.,2023) stands out as Chinese medical LLMs, which are especially valuable given language inequality within the current NLP field (Bird, 2020; Zeng et al., 2022b).

2.2 医疗保健领域的LLM

面向医疗保健的LLM已引起研究关注，但目前的医学LLM通常是完全从头开始训练的，这需要高成本、时间和环境影响，或者从通用LLM中进行微调。作为替代方案，已引入SFT（监督微调）方法来将通用LLM适应医疗环境。例如，Xiong等人（2023）和李等人（2023b）建议在医生-患者对话上微调ChatGLM和LLaMA，以获得DoctorGLM和ChatDoctor；MedAlpaca（Han等人，2023）在Alpaca上进行了微调，使用了来自各种医学语料库生成的超过160,000个医学问答对。BianQue（Yirong等人，2023）结合了多轮医生问答数据集进行链式提问；Clinicalcamel（Toma等人，2023）同时结合了医生-患者对话、临床文章和医学问答对，以微调LLaMA2模型。此外，也提出了通过使LLM与医疗领域对齐来改进医学LLM的指令提示调谐。例如，Med-PaLM（Singhal等人，2023a）和Med-PaLM-2（Singhal等人，2023b）让合格临床医生构建指令数据来微调PaLM。Huatuo（Wang等人，2023a）和ChatGLM-Med（Wang等人，2023b）从知识图中构建基于知识的指令数据，将医学知识注入LLM，从而提高下游性能。在现有的医学LLM中，Huatuo（Wang等人，2023a）、ChatGLM-Med（Wang等人，2023b）、DoctorGLM（Xiong等人，2023）和BianQue（Yirong等人，2023）作为中文医学LLM脱颖而出，鉴于当前NLP领域中的语言不平等（Bird，2020；Zeng等人，2022b），它们尤为宝贵。

A concurrent study (Yang et al., 2023c) also employed a multi-stage training approach to enhance a medical language model called Zhongjing.However, to align the medical LLM outputs with human preferences, Zhongjing focused on adopting RLHF, which requires medical expert labeling and demands rigorous hyperparameter tuning.Our approach adopted DPO instead, which can automatically and efficiently achieve the goal. We also benchmarked medical LLM performance on a broader set of medical applications, as opposed to Zhongjing’s mere focus on doctor-patient dialogues. In addition, we introduce a large-scale medical dataset ChiMed, which incorporates a diverse set of data types (QA, plain texts, knowledge graphs, and dialogues) for each step of the the proposed training strategy.

一项同时进行的研究（Yang等人，2023c）也采用了多阶段训练方法来增强一个名为Zhongjing的医学语言模型。然而，为了使医学LLM的输出与人类偏好对齐，Zhongjing专注于采用RLHF（强化学习从人类反馈），这需要医疗专家标注，并需要严格的超参数调优。我们的方法采用了DPO（直接偏好优化）代替，它可以自动且高效地实现目标。我们还对医学LLM性能进行了更广泛的医疗应用基准测试，而不是Zhongjing仅仅关注医生-患者对话。此外，我们还引入了一个大规模的医学数据集ChiMed，它包含了各种数据类型（问答、普通文本、知识图谱和对话）以适应提出的训练策略的每一步。

3 Method

Fig.1 presents our three-fold pipeline with DCPT(Sec. 3.1), SFT (Sec. 3.2), and DPO (Sec. 3.3).

3方法

图1展示了我们的三阶段管道，包括DCPT（第3.1节）、SFT（第3.2节）和DPO（第3.3节）。

3.1 Domain-specific Continued Pre-training

General-purpose LLMs struggle with medical texts due to specialized language and styles. Therefore,we started with further pre-training Baichuan, a Chinese foundation model, to strengthen its understanding of fundamental medical knowledge. As a first step, we constructed a medical pre-training dataset called ChiMed-CPT by integrating existing datasets and new data crawled from the internet.

3.1 特定领域的持续预训练

通用LLM（大型语言模型）在处理医学文本时会遇到专门的语言和风格问题。因此，我们从进一步预训练一个名为Baichuan的中文基础模型开始，以加强其对基础医学知识的理解。作为第一步，我们构建了一个名为ChiMed-CPT的医学预训练数据集，通过整合现有数据集和从互联网爬取的新数据。

Figure 1: The construction pipeline of Qilin-Med.

图1：麒麟医疗的施工管道。

3.1.1 Pre-training Dataset Construction

Medical Data Collection We collected four types of medical data: Question Answering, plain (i.e.,unstructured) text, knowledge graph, and dialogue. The Question Answering subset contains three publicly available datasets: Huatuo-26M-encyclopedias (Li et al., 2023a), Huatuo-26M-medical_knowledge (Li et al., 2023a), and CMExam (Liu et al., 2023). Among them, Huatuo-26M-encyclopedias was curated from plain texts in Chinese Wikipedia and the Qianwen Health website; Huatuo-26M-medical_knowledge was curated from three knowledge graphs - CPubMed-KG(Qingcai Chen), 39Health-KG (Chen, 2018), and Xywy-KG (Bai, 2019); CMExam was sourced from the Chinese National Medical Licensing Examination. The plain text subset contains the MedQAtextbooks dataset (Jin et al., 2020) derived from textual data in Chinese medical textbooks. The knowledge graph subset contains data we extracted from CPubMed-KG, 39Health-KG, and Xywy-KG. To ensure the knowledge graph is comprehensive, we aggregated various features related to a disease entity,such as causation, symptoms, and recommended drugs. For the medical dialogue subsets, For the medical dialogue subsets, we have compiled a dataset, named CMD. This dataset comprises over 392K multi-turn medical dialogues sourced from various medical websites and covers 196 subspecialties. Furthermore, we have incorporated resources from Chinese-medical-dialogue-data (Toyhom, 2019) and Medical-Dialogue-System (Chen et al., 2020). Finally, following the deduplicating method proposed by (Lee et al., 2022), we deduplicated the dataset, yielding the Chi totaling 3.0 GB of data.

3.1.1 预训练数据集构建

医学数据收集

医学数据收集我们收集了四种类型的医学数据：问答、普通（即非结构化）文本、知识图谱和对话。问答子集包含三个公开可用的数据集：Huatuo-26M-encyclopedias（李等人，2023a）、Huatuo-26M-medical_knowledge（李等人，2023a）和CMExam（刘等人，2023）。其中，Huatuo-26M-encyclopedias是从中文维基百科和千问健康网站的普通文本中整理的；Huatuo-26Mmedical_knowledge是从三个知识图谱——CPubMed-KG（Qingcai Chen）、39Health-KG（Chen，2018）和Xywy-KG（Bai，2019）中整理的；CMExam来自中国国家医学执业资格考试。普通文本子集包含从中文医学教科书文本数据中整理的MedQA-textbooks数据集（金等人，2020）。知识图谱子集包含我们从CPubMed-KG、39Health-KG和Xywy-KG中提取的数据。为了确保知识图谱的全面性，我们汇总了与疾病实体相关的各种特征，例如病因、症状和推荐药物。对于医学对话子集，我们整理了一个名为CMD的数据集，包含来自各种医疗网站的超过392K个多轮医学对话，涵盖196个亚专科。此外，我们还整合了来自Chinese-medical-dialogue-data（Toyhom，2019）和Medical-Dialogue-System（Chen等人，2020）的资源。最后，我们遵循（Lee等人，2022年）提出的方法去重数据集，生成了ChiMed数据集，总计3.0 GB的数据。

Statistics of the dataset are summarized in Table 1.

数据集的统计信息总结在表1中。

Table 1: Statistics of ChiMed-CPT.

表1：ChiMed-CPT的统计数据。

3.1.2 Training Objective

We used a self-supervised objective, next-token prediction, for domain-specific continued pre-training.Given N sequences partitioned from ChiMed-CPT,where each sequencecontains T tokens, we defined the loss function as the sum of the negative log probabilities of the next token given the previous tokensin the sequence:

3.1.2 训练目标

我们为特定领域的持续预训练使用了自监督目标，即下一个标记预测。给定从ChiMed-CPT中分割出的N个序列，其中每个序列包含T个标记，我们定义的损失函数是序列中给定前T个标记的下一个标记的负对数概率之和：

where θ denotes the model parameters. In the following part, we denote the model obtained via DCPT as the "Medical Foundation Model," as it exhibits precise parsing capability and a fine understanding of medical texts.

其中，θ表示模型参数。在接下来的部分，我们将通过DCPT获得的模型称为“医学基础模型”，因为它展现了精确的解析能力和对医学文本的深入理解。

3.2 Supervised Fine-Tuning

While proficient in medical text comprehension,the medical foundation model can fall short in specific medical tasks due to a lack of task adherence. Frequent pre-training is also impractical due to resource constraints. In response, we conducted SFT on the model using a carefully curated dataset to improve its interpretive and responsive capabilities.

3.2 监督式微调

尽管医学基础模型在医学文本理解方面很熟练，但由于缺乏对特定医学任务的适应性，它在某些医学任务上可能表现不佳。由于资源限制，频繁的预训练也是不切实际的。因此，我们使用精心策划的数据集对模型进行了监督式微调（SFT），以改善其解释和响应能力。

3.2.1 Instruction Dataset Construction

We constructed ChiMed-SFT (statistics shown in Table 2), which consists of general and medical domain single-turn and multi-turn instructions (i.e.,prompts) along with their ground-truth responses.General domain instructions aim to enhance the LLM’s understanding and generation capabilities for instructions, while medical domain instructions focus on answering medical questions, simulating doctor-patient consultations, and explaining medical queries. The responses for the general domain instructions were primarily generated by ChatGPT,while medical domain instructions and expected responses were both real doctor-patient diagnostic dialogues collected from medical websites. To ensure stability in supervised fine-tuning, we standardized instructions from diverse sources within ChiMed-SFT into a uniform format.

3.2.1 指令数据集构建

我们构建了ChiMed-SFT（统计数据见表2），它包含了通用和医学领域的单轮和多轮指令（即提示）以及它们的真实响应。通用领域指令旨在增强LLM对指令的理解和生成能力，而医学领域指令则专注于回答医学问题、模拟医患咨询以及解释医学查询。通用领域指令的响应主要由ChatGPT生成，而医学领域指令和预期响应都是从医疗网站上收集的真实医患诊断对话。为了确保监督式微调的稳定性，我们将ChiMed-SFT中来自不同来源的指令标准化为统一格式。

Table 2: Statistics of ChiMed-SFT.

表2：ChiMed-SFT的统计数据。

3.2.2 Training Objective

Considering each prompt as well as its corresponding response from ChiMed-SFT, the loss function of SFT stage can be defined as follows:

where N denotes the total number of training instances and θ denotes model parameters.

3.2.2 训练目标

考虑到每个提示以及它对应的ChiMed-SFT中的响应，监督式微调阶段的损失函数可以定义如下：

其中 N 表示训练实例的总数，θ 表示模型参数。

We term the fine-tuned model as the "Medical Chat Model" capable of executing specific medical tasks via instructions or dialogues while staying updated with the latest medical knowledge without significant additional resources.

我们将微调后的模型称为“医学聊天模型”，它能够通过指令或对话执行特定的医学任务，同时在不消耗大量额外资源的情况下保持最新的医学知识更新。

3.3 Direct Preference Optimization

SFT encourages some responses but does not prevent undesirable ones, such as missing or inaccurate medical information. A popular solution is RLHF, which uses reward models from response rankings to guide LLM training. However, it is complex and unstable, requiring extensive hyperparameter tuning.

3.3 直接偏好优化

监督式微调（SFT）鼓励某些响应，但并不能阻止不理想的响应，比如遗漏或不准确的医学信息。一个流行的解决方案是RLHF（人类反馈强化学习），它使用来自响应排名的奖励模型来指导LLM的训练。然而，这种方法复杂且不稳定，需要广泛的超参数调整。

To improve stability, we used DPO (Rafailov et al., 2023) to align the medical chat model output with human preferences. DPO is simpler and more effective than RHLF as it doesn’t need explicit reward modeling or reinforcement learning.

为了提高稳定性，我们使用了DPO（Rafailov等人，2023年）来将医学聊天模型的输出与人类偏好对齐。DPO比RHLF更简单、更有效，因为它不需要显式的奖励建模或强化学习。

3.3.1 Preference Dataset Construction

We built ChiMed-DPO (statistics shown in Table 3) from two public available preference datasets: (1)Zhongjing_rlhf (Yang et al., 2023c), which comprises 20,000 samples (10,000 in-distribution and 10,000 out-of-distribution) annotated by medical postgraduates/doctors, and (2) MedicalGPT (Xu,2023), which contains 4,000 samples from Chinesemedical-dialogue-data, with preferred responses from doctors and rejected ones from BenTsao(Wang et al., 2023a) model.

3.3.1 偏好数据集构建

我们构建了ChiMed-DPO（统计数据见表3），它来自两个公开可用的偏好数据集：(1) Zhongjing_rlhf（Yang等人，2023c），它包含了20,000个样本（10,000个分布内和10,000个分布外），由医学研究生/医生注释，以及(2) MedicalGPT（Xu，2023），它包含了4,000个来自Chinese-medical-dialogue-data的样本，其中偏好响应来自医生，拒绝响应来自BenTsao（Wang等人，2023a）模型。

Each training sample in ChiMed-DPO is a triplet consisting of a prompt, a preferred response, and a rejected response.

ChiMed-DPO中的每个训练样本都是一个三元组，包括一个提示、一个偏好响应和一个拒绝响应。

3.3.2 Training Objective

To enhance model performance, our primary goals were to calculate log probabilities for preferred and rejected responses within the current model,and subsequently fine-tune model parameters with the aim of elevating the likelihood of preferred responses while diminishing the likelihood of rejected responses. This optimization process was guided by a specific loss function, which can be succinctly outlined as follows:

Through this process, responses generated by QilinMed will better align with human preferences while avoiding unfavored ones, thus improving the quality and safety of medical dialogues.

3.3.2 训练目标

为了提升模型性能，我们的主要目标是计算当前模型内偏好响应和拒绝响应的对数概率，然后微调模型参数，目的是提高偏好响应的可能性同时降低拒绝响应的可能性。这个优化过程是由一个特定的损失函数指导的，可以简洁地概述如下：

通过这个过程，QilinMed生成的响应将更好地与人类偏好对齐，同时避免不受欢迎的响应，从而提高医学对话的质量和安全。

4 Experiments

4.1 Evaluation Datasets, Metrics andBaselines

4.1.1 Evaluation Datasets

We evaluated Qilin-Med in scenarios such as medical knowledge Question Answering and dialogue on the following datasets:

1. CMExam (Liu et al., 2023), a standardized medical exam and practice question dataset.It contains over 60,000 multiple-choice questions and provides question explanations.

2. CEval (Huang et al., 2023), a comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of LLMs. It contains 13,948 multiple-choice exam questions across 52 diverse disciplines,including three medical sub-disciplines: Clinical Medicine, Basic Medicine, and Physician.

3. Huatuo-26M (Li et al., 2023a), a Chinese medical dataset that consists of over 26 million medical question-answer pairs, covering topics including diseases, symptoms, treatments,and drug information.

4实验

4.1 评估数据集、指标和基线

4.1.1 评估数据集

我们在医学知识问答和对话等场景下对Qilin-Med进行了评估，使用了以下数据集：

CMExam（Liu等人，2023年），一个标准化的医学考试和实践问题数据集。它包含了超过60,000个多选题，并提供问题解释。
CEval（Huang等人，2023年），一个全面的中国评估套件，旨在评估LLMs的高级知识和推理能力。它包含了52个不同学科（包括三个医学子学科：临床医学、基础医学和医师）的13,948个多选题。
Huatuo-26M（Li等人，2023a年），一个中文医学数据集，包含超过2600万个医学问答对，覆盖疾病、症状、治疗和药物信息等主题

4.1.2 Metrics

We assess model performance on multiple-choice questions using accuracy and weighted F1 score- metrics commonly employed in information retrieval and question-answering tasks. For medical dialogue tasks, BLEU (Papineni et al., 2002)and ROUGE (Lin and Hovy, 2003) were used to evaluate the discrepancy between model-generated responses and ground truth.

4.1.2 指标

我们使用准确率和加权F1分数来评估模型在选择题上的性能，这些指标在信息检索和问答任务中通常被采用。对于医学对话任务，我们使用BLEU (Papineni et al., 2002)和ROUGE (Lin and Hovy, 2003)来评估模型生成响应与真实响应之间的差异。

4.1.3 Baselines

We used Baichuan-7B (Yang et al., 2023a) as the base model. Baichuan-7B is an open-source, largescale pre-trained language model built on the Transformer architecture. It has 7 billion parameters and is trained on approximately 1.2 trillion tokens. It supports both Chinese and English with a context window length of 4096.

4.1.3 基线

我们使用Baichuan-7B（Yang等人，2023a）作为基础模型。Baichuan-7B是一个开源的大规模预训练语言模型，基于Transformer架构构建。它有70亿参数，在大约1200亿个token上进行训练。它支持中英文，上下文窗口长度为4096。

For baselines, we evaluated LLMs in both general scenarios and the medical domain across various tasks. For CMExam, we reported the performance of ChatGLM-6B, LLaMA (Touvron et al.,2023a), Vicuna (Chiang et al., 2023), Alpaca (Taori et al., 2023), Huatuo (Wang et al., 2023a), and DoctorGLM (Xiong et al., 2023) on both the prediction and reasoning tasks. For CEval, we evaluated the performance of ChatGLM (Du et al., 2022),Chinese-LLaMA2 (Cui et al., 2023), and ChineseAlpaca (Cui et al., 2023) on the prediction task.Since CMExam has a standardized training set, we also reported the performance of LLaMA, Alpaca,and Vicuna on CMExam after SFT. Additionally,we evaluated models such as T5 (Raffel et al., 2020)and GPT2 (Radford et al., 2019) on the test set of Huatuo-26M. However, since Huatuo-26M is not fully open-sourced, we were unable to run SFT with this dataset.

对于基线，我们评估了LLMs在一般场景和医学领域各种任务下的表现。对于CMExam，我们报告了ChatGLM-6B、LLaMA (Touvron等人，2023a)、Vicuna (Chiang等人，2023)、Alpaca (Taori等人，2023)、Huatuo (Wang等人，2023a)和DoctorGLM (Xiong等人，2023)在预测和推理任务上的性能。对于CEval，我们评估了ChatGLM (Du等人，2022)、Chinese-LLaMA2 (Cui等人，2023)和Chinese-Alpaca (Cui等人，2023)在预测任务上的性能。由于CMExam有一个标准化的训练集，我们还报告了LLaMA、Alpaca和Vicuna在SFT后的CMExam性能。此外，我们评估了T5 (Raffel等人，2020)和GPT2 (Radford等人，2019)在Huatuo-26M测试集上的表现。然而，由于Huatuo-26M没有完全开源，我们无法使用这个数据集进行SFT。

4.2 Implementation Details

For DCPT, Baichuan-7B was trained on eight A100 80G GPUs, with settings: batch size of 1/GPU,three epochs, a 2e-4 learning rate, 0.05 warmup ratio, 0.01 weight decay, and 1024 block size.

4.2 实现细节

对于DCPT，Baichuan-7B在八个A100 80G GPU上进行训练，设置如下：每个GPU的批处理大小为1，三个epochs，学习率为2e-4，0.05的预热比例，0.01的权重衰减，以及1024的块大小。

Table 3: Statistics of the ChiMed-DPO.

表3：ChiMed-DPO的统计数据。

Table 4: C-Eval results.

表4：C-Eval结果。

For SFT, A100 80G GPUs were used with a 64 batch size/GPU. Qilin-Med settings were: 2e-5 learning rate, 0.05 warmup ratio, 0.05 weight decay,and max_source_length and max_target_length both at 256. We accelerated training using DeepSpeed ZeRO-2 (Ren et al., 2021). We adopted the LoRA technique (Hu et al., 2021), a type of SFT, with lora_rank set at 8, lora_alpha at 32, and lora_dropout at 0.05 for enhanced performance.

对于SFT，我们使用了A100 80G GPUs，每个GPU的批处理大小为64。Qilin-Med的设置如下：2e-5的学习率，0.05的预热比例，0.05的权重衰减，max_source_length和max_target_length都设置为256。我们使用DeepSpeed ZeRO-2 (Ren et al., 2021)来加速训练。我们采用了LoRA技术（Hu et al., 2021），一种SFT，其中lora_rank设置为8，lora_alpha设置为32，lora_dropout设置为0.05以提升性能。

For DPO, 4 RTX 3090 GPUs were used with a batch size of 8/GPU. Settings were: 2e-5 learning rate, 0.05 warmup ratio, 0.05 weight decay, and both max_source_length and max_target_length at 256. The LoRA technique was again applied with lora_rank set at 8, lora_alpha at 16, and lora_dropout at 0.05.

对于DPO，我们使用了4个RTX 3090 GPUs，每个GPU的批处理大小为8。设置如下：2e-5的学习率，0.05的预热比例，0.05的权重衰减，max_source_length和max_target_length都设置为256。再次应用了LoRA技术，其中lora_rank设置为8，lora_alpha设置为16，lora_dropout设置为0.05。

For CMExam assessment, we used OpenAI’s GPT-3.5-turbo, GPT-4-0314, and models like LLaMA, Alpaca, and Vicuna, each with 7B parameters. ChatGLM was tested using its 6B parameter version and operated with P-Tuning V2 (Liu et al.,2021), using a prefix token length of 128 and a learning rate of 0.02 for SFT. For other models including LLaMA, Alpaca, Vicuna, and Huatuo, we used the LoRA technique (Hu et al., 2021) with a rank of 8, an alpha of 16, and a 0.05 dropout rate.

在CMExam评估中，我们使用了OpenAI的GPT-3.5-turbo、GPT-4-0314以及LLaMA、Alpaca和Vicuna等模型，每个模型的参数量为7B。ChatGLM使用了其6B参数版本，并使用P-Tuning V2 (Liu et al., 2021)进行操作，前缀令牌长度为128，SFT的学习率为0.02。对于其他模型，包括LLaMA、Alpaca、Vicuna和Huatuo，我们使用了LoRA技术（Hu et al., 2021），rank设置为8，alpha设置为16，dropout率为0.05。

During the Huatuo-26M evaluation, we compared T5 and GPT2 performances. Both models were set with maximum question and answer lengths of 256 and 512, respectively. We used the original 12-layer Chinese GPT2.

在Huatuo-26M评估期间，我们比较了T5和GPT2的性能。两个模型的问题和答案的最大长度分别设置为256和512。我们使用了原始的12层中文GPT2。

In the C-Eval phase, all models were evaluated using few-shot prompting. We opted for 5 shots and employed a greedy decoding strategy for answer prediction.

在C-Eval阶段，所有模型都使用少量样本提示进行评估。我们选择了5个样本，并采用贪婪解码策略进行答案预测。

4.3 Results and Discussion

On C-Eval Table 4 summarizes online evaluation results on the C-Eval benchmark. Among the five general LLMs compared in the upper part of the table, Baichuan-7B achieved the highest scores in both average and three medical subjects (namely Clinical Medicine, Physician and Basic Medicine),outperforming other models in instruction following as well as medical understanding. Specifically,Baichuan-7B achieved an accuracy of 45.1% in Basic Medicine, significantly surpassing ChatGLM-6B which scored only 36.6%. After the Domainspecific Continued Pre-training and Supervised Fine-tuning stages, the model enhanced its proficiency in medical knowledge and comprehension,better equipping it to address questions within medical domains. Notably, our Qilin models show a great performance boost compared to ZhongjingLLaMA. However, a decline in general language capabilities was noted, with average accuracy on CEval dropping from 42.8% to 40.1%. This decline suggests that while the model’s medical expertise grew, its broader linguistic abilities suffered due to its increased focus on the medical field.

4.3 结果与讨论

在C-Eval

在C-Eval上，表4总结了在线评估结果。在表的上半部分比较的五个通用LLM中，Baichuan-7B在平均分和三个医学科目（即临床医学、医师和基础医学）上都取得了最高的分数，超过了其他模型在遵循指令以及医学理解方面的表现。特别是，Baichuan-7B在基础医学上的准确率达到45.1%，显著超过了ChatGLM-6B的36.6%。经过领域特定持续预训练和监督式微调阶段后，模型在医学知识和理解方面的熟练度得到了提升，更好地装备了它来回答医学领域内的问题。值得注意的是，与Zhongjing-LLaMA相比，我们的Qilin模型显示出巨大的性能提升。然而，也注意到通用语言能力的下降，C-Eval的平均准确率从42.8%下降到40.1%。这种下降表明，尽管模型的医学专业知识增长了，但由于其增加了对医学领域的关注，其更广泛的语言能力受到了影响。

On CMExam Table 5 displays the evaluation outcomes on the CMExam benchmark. ChatGLM and Vicuna performed well in explanation generation, reflecting enhanced comprehension of medical knowledge and dialogue skills. Of the two, Vicuna had a weaker answer prediction rate at 5%, while ChatGLM reached 26%. After fine-tuning with CMExam’s training data (i.e.,LLaMA-CMExam, Alpaca-CMExam, and VicunaCMExam), we noted marked improvements in both tasks. Following the Domain-specific Continued pre-training and Supervised Fine-tuning using our data, our proposed Qilin-Med-7B-CPT and QilinMed-7B-SFT outperformed those fine-tuned on CMExam. This indicates our framework’s efficacy in enriching LLMs with medical knowledge and bolstering their problem-solving capabilities in the medical domain.

在CMExam

在CMExam上，表5展示了CMExam基准的评估结果。ChatGLM和Vicuna在解释生成方面表现良好，反映出对医学知识和对话技能的增强理解。在这两个模型中，Vicuna的答案预测率为5%，而ChatGLM达到了26%。在使用CMExam的训练数据进行微调后（即LLaMA-CMExam、Alpaca-CMExam和Vicuna-CMExam），我们在两个任务上都注意到了明显的改进。在采用我们的数据进行领域特定持续预训练和监督式微调后，我们提出的Qilin-Med-7B-CPT和Qilin-Med-7B-SFT在性能上超过了那些在CMExam上微调的模型。这表明我们的框架在丰富LLM的医学知识以及增强其在医学领域的问题解决能力方面是有效的。

Table 5: CMExam results

表5：CMExam结果。

Table 6: Huatuo-26M results.

表6：华佗-26M结果。

On Huatuo-26M Table 6 shows the evaluation re sults on Huatuo-26M. Among all three baseline methods (namely T5, GPT2, and Baichuan-7B),Baichuan-7B achieved the highest scores on most metrics, while T5 exhibited poor medical dialogue performance. Qilin-Med-7B-CPT outperformed Baichuan-7B in terms of BLEU-1 and ROUGE-1, proving that DCPT effectively injects medicalrelated knowledge into the model. Comparing Qilin-Med-7B-CPT and Qilin-Med-7B-SFT (10.63 vs. 12.69 in terms of BLEU-1), we see that SFT further strengthens model medical knowledge and instruction compliance capabilities. Finally, QilinMed-7B-DPO achieved higher scores in all metrics than Qilin-Med-7B-SFT, showing that DPO efficiently helps align the medical chat model output with human preferences and encourages the model to generate more preferred outputs.

在华佗-26M

在Huatuo-26M上，表6展示了Huatuo-26M的评估结果。在所有三种基线方法（即T5、GPT2和Baichuan-7B）中，Baichuan-7B在大多数指标上取得了最高的分数，而T5在医学对话性能上表现较差。Qilin-Med-7B-CPT在BLEU-1和ROUGE-1方面超过了Baichuan-7B，证明DCPT有效地将医学相关知识注入了模型。比较Qilin-Med-7B-CPT和Qilin-Med-7B-SFT（BLEU-1分别为10.63和12.69），我们看到SFT进一步强化了模型的医学知识和指令遵循能力。最后，Qilin-Med-7B-DPO在所有指标上的得分都高于Qilin-Med-7B-SFT，显示出DPO有效地帮助将医学聊天模型的输出与人类偏好对齐，并鼓励模型生成更受欢迎的输出。

4.4 Case Study

We examine the model outputs for Medical Dialogue and Medical Question Answering tasks using examples from Huatuo-26M and CMExam. As shown in Table 2, Baichuan-7B’s response appears detached from the conversation’s context, often leading to unnatural sentence transitions and run-on sentences in the Chinese generation. The incorporation of the CPT and SFT stages significantly refines Baichuan-7B’s medical acumen, leading to more relevant and informed responses, a trend further evident in Table 3. However, certain responses still exhibited run-on sentences, highlighting the need for further refinement. Notably, outputs from Qilin-7B-DPO stand out, aligning closely with human expectations in both accuracy and context, emphasizing the pivotal role and efficacy of the DPO stage in enhancing model outputs, while also addressing the aforementioned linguistic challenges.

4.4 案例研究

我们检查了模型在医学对话和医学问答任务中的输出，使用的例子来自Huatuo-26M和CMExam。如表2所示，Baichuan-7B的响应与对话的上下文脱节，常常导致中文生成中的句子过渡不自然和长句。引入CPT和SFT阶段显著提高了Baichuan-7B的医学知识，使得响应更加相关和知识渊博，这一趋势在表3中更为明显。然而，某些响应仍然存在长句，凸显了进一步改进的必要性。值得注意的是，Qilin-7B-DPO的输出在准确性和上下文方面与人类期望非常接近，强调了DPO阶段在提升模型输出方面的重要作用和效果，同时也解决了前述的语言学挑战。

5 Limitations

The introduction of Qilin-Med, trained on the ChiMed dataset, marks a significant advancement in medical LLMs. However, several limitations should be acknowledged. The ChiMed dataset,while comprehensive, primarily focuses on Chi-nese medical knowledge, potentially limiting the model’s global applicability. The multi-stage training pipeline, including the DPO stage, might introduce biases based on the preferences of the human evaluators involved. Furthermore, while metrics like BLEU and ROUGE provide insights into the model’s performance, they might not capture the complete picture, especially in nuanced medical scenarios. Future work should consider a more diverse set of evaluation metrics, including human evaluations, to ensure a holistic understanding of Qilin-Med’s capabilities

5 局限性

Qilin-Med的引入，在ChiMed数据集上的训练，标志着医学LLM的一个重大进步。然而，应该承认几个局限性。ChiMed数据集虽然全面，但主要关注中文医学知识，可能限制了模型的全球适用性。多阶段训练管道，包括DPO阶段，可能会引入基于人类评估者偏好的偏见。此外，虽然BLEU和ROUGE等指标提供了模型性能的洞见，但它们可能无法捕捉到完整的画面，特别是在微妙的医学场景中。未来的工作应考虑更多样化的评估指标集，包括人类评估，以确保对Qilin-Med能力的全面理解。

Figure 2: A case on Huatuo-26M dialogue dataset.

图2：华图-26M对话数据集上的一个案例。

6 Ethics and Societal Impacts

We do not recruit any human research participants for this study. To prepare the data, the information was made anonymous in accordance with the regulations set by the Health Insurance Portability and Accountability Act (HIPAA), ensuring that the protected health information was de-identified. The creation and utilization of the ChiMed dataset adhered to stringent ethical standards, ensuring the authenticity and accuracy of the medical knowl edge it encapsulates. However, it is crucial to em phasize that Qilin-Med and ChiMed are intended for research and academic purposes. Commercial exploitation or any use that deviates from this primary objective is strictly discouraged. Researchers and practitioners are urged to respect these guidelines, ensuring the ethical and responsible use of Qilin-Med and the associated dataset. The development of Qilin-Med aims to enhance the capabilities of LLMs in the medical domain. However, it is paramount to understand that Qilin-Med is not a replacement for human medical expertise. It should not be used for direct patient diagnosis or as a standalone tool for medical decision-making. Any conclusions or insights derived from Qilin-Med should be contextualized, considering the specific focus of ChiMed and the inherent limitations of LLMs. The primary intent behind Qilin-Med is to aid research, and its use should be confined to this scope to prevent potential misuse.

6 伦理和社会影响

本研究没有招募任何人类研究参与者。为了准备数据，根据健康保险可携带性和责任法案（HIPAA）的规定，对信息进行了匿名化处理，确保受保护的医疗信息被去识别化。ChiMed数据集的创建和使用遵循了严格的伦理标准，确保了其封装的医学知识的真实性和准确性。然而，必须强调的是，Qilin-Med和ChiMed旨在用于研究和学术目的。强烈反对任何偏离这一主要目标的商业利用或其他用途。研究人员和从业者被敦促遵守这些指导原则，确保对Qilin-Med和相关数据集的伦理和负责任的使用。Qilin-Med的发展旨在提高LLM在医学领域的 capabilities 能力。然而，至关重要的是要理解Qilin-Med不是人类医学专长的替代品。它不应用于直接的患者诊断或作为医疗决策的独立工具。从Qilin-Med得出的任何结论或见解都应结合ChiMed的具体关注点和LLM的固有局限性来考虑。Qilin-Med背后的主要意图是辅助研究，其使用应局限于这一范围，以防止潜在的滥用。

7 Conclusion & Future Work

This study introduces a multi-stage training approach, a large-scale Chinese medicine dataset -ChiMed, and Qilin-Med, a cutting-edge Chinese medical language model. It demonstrates the potential of domain-specific training in healthcare, with implications for patient care, clinical decisions, and medical research. Qilin-Med’s refined outputs, especially post-DPO stage, enable more accurate and context-aware medical dialogues, forerunning a new era of AI-driven medical insights and interventions.

7 结论与未来工作

本研究介绍了一种多阶段训练方法，一个大规模的中文医学数据集——ChiMed，以及Qilin-Med，一个先进的中文医学语言模型。它展示了特定领域训练在医疗保健方面的潜力，对患者护理、临床决策和医学研究具有影响。特别是经过DPO阶段后的Qilin-Med的优化输出，使医学对话更加准确和上下文感知，预示着AI驱动的医学见解和干预的新时代。

Figure 3: A case on CMExam dataset

图3：CMExam数据集上的一个案例。

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费场景POC验证，效果验证后签署服务协议。零风险落地应用大模型，已交付160+中大型企业