我要投稿

StyleChat：如何让模型学习使用从未见过的风格进行对话

发布日期：2024-05-05 07:26:05 浏览次数： 2943 作者：大语言模型论文跟踪

StyleChat：如何让模型学习使用从未见过的风格进行对话

发布时间：2024年03月17日

LLM应用 对话系统 自然语言生成

摘要

大型语言模型（LLMs）在生成任务上表现出色，备受瞩目，在此背景下，风格化的对话生成是打造智能互动对话助手的关键环节。然而，受限于数据驱动特性和数据偏见，LLMs 在处理特定任务时，尤其是风格化对话生成时，往往表现欠佳，因为这类任务极度缺乏有监督的数据资源。虽然已有多种基于提示的方法尝试解决特定任务，但面对涵盖广泛对话风格的真实世界复杂场景时，这些方法仍有待提升。为此，本研究充分利用 LLMs 强大的生成能力，精心构建并严格人工筛选出一个包含38种风格的风格化对话数据集——StyleEval。基于此数据集，我们创新性地提出StyleChat框架，它采用背诵增强记忆策略与多任务风格学习策略，旨在提高模型的泛化能力。为了验证StyleChat的有效性，我们设立了一个包含生成任务和选择任务在内的综合测试基准，用于全面考察训练后的模型是否真正理解和掌握了不同风格及用户偏好。实验结果表明，StyleChat框架成功超越所有对照组，并助力突破LLMs在风格转换上的局限性。

概览

null

风格化对话生成在LLMs时代智能对话应用的发展中非常重要。但是这个数据经常会因为缺少监督学习数据而被限制，这些数据将上下文和响应与期望的风格相关联。特别是，对于抽象的、多层次的或动态派生的风格，收集平行语料库更加困难。现有的工作通常依赖于使用回译构建的伪数据，导致数据质量低下，无法考虑到个体语言风格的变异性和复杂性。这样的后果往往是模型生成的对话过于标准化，缺乏个性和多样性。特别是当遇到预训练阶段未见过的领域数据或新风格时，LLMs处理复杂指令的泛化能力显著下降。

为了应对这些挑战，作者利用大型语言模型（LLMs）并结合统计和语言学视角，构建了一个名为StyleEval的大型数据集。该数据集包含带有风格概要的风格化对话，有助于创建定制的对话代理。

这个数据集包含了38种风格和24,728个对话。从收集各种流派中众所周知的风格开始，利用GPT-4生成包含描述和示例的统计层面的风格概要。然后，我们根据语言学知识从这些示例中提取出语言学层面的风格概要。在初步的预处理之后，我们邀请注释者评估对话的质量。此外，我们旨在增强LLM的风格泛化能力，而不损害其整体功能。

但是，直接使用LLM的时候，当模型在泛化到新风格时会出现新问题。为了解决这个问题，作者提出了StyleChat框架，该框架引入了一个风格思维链，使模型能够在通过背诵增强的记忆策略响应之前生成风格概要。

这种记忆包含两个阶段：在训练期间先背诵再响应，以及在推理期间先回忆再响应。这种方法还鼓励StyleChat学习如何推导出未见过的样式概要，从而提高泛化能力。此外，我们通过实施多任务风格学习来进一步增强风格推导能力，通过风格转换数据集来增加风格能力的激活。

主要贡献：

• 构建了一个大规模、高质量的风格化对话生成数据集StyleEval，包含24,728个并行风格化对话轮次，涵盖38种不同风格。
• 提出了一种用于风格化对话生成的朗诵增强记忆策略，激励StyleChat学习推导未见过的风格配置文件，以实现更好的泛化。
• 在StyleEval数据集上对各种大型语言模型进行了广泛的实验，证明了StyleChat框架在风格化对话生成中的性能显著优于所有基线模型。

数据构建

null

1. 设计原则：作者首先确定了两个主要目标：(1) 高效地将LLMs对齐到特定的风格，以及 (2) 激活LLMs与风格相关的能力以实现更好的泛化。为此，他们从统计和语言学的角度定义风格，并策略性地设计了两个以风格为中心的任务的数据分布：风格化对话生成和文本风格转换。
2. 统计层面的风格概要：作者采用了传统的深度学习方法，从统计或数据驱动的角度定义风格。他们使用GPT-4作为风格代理，生成特定风格的综合描述和代表性句子。为了确保这些示例的相关性和准确性，作者实施了一个后期选择阶段，人工注释者精心挑选最能体现核心特征的句子。为了精确而高效地定义风格，每个风格限制了四个示例句子。
3. 语言学层面的风格概要：除了统计视角，作者还深入探讨了风格的语言学视角。他们认为，仅通过大量句子集合来表示风格是不够的，因为这缺乏明确的指导，如何产生风格化的句子，并且在泛化到新风格时成本高昂。作者提出了从语言学视角重新评估风格概念，并将其与统计层面的风格概要结合起来，旨在更高效、准确地定义风格，并增强泛化能力。具体来说，他们将风格分解为四个属性：词汇选择（diction）、句法（syntax）、修辞手法（figures of speech）和修辞目的（rhetorical purposes），并基于这些属性构建风格概要。
4. 多任务数据集以激活风格：在这一部分，作者概述了StyleEval的发展，该数据集包括两个以风格为中心的任务：风格化对话生成和文本风格转换。他们利用多级风格概要，结合GPT-4和特定风格的风格概要，构建了上下文和风格化响应对，这些对作为模型在多轮对话设置中的训练数据。为了优化模型的泛化潜力，他们策划了一个包含3,532个主要风格的示例和23个较少见风格的400个示例的集合。对于文本风格转换，他们使用GPT-4获得了四种主要风格之间任意一对风格的转换实例，总共600对数据。尽管数据量有限，但他们在第4.2节中展示了多任务学习在提高泛化能力方面的有效性，特别是在面对以前未见过的风格的情境中。

null

上图提供了StyleChat框架的概览。这个框架是为了在大型语言模型（LLMs）中增强风格化对话生成的能力而设计的。它特别关注于如何使模型在训练和推理阶段有效地利用和生成风格概要。

在训练阶段，StyleChat框架指导模型执行一个两阶段的过程。首先，模型被指示背诵（recite）特定的风格概要，这是一个通过记忆策略来理解风格的步骤。这一步骤涉及到对风格的深入理解，包括其语言特征和修辞目的。其次，模型根据背诵的风格概要生成响应（respond），这意味着模型需要在生成对话时保持风格的一致性。

在推理阶段，模型需要回忆（recall）或推导（derive）出风格概要，然后基于这些概要生成响应。对于已见过的风格，模型会从参数记忆（parametric memory）中回忆出风格概要；对于未见过的风格，模型则需要利用其学习到的知识和推理能力来推导出相应的风格概要。这个过程教会了模型如何通过风格思维链（chains of style thoughts）来学习，从而在面对新风格时能够更好地泛化和适应。

结果

null

上图展示了StyleChat框架在多轮风格化对话生成任务中的表现。可以看到StyleChat在多个不同风格（如Sci-Fi、Politeness、ArXiv、Shakespearean、Humor、Recipe、Holmes、Poems、Romance、Questionnaire、Diary等）中的表现。这些风格代表了不同的对话场景和语言特征，例如科幻（Sci-Fi）可能包含未来主义的语言和概念，礼貌（Politeness）可能涉及正式和礼貌的表达，而莎士比亚式（Shakespearean）则可能包含古典英语和诗意的修辞。

StyleChat的表现通过多个指标来衡量，包括Relevance（相关性）、Coherence（连贯性）和Style（风格）。这些指标评估了模型生成的响应在多大程度上与给定的上下文和风格要求相匹配。Relevance衡量响应与上下文的一致性，Coherence评估上下文和响应作为一个整体信息的连贯性，而Style则评估响应在多大程度上反映了所需的风格特征。

在图4中，StyleChat在所有评估维度上都取得了高分，表明它能够有效地生成既符合上下文又保持特定风格的对话。这表明StyleChat在理解和应用不同对话风格方面具有较高的能力，能够在多轮对话中保持一致的风格，这对于构建智能和引人入胜的对话代理至关重要。此外，StyleChat在某些风格上的表现甚至超过了其他基线模型，如ChatGPT，这进一步证明了其在风格化对话生成任务中的有效性和优越性。

Prompt

Prompt for ChatGPT to construct the style profile

# Task 

- Describe the given text style in several sentences.

# Style 

- {Style}
# Description

Prompt for generating examples in statistical-level style profile

# Task 

- Generate 4 most representative and diverse sentences in the given style.

# Style 

- Name: {Style} 

- Description: {Description}

# Output Format 

- Place each sentence on a new line without any numbers or additional formatting.
# Generation

Prompt for extracting linguistic-level style profile

# Task 

- Observe style attributes of given sentences from the following 4 perspectives.

- Diction: Explore the choice of words, their connotations, and levels of formality.

- Syntax: Examine the arrangement of words and phrases, sentence structures, and the use of punctuation.

- Figures of Speech: Identify and discuss any literary devices or figures of speech like metaphors, similes, personification, etc.

- Rhetorical Purpose: Analyze the intent behind the sentences, the persuasive techniques if any, and the overall message or purpose they aim to convey.

# Rules 

- DO NOT give each sentence an observation. Only output 1 observation in all.

- DO NOT use phrases or words in sentences as examples in observation. Only list observations without justifying.

# Output Format of Observations 

⟨ Diction⟩ [Observations of Diction] 

⟨ Syntax⟩ [Observations of Syntax] 

⟨ Figures of Speech⟩ [Observations of Figures of Speech] 

⟨ Rhetorical Purpose⟩ [Observations of Rhetorical Purpose] 

# Sentences 

{Examples} 
# Observations

Prompt for generating labels for Stylized Dialogue Generation

# Task

- Generate response in {Style} style.

# Style Description

- {Description}

# Observations from Linguistic Perspective

- Diction: …

- Syntax: …

- Figures of Speech: …

- Rhetorical Purpose: …

# Sample Sentences in {Style} style

{Examples}

# Rules

- Only output the stylized response without any explanation.

# Context

Context
# Response in {Style} style in one short sentence.

Prompt for generating labels for Text Style Transfer

# Task

- Style Transfer. Transfer the following sentence from {Style1} style to {Style2} style.

# Sentence

…
# Transferred Sentence

Prompt for training in Stylized Dialogue Generation

# Context

{Context}

# Task
Respond in {Style} style. Let’s think step by step. First, describe the style. Then, generate example sentences in this style. After that, observe the linguistic pattern of this style. Finally, output the stylized response.

Prompt for training in Text Style Transfer

Transfer the following sentence from {Style1} style into {Style2} style.

# Sentence

{Sentence}
# Transferred Sentence

Prompt for using GPT4 to evaluate responses

# Task

- You will be provided with one {Style} style response for a given context.

- Your task is to rate the stylized response in terms of relevance, coherence, and style.

- Please refer to the criteria while reviewing.

# Evaluation Criteria

Relevance (1-5): How well does the response align with the given context and reference?

- 1: Irrelevant. The response has no connection to the provided context or reference.

- 2: Slightly Relevant. The response somewhat touches upon the context but misses its core essence.

- 3: Moderately Relevant. The response connects to the context but may include unrelated or unnecessary information.

- 4: Mostly Relevant. The response mostly corresponds with the context, with a few unrelated points.

- 5: Highly Relevant. The response fully matches and adheres to the context and reference.

Coherence (1-5): How well do the context and response form a coherent body of information?

- 1: Incoherent. The response lacks structure and organization, making it hard to connect it to the context and form a coherent body of information.

- 2: Slightly Coherent. The response shows basic structure, but there are significant organizational flaws and alignment issues with the context.

- 3: Moderately Coherent. The response is structured and mostly organized, but there may be elements that don’t align well with the context or parts that lack clarity.

- 4: Mostly Coherent. The response is well-structured and organized with only minor deviations from the context or small clarity issues.

- 5: Highly Coherent. The response is excellently structured and organized, aligning seamlessly with the context to present a unified and clear body of information.

Style (1-5): How well does the response reflect {Style} style?

- 1: No Style. The response does not display any traces of the specified style.

- 2: Slight Style. The response marginally captures the style, but largely appears neutral or generic.

- 3: Moderate Style. The response showcases elements of the style, but there are portions that deviate from it.

- 4: Strong Style. The response is predominantly in line with the intended style, with occasional inconsistencies.

- 5: Pure Style. The response perfectly mirrors the intended style, capturing all its nuances and tones.

# Context

{Context}

# Response to Rate

{Response}
# Evaluation (scores ONLY, json format)

Prompt for Multiple Choice Questions

Multiple choice: Which response is suitable for the given context and is in Style style?

# Context:

{Context}

Choices:

(A) …

(B) …

(C) …

(D) …
Output the answer without explanation. Let’s think step by step. First, describe the style. Then, generate example sentences in this style. After that, observe the linguistic pattern of this style. Finally, output the best choice without explanation.

Prompt for Input and Output in Ablation Study

w/o Pofile input

# Context

{Context}

# Task
Respond in {Style} style.

w/o Pofile output

# Response in {Style} style
{Response}

w/o Recite input

# Context

{Context}

{Style Profile}

# Task
Respond in {Style} style.

w/o Recite output

# Response in {Style} style
{Response}

w/ Recite input

# Context

{Context}

# Task
Respond in {Style} style. Let’s think step by step. First, describe the style. Then, generate example sentences in this style. After that, observe the linguistic pattern of this style. Finally, output the stylized response.

w/ Recite output

{Style Profile}

# Response in {Style} style
{Response}


[Arxiv](https://arxiv.org/abs/2403.11786)

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费场景POC验证，效果验证后签署服务协议。零风险落地应用大模型，已交付160+中大型企业