AI知识库

53AI知识库

学习大模型的前沿技术与行业应用场景


如何用CAMEL从DeepSeek-R1蒸馏数学推理数据?手把手教你实现!
发布日期:2025-02-05 05:30:57 浏览次数: 1564 来源:CAMEL AI
推荐语

深入探索DeepSeek-R1的数学推理能力,CAMEL框架助你一臂之力!

核心内容:
1. DeepSeek-R1推理模型的独特优势及应用场景
2. 利用CAMEL框架提取数学推理数据的详细步骤
3. Hugging Face平台的数据集成与共享流程

杨芳贤
53A创始人/腾讯云(TVP)最具价值专家

DeepSeek R1,这款备受瞩目的业界领先的推理模型,凭借卓越的数学推理能力和高效的逻辑处理,在近期引发了广泛关注。无论是基础算术还是复杂的数学难题,它都能轻松应对,为开发者提供强大的计算支持。

现在,结合 CAMEL 框架,我们可以利用长链式思维(Long Chain-of-Thought, CoT)提取数学问题的详细推理过程,从 DeepSeek R1 中蒸馏出高质量的数学推理数据。最终,我们将这一数据集上传至 Hugging Face,方便社区伙伴们共享和使用,推动更智能的数学推理研究。


本教程将手把手带你探索如何用 CAMEL 框架高效提取 DeepSeek R1 的数学推理能力,生成有价值的数据集,一起动手试试吧! ?


? 在这里,您将探索以下内容:
  • CAMEL框架:一个功能强大的多智能体框架,能够生成合成数据并模拟多智能体角色扮演场景,助力实现更高级的AI应用。
  • 数据蒸馏流程:通过系统化的方法,从DeepSeek R1等模型中提取并优化包含详细思维过程的高质量推理数据集。
  • Hugging Face集成:提供便捷的流程,将蒸馏后的数据集上传并分享到Hugging Face平台。


    通过我们的合成数据生成工具,CAEML-AI 精心打造了三个高质量的数据集,这些数据集现已发布在 Hugging Face 平台上,方便大家随时使用:
    • AMC AIME STaR 数据集
      包含 4000 道高难度数学题目及其解答,特别加入了解决方案的迭代改进历史,展示了如何一步步优化答案。
      ? 查看数据集:https://huggingface.co/datasets/camel-ai/amc_aime_star
    • AMC AIME 蒸馏数据集
      包含 4000 道高难度数学题目及其解答,每道题目都配有清晰的分步解析。
      ? 查看数据集:https://huggingface.co/datasets/camel-ai/amc_aime_distilled
    • GSM8K 蒸馏数据集
      包含 7000 道高质量、语言多样化的小学数学应用题及其解答,每道题目都配有详细的分步解析。
      ? 查看数据集:https://huggingface.co/datasets/camel-ai/gsm8k_distilled

无论您是希望探索 AI 如何解决复杂问题,还是想深入钻研数学推理,这些数据集都是绝佳的资源!?✨

 使用CAMEL数据蒸馏管道生成数学推理数据集的具体步骤 
? 前期准备
1.  安装依赖
首先,安装所需的Python库,从命令行执行以下命令:
pip install "git+https://github.com/camel-ai/camel.git@4210cb0849f3f13d6a46fefeb9e2c3e791c158cb#egg=camel-ai"
pip install datasets
pip install rouge

2. 设置相关密钥

设置SILICONFLOW_API_KEY 或 DEEPSEEK_API_KEY,这些密钥将用于结合思维过程来提炼数学推理数据。

⭐ 提示:也可以选择其他模型提供商,比如 Fireworks 或 Together AI。

from getpass import getpass
import os
SILICONFLOW_API_KEY = getpass('Enter your SILICONFLOW_API_KEY: ')
os.environ["SILICONFLOW_API_KEY"] = SILICONFLOW_API_KEY
DEEPSEEK_API_KEY = getpass('Enter your DEEPSEEK_API_KEY: ')
os.environ["DEEPSEEK_API_KEY"] = DEEPSEEK_API_KEY
# To make deepseek r1 responds with thought process content,we should set the following environment variable
os.environ["GET_REASONING_CONTENT"]="True"

3. 从Hugging Face下载数据

我们从Hugging Face平台开始准备原始的数学数据,这些数据的核心部分主要包括问题

答案两部分。接下来,我们将以GSM8K数据集为例,为大家详细讲解具体操作步骤。
# Set the number of problems to download from GSM8K in huggingface
NUMBER_OF_PROBLEMS=10
import json
from pathlib import Path
import uuid
from datasets import load_dataset

def download_gsm8k_dataset():
    try:
        # Load the dataset using the datasets library
        dataset = load_dataset("openai/gsm8k""main")

        # Get the items from train split
        data = dataset['train'].select(range(NUMBER_OF_PROBLEMS))

        # Convert to the desired format
        formatted_data = []
        for item in data:
            # Extract the final answer from the solution
            solution = item['answer']
            if solution:
                # GSM8K solutions typically end with "#### number"
                import re

                match = re.search(r'####\s*(\d+)', solution)
                if match:
                    number = match.group(1)
                    # Replace the "#### number" with "\boxed{number}"
                    solution = re.sub(
                        r'####\s*\d+'f'\\\\boxed{{{number}}}', solution
                    )

            formatted_item = {
                "id": str(uuid.uuid4()),  # GSM8K doesn't provide IDs
                "problem": item['question'],
                "type""openai/gsm8k",  # All problems are from GSM8K
                "solution": solution,  # Use the modified solution with \boxed
            }
            formatted_data.append(formatted_item)

        # Save to a file
        output = formatted_data
        output_file = "downloaded_gsm8k_10.json"
        with open(output_file, "w"as f:
            json.dump(output, f, indent=2)

        print(f"Successfully downloaded and saved GSM8K dataset to {output_file}")
    except Exception as e:
        print(f"Error downloading GSM8K dataset: {e}")

if __name__ == "__main__":
    download_gsm8k_dataset()

获得了一些符合目标格式的示例数据,接下来让我们开始蒸馏一些包含详细思维过程的数学推理数据吧!

? 蒸馏包含思维过程的数学推理数据(长链思维数据,Long CoT Data)

1. 导入所需的库

import nest_asyncio
nest_asyncio.apply()

import json
import os
import time

from camel.agents import ChatAgent
from camel.datagen import STaRPipeline
from camel.models import ModelFactory
from camel.types import ModelPlatformType, ModelType

2. 设置推理模型和评估模型

由于DeepSeek的API服务目前不太稳定,我们将通过Siliconflow来调用DeepSeek R1模型。CAMEL的模型管理器会根据请求的成功情况自动切换模型。

# Set DeepSeek R1 served by siliconflow as reason model 1
reason_model_1 = ModelFactory.create(
    model_platform=ModelPlatformType.OPENAI_COMPATIBLE_MODEL,
    model_type="deepseek-ai/DeepSeek-R1",
    api_key=os.environ["SILICONFLOW_API_KEY"],
    url="https://api.siliconflow.cn/v1",
    model_config_dict={"max_tokens"4096}, # Config the max_token carefully
)

# Set DeepSeek R1 served by deepseek cloud as reason model 2
reason_model_2 = ModelFactory.create(
    model_platform=ModelPlatformType.DEEPSEEK,
    model_type=ModelType.DEEPSEEK_REASONER,
)

3. 运行CAMEL的Self-Improve数据生成模块

在运行之前,请注意一些关键参数的设置,例如:

  • problems_path:原始数学问题的路径。

  • output_path:生成数据的保存路径。

  • max_iterations:最大迭代次数,控制数据生成的深度。

  • rationalization:是否将正确内容作为参考加入推理过程生成。


注意事项:

  • 我们已经将部分可选的设置代码注释掉,大家可以按需启用对应代码。
  • 生成的数据可以直接用于训练或进一步分析。

运行完成后,你将在output_path中找到生成的高质量数学推理数据集!

start_time = time.time()
problems_path = "downloaded_gsm8k_10.json"
output_path = "generated_data.json"

# Load problems from JSON file
with open(problems_path, 'r'as f:
    problems = json.load(f)

# Initialize agent
reason_agent_system_message = """Answer my question and give your
final answer within \\boxed{}."""

evaluate_agent_system_message = """You are a highly critical teacher who
evaluates the student's answers with a meticulous and demanding approach.
"""


# Set up reason agent
reason_agent = ChatAgent(
    system_message=reason_agent_system_message,
    model=[reason_model_1, reason_model_2], # add models to the list, You can also swtich to other models
)

# # Set up evaluate agent(optional)
# evaluate_agent = ChatAgent(
#     system_message=evaluate_agent_system_message
# )

# # Initialize reward model (optional)
# reward_model = NemotronRewardModel(
#     model_type=ModelType.NVIDIA_NEMOTRON_340B_REWARD,
#     url="https://integrate.api.nvidia.com/v1",
#     api_key=os.environ.get("NVIDIA_API_KEY"),
# )

# # Set score thresholds for different dimensions (optional)
# score_threshold = {
#     "correctness": 1.0,
#     "clarity": 0.0,
#     "completeness": 0.0,
# }
# # Or use a single threshold for all dimensions:
# score_threshold = 0.9


# Create and run pipeline
pipeline = STaRPipeline(
    reason_agent=reason_agent,
    problems=problems,  # Pass problems list directly
    output_path=output_path,
    max_iterations=0,
    batch_size=100# Size of batch to process the data (optional)
    # evaluate_agent=evaluate_agent, # To use evaluate agent(optional)
    # score_threshold=score_threshold, # Score thresholds for agent evaluation (optional)
    # reward_model=reward_model,  # To use a reward model (optional)
)

print("Start generation! May take some time, please wait..")

results = pipeline.generate(rationalization=False)

end_time = time.time()
execution_time = end_time - start_time

print(f"\nProcessed {len(results)} problems")
print(f"Results saved to: {output_path}")
print(f"Total execution time: {execution_time:.2f} seconds")

通过以下代码查看生成的CoT数据:

with open('generated_data.json''r'as f:
    data = json.load(f)
    print(json.dumps(data, indent=2))

上传数据到Hugging Face平台

具体步骤包含:

  • 加载生成的数据:从本地文件加载生成的数据集。
  • 转换为Hugging Face格式:将数据转换为Hugging Face的Dataset格式。
  • 生成数据集卡片:创建包含数据集描述、标签和许可证信息的卡片。
  • 登录Hugging Face:使用API token登录Hugging Face账户。
  • 上传数据集:将数据集和卡片上传到Hugging Face平台。
# Import necessary modules and classes
from camel.datahubs.huggingface import HuggingFaceDatasetManager  # Manages interactions with Hugging Face datasets
from camel.datahubs.models import Record  # Represents a single record in the dataset
from datetime import datetime  # Handles date and time operations
import json  # For reading JSON files

def load_star_output(file_path):
    r"""Load and parse the star output JSON file.

    Args:
        file_path (str): Path to the star_output.json file.

    Returns:
        list: List of traces from the JSON file.
    """

    with open(file_path, 'r'as f:
        data = json.load(f)
    return data['traces']

# Main function: Upload dataset to Hugging Face
def upload_to_huggingface(transformed_data, username, dataset_name=None):
    r"""Uploads transformed data to the Hugging Face dataset platform.

    Args:
        transformed_data (list): Transformed data, typically a list of dictionaries.
        username (str): Hugging Face username.
        dataset_name (str, optional): Custom dataset name.

    Returns:
        str: URL of the uploaded dataset.
    """

    # Initialize HuggingFaceDatasetManager to interact with Hugging Face datasets
    manager = HuggingFaceDatasetManager()

    # Generate or validate the dataset name
    dataset_name = generate_or_validate_dataset_name(username, dataset_name)

    # Create the dataset on Hugging Face and get the dataset URL
    dataset_url = create_dataset(manager, dataset_name)

    # Create a dataset card to add metadata
    create_dataset_card(manager, dataset_name, username)

    # Convert the transformed data into a list of Record objects
    records = create_records(transformed_data)

    # Add the Record objects to the dataset
    add_records_to_dataset(manager, dataset_name, records)

    # Return the dataset URL
    return dataset_url

# Generate or validate the dataset name
def generate_or_validate_dataset_name(username, dataset_name):
    r"""Generates a default dataset name or validates and formats a user-provided name.

    Args:
        username (str): Hugging Face username.
        dataset_name (str, optional): User-provided custom dataset name.

    Returns:
        str: Formatted dataset name.
    """

    if dataset_name isNone:
        # If no dataset name is provided, generate a default name with the username and current date
        current_date = datetime.now().strftime("%Y%m%d")
        dataset_name = f"star_traces_{current_date}"

    # Format the dataset name to include the username
    returnf"{username}/{dataset_name}"

# Create a dataset on Hugging Face
def create_dataset(manager, dataset_name):
    r"""Creates a new dataset on Hugging Face and returns the dataset URL.

    Args:
        manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
        dataset_name (str): Name of the dataset.

    Returns:
        str: URL of the created dataset.
    """

    dataset_url = manager.create_dataset(dataset_name)
    return dataset_url

# Create a dataset card with metadata
def create_dataset_card(manager, dataset_name, username):
    r"""Creates a dataset card to add metadata

    Args:
        manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
        dataset_name (str): Name of the dataset.
        username (str): Hugging Face username.
    """

    manager.create_dataset_card(
        dataset_name=dataset_name,
        description="A dataset containing mathematical problem-solving traces with step-by-step solutions and improvement history. Each record includes a mathematical problem, its final solution, and the iterative improvement process.",
        license="mit",  # Using lowercase 'mit' as required by HuggingFace
        tags=["math""problem-solving""step-by-step""traces"],
        authors=[username],
        language=["en"],
        task_categories=["text-generation"],
        content="This dataset contains mathematical problem-solving traces generated using the CAMEL framework. Each entry includes:\n\n"
                "- A mathematical problem statement\n"
                "- A detailed step-by-step solution\n"
    )

# Convert transformed data into Record objects
def create_records(transformed_data):
    r"""Converts transformed data into a list of Record objects.

    Args:
        transformed_data (list): List of trace dictionaries from star_output.json.

    Returns:
        list: List of Record objects.
    """

    records = []
    for trace in transformed_data:
        record = Record(
            source_type=trace['type'],
            problem=trace['problem'],
            solution=trace['final_trace'],
        )
        records.append(record)
    return records

# Add Record objects to the dataset
def add_records_to_dataset(manager, dataset_name, records):
    r"""Adds a list of Record objects to the dataset.

    Args:
        manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
        dataset_name (str): Name of the dataset.
        records (list): List of Record objects.
    """

    manager.add_records(dataset_name, records)

? 配置Hugging Face访问令牌,上传数据集

前往https://huggingface.co/settings/tokens/new?tokenType=write 获取Hugging Face的 API 密钥,并确保你已开启对仓库的写入权限。
接下来,在 Hugging Face上创建一个新的数据集:
# Get HuggingFace token and username
HUGGING_FACE_TOKEN = getpass('Enter your HUGGING_FACE_TOKEN: ')
os.environ["HUGGING_FACE_TOKEN"] = HUGGING_FACE_TOKEN
username = input("Enter your HuggingFace username: ")
dataset_name = input("Enter your dataset name:")

# Load the star output data
current_dir = os.getcwd()
star_output_path = os.path.join(current_dir, './generated_data.json')
traces = load_star_output(star_output_path)

# Upload the data to HuggingFace
dataset_url = upload_to_huggingface(traces, username, dataset_name)
print(f"\nDataset uploaded successfully!")
print(f"You can view your dataset at: {dataset_url}")
? 最终上传的数据预览

? 本篇教程亮点:

本教程详细讲解了如何利使用CAMEL的合成数据生成模块与DeepSeek R1模型进行数学推理数据的生成,并将生成的数据集上传至 Hugging Face 平台。

  • 高质量合成数据生成:这个pipeline能够蒸馏出包含详细分步解答的数学推理数据集,是生成合成数据的理想工具。
  • 公开数据集:发布了包括AMC AIME STaR、AMC AIME蒸馏数据集和GSM8K蒸馏数据集,涵盖多种数学主题的多样化问题及推理解决方案。
  • Hugging Face集成:轻松在Hugging Face平台上共享和访问数据集,支持协作研究与开发。
  • 可定制与可扩展:支持并行处理、可定制的智能体和奖励模型,能够高效地进行大规模数据生成。

欢迎大家多多使用CAMEL框架,并探索更多有趣的应用场景~(文末有camel的github链接,欢迎大家多给我们关注点星~)


CAMEL微信群


加入CAMEL微信群,请添加CAMEL官方微信号CamelAIOrg,会有工作人员通过您的好友申请并邀请您加入我们的微信群~



Join CAMEL Community


www.camel-ai.org


github.com/camel-ai/camel


https://discord.com/invite/

CNcNpquyDc




53AI,企业落地大模型首选服务商

产品:场景落地咨询+大模型应用平台+行业解决方案

承诺:免费场景POC验证,效果验证后签署服务协议。零风险落地应用大模型,已交付160+中大型企业

联系我们

售前咨询
186 6662 7370
预约演示
185 8882 0121

微信扫码

与创始人交个朋友

回到顶部

 
扫码咨询