我要投稿

如何用CAMEL从DeepSeek-R1蒸馏数学推理数据？手把手教你实现！

发布日期：2025-02-05 05:30:57 浏览次数： 2521

作者：CAMEL AI

微信搜一搜，关注“CAMEL AI”

DeepSeek R1，这款备受瞩目的业界领先的推理模型，凭借卓越的数学推理能力和高效的逻辑处理，在近期引发了广泛关注。无论是基础算术还是复杂的数学难题，它都能轻松应对，为开发者提供强大的计算支持。

现在，结合 CAMEL 框架，我们可以利用长链式思维（Long Chain-of-Thought, CoT）提取数学问题的详细推理过程，从 DeepSeek R1 中蒸馏出高质量的数学推理数据。最终，我们将这一数据集上传至 Hugging Face，方便社区伙伴们共享和使用，推动更智能的数学推理研究。

本教程将手把手带你探索如何用 CAMEL 框架高效提取 DeepSeek R1 的数学推理能力，生成有价值的数据集，一起动手试试吧！

? 在这里，您将探索以下内容：

CAMEL框架：一个功能强大的多智能体框架，能够生成合成数据并模拟多智能体角色扮演场景，助力实现更高级的AI应用。
数据蒸馏流程：通过系统化的方法，从DeepSeek R1等模型中提取并优化包含详细思维过程的高质量推理数据集。
Hugging Face集成：提供便捷的流程，将蒸馏后的数据集上传并分享到Hugging Face平台。

通过我们的合成数据生成工具，CAEML-AI 精心打造了三个高质量的数据集，这些数据集现已发布在 Hugging Face 平台上，方便大家随时使用：

? AMC AIME STaR 数据集
包含 4000 道高难度数学题目及其解答，特别加入了解决方案的迭代改进历史，展示了如何一步步优化答案。
? 查看数据集：https://huggingface.co/datasets/camel-ai/amc_aime_star
? AMC AIME 蒸馏数据集
包含 4000 道高难度数学题目及其解答，每道题目都配有清晰的分步解析。
? 查看数据集：https://huggingface.co/datasets/camel-ai/amc_aime_distilled
? GSM8K 蒸馏数据集
包含 7000 道高质量、语言多样化的小学数学应用题及其解答，每道题目都配有详细的分步解析。
? 查看数据集：https://huggingface.co/datasets/camel-ai/gsm8k_distilled

无论您是希望探索 AI 如何解决复杂问题，还是想深入钻研数学推理，这些数据集都是绝佳的资源！?✨

使用CAMEL数据蒸馏管道生成数学推理数据集的具体步骤

? 前期准备

1. 安装依赖

首先，安装所需的Python库，从命令行执行以下命令：

pip install "git+https://github.com/camel-ai/camel.git@4210cb0849f3f13d6a46fefeb9e2c3e791c158cb#egg=camel-ai"
pip install datasets
pip install rouge

2. 设置相关密钥

设置SILICONFLOW_API_KEY 或 DEEPSEEK_API_KEY，这些密钥将用于结合思维过程来提炼数学推理数据。

⭐ 提示：也可以选择其他模型提供商，比如 Fireworks 或 Together AI。

from getpass import getpass
import os

SILICONFLOW_API_KEY = getpass('Enter your SILICONFLOW_API_KEY: ')
os.environ["SILICONFLOW_API_KEY"] = SILICONFLOW_API_KEY

DEEPSEEK_API_KEY = getpass('Enter your DEEPSEEK_API_KEY: ')
os.environ["DEEPSEEK_API_KEY"] = DEEPSEEK_API_KEY

# To make deepseek r1 responds with thought process content,we should set the following environment variable
os.environ["GET_REASONING_CONTENT"]="True"

3. 从Hugging Face下载数据

我们从Hugging Face平台开始准备原始的数学数据，这些数据的核心部分主要包括问题

和答案两部分。接下来，我们将以GSM8K数据集为例，为大家详细讲解具体操作步骤。

# Set the number of problems to download from GSM8K in huggingface
NUMBER_OF_PROBLEMS=10

import json
from pathlib import Path
import uuid
from datasets import load_dataset

def download_gsm8k_dataset():
    try:
        # Load the dataset using the datasets library
        dataset = load_dataset("openai/gsm8k", "main")

        # Get the items from train split
        data = dataset['train'].select(range(NUMBER_OF_PROBLEMS))

        # Convert to the desired format
        formatted_data = []
        for item in data:
            # Extract the final answer from the solution
            solution = item['answer']
            if solution:
                # GSM8K solutions typically end with "#### number"
                import re

                match = re.search(r'####\s*(\d+)', solution)
                if match:
                    number = match.group(1)
                    # Replace the "#### number" with "\boxed{number}"
                    solution = re.sub(
                        r'####\s*\d+', f'\\\\boxed{{{number}}}', solution
                    )

            formatted_item = {
                "id": str(uuid.uuid4()),  # GSM8K doesn't provide IDs
                "problem": item['question'],
                "type": "openai/gsm8k",  # All problems are from GSM8K
                "solution": solution,  # Use the modified solution with \boxed
            }
            formatted_data.append(formatted_item)

        # Save to a file
        output = formatted_data
        output_file = "downloaded_gsm8k_10.json"
        with open(output_file, "w") as f:
            json.dump(output, f, indent=2)

        print(f"Successfully downloaded and saved GSM8K dataset to {output_file}")
    except Exception as e:
        print(f"Error downloading GSM8K dataset: {e}")

if __name__ == "__main__":
    download_gsm8k_dataset()

获得了一些符合目标格式的示例数据，接下来让我们开始蒸馏一些包含详细思维过程的数学推理数据吧！

? 蒸馏包含思维过程的数学推理数据（长链思维数据，Long CoT Data）

1. 导入所需的库

import nest_asyncio
nest_asyncio.apply()

import json
import os
import time

from camel.agents import ChatAgent
from camel.datagen import STaRPipeline
from camel.models import ModelFactory
from camel.types import ModelPlatformType, ModelType

2. 设置推理模型和评估模型

由于DeepSeek的API服务目前不太稳定，我们将通过Siliconflow来调用DeepSeek R1模型。CAMEL的模型管理器会根据请求的成功情况自动切换模型。

# Set DeepSeek R1 served by siliconflow as reason model 1
reason_model_1 = ModelFactory.create(
    model_platform=ModelPlatformType.OPENAI_COMPATIBLE_MODEL,
    model_type="deepseek-ai/DeepSeek-R1",
    api_key=os.environ["SILICONFLOW_API_KEY"],
    url="https://api.siliconflow.cn/v1",
    model_config_dict={"max_tokens": 4096}, # Config the max_token carefully
)

# Set DeepSeek R1 served by deepseek cloud as reason model 2
reason_model_2 = ModelFactory.create(
    model_platform=ModelPlatformType.DEEPSEEK,
    model_type=ModelType.DEEPSEEK_REASONER,
)

3. 运行CAMEL的Self-Improve数据生成模块

在运行之前，请注意一些关键参数的设置，例如：

problems_path：原始数学问题的路径。
output_path：生成数据的保存路径。
max_iterations：最大迭代次数，控制数据生成的深度。
rationalization：是否将正确内容作为参考加入推理过程生成。

注意事项：

我们已经将部分可选的设置代码注释掉，大家可以按需启用对应代码。
生成的数据可以直接用于训练或进一步分析。

运行完成后，你将在output_path中找到生成的高质量数学推理数据集！

start_time = time.time()
problems_path = "downloaded_gsm8k_10.json"
output_path = "generated_data.json"

# Load problems from JSON file
with open(problems_path, 'r') as f:
    problems = json.load(f)

# Initialize agent
reason_agent_system_message = """Answer my question and give your
final answer within \\boxed{}."""
evaluate_agent_system_message = """You are a highly critical teacher who
evaluates the student's answers with a meticulous and demanding approach.
"""

# Set up reason agent
reason_agent = ChatAgent(
    system_message=reason_agent_system_message,
    model=[reason_model_1, reason_model_2], # add models to the list, You can also swtich to other models
)

# # Set up evaluate agent(optional)
# evaluate_agent = ChatAgent(
#     system_message=evaluate_agent_system_message
# )

# # Initialize reward model (optional)
# reward_model = NemotronRewardModel(
#     model_type=ModelType.NVIDIA_NEMOTRON_340B_REWARD,
#     url="https://integrate.api.nvidia.com/v1",
#     api_key=os.environ.get("NVIDIA_API_KEY"),
# )

# # Set score thresholds for different dimensions (optional)
# score_threshold = {
#     "correctness": 1.0,
#     "clarity": 0.0,
#     "completeness": 0.0,
# }
# # Or use a single threshold for all dimensions:
# score_threshold = 0.9


# Create and run pipeline
pipeline = STaRPipeline(
    reason_agent=reason_agent,
    problems=problems,  # Pass problems list directly
    output_path=output_path,
    max_iterations=0,
    batch_size=100, # Size of batch to process the data (optional)
    # evaluate_agent=evaluate_agent, # To use evaluate agent(optional)
    # score_threshold=score_threshold, # Score thresholds for agent evaluation (optional)
    # reward_model=reward_model,  # To use a reward model (optional)
)

print("Start generation! May take some time, please wait..")

results = pipeline.generate(rationalization=False)

end_time = time.time()
execution_time = end_time - start_time

print(f"\nProcessed {len(results)} problems")
print(f"Results saved to: {output_path}")
print(f"Total execution time: {execution_time:.2f} seconds")

通过以下代码查看生成的CoT数据：

with open('generated_data.json', 'r') as f:
    data = json.load(f)
    print(json.dumps(data, indent=2))

? 上传数据到Hugging Face平台

具体步骤包含：

加载生成的数据：从本地文件加载生成的数据集。
转换为Hugging Face格式：将数据转换为Hugging Face的Dataset格式。
生成数据集卡片：创建包含数据集描述、标签和许可证信息的卡片。
登录Hugging Face：使用API token登录Hugging Face账户。
上传数据集：将数据集和卡片上传到Hugging Face平台。

# Import necessary modules and classes
from camel.datahubs.huggingface import HuggingFaceDatasetManager  # Manages interactions with Hugging Face datasets
from camel.datahubs.models import Record  # Represents a single record in the dataset
from datetime import datetime  # Handles date and time operations
import json  # For reading JSON files

def load_star_output(file_path):
    r"""Load and parse the star output JSON file.

    Args:
        file_path (str): Path to the star_output.json file.

    Returns:
        list: List of traces from the JSON file.
    """
    with open(file_path, 'r') as f:
        data = json.load(f)
    return data['traces']

# Main function: Upload dataset to Hugging Face
def upload_to_huggingface(transformed_data, username, dataset_name=None):
    r"""Uploads transformed data to the Hugging Face dataset platform.

    Args:
        transformed_data (list): Transformed data, typically a list of dictionaries.
        username (str): Hugging Face username.
        dataset_name (str, optional): Custom dataset name.

    Returns:
        str: URL of the uploaded dataset.
    """
    # Initialize HuggingFaceDatasetManager to interact with Hugging Face datasets
    manager = HuggingFaceDatasetManager()

    # Generate or validate the dataset name
    dataset_name = generate_or_validate_dataset_name(username, dataset_name)

    # Create the dataset on Hugging Face and get the dataset URL
    dataset_url = create_dataset(manager, dataset_name)

    # Create a dataset card to add metadata
    create_dataset_card(manager, dataset_name, username)

    # Convert the transformed data into a list of Record objects
    records = create_records(transformed_data)

    # Add the Record objects to the dataset
    add_records_to_dataset(manager, dataset_name, records)

    # Return the dataset URL
    return dataset_url

# Generate or validate the dataset name
def generate_or_validate_dataset_name(username, dataset_name):
    r"""Generates a default dataset name or validates and formats a user-provided name.

    Args:
        username (str): Hugging Face username.
        dataset_name (str, optional): User-provided custom dataset name.

    Returns:
        str: Formatted dataset name.
    """
    if dataset_name isNone:
        # If no dataset name is provided, generate a default name with the username and current date
        current_date = datetime.now().strftime("%Y%m%d")
        dataset_name = f"star_traces_{current_date}"

    # Format the dataset name to include the username
    returnf"{username}/{dataset_name}"

# Create a dataset on Hugging Face
def create_dataset(manager, dataset_name):
    r"""Creates a new dataset on Hugging Face and returns the dataset URL.

    Args:
        manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
        dataset_name (str): Name of the dataset.

    Returns:
        str: URL of the created dataset.
    """
    dataset_url = manager.create_dataset(dataset_name)
    return dataset_url

# Create a dataset card with metadata
def create_dataset_card(manager, dataset_name, username):
    r"""Creates a dataset card to add metadata

    Args:
        manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
        dataset_name (str): Name of the dataset.
        username (str): Hugging Face username.
    """
    manager.create_dataset_card(
        dataset_name=dataset_name,
        description="A dataset containing mathematical problem-solving traces with step-by-step solutions and improvement history. Each record includes a mathematical problem, its final solution, and the iterative improvement process.",
        license="mit",  # Using lowercase 'mit' as required by HuggingFace
        tags=["math", "problem-solving", "step-by-step", "traces"],
        authors=[username],
        language=["en"],
        task_categories=["text-generation"],
        content="This dataset contains mathematical problem-solving traces generated using the CAMEL framework. Each entry includes:\n\n"
                "- A mathematical problem statement\n"
                "- A detailed step-by-step solution\n"
    )

# Convert transformed data into Record objects
def create_records(transformed_data):
    r"""Converts transformed data into a list of Record objects.

    Args:
        transformed_data (list): List of trace dictionaries from star_output.json.

    Returns:
        list: List of Record objects.
    """
    records = []
    for trace in transformed_data:
        record = Record(
            source_type=trace['type'],
            problem=trace['problem'],
            solution=trace['final_trace'],
        )
        records.append(record)
    return records

# Add Record objects to the dataset
def add_records_to_dataset(manager, dataset_name, records):
    r"""Adds a list of Record objects to the dataset.

    Args:
        manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
        dataset_name (str): Name of the dataset.
        records (list): List of Record objects.
    """
    manager.add_records(dataset_name, records)

? 配置Hugging Face访问令牌，上传数据集

前往https://huggingface.co/settings/tokens/new?tokenType=write 获取Hugging Face的 API 密钥，并确保你已开启对仓库的写入权限。

接下来，在 Hugging Face上创建一个新的数据集：

# Get HuggingFace token and username
HUGGING_FACE_TOKEN = getpass('Enter your HUGGING_FACE_TOKEN: ')
os.environ["HUGGING_FACE_TOKEN"] = HUGGING_FACE_TOKEN
username = input("Enter your HuggingFace username: ")
dataset_name = input("Enter your dataset name:")

# Load the star output data
current_dir = os.getcwd()
star_output_path = os.path.join(current_dir, './generated_data.json')
traces = load_star_output(star_output_path)

# Upload the data to HuggingFace
dataset_url = upload_to_huggingface(traces, username, dataset_name)
print(f"\nDataset uploaded successfully!")
print(f"You can view your dataset at: {dataset_url}")

最终上传的数据预览

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费场景POC验证，效果验证后签署服务协议。零风险落地应用大模型，已交付160+中大型企业