我要投稿

CrewAI Agents + Crawl4AI：实现自动化爬取与解析数据的创新方案

发布日期：2024-10-08 08:52:14 浏览次数： 2845

作者：大模型之路

微信搜一搜，关注“大模型之路”

在当今数据驱动的时代，信息的获取和分析对于企业和研究机构而言至关重要。然而，传统的数据收集方法往往耗时费力，且效率低下。为了应对这一挑战，Crawl4AI 结合 CrewAI agents 提供了一种创新的自动化爬取与解析数据的解决方案。

Crawl4AI 是一个开源且免费的 web 爬取和数据提取工具，专为 AI 代理设计。它通过自动化繁琐的数据抓取任务，极大地提高了数据获取的效率。而 CrewAI agents 则是一个基于 AI 的智能平台，能够协调多个代理（agents）完成复杂的数据处理任务（Multi-Agent架构-CrewAI详解）。将 Crawl4AI 与 CrewAI agents 结合使用，可以构建一个高效、智能的数据处理系统，实现从数据抓取到分析的全链条自动化。

二、Crawl4AI 的核心技术

2.1 开源与免费

Crawl4AI 最大的优势在于其开源和免费的特性。这意味着任何开发者都可以免费使用其强大的功能，无需担心高昂的许可费用。这大大降低了数据抓取和处理的门槛，使得更多企业和个人能够轻松构建自己的数据收集系统。

2.2 AI 驱动

Crawl4AI 利用 AI 技术自动定义和解析网页元素，极大地提高了数据抓取的准确性和效率。传统的爬虫工具往往需要手动定义抓取规则，而 Crawl4AI 则能通过学习网页结构，智能地识别并提取所需信息。这不仅减少了人为错误，还大大提高了处理复杂网页的能力。

2.3 结构化输出

Crawl4AI 将提取的数据转换为结构化格式（如 JSON 和 Markdown），方便后续的数据分析和处理。这种结构化的数据表示方式不仅提高了数据的可读性，还为数据分析和挖掘提供了便利。

2.4 多功能支持

Crawl4AI 还支持多种高级功能，如滚动加载、多 URL 爬取、媒体标签提取、元数据提取以及屏幕截图捕获等。这些功能使得 Crawl4AI 能够适应各种复杂的网页环境，满足多样化的数据抓取需求。

三、CrewAI Agents 的角色与功能

CrewAI agents 是一个基于 AI 的智能平台，能够协调多个代理完成复杂的数据处理任务（Multi-Agent架构-CrewAI详解）。在 Crawl4AI + CrewAI agents 的解决方案中，这些 agents 扮演着至关重要的角色。

3.1 Web Scraper

Web Scraper 是负责数据抓取的 agent。它利用 Crawl4AI 的功能从指定网站抓取数据，并将原始数据传递给后续的处理 agent。Web Scraper 需要具备深厚的网页抓取经验和结构化数据提取能力，以确保数据的完整性和准确性。

3.2 Data Cleaner

Data Cleaner 负责对原始数据进行清洗和整理。由于爬取的数据往往包含冗余、错误或不一致的信息，Data Cleaner 需要通过一系列规则和方法来去除这些无用数据，并将剩余的数据转换为统一的格式。这一步骤对于后续的数据分析至关重要。

3.3 Data Analyzer

Data Analyzer 是负责数据分析的 agent。它接收清洗后的数据，并运用统计、机器学习等方法来提取有价值的信息和洞察。Data Analyzer 可以帮助用户发现数据中的隐藏模式、趋势和关联关系，为决策提供有力支持。

四、实现步骤

4.1 安装与配置

首先，需要安装 Crawl4AI 和 CrewAI agents 所需的依赖库。这通常包括 Python 环境、pip 包管理器以及相关的库文件。接下来，根据具体需求配置 CrewAI agents，包括定义 agents 的角色、任务以及它们之间的交互关系。

4.2 编写爬虫脚本

使用 Crawl4AI 提供的 API 编写爬虫脚本。这包括创建 WebCrawler 实例、加载必要的模型、指定爬取目标 URL 以及设置提取策略等。在提取策略中，可以利用 LLM（大型语言模型）来定义数据提取的规则和模板，以提高数据抓取的准确性和效率。

4.3 数据清洗与整理

将爬取到的原始数据传递给 Data Cleaner agent 进行清洗和整理。这一步骤包括去除重复数据、修正错误数据、转换数据格式等。Data Cleaner agent 可以根据预定义的规则或模型来自动完成这些任务，以减轻人工干预的负担。

4.4 数据分析与洞察

最后，将清洗后的数据传递给 Data Analyzer agent 进行深入分析。Data Analyzer agent 可以运用多种统计和机器学习方法来提取数据中的隐藏信息，如趋势、模式、关联关系等。这些分析结果可以为企业的决策提供支持，帮助企业优化产品、制定营销策略等。

4.5核心代码

# tools.pyimport osfrom crawl4ai import WebCrawlerfrom crawl4ai.extraction_strategy import LLMExtractionStrategyfrom pydantic import BaseModel, Fieldfrom praisonai_tools import BaseTool
class ModelFee(BaseModel):llm_model_name: str = Field(..., description="Name of the model.")input_fee: str = Field(..., description="Fee for input token for the model.")output_fee: str = Field(..., description="Fee for output token for the model.")
class ModelFeeTool(BaseTool):name: str = "ModelFeeTool"description: str = "Extracts model fees for input and output tokens from the given pricing page."
def _run(self, url: str):crawler = WebCrawler()crawler.warmup()
result = crawler.run(url=url,word_count_threshold=1,extraction_strategy= LLMExtractionStrategy(provider="openai/gpt-4o",api_token=os.getenv('OPENAI_API_KEY'), schema=ModelFee.schema(),extraction_type="schema",instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. Do not miss any models in the entire content. One extracted model JSON format should look like this: {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""),bypass_cache=True,)return result.extracted_content
if __name__ == "__main__":# Test the ModelFeeTooltool = ModelFeeTool()url = "https://www.openai.com/pricing"result = tool.run(url)print(result)

framework: crewaitopic: extract model pricing from websitesroles:web_scraper:backstory: An expert in web scraping with a deep understanding of extracting structureddata from online sources. https://openai.com/api/pricing/ https://www.anthropic.com/pricing https://cohere.com/pricinggoal: Gather model pricing data from various websitesrole: Web Scrapertasks:scrape_model_pricing:description: Scrape model pricing information from the provided list of websites.expected_output: Raw HTML or JSON containing model pricing data.tools:- 'ModelFeeTool'data_cleaner:backstory: Specialist in data cleaning, ensuring that all collected data is accurateand properly formatted.goal: Clean and organize the scraped pricing datarole: Data Cleanertasks:clean_pricing_data:description: Process the raw scraped data to remove any duplicates and inconsistencies,and convert it into a structured format.expected_output: Cleaned and organized JSON or CSV file with model pricingdata.tools:- ''data_analyzer:backstory: Data analysis expert focused on deriving actionable insights from structureddata.goal: Analyze the cleaned pricing data to extract insightsrole: Data Analyzertasks:analyze_pricing_data:description: Analyze the cleaned data to extract trends, patterns, and insightson model pricing.expected_output: Detailed report summarizing model pricing trends and insights.tools:- ''

Crawl4AI与CrewAI Agents的结合，为自动化爬取和解析数据提供了强大的解决方案。通过利用Crawl4AI的AI驱动能力和CrewAI的智能代理系统，企业和开发者可以轻松地构建复杂的数据处理流程，实现数据的高效利用和价值的最大化。