我要投稿

AI爬虫利器：轻松转换网页数据，赋能AI模型训练与应用的强大助手！

发布日期：2024-06-02 08:08:03 浏览次数： 2679

作者：AI技能

微信搜一搜，关注“AI技能”

项目概述

背景：随着信息时代的快速发展，从网站上高效抓取和处理数据的需求急剧增加。Firecrawl 由 Mendable.ai 和 Firecrawl 社区共同开发，专门用于将网站内容转换为 LLM（大型语言模型）可直接使用的 Markdown 格式，解决了数据采集和转换的效率问题。
目标：主要目标是提供一个简单、高效的方式，使开发者和研究人员能够快速从网站抓取和转换数据，支持Markdown和结构化数据转换，以便进行数据分析和机器学习应用。

主要功能：

全面网站爬取：即使无需站点地图，也能自动爬取所有可访问子页面。
高级数据采集：即使网站使用 JavaScript 渲染内容，Firecrawl 也能有效采集数据。
格式化输出：返回干净、格式良好的 Markdown 文档，直接适用于 LLM 应用。
并行处理：抓取过程支持并行处理，能够快速返回结果。
智能缓存：会缓存内容，除非有新内容出现，否则无需等待全面搜索。
LLM 提取：利用大型语言模型快速完成网页数据的智能提取。

安装与配置

前提条件

Python 3.7或更高版本
Node.js（可选，用于Node SDK）
API密钥：需在Firecrawl网站注册以获取

安装步骤

克隆仓库：执行命令 git clone https://github.com/mendable/firecrawl.git
安装依赖：Python 用户运行 pip install firecrawl-py，Node.js 用户运行 npm install @mendable/firecrawl-js
运行项目：根据语言环境设置适当的环境变量或直接在代码中配置API密钥，然后开始调用API或运行SDK函数。

配置说明

需要在项目中配置API密钥，可以将其设置为环境变量 FIRECRAWL_API_KEY 或在使用SDK时传递。

使用指南

爬取（Crawling）

使用爬取功能，可以对指定的URL及其所有可访问的子页面进行爬取。此操作提交一个爬取任务，并返回一个任务ID，用于检查爬取状态。

启动爬取任务

 
curl -X POST https://api.firecrawl.dev/v0/crawl \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://mendable.ai"
}'

成功调用后返回的 jobId：

 
{ "jobId": "1234-5678-9101" }

检查爬取任务状态

 
curl -X GET https://api.firecrawl.dev/v0/crawl/status/1234-5678-9101 \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY'

返回爬取任务的状态：

 
{
"status": "completed",
"current": 22,
"total": 22,
"data": [
{
"content": "Raw Content ",
"markdown": "# Markdown Content",
"provider": "web-scraper",
"metadata": {
"title": "Mendable | AI for CX and Sales",
"description": "AI for CX and Sales",
"language": null,
"sourceURL": "https://www.mendable.ai/"
}
}
]
}

抓取（Scraping）

使用抓取功能，可以获取单个URL的内容。

抓取单个URL

 
curl -X POST https://api.firecrawl.dev/v0/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://mendable.ai"
}'

返回抓取的数据：

 
{
"success": true,
"data": {
"content": "Raw Content ",
"markdown": "# Markdown Content",
"provider": "web-scraper",
"metadata": {
"title": "Mendable | AI for CX and Sales",
"description": "AI for CX and Sales",
"language": null,
"sourceURL": "https://www.mendable.ai/"
}
}
}

搜索（Search）

使用搜索功能，可以根据查询词进行网页搜索，获取最相关的结果，并抓取每个页面返回Markdown格式的内容。

进行搜索

 
curl -X POST https://api.firecrawl.dev/v0/search \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"query": "firecrawl",
"pageOptions": {
"fetchPageContent": true // false for a fast serp api
}
}'

返回搜索和抓取结果：

 
{
"success": true,
"data": [
{
"url": "https://mendable.ai",
"markdown": "# Markdown Content",
"provider": "web-scraper",
"metadata": {
"title": "Mendable | AI for CX and Sales",
"description": "AI for CX and Sales",
"language": null,
"sourceURL": "https://www.mendable.ai/"
}
}
]
}

智能提取（Intelligent Extraction）

利用LLM技术从抓取的页面中提取结构化数据。

从URL提取结构化数据

 
curl -X POST https://api.firecrawl.dev/v0/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://www.mendable.ai/",
"extractorOptions": {
"mode": "llm-extraction",
"extractionPrompt": "Based on the information on the page, extract the information from the schema.",
"extractionSchema": {
"type": "object",
"properties": {
"company_mission": {"type": "string"},
"supports_sso": {"type": "boolean"},
"is_open_source": {"type":

"boolean"}, "is_in_yc": {"type": "boolean"} }, "required": ["company_mission", "supports_sso", "is_open_source", "is_in_yc"] } } }'

 
返回提取的结构化数据：
```json
{
"success": true,
"data": {
"content": "Raw Content",
"metadata": {
"title": "Mendable",
"description": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
"robots": "follow, index",
"ogTitle": "Mendable",
"ogDescription": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
"ogUrl": "https://mendable.ai/",
"ogImage": "https://mendable.ai/mendable_new_og1.png",
"ogLocaleAlternate": [],
"ogSiteName": "Mendable",
"sourceURL": "https://mendable.ai/"
},
"llm_extraction": {
"company_mission": "Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to",
"supports_sso": true,
"is_open_source": false,
"is_in_yc": true
}
}
}