微信扫码
添加专属顾问
我要投稿
探索AI大模型如何革新爬虫技术,提升数据处理效率。 核心内容: 1. AI时代下爬虫技术的转型与挑战 2. 开源爬虫框架Crawl4AI的全面介绍 3. Crawl4AI核心功能与代码解析
AI大模型时代下的爬虫人也需要紧跟智能的潮流,抓住模型发展的契机,使用AI创建新的爬虫定义新的爬虫范式!数据的解析、整理、格式化可以让大模型来提高处理的效率!
介绍一个开源llm爬虫框架:Crawl4AI是一个功能全面、性能优越的网络爬虫工具,特别适合需要处理大量网页数据并进行智能分析的场景。
爬虫人花费了大量的时间在元素的定位和数据的解析获取上,我们为此招募了许多的xpath、css规则编写人,就为了适应上百、上千的web页面的数据处理。
首先我们需要明确的一个点就是,爬虫的数据处理与源码的获取不是一个概念,AI并不能帮助我们获取到所有的网站的源代码!为什么不能获取呢?由于现在的数据安全意识的增强,许多的站点都有反爬虫以及风控措施、模型不能直接与这些防护做对抗!
那么AI可以帮助做些什么呢?AI可以使用他的推理能力和智能体的能力,帮助用户使用自动化的工具打开一些简单的站点。可以帮助我们在源代码里面提取一些表格、列表等结构化的数据并处理后输出!
Crawl4AI是一个开源的网络爬虫和数据提取工具,专为大型语言模型(LLM)设计,旨在简化网页数据的抓取和提取过程。它通过异步操作、高效的数据处理和智能提取策略,为开发者提供了一个强大且灵活的工具,能够应对现代网页的复杂性和动态性。Crawl4AI不仅支持传统的爬虫功能,还融入了AI技术,使其在处理大规模数据和动态内容时表现出色。
Crawl4AI的核心目标是提供一个高效、灵活且易于集成的网络爬虫工具,特别适合与大型语言模型和AI应用配合使用。以下是Crawl4AI的主要特点:
Crawl4AI的代码结构清晰,模块化设计便于维护和扩展。以下是对其主要功能和代码实现的解析:
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="http://zhaomeng.net")
print(result.markdown)
asyncio.run(main())
代码解析:
Crawl4AI提供了多种数据提取策略,包括基于CSS/XPath的传统方法和基于LLM的智能提取。以下是使用LLM提取策略的示例:
from crawl4ai.extraction_strategy import LLMExtractionStrategy
INSTRUCTION_TO_LLM = "Extract all rows from the main table as objects with 'CASNo','purity','MF','MW','SMILES','size', 'price' ,'stock' from the content."
class Product(BaseModel):
CASNo:str
size: str
price: str
stock:str
purity:str
MF:str
MW:str
SMILES:str
llm_strategy = LLMExtractionStrategy(
provider="deepseek/deepseek-chat",
api_token=apikey,
schema=Product.model_json_schema(),
extraction_type="schema",
instruction=INSTRUCTION_TO_LLM,
chunk_token_threshold=1000,
overlap_rate=0.0,
apply_chunking=True,
input_format="markdown",
extra_args={"temperature": 0.0, "max_tokens": 800},
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.chemshuttle.com/building-blocks/amino-acids/fmoc-r-3-amino-4-4-nitrophenyl-butyric-acid.html",
extraction_strategy=extraction_strategy
)
print(result.extracted_content)
解析:
Crawl4AI能够处理通过JavaScript动态加载的内容。以下是配置爬虫执行JavaScript的示例:
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="document.querySelector('.content-loaded')"
)
print(result.markdown)
解析:
Crawl4AI实现了全面的错误处理机制,确保在网络不稳定或网页结构变化时稳定运行。以下是错误处理的示例:
try:
result = await crawler.arun(url="https://example.com")
except Exception as e:
print(f"An error occurred: {e}")
解析:
背景导入:
获取化学生物医药行业的站点的产品信息以及产品的价格、规格、纯度等信息
ollama安装:https://ollama.com/
deepseek-r1本地部署
ollama run deepseek-r1:14b
官网:https://platform.deepseek.com/usage
注册api_key
pip install crawl4ai
playwright install
上述的相关分析,和正常做一些爬虫业务需求是一样的,不会因为需要对接就有什么特别不太一样的,所以按正常的需求分析。
class Product(BaseModel):
CASNo:str
size: str
price: str
stock:str
purity:str
MF:str
MW:str
SMILES:str
llm_strategy = LLMExtractionStrategy(
provider="deepseek/deepseek-chat",
api_token="sk-1561f1bf223f41df908dc96cd3e5b403",
schema=Product.model_json_schema(),
extraction_type="schema",
instruction=INSTRUCTION_TO_LLM,
chunk_token_threshold=1000,
overlap_rate=0.0,
apply_chunking=True,
input_format="markdown",
extra_args={"temperature": 0.0, "max_tokens": 800},
)
crawl_config = CrawlerRunConfig(
extraction_strategy=llm_strategy,
cache_mode=CacheMode.BYPASS,
process_iframes=False,
remove_overlay_elements=True,
exclude_external_links=True,
)
browser_cfg = BrowserConfig(headless=True, verbose=True)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
try:
result = await crawler.arun(url=URL_TO_SCRAPE, config=crawl_config)
if result.success:
data = json.loads(result.extracted_content)
print("Extracted items:", data)
llm_strategy.show_usage()
else:
print("Error:", result.error_message)
except Exception as e:
print(traceback.print_exc())
数据展示:
Extracted items: [{'CASNo': '269398-78-9', 'size': '1g', 'price': '$150.00', 'stock': 'Typically in stock', 'purity': '95%', 'MF': 'C25H22N2O6', 'MW': '446.459', 'SMILES': 'OC(=O)C[C@@H](CC1=CC=C(C=C1)[N+]([O-])=O)NC(=O)OCC1C2=CC=CC=C2C2=C1C=CC=C2', 'error': False}, {'CASNo': '269398-78-9', 'size': '5g', 'price': '$450.00', 'stock': 'Typically in stock', 'purity': '95%', 'MF': 'C25H22N2O6', 'MW': '446.459', 'SMILES': 'OC(=O)C[C@@H](CC1=CC=C(C=C1)[N+]([O-])=O)NC(=O)OCC1C2=CC=CC=C2C2=C1C=CC=C2', 'error': False}, {'CASNo': '269398-78-9', 'size': '10g', 'price': 'Inquire', 'stock': 'Inquire', 'purity': '95%', 'MF': 'C25H22N2O6', 'MW': '446.459', 'SMILES': 'OC(=O)C[C@@H](CC1=CC=C(C=C1)[N+]([O-])=O)NC(=O)OCC1C2=CC=CC=C2C2=C1C=CC=C2', 'error': False}, {'CASNo': '269398-78-9', 'size': '100g', 'price': '$6980.00', 'stock': 'Inquire', 'purity': '95%', 'MF': 'C25H22N2O6', 'MW': '446.459', 'SMILES': 'OC(=O)C[C@@H](CC1=CC=C(C=C1)[N+]([O-])=O)NC(=O)OCC1C2=CC=CC=C2C2=C1C=CC=C2', 'error': False}]
import asyncio
import json
import os
import traceback
from typing import List
from crawl4ai import AsyncWebCrawler, BrowserConfig, CacheMode, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
# URL_TO_SCRAPE = "https://nstchemicals.com/product/s-pro-xylane-cas-868156-46-1/"
# INSTRUCTION_TO_LLM = "Extract all rows from the main table as objects with 'specs', 'price' from the content."
URL_TO_SCRAPE = "https://www.chemshuttle.com/building-blocks/amino-acids/fmoc-r-3-amino-4-4-nitrophenyl-butyric-acid.html"
INSTRUCTION_TO_LLM = "Extract all rows from the main table as objects with 'CASNo','purity','MF','MW','SMILES','size', 'price' ,'stock' from the content."
class Product(BaseModel):
CASNo:str
size: str
price: str
stock:str
purity:str
MF:str
MW:str
SMILES:str
async def main():
llm_strategy = LLMExtractionStrategy(
provider="deepseek/deepseek-chat",
api_token="api-key",
schema=Product.model_json_schema(),
extraction_type="schema",
instruction=INSTRUCTION_TO_LLM,
chunk_token_threshold=1000,
overlap_rate=0.0,
apply_chunking=True,
input_format="markdown",
extra_args={"temperature": 0.0, "max_tokens": 800},
)
crawl_config = CrawlerRunConfig(
extraction_strategy=llm_strategy,
cache_mode=CacheMode.BYPASS,
process_iframes=False,
remove_overlay_elements=True,
exclude_external_links=True,
)
browser_cfg = BrowserConfig(headless=True, verbose=True)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
try:
result = await crawler.arun(url=URL_TO_SCRAPE, config=crawl_config)
if result.success:
data = json.loads(result.extracted_content)
print("Extracted items:", data)
llm_strategy.show_usage()
else:
print("Error:", result.error_message)
except Exception as e:
print(traceback.print_exc())
if __name__ == "__main__":
asyncio.run(main())
53AI,企业落地大模型首选服务商
产品:场景落地咨询+大模型应用平台+行业解决方案
承诺:免费场景POC验证,效果验证后签署服务协议。零风险落地应用大模型,已交付160+中大型企业
2025-03-28
HAI Platform:幻方AI开源的高效AI训练平台
2025-03-28
Dify-Plus:一个定制化的Dify二开开发
2025-03-28
MCP标准即将一统AI江湖?OpenAI官宣开始采用,但智能体通信协议之争才刚刚开始
2025-03-28
开源骚操作!改Logo、塞私货、传密钥|某药企魔改Dify,惨遭律师函...
2025-03-28
【开源】Dify+RAGFlow强强联合:知识库精准度飙升,PDF表格秒变结构化数据!
2025-03-28
给丽珠医药普普法,开源“白嫖”与法律后果
2025-03-28
阿里深夜开源Qwen2.5-Omni,7B参数完成看、听、说、写
2025-03-27
阿里Qwen版高级语音模式和实时视频聊天模式来了:每天10次试用
2025-01-01
2024-07-25
2025-01-21
2024-05-06
2024-09-20
2024-07-20
2024-06-12
2024-07-11
2024-08-13
2024-12-26
2025-03-25
2025-03-25
2025-03-24
2025-03-22
2025-03-19
2025-03-17
2025-03-17
2025-03-13