微信扫码
添加专属顾问
我要投稿
通过 pip 安装 ScrapeGraphAI:
pip install scrapegraphai
安装 Playwright,用于基于 JavaScript 的抓取:
playwright install
建议在虚拟环境中安装库,以避免与其他库发生冲突。
SmartScraperGraph
:单页面抓取器,只需要用户提示和输入源。
SearchGraph
:多页面抓取器,从搜索引擎的前 n 个搜索结果中提取信息。
SpeechGraph
:单页面抓取器,从网站提取信息并生成音频文件。
使用本地模型的 SmartScraperGraph:
确保已安装 Ollama 并使用 ollama pull
命令下载模型。
示例代码展示了如何创建 SmartScraperGraph
实例并运行它,以获取项目列表及其描述。
使用混合模型的 SearchGraph:
使用 Groq 作为 LLM 和 Ollama 作为嵌入模型。
示例代码展示了如何创建 SearchGraph
实例并运行它,以获取 Chioggia 的传统食谱列表。
使用 OpenAI 的 SpeechGraph:
只需要传递 OpenAI API 密钥和模型名称。
示例代码展示了如何创建 SpeechGraph
实例并运行它,以生成项目摘要的音频文件。
SmartScraperGraph
的输出是项目及其描述的列表。
SearchGraph
的输出是食谱的列表。
SpeechGraph
的输出是页面上项目摘要的音频文件。
在使用之前,需要设置 OpenAI API 密钥。
文档和参考页面可以在 ScrapeGraphAI 的官方页面上找到。
The reference page for Scrapegraph-ai is available on the official page of pypy: pypi.
pip install scrapegraphai
you will also need to install Playwright for javascript-based scraping:
playwright install
Note: it is recommended to install the library in a virtual environment to avoid conflicts with other libraries ?
Follow the procedure on the following link to setup your OpenAI API key: link.
The documentation for ScrapeGraphAI can be found here.
Check out also the docusaurus documentation.
There are three main scraping pipelines that can be used to extract information from a website (or local file):
SmartScraperGraph
: single-page scraper that only needs a user prompt and an input source;
SearchGraph
: multi-page scraper that extracts information from the top n search results of a search engine;
SpeechGraph
: single-page scraper that extracts information from a website and generates an audio file.
It is possible to use different LLM through APIs, such as OpenAI, Groq, Azure and Gemini, or local models using Ollama.
Remember to have Ollama installed and download the models using the ollama pull command.
scrapegraphai.graphs graph_config { : { : , : , : , : , }, : { : , : , }, : , } smart_scraper_graph ( prompt, source, configgraph_config ) result smart_scraper_graph.() (result)
The output will be a list of projects with their descriptions like the following:
{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}
We use Groq for the LLM and Ollama for the embeddings.
scrapegraphai.graphs graph_config { : { : , : , : }, : { : , : , }, : , } search_graph ( prompt, configgraph_config ) result search_graph.() (result)
The output will be a list of recipes like the following:
{'recipes': [{'name': 'Sarde in Saòre'}, {'name': 'Bigoli in salsa'}, {'name': 'Seppie in umido'}, {'name': 'Moleche frite'}, {'name': 'Risotto alla pescatora'}, {'name': 'Broeto'}, {'name': 'Bibarasse in Cassopipa'}, {'name': 'Risi e bisi'}, {'name': 'Smegiassa Ciosota'}]}
You just need to pass the OpenAI API key and the model name.
scrapegraphai.graphs graph_config { : { : , : , }, : { : , : , : }, : , } speech_graph ( prompt, source, configgraph_config, ) result speech_graph.() (result)
The output will be an audio file with the summary of the projects on the page.
53AI,企业落地大模型首选服务商
产品:场景落地咨询+大模型应用平台+行业解决方案
承诺:免费POC验证,效果达标后再合作。零风险落地应用大模型,已交付160+中大型企业
2025-09-17
终于有Agent,把刀捅到了老板真正痛的地方。
2025-09-17
阿里发布下一代企业级智能体开发框架AgentScope 1.0
2025-09-17
关于大模型窗口大小的思考——上下文工程和提示词工程
2025-09-16
OpenAI首次揭秘:7亿人到底在用ChatGPT干嘛?
2025-09-16
基于本体论与大模型的新一代智能应用开发体系
2025-09-16
GPT‑5-Codex 发布:OpenAI 的 Claude Code
2025-09-16
新版 GPT-5 刚刚发布,最卷 AI 连肝代码 7 小时,编程工具大洗牌开始了
2025-09-16
Subagents:构建高可靠 AI Coding 专家顾问团
2025-08-21
2025-06-21
2025-08-21
2025-08-19
2025-07-29
2025-09-08
2025-08-19
2025-08-20
2025-07-04
2025-09-14
2025-09-16
2025-09-14
2025-09-12
2025-09-11
2025-09-11
2025-09-09
2025-09-09
2025-09-08