微信扫码
与创始人交个朋友
我要投稿
通过 pip 安装 ScrapeGraphAI:
pip install scrapegraphai
安装 Playwright,用于基于 JavaScript 的抓取:
playwright install
建议在虚拟环境中安装库,以避免与其他库发生冲突。
SmartScraperGraph
:单页面抓取器,只需要用户提示和输入源。
SearchGraph
:多页面抓取器,从搜索引擎的前 n 个搜索结果中提取信息。
SpeechGraph
:单页面抓取器,从网站提取信息并生成音频文件。
使用本地模型的 SmartScraperGraph:
确保已安装 Ollama 并使用 ollama pull
命令下载模型。
示例代码展示了如何创建 SmartScraperGraph
实例并运行它,以获取项目列表及其描述。
使用混合模型的 SearchGraph:
使用 Groq 作为 LLM 和 Ollama 作为嵌入模型。
示例代码展示了如何创建 SearchGraph
实例并运行它,以获取 Chioggia 的传统食谱列表。
使用 OpenAI 的 SpeechGraph:
只需要传递 OpenAI API 密钥和模型名称。
示例代码展示了如何创建 SpeechGraph
实例并运行它,以生成项目摘要的音频文件。
SmartScraperGraph
的输出是项目及其描述的列表。
SearchGraph
的输出是食谱的列表。
SpeechGraph
的输出是页面上项目摘要的音频文件。
在使用之前,需要设置 OpenAI API 密钥。
文档和参考页面可以在 ScrapeGraphAI 的官方页面上找到。
The reference page for Scrapegraph-ai is available on the official page of pypy: pypi.
pip install scrapegraphai
you will also need to install Playwright for javascript-based scraping:
playwright install
Note: it is recommended to install the library in a virtual environment to avoid conflicts with other libraries ?
Follow the procedure on the following link to setup your OpenAI API key: link.
The documentation for ScrapeGraphAI can be found here.
Check out also the docusaurus documentation.
There are three main scraping pipelines that can be used to extract information from a website (or local file):
SmartScraperGraph
: single-page scraper that only needs a user prompt and an input source;
SearchGraph
: multi-page scraper that extracts information from the top n search results of a search engine;
SpeechGraph
: single-page scraper that extracts information from a website and generates an audio file.
It is possible to use different LLM through APIs, such as OpenAI, Groq, Azure and Gemini, or local models using Ollama.
Remember to have Ollama installed and download the models using the ollama pull command.
scrapegraphai.graphs graph_config { : { : , : , : , : , }, : { : , : , }, : , } smart_scraper_graph ( prompt, source, configgraph_config ) result smart_scraper_graph.() (result)
The output will be a list of projects with their descriptions like the following:
{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}
We use Groq for the LLM and Ollama for the embeddings.
scrapegraphai.graphs graph_config { : { : , : , : }, : { : , : , }, : , } search_graph ( prompt, configgraph_config ) result search_graph.() (result)
The output will be a list of recipes like the following:
{'recipes': [{'name': 'Sarde in Saòre'}, {'name': 'Bigoli in salsa'}, {'name': 'Seppie in umido'}, {'name': 'Moleche frite'}, {'name': 'Risotto alla pescatora'}, {'name': 'Broeto'}, {'name': 'Bibarasse in Cassopipa'}, {'name': 'Risi e bisi'}, {'name': 'Smegiassa Ciosota'}]}
You just need to pass the OpenAI API key and the model name.
scrapegraphai.graphs graph_config { : { : , : , }, : { : , : , : }, : , } speech_graph ( prompt, source, configgraph_config, ) result speech_graph.() (result)
The output will be an audio file with the summary of the projects on the page.
53AI,企业落地应用大模型首选服务商
产品:大模型应用平台+智能体定制开发+落地咨询服务
承诺:先做场景POC验证,看到效果再签署服务协议。零风险落地应用大模型,已交付160+中大型企业
2024-05-28
2024-04-26
2024-04-11
2024-08-21
2024-07-09
2024-08-13
2024-07-18
2024-10-25
2024-07-01
2024-06-17