微信扫码
与创始人交个朋友
我要投稿
通过 pip 安装 ScrapeGraphAI:
pip install scrapegraphai
安装 Playwright,用于基于 JavaScript 的抓取:
playwright install
建议在虚拟环境中安装库,以避免与其他库发生冲突。
SmartScraperGraph
:单页面抓取器,只需要用户提示和输入源。
SearchGraph
:多页面抓取器,从搜索引擎的前 n 个搜索结果中提取信息。
SpeechGraph
:单页面抓取器,从网站提取信息并生成音频文件。
使用本地模型的 SmartScraperGraph:
确保已安装 Ollama 并使用 ollama pull
命令下载模型。
示例代码展示了如何创建 SmartScraperGraph
实例并运行它,以获取项目列表及其描述。
使用混合模型的 SearchGraph:
使用 Groq 作为 LLM 和 Ollama 作为嵌入模型。
示例代码展示了如何创建 SearchGraph
实例并运行它,以获取 Chioggia 的传统食谱列表。
使用 OpenAI 的 SpeechGraph:
只需要传递 OpenAI API 密钥和模型名称。
示例代码展示了如何创建 SpeechGraph
实例并运行它,以生成项目摘要的音频文件。
SmartScraperGraph
的输出是项目及其描述的列表。
SearchGraph
的输出是食谱的列表。
SpeechGraph
的输出是页面上项目摘要的音频文件。
在使用之前,需要设置 OpenAI API 密钥。
文档和参考页面可以在 ScrapeGraphAI 的官方页面上找到。
The reference page for Scrapegraph-ai is available on the official page of pypy: pypi.
pip install scrapegraphai
you will also need to install Playwright for javascript-based scraping:
playwright install
Note: it is recommended to install the library in a virtual environment to avoid conflicts with other libraries ?
Follow the procedure on the following link to setup your OpenAI API key: link.
The documentation for ScrapeGraphAI can be found here.
Check out also the docusaurus documentation.
There are three main scraping pipelines that can be used to extract information from a website (or local file):
SmartScraperGraph
: single-page scraper that only needs a user prompt and an input source;
SearchGraph
: multi-page scraper that extracts information from the top n search results of a search engine;
SpeechGraph
: single-page scraper that extracts information from a website and generates an audio file.
It is possible to use different LLM through APIs, such as OpenAI, Groq, Azure and Gemini, or local models using Ollama.
Remember to have Ollama installed and download the models using the ollama pull command.
from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {
"model": "ollama/mistral",
"temperature": 0,
"format": "json", # Ollama needs the format to be specified explicitly
"base_url": "http://localhost:11434", # set Ollama URL
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "http://localhost:11434", # set Ollama URL
},
"verbose": True,
}
smart_scraper_graph = SmartScraperGraph(
prompt="List me all the projects with their descriptions",
# also accepts a string with the already downloaded HTML code
source="https://perinim.github.io/projects",
config=graph_config
)
result = smart_scraper_graph.run()
print(result)
The output will be a list of projects with their descriptions like the following:
{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}
We use Groq for the LLM and Ollama for the embeddings.
from scrapegraphai.graphs import SearchGraph
# Define the configuration for the graph
graph_config = {
"llm": {
"model": "groq/gemma-7b-it",
"api_key": "GROQ_API_KEY",
"temperature": 0
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
},
"max_results": 5,
}
# Create the SearchGraph instance
search_graph = SearchGraph(
prompt="List me all the traditional recipes from Chioggia",
config=graph_config
)
# Run the graph
result = search_graph.run()
print(result)
The output will be a list of recipes like the following:
{'recipes': [{'name': 'Sarde in Saòre'}, {'name': 'Bigoli in salsa'}, {'name': 'Seppie in umido'}, {'name': 'Moleche frite'}, {'name': 'Risotto alla pescatora'}, {'name': 'Broeto'}, {'name': 'Bibarasse in Cassopipa'}, {'name': 'Risi e bisi'}, {'name': 'Smegiassa Ciosota'}]}
You just need to pass the OpenAI API key and the model name.
from scrapegraphai.graphs import SpeechGraph
graph_config = {
"llm": {
"api_key": "OPENAI_API_KEY",
"model": "gpt-3.5-turbo",
},
"tts_model": {
"api_key": "OPENAI_API_KEY",
"model": "tts-1",
"voice": "alloy"
},
"output_path": "audio_summary.mp3",
}
# ************************************************
# Create the SpeechGraph instance and run it
# ************************************************
speech_graph = SpeechGraph(
prompt="Make a detailed audio summary of the projects.",
source="https://perinim.github.io/projects/",
config=graph_config,
)
result = speech_graph.run()
print(result)
The output will be an audio file with the summary of the projects on the page.
历史消息:
最新版Gitlab-CICD配置教程-Harbor集成篇,看完就会!
终于找到微信聊天记录SQLite数据库文件解密方法了,一起来看看吧!
ChatGPT技术-如何助力RuoYi-Vue框架完成MySQL到SQLite的迁移
微信解密PC端剖析-微信HOOK、机器人、数据库解密、公众号采集及企业微信接口应用深度解析
53AI,企业落地应用大模型首选服务商
产品:大模型应用平台+智能体定制开发+落地咨询服务
承诺:先做场景POC验证,看到效果再签署服务协议。零风险落地应用大模型,已交付160+中大型企业
2024-08-18
当产品经理谈到用LLM Agent构建新一代智能体的时候,他们在说什么?
2024-08-15
对话AI教育从业者们:AI如何解决因材施教的难题?
2024-08-03
工业应用中的向量数据库与知识向量化存储方案
2024-07-25
两大深度学习框架TensorFlow与PyTorch对比
2024-07-17
让生成式 AI 触手可及:NVIDIA NIM on VKE 部署实践
2024-07-16
中文大模型基准测评2024上半年报告
2024-07-16
一文看懂人工智能的起源、发展、三次浪潮与未来趋势
2024-07-14
"自拍" 秒变 "证件照" 看Coze如何实现
2024-05-14
2024-04-26
2024-05-22
2024-04-12
2024-07-18
2024-03-30
2024-05-10
2024-08-13
2024-04-25
2024-04-26