我要投稿

AI网络爬虫：批量下载某个网页中的全部链接

发布日期：2024-07-18 12:13:40 浏览次数： 2248

网页如下，有多个链接：

找到其中的a标签：

产品优势

</a>

在deepseek中输入提示词：

你是一个Python编程专家，要完成一个百度搜索页面爬取的Python脚本，具体任务如下：

解析网页：https://cloud.tencent.com/document/product/1093

定位class="rno-learning-path-wrap"的div标签；

然后定位div标签中所有a标签，提取title属性值作为网页文件名，提取href属性值作为网页下载地址，下载网页，保存网页到文件夹：F:\aivideo\腾讯云语音识别

注意：

每一步都要输出信息到屏幕上

每下载一个网页后，随机暂停3-6秒；

设置请求标头：

Accept:

text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7

Accept-Encoding:

gzip, deflate, br, zstd

Accept-Language:

zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6

Cache-Control:

max-age=0

Priority:

u=0, i

Referer:

https://cloud.tencent.com/product/asr?from_column=20421&from=20421

Sec-Ch-Ua:

" Not A;Brand";v="99", "Chromium";v="127", "Google Chrome";v="127"

Sec-Ch-Ua-Mobile:

Sec-Ch-Ua-Platform:

"Windows"

Sec-Fetch-Dest:

document

Sec-Fetch-Mode:

navigate

Sec-Fetch-Site:

same-origin

Sec-Fetch-User:

Upgrade-Insecure-Requests:

User-Agent:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36

Deepseek生成的源代码：

import requests

from bs4 import BeautifulSoup

import os

import time

import random

# 设置请求头

headers = {

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",

"Accept-Encoding": "gzip, deflate, br, zstd",

"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6",

"Cache-Control": "max-age=0",

"Priority": "u=0, i",

"Referer": "https://cloud.tencent.com/product/asr?from_column=20421&from=20421",

"Sec-Ch-Ua": '" Not A;Brand";v="99", "Chromium";v="127", "Google Chrome";v="127"',

"Sec-Ch-Ua-Mobile": "?0",

"Sec-Ch-Ua-Platform": '"Windows"',

"Sec-Fetch-Dest": "document",

"Sec-Fetch-Mode": "navigate",

"Sec-Fetch-Site": "same-origin",

"Sec-Fetch-User": "?1",

"Upgrade-Insecure-Requests": "1",

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36"

}

# 创建文件夹

output_folder = 'F:\\aivideo\\腾讯云语音识别'

if not os.path.exists(output_folder):

os.makedirs(output_folder)

# 请求网页

url = "https://cloud.tencent.com/document/product/1093"

response = requests.get(url, headers=headers)

response.encoding = 'utf-8'

# 解析网页

soup = BeautifulSoup(response.text, 'html.parser')

# 定位class="rno-learning-path-wrap"的div标签

div_tag = soup.find('div', class_='rno-learning-path-wrap')

# 定位div标签中所有a标签

a_tags = div_tag.find_all('a')

# 提取title属性值作为网页文件名，提取href属性值作为网页下载地址

for a_tag in a_tags:

title = a_tag.get('title')

href = a_tag.get('href')

if title and href:

# 输出信息到屏幕上

print(f"Found link: {title} - {href}")

# 下载网页

try:

response = requests.get(href, headers=headers)

response.encoding = 'utf-8'

file_path = os.path.join(output_folder, f"{title}.html")

with open(file_path, 'w', encoding='utf-8') as file:

file.write(response.text)

print(f"Downloaded {href} to {file_path}")

except Exception as e:

print(f"Failed to download {href}: {e}")

# 随机暂停3-6秒

time.sleep(random.uniform(3, 6))

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费场景POC验证，效果验证后签署服务协议。零风险落地应用大模型，已交付160+中大型企业

相关资讯

2025-07-08

【Augment】 Augment技巧之 Rewrite Prompt(重写提示) 有神奇的魔法

2025-07-07

每周帮你节省20小时的10个高效DeepSeek提示词

2025-07-06

告别"AI味儿"！5个提示词帮你润色文章！

2025-06-29

提升AI编程效果的13个Prompt技巧

2025-06-28

很多人用不好AI，写不好提示词，打造不出自己的提示词系统，是因为不明白这一点！

2025-06-26

程序员的提示工程实战手册

2025-06-23

被 AI 气到崩溃？手把手教你写出 “有效指令”

2025-06-21

AI越强，Prompt越没用？恰恰相反，不懂这些你将被淘汰

了解更多

160+中大型企业正在使用53AI

立即咨询预约演示

把握AI发展的机遇，共同探索、共同进步

2025-01-22

如何打造基于GenAI的员工服务机器人

2025-01-22

热点资讯

自己写的论文AI率还爆表？莫慌！这份“降AI率”Prompt请收好！

2025-05-09

Augment官方：11种提示词技巧，打造更出色的AI编程智能体

2025-05-25

你还在随便写提示词？顶级AI公司的“提示词秘笈”告诉你：这事没那么简单！

2025-06-02

LLM 返回的 JSON 有问题？试试 json-repair!

2025-04-30

DeepSeek V3-0324很不错，这是一些提示词和使用建议

2025-04-15

谷歌官方出品！大模型提示词全攻略

2025-05-12

结构化输出指南：三个必备prompt提示技巧

2025-05-18

Dify高阶技巧：通过元数据实现文档级检索问答

2025-06-18

写好 AI 提示词的核心技巧（上）

2025-05-07

Claude 官方AI编程教程：最低级设计和最高级的技巧

2025-04-20

大家都在问

做 Prompt 工程师半年，我被大模型按在地上摩擦后悟了些什么？

2025-06-17

Manus可以输出15万字的内容，是精品还是垃圾？

2025-04-21

如果使用AI工具有段位，你是青铜还是王者？

2025-03-31

参加李继刚线下活动启发：未来提示词还会存在吗？

2025-03-29

AI小技巧：LLM时代，如何写好Prompts？

2025-03-17

DeepSeek-R1提示词使用指南：为什么说没有技巧就是最好的技巧？

2025-02-06

跟大模型对话时prompt提示词越礼貌结果越好？为什么？

2025-01-10

还在吐槽 o1 降智？OpenAI最新提示指南来了！试试看？

2024-12-25

热门标签

内容创作大模型技术个人提效 langchain llamaindex 多模态技术 RAG技术智能客服知识图谱模型微调 RAGFlow coze Dify Fastgpt Bisheng Qanything AI+汽车 AI+金融 AI+工业 AI+培训 AI+SaaS 提示词框架提示词技巧 AI+电商 AI面试数字员工 ChatBI 知识管理开源大模型智能营销智能硬件智能化改造 AI+医疗 MaxKB