微信扫码
与创始人交个朋友
我要投稿
网页如下,有多个链接:
找到其中的a标签:
<a hotrep="doc.overview.modules.path.0.0.1" href="https://cloud.tencent.com/document/product/1093/35681" title="产品优势">
产品优势
</a>
在deepseek中输入提示词:
你是一个Python编程专家,要完成一个百度搜索页面爬取的Python脚本,具体任务如下:
解析网页:https://cloud.tencent.com/document/product/1093
定位class="rno-learning-path-wrap"的div标签;
然后定位div标签中所有a标签,提取title属性值作为网页文件名,提取href属性值作为网页下载地址,下载网页,保存网页到文件夹:F:\aivideo\腾讯云语音识别
注意:
每一步都要输出信息到屏幕上
每下载一个网页后,随机暂停3-6秒;
设置请求标头:
Accept:
text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
Accept-Encoding:
gzip, deflate, br, zstd
Accept-Language:
zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6
Cache-Control:
max-age=0
Priority:
u=0, i
Referer:
https://cloud.tencent.com/product/asr?from_column=20421&from=20421
Sec-Ch-Ua:
" Not A;Brand";v="99", "Chromium";v="127", "Google Chrome";v="127"
Sec-Ch-Ua-Mobile:
?0
Sec-Ch-Ua-Platform:
"Windows"
Sec-Fetch-Dest:
document
Sec-Fetch-Mode:
navigate
Sec-Fetch-Site:
same-origin
Sec-Fetch-User:
?1
Upgrade-Insecure-Requests:
1
User-Agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36
Deepseek生成的源代码:
import requests
from bs4 import BeautifulSoup
import os
import time
import random
# 设置请求头
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6",
"Cache-Control": "max-age=0",
"Priority": "u=0, i",
"Referer": "https://cloud.tencent.com/product/asr?from_column=20421&from=20421",
"Sec-Ch-Ua": '" Not A;Brand";v="99", "Chromium";v="127", "Google Chrome";v="127"',
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": '"Windows"',
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36"
}
# 创建文件夹
output_folder = 'F:\\aivideo\\腾讯云语音识别'
if not os.path.exists(output_folder):
os.makedirs(output_folder)
# 请求网页
url = "https://cloud.tencent.com/document/product/1093"
response = requests.get(url, headers=headers)
response.encoding = 'utf-8'
# 解析网页
soup = BeautifulSoup(response.text, 'html.parser')
# 定位class="rno-learning-path-wrap"的div标签
div_tag = soup.find('div', class_='rno-learning-path-wrap')
# 定位div标签中所有a标签
a_tags = div_tag.find_all('a')
# 提取title属性值作为网页文件名,提取href属性值作为网页下载地址
for a_tag in a_tags:
title = a_tag.get('title')
href = a_tag.get('href')
if title and href:
# 输出信息到屏幕上
print(f"Found link: {title} - {href}")
# 下载网页
try:
response = requests.get(href, headers=headers)
response.encoding = 'utf-8'
file_path = os.path.join(output_folder, f"{title}.html")
with open(file_path, 'w', encoding='utf-8') as file:
file.write(response.text)
print(f"Downloaded {href} to {file_path}")
except Exception as e:
print(f"Failed to download {href}: {e}")
# 随机暂停3-6秒
time.sleep(random.uniform(3, 6))
53AI,企业落地应用大模型首选服务商
产品:大模型应用平台+智能体定制开发+落地咨询服务
承诺:先做场景POC验证,看到效果再签署服务协议。零风险落地应用大模型,已交付160+中大型企业
2024-09-18
2024-07-18
2024-07-02
2024-07-10
2024-07-09
2024-07-10
2024-07-15
2024-07-14
2024-08-14
2024-07-26
2024-11-13
2024-10-31
2024-10-29
2024-10-16
2024-09-19
2024-08-28
2024-08-24
2024-08-11