BeautifulSoup 实现
提前创建好文件夹,用于存放图片
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
| from bs4 import BeautifulSoup import requests import time
start_time = time.time()
url = "https://pvp.qq.com/web201605/herolist.shtml"
session = requests.session()
response = session.get(url) print(f'状态码: {response}')
if response.status_code != 200: pass else: print("服务器连接正常")
soup = BeautifulSoup(response.content.decode('gbk'), 'html.parser')
items = soup.find_all('ul', class_='herolist clearfix')
items2 = items[0].find_all('li')
for item in items2: img_url = item.find('img')['src'] name = item.find('a').text new_url = "https:" + img_url url_content = session.get(new_url).content with open("./wang2/" + name + ".jpg", "wb") as f: f.write(url_content) print(f"{name}下载完成")
end_time = time.time()
print(f"下载完成,用时{end_time - start_time}秒")
|
1 2 3 4 5 6
| ... 墨子下载完成 赵云下载完成 小乔下载完成 廉颇下载完成 下载完成,用时57.759543895721436秒
|
PyQuery 实现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
| import requests import re from pyquery import PyQuery as pq import time
start_time = time.time()
url = "https://pvp.qq.com/web201605/herolist.shtml"
session = requests.session()
response = session.get(url) print(f'状态码: {response}')
if response.status_code != 200: pass else: print("服务器连接正常")
content = response.content.decode('gbk') doc = pq(content)
items = doc('.herolist>li')
items = items.items() print(items)
for item in items: url = item.find("img").attr("src") new_url = "https:" + url name = item.find("a").text() url_content = session.get(new_url).content with open('./wangzhe/' + name + '.jpg', 'wb') as f: f.write(url_content) print(f'{name}下载完成')
end_time = time.time()
print(f'下载完成,用时{end_time - start_time}秒')
|
1 2 3 4 5
| ... 赵云下载完成 小乔下载完成 廉颇下载完成 下载完成,用时65.03721570968628秒
|
re 暂不能实现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| import requests import re import time
start_time = time.time() url = "https://pvp.qq.com/web201605/herolist.shtml"
session = requests.session()
response = session.get(url) print(f'status_code: {response}') if response.status_code != 200: pass else: print("ok")
r = r'<li>(.*?)</li>'
items = re.findall(r,response.content.decode('gbk'), re.DOTALL)
for item in items: print(item) r1 = r'<img*>' item2 = re.findall(r1,item, re.DOTALL) print(item2)
|
这个返回的结果
1 2 3 4 5 6 7
| ... <a href="herodetail/'+this.ename+'.shtml" target="_blank"><img src="'+imgurl+this.ename+'.jpg" width="91" alt="'+this.cname+'">'+this.cname+'</a> [] <a href="herodetail/' + data[f].id_name + '.shtml" target="_blank"><img src="' + _imgurl + _ename + '.jpg" width="91" height="91" alt="' + _cname + '">' + _cname + '</a> [] <a href="herodetail/' + dataList[j].id_name + '.shtml" target="_blank"><img src="' + imgurl + dataList[j].ename + '.jpg" width="91px" alt="' + dataList[j].cname + '">' + dataList[j].cname + '</a> []
|
看样子是需要结合 js 去解析的,所以此处不行,而前两种方法可以。
结果图

关键点小总结
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
| soup:
soup = BeautifulSoup(response.content.decode('gbk'), 'html.parser')
img_url = soup.find_all('ul', class_='herolist clearfix')
name = item.find('a').text
========================================================================================================= pq:
doc = pq(content)
items = doc('.herolist>li')
url = item.find("img").attr("src")
name = item.find("a").text()
|
后记
闲得慌,后面挑战难点的