爬虫

搭建爬虫服务

搭建环境

安装requests

1
pip install requests

爬取页面流程

导入库

1
import requests

请求页面

1
2
res = requests.get("https://baidu.com")
res.status_code //200

打印原始字符串

1
2
3
# res.encoding = "utf-8"
res.encoding = res.apparent_encoding
print(res.text)

执行文件

1
python3 spider.py
1
2
3
4
5
6
7
8
9
import requests

target = "https://baidu.com"

res = requests.get(target)
print(res.status_code)

res.encoding = res.apparent_encoding
print(res.text)

Requests接口

方法

  • resquest.get() 获取页面/发送get请求
  • resquest.head() 获取页面头部信息
  • resquest.post() 发送post请求

返回数据

  • .status_code HTTP状态码
  • .encoding 从头信息获取的编码方式或者默认的ISO-8859-1
  • .apparent_encoding 从内容中自动解析出的编码方式
  • .content 内容的二进制显示

携带URL参数

1
res = requests.get(url,params = {"wd":"xixi"})

解析页面

使用Beautiful Soup

初步解析

1
2
soup = BeautifulSoup(res.text, "html.parser")
print(soup.prettify())

获取指定标签

1
2
for i in soup.find_all('a'):
print(i)

获取指定内容

1
2
3
#获取class=mnav中的文本和href属性值
for i in soup.find_all('a',attrs ={"class":"mnav"}):
print(i.string + ' ' + i.attrs['href'])

下载图片

1
2
3
4
5
6
7
8
for dom in soup.find_all('a', attrs={"class": "card-img-hover"}):
domIndex = domIndex+1
print('https://cloud.tencent.com' +
dom.img.attrs['src']+' | '+dom.img.attrs['title'])

img = requests.get(dom.img.attrs['src']).content
with open('./source/zcool_'+str(domIndex)+'.jpg', 'wb') as file:
file.write(img)

实例一

获取文章列表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import requests
from bs4 import BeautifulSoup

target = "https://cloud.tencent.com/developer/articles/0"

res = requests.get(target)
print(res.status_code)
res.encoding = "utf-8"

print('-----------------------------------------------------------')

soup = BeautifulSoup(res.text, "html.parser")
# print(soup.prettify())
for dom in soup.find_all('div', attrs={"class": "com-3-article-panel"}):

print('https://cloud.tencent.com' +
dom.a.attrs['href']+' | '+dom.h3.string)

print('-----------------------------------------------------------')

输出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
200
-----------------------------------------------------------
https://cloud.tencent.com/developer/article/1006156 | 王者荣耀高并发背后的故事
https://cloud.tencent.com/developer/article/1004425 | MySQL 内核深度优化
https://cloud.tencent.com/developer/article/1004390 | 微信支付商户系统架构背后的故事
https://cloud.tencent.com/developer/article/1020337 | 包学会之浅入浅出Vue.js:开学篇
https://cloud.tencent.com/developer/article/1199082 | 给大家推荐几个常用谷歌浏览器插件(不用翻墙也能使用谷歌搜索了)
https://cloud.tencent.com/developer/article/1004881 | 你大概走了假敏捷:认真说说敏捷的实现和问题(手绘版)
https://cloud.tencent.com/developer/article/1151441 | 世界杯直播背后看不见的战斗:腾讯云极速高清技术部署实录
https://cloud.tencent.com/developer/article/1005607 | 整合 Django + Vue.js 框架快速搭建web项目
https://cloud.tencent.com/developer/article/1004377 | Memcached 与 Redis 实现的对比
https://cloud.tencent.com/developer/article/1158911 | 100行代码搞定短视频App,终于可以和美女合唱了。
https://cloud.tencent.com/developer/article/1004370 | 存储总量达 20T 的 MySQL 实例,如何完成迁移?
https://cloud.tencent.com/developer/article/1004383 | RabbitMQ进程结构分析与性能调优
https://cloud.tencent.com/developer/article/1004409 | 浅析海量用户的分布式系统设计(1)
https://cloud.tencent.com/developer/article/1333180 | 【教程】如何在腾讯云安装宝塔面板
https://cloud.tencent.com/developer/article/1020338 | 包学会之浅入浅出Vue.js:升学篇
https://cloud.tencent.com/developer/article/1004424 | 如何使用私有网络部署全球同服游戏服务
https://cloud.tencent.com/developer/article/1004374 | 腾讯云分布式高可靠消息队列 CMQ 架构
https://cloud.tencent.com/developer/article/1004367 | MySQL 数据库设计总结
https://cloud.tencent.com/developer/article/1004363 | HBase跨版本数据迁移总结
https://cloud.tencent.com/developer/article/1004423 | 基于用户画像大数据的电商防刷架构
-----------------------------------------------------------

实例二

下载封面图片

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import requests
from bs4 import BeautifulSoup

target = "https://www.zcool.com.cn/"

res = requests.get(target)
print('status_code:', res.status_code)
res.encoding = "utf-8"

print('-----------------------------------------------------------')

soup = BeautifulSoup(res.text, "html.parser")
# print(soup.prettify())
domIndex = 0
for dom in soup.find_all('a', attrs={"class": "card-img-hover"}):
domIndex = domIndex+1
print('https://cloud.tencent.com' +
dom.img.attrs['src']+' | '+dom.img.attrs['title'])

img = requests.get(dom.img.attrs['src']).content
with open('./source/zcool_'+str(domIndex)+'.jpg', 'wb') as file:
file.write(img)

print('-----------------------------------------------------------')

BeautifulSoup 页面解析

选择器

CSS选择器

1
doms = soup.select('a[href]')

自定义筛选器

1
2
def a_has_href(tag):
return tag.has_attr('href')

多个选择器

1
soup.find_all(['a','base'])

正则匹配

任意位置匹配

1
re.search('xixi','www.xixi.com')