Python网络爬虫

我的爬虫已经蠢蠢欲动了。

Requests库网络爬取实战

Requests库安装

终端输入pip install requests

爬取网页的通用代码框架

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import requests

def getHTMLText(url):
try:
r = requests.get(url, timeout=30)
r.raise_for_status() #如果状态不是200, 引发HTTPError异常
r.encoding = r.apparent_encoding
return r.text
except:
return "产生异常"

if __name__ == "__main__":
url = "http://www.baidu.com"
print(getHTMLText(url))

模拟浏览器向服务器提供http请求

有些网站能够根据头文件拒绝爬虫访问,故更改头文件为火狐5.0

1
2
3
4
5
6
7
8
9
10
import requests
url = "http://ip138.com"
try:
kv = {'user-agent':'Mozilla/5.0'}
r = requests.get(url,headers=kv)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[1000:2000])
except:
print("爬取失败")

百度/360关键词提交

自动向搜索引擎提交关键词并获得结果。

百度搜索代码

1
2
3
4
5
6
7
8
9
10
import requests
keyword = "gkdoe"
try:
kv = {'wd':keyword}
r = requests.get("https://www.baidu.com/s",params=kv)
print(r.request.url)
r.raise_for_status()
print(len(r.text))
except:
print("爬取失败")

360搜索代码

1
2
3
4
5
6
7
8
9
10
import requests
keyword = "gkdoe"
try:
kv = {'q':keyword}
r = requests.get("https://www.so.com/s",params=kv)
print(r.request.url)
r.raise_for_status()
print(len(r.text))
except:
print("爬取失败")

网络图片的爬取与存储

图片爬取代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import requests
import os
url = "http://image.nationalgeographic.com.cn/2017/0402/20170402065331835.jpeg"
root = "E://pictures//" #设置路径为E盘pictures文件夹
path = root + url.split('/')[-1]
try:
if not os.path.exists(root):
os.mkdir(root)
if not os.path.exists(path):
r = requests.get(url)
with open(path, 'wb') as f:
f.write(r.content)
f.close()
print("文件保存成功")
else:
print("文件已存在")
except:
print("爬取失败")

IP地址归属地的自动查询

1
2
3
4
5
6
7
8
9
import requests
url = "http://ip138.com/ips138.asp?ip="
try:
r = requests.get(url+'49.221.17.13')
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[-500:]) #输出最后500个字符
except:
print("爬取失败")

Beautiful Soup库

Beautiful Soup库安装

终端输入pip install beautifulsoup4

两行代码解析信息

1
2
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>data</p>', 'html.parser')

第一个参数是需要BeautifulSoup解析的html格式信息,可用<p>data</p>代替
第二个参数是解析器,这里使用的是html.parser