python 爬虫基本库的使用

python爬虫

发布日期: 2019-04-13

阅读次数:

1.urllib库

python3中将python2中的urllib和urllib2两个库统一起来，统一为urllib库，它分为四个模块。

request:发送HTTP请求的模块。

error:异常处理模块，如果出现请求异常，可以获得异常并且进行处理

parse: 提供对URL处理方法，拆分、解析、合并等。

robotparser:用来识别网站的robots.txt文件，然后判断哪些网站可以爬，那些不可爬。

2. request发送请求模块

request模块可以模拟浏览器发出HTTP请求，同时带有处理授权验证、重定向、cookies以及其他内容。

2.1 urlopen()

import urllib.request

response=urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))

print(type(response))
print(response.getheaders())
print(response.status)
print(response.getheader('Server'))

通过上面代码可以对网站进行http请求，将获得的相应存储在response里面，response是一个HTTPResponse对象，主要包含read()、readinto()、getheader(name)、getheaders()、fileno()等方法，以及msg、version、status等属性，type(response)将显示出response的属性，

<class 'http.client.HTTPResponse'>

注意：response.getheaders() 返回头部所有的信息，然后resopnse.gethead(‘name’)是返回指定的头部名称信息，它没有S。

函数的格式：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

参数说明：

data是post传值的数据，可以传递表单

print(urllib.parse.urlencode({'name':'qin'}))

urllib.parse.urlencode是将字典转换成字符串，最后变成

name=qin,data = bytes(urllib . parse . urlencode({'word ’:’ hello'}), encoding ＝’ ut 8')

timeout参数:

用来设置超时时间，单位是秒，支持ftp,http,https请求

import socket
import urllib.request
import urllib.error

try:
    response=urllib.request.urlopen('****',timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout):
       print('TIME OUT')

其他参数:

context是指定ssl设置的，cafile和capath是指定证书和路径的。

2.2. 使用Request对象来发出请求

因为urlopen()可以实现最基本的请求，但是不足用来构建一个完整的请求。如果加入Headers等信息，需要利用更加强大的Request类

Request对象：

class urllib.request.Request(ur1,data=None,headers={},origin_req_host=None,unverifiable=False,method=None)

url是必传参数，data可选，必须是字节流使用bytes类型，字典的话得用urllib.parse模块的urlencode()编码，第三个参数headers是字典，是请求头，origin_req_host是请求方的ip或者host名称，第六个是method,get、post等

实例：

from urllib import request,parse
url='http://httpbin.org/post'
headers={
    'User-Agent':'Mozilla/4.0(compatible;MSIE 5-5;Windows NT',
    'Host':'httpbin.org'
}

dict={
    'name':'qinjian'
}

data=bytes(parse.urlencode(dict),encoding='utf8')
req=request.Request(url=url,data=data,headers=headers,method='POST')
response=request.urlopen(req)
print(response.read().decode('utf-8'))
headers也可以使用add_header()方法来添加：

req=request.Request(url=url,data=data,method='POST')
req.add_header('User-Agent','Mozilla/4.0(compatible;MSIE 5-5;Windows NT')