您当前的位置：首页 > 计算机 > 编程开发 > Python

【python】urllib库之四大模块

时间：08-16来源：作者：点击数：25

【前言】

有好一段时间都没敲py了，今天将urllib库算是较全的学习了一下老实说还是敲py比较舒服，当然还有requests，Beautiful库，正则表达式这些对于进行对爬去文章的处理都是不可避免的。

urllib库

一 urllib库四大模块

1：request

http请求模块，可以用来模拟发送请求。就好比在浏览器中输入网址然后回车一样，只需要给库方法传入URL以及额外的参数，就可以模拟实现这个过程。

2：error

3：parse

一个工具模块，提供了好多URL处理方法，比如拆分，解析，合并等。

4：robotparser

主要用来识别网址的robots.txt文件，然后判断哪些网站可以爬，哪些网站不可以爬，它用的很少。

1：urlopen()

"""


程序功能:rullib.request 模拟浏览器的一个请求发送过程
         目的：获取网页的源代码
"""
# 导入rullib库
import urllib.request
response = urllib.request.urlopen('https://www.python.org');
print(response.read().decode('utf-8')) # 以编码utf-8的格式进行请求阅读

2：data参数

"""


程序功能：urlopen()参数
"""
import urllib.request # 请求模块
import urllib.parse # urllib库中的工具模块
# 传递一个参数：word，值：hello-------》转字节流使用bytes()方法：第一个参数：str类型，需要使用urllib.parse模块
# 中的urlopen()方法来将参数字典转换为字符串，第二个参数：编码格式：utf-8

data = bytes(urllib.parse.urlencode({'word': 'hello'}),encoding='utf-8')
response = urllib.request.urlopen('http://httpbin.org/post',data=data)
print('\n',response.read())

3：timeout参数

"""


程序功能：
"""
import socket# 判断异常
import urllib.error
import urllib.request
try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    # socket.timeout超时异常
    if isinstance(e.reason, socket.timeout):
        print('时间超时')

二：request.Request方法

1：一般用法

"""


程序功能：Request类
class urllib.request.Request(url,data=None, headers={},orgin_req_host=None,unverifiable=False,metho=None)
"""
"""
import urllib.request
request = urllib.request.Request('http://httpbin.org/get')# 请求响应
response = urllib.request.urlopen(request)# 使用urlopen()方法来发送请求：Request类型的对象
print(response.read().decode('utf-8'))

"""
from urllib import request,parse # 请求和处理方法

url = "http://httpbin.org/post"
headers = {
    # 伪装成谷歌浏览器
    'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.12 Mobile Safari/537.36',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'Germey'
}

data = bytes(parse.urlencode(dict),encoding='utf-8')
# req = request.Request(url=url, data=data,method='POST')
# req.add_header('User_Agent','Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.12 Mobile Safari/537.36')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

结果

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "Germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.12 Mobile Safari/537.36"
  }, 
  "json": null, 
  "origin": "182.245.65.138", 
  "url": "http://httpbin.org/post"
}

2：高级用法

1：验证

当请求的一个网页需要验证：提示输入用户名和密码

"""


程序功能：
"""
# HTTPBasicAuthHandler：用于管理密码，为何用户名和密码的表
# build_opener()方法构建一个Opener：Opener在发送请求的时候就相当于已经验证成功
from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from urllib.error import URLError #  导入错误提示包

username = 'username'
password = 'password'
url = 'http://localhost:5000/'# 一个需要验证的网址

p = HTTPPasswordMgrWithDefaultRealm()# 实例化HTTPBasicAuthHandler对象，参数是 HTTPPasswordMgrWithDefaultRealm对象
p.add_password(None,url,username,password)
auth_hander = HTTPBasicAuthHandler(p)
opener = build_opener(auth_hander)# build_opener()方法构建一个Opener：Opener在发送请求的时候就相当于已经验证成功

try:
    result = opener.open(url)# 使用opener的open()打开这个链接
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

2：代理

"""


程序功能：代理服务器
1：ProxyHandler：参数是一个字典---》键名：协议类型，键值：代理链接（可以添加多个代理）
2：然后使用Hander以及build_opener()方法构造一个opener
3: 发送请求
"""
from urllib.error import URLError
from urllib.request import ProxyHandler,build_opener


proxy_hander = ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'http://127.0.0.1:9743'
})
# 使用build_opener()方法构建一个opener
opener = build_opener(proxy_hander)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

3：Cookies

"""


程序功能：
   获取网站的Cookies
   1：声明CookieJar对象
   2：urllib.request.HTTPCookieProcessor(cookie)构建一个Handler
   3：利用build_opener()方法构建一个opener
   4: 执行open()函数

   改进版本：V2.0
   CookieJar改写成：MozillaCookieJar--生成文件时将会用到
   比如：读取和保存Cookies,可以将Cookies保存成Mozilla型浏览器的Cookies格式

   改进版本：V3.0
   保存格式：libwww-perl(LWP)格式的Cookies文件
   要改成libwww-perl(LWP)格式的Cookies文件只需要声明：cookie = http.cookiejar.LWPCookieJar(filename)

"""
import http.cookiejar,urllib.request

"""
cookie = http.cookiejar.CookieJar()# 1 声明CookieJar对象
handler = urllib.request.HTTPCookieProcessor(cookie) # 2
opener = urllib.request.build_opener(handler) # 3

response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)
结果：
    BAIDUID=77B6920A1FCACD2B94C2905DD2B83C90:FG=1
    BIDUPSID=77B6920A1FCACD2B94C2905DD2B83C90
    H_PS_PSSID=1445_21110_22074
    PSTM=1537189285
    BDSVRTM=0
    BD_HOME=0
    delPer=0

"""

"""
# V2.0
filename = 'cookies.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler  = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)

"""

"""
# V3.0
filename = 'cookies.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler  = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)

"""
# 针对V3.0格式读取并利用
cookie = http.cookiejar.LWPCookieJar()
# 使用load()加载本地的cookies文件
cookie.load('cookies.txt',ignore_discard=True,ignore_expires=True)
# 获取Cookies的内容
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))# 输出百度网页的源代码

三：异常处理

1：URLError

"""


程序功能：
"""
from urllib import request,error
try:
    response = request.urlopen('http://jiajiknag.com/index.html')
except error.URLError as e:
    print('页面不存在！！！')

2：HTTPError

"""


程序功能：
        HTTPError 子类
        URLError 父类
        先捕获子类的错误后捕获父类的错误

"""
from urllib import request,error
try:
    response = request.urlopen('http://jiajiknag.com/index.html')
except error.HTTPError as e:
    print(e.reason,e.code,e.headers,sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request.Sucessfully')

四：解析链接

1：urlparse()

"""


程序功能：
"""
from urllib.parse import urlparse # 该方法可以实现URL 的识别和分段
result = urlparse('http://www.baidu . com/index .htr比u ser?id=S#comment')
print(type(result),result)

结果：

<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu . com', path='/index .htr比u ser', params='', query='id=S', fragment='comment')

urlparse （）方法将其拆分成了6 个部分 :

：／／前面的就是scheme ，代表协议；

第一个／符号前面便是netloc ，即域名，

后面是path ，即访问路径；

分号；前面是params ，代表参数；

问号？后面是查询条件query ，一般用作GET 类型的URL;

井号＃后面是锚点，用于直接定位页面内部的下拉位置。

urlparse ()方法其他配置

1：urlstring

2：scheme

3：allow_fragments

"""


程序功能：
"""
"""
from urllib.parse import urlparse # 该方法可以实现URL 的识别和分段
result = urlparse('http://www.baidu . com/index .htr比u ser?id=S#comment')
print(type(result),result)
"""
"""
scheme：默认的协议
from urllib.parse import urlparse # 该方法可以实现URL 的识别和分段
# scheme:默认的协议
result = urlparse('http://www.baidu . com/index .htr比u ser?id=S#comment',scheme='https')
# print(type(result),result)
print(result)
结果：ParseResult(scheme='http', netloc='www.baidu . com', path='/index .htr比u ser', params='', query='id=S', fragment='comment')
"""

# 即是否忽略fragment 。如果它被设置为False ，干ragment 部分就会被忽略，
# 它会被解析为path 、parameters 或者query 的一部分，而fragment 部分为空。
from urllib.parse import urlparse # 该方法可以实现URL 的识别和分段
# scheme:默认的协议
result = urlparse('http://www.baidu . com/index .htr比u ser?id=S#comment',allow_fragments=False)
# print(type(result),result)
print(result)

2：urlunparse()

"""


程序功能：有了urlparse()，
         相应地就有了它的对立方法urlunp arse （） 。它接受的参数是一个可迭代对象，
         但是它的长度必须是6 ， 否则会抛出参数数量不足或者过多的问题。
"""
from urllib.parse import urlunparse

data = ['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))

结果：

http://www.baidu.com/index.html;user?a=6#comment

3：urlsplit()

"""


程序功能：这个方法和urlparse() 方法非常相似，
         只不过它不再单独解析params 这一部分，只运回5个结果。
"""
from urllib.parse import urlsplit
result = urlsplit('http://www.baidu . com/index .htr比u ser?id=S#comment')
# 返回结果是Spli tResult ， 它其实也是一个元组类型， 既可以用属性获取值，也可以
# 用泵’引来获取。
print(result)
print(result.scheme,result[0])

4：urlunsplit()

"""


程序功能：
     它也是将链接各个部分组合成完整链接的方法，传人的参数也是一个可迭
     代对象，例如列表、元组等，唯一的区别是长度必须为5 。
"""
from urllib.parse import urlunsplit
data = ['http','www.baidu.com','index.html','a=6','comment']
print(urlunsplit(data))

结果：

http://www.baidu.com/index.html?a=6#comment

5：urljoin()

"""


程序功能：
     生成链接还有另一个方法，那就是urljoin(I)方法。我们可以提供一个base_url （基础链
接） 作为第一个参数，将新的链接作为第二个参数.
     该方法会分析base_url 的scheme 、netloc 和path这3 个内容并对新链接缺失的部分进行补充，最后返回结果。
"""
from urllib.parse import urljoin
print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com ', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com讪d=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#co阳nent', '?category=2'))

结果：

http://www.baidu.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html?question=2
https://cuiqingcai.com/index.php
http://www.baidu.com?category=2#comment
www.baidu.com?category=2#comment
www.baidu.com?category=2

6：urlencode()

"""


程序功能：声明了一个字典来将参数表示出来，然后调用urlencode （）方法将其序列化为GET 请求
参数。

"""
from urllib.parse import urlencode
params = {
    'name': 'germey',
    'age': 22
}

base_url = 'http://www.baidu.com?'# 创建一个链接
url = base_url + urlencode(params)
print(url)

结果：

http://www.baidu.com?name=germey&age=22

7：parse_qs()

"""


程序功能：parse_qs()方法， 就可以将它转回字典，示例如下：
"""
from urllib.parse import parse_qs
query = 'name=jiajikang&age=20'
print(query)
print(parse_qs(query))

结果：

name=jiajikang&age=20
{'name': ['jiajikang'], 'age': ['20']}

8：parse_qsl()

"""


程序功能：还有一个parse_qsl()方法，它用于将参数转化为元组组成的列表，
"""
from urllib.parse import parse_qsl
query = 'name=jiajiknag&age=22'
print(parse_qsl(query))

结果：

[('name', 'jiajiknag'), ('age', '22')]

9：quote()

"""


程序功能：
      该方法可以将内容转化为URL 编码的格式。URL 中带有中文参数时，有时可能会导致乱码的问题，
 此时用这个方法可以将巾文字符转化为U RL 编码
"""
from urllib.parse import quote
keyword = '贾继康'
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url)

结果：

https://www.baidu.com/s?wd=%E8%B4%BE%E7%BB%A7%E5%BA%B7

#####10：unquote()

"""


程序功能：它可以进行URL 解码
"""
from urllib.parse import unquote

url = 'https://www.baidu.com/s?wd=%E8%B4%BE%E7%BB%A7%E5%BA%B7'
print(unquote(url))

结果：

https://www.baidu.com/s?wd=贾继康

五：分析Robots协议(urllib库中得robotparser模块)

1：Robots协议

Robots 协议也称作爬虫协议、机器人协议，它的全名叫作网络爬虫排除标准（ Robots Exclusion Protocol ），用来告诉爬虫和搜索引擎哪些页面可以抓取，哪些不可以抓取。它通常是一个叫作robots.txt的文本文件，一般放在网站的根目录下。当搜索爬虫访问一个站点时，它首先会检查这个站点根目录下是否存在robots.txt 文件，如果存在，搜索爬虫会根据其中定义的爬取范围来爬取。如果没有找到这个文件，搜索爬虫便会访问所有可直接访问的页面。

2：爬虫名称

爬虫名称名称网
BaiduSpider 百度 www .baidu.com
Googlebot 谷歌 www.google.com
360Spider 360搜索 www.so.com
YodaoBot 有道 www.youdao.com
ia archiver Alexa www.alexa.cn

3：robotparser(判断网页是否可以被抓取)

了解Robots 协议之后，我们就可以使用ro bot parser 模块来解析robots.txt 了。该模块提供了一个

类RobotFileParser ，它可以根据某网站的robots.txt 文件来判断一个爬取爬虫是否有权限来爬取这个

网页。

该类用起来非常简单，只需要在构造方法里传人robots.txt 的链接即可。首先看一下它的声明：

urllib.robotparser.RobotFileParser(url =’ ’)

当然，也可以在声明时不传人，默认为空，最后再使用set_url （）方法设置一下也可。

下面列刷了这个类常用的几个方法。

1：set_url()：用来设置ro bots . txt 文件的链接。如果在创建RobotFileParser 对象时传入了链

接，那么就不需要再使用这个方法设置了。

2：read()：读取robots .txt 文件并进行分析。注意，这个方法执行一个读取和分析操作，如果不

调用这个方法，接下来的判断都会为False ，所以一定记得调用这个方法。这个方法不会返

回任何内容，但是执行了读取操作。

3：parse()：用来解析robots.txt文件，传人的参数是robots . txt 某些行的内容，它会按照robots.txt

的语法规则来分析这些内容。

4：can_fetch()：该方法传人两个参数，第一个是Use r-age nt ，第二个是要抓取的URL 。返回的

内容是该搜索引擎是否可以抓取这个URL ，返回结果是True 或False a

5：mtime()：返回的是上次抓取和分析robots.txt 的时间，这对于长时间分析和抓取的搜索爬虫是

很有必要的，你可能需要定期检查来抓取最新的robots.txt 。

6：modified()：它同样对长时间分析和抓取的搜索爬虫很有帮助，将当前时间设置为上次抓取

和分析robots.txt 的时间。

"""


程序功能：判断网页是否可以被抓取
          1：创建RobotFileParser()对象
          2：set_url()方法设置robots.txt的链接

"""

from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('http://www.jianshu.com/robots.txt')
rp.read()
print(rp.can_fetch('*','http://www.jianshu.com/p/b67554025d7d'))
print(rp.can_fetch('*', 'http://www.jianshu.com/search?q=python&page=1&type=collections'))

结果：

False
False

方便获取更多学习、工作、生活信息请关注本站微信公众号 城东书院微信服务号

来顶一下

返回首页

上一篇:爬虫-requests 下一篇:python --阿里云(智能媒体管理/视频点播)

吐血整理\| 全国招投标	高考生入学注意：这些大
【健康】纯净水、天然	14种竞赛生升学路径盘