python requests常见异常

时间：08-22来源：作者：点击数：40

1. 连接超时

服务器在指定时间内没有应答，抛出 requests.exceptions.ConnectTimeout

requests.get('http://github.com', timeout=0.001) 

# 抛出错误
requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='github.com', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f1b16da75f8>, 'Connection to github.com timed out. (connect timeout=0.001)'))

2. 连接、读取超时

若分别指定连接和读取的超时时间，服务器在指定时间没有应答，抛出 requests.exceptions.ConnectTimeout

timeout=([连接超时时间], [读取超时时间])
连接：客户端连接服务器并并发送http请求服务器
读取：客户端等待服务器发送第一个字节之前的时间

requests.get('http://github.com', timeout=(6.05, 0.01)) 

# 抛出错误
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='github.com', port=80): Read timed out. (read timeout=0.01)

3. 未知的服务器

抛出 requests.exceptions.ConnectionError

requests.get('http://github.comasf', timeout=(6.05, 27.05)) 

# 抛出错误
requests.exceptions.ConnectionError: HTTPConnectionPool(host='github.comasf', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f75826665f8>: Failed to establish a new connection: [Errno -2] Name or service not known',))

requests.exceptions.ConnectionError: HTTPSConnectionPool Max retries exceeded

原因：

http的连接数超过最大限制，默认的情况下连接是Keep-alive的，所以这就导致了服务器保持了太多连接而不能再新建连接。
ip被封
请求过快

解决：

在header中不使用持久连接'Connection': 'close'或requests.adapters.DEFAULT_RETRIES = 5
若是请求过快，可设置time.sleep
使用代理ip
Requests请求时有时会请求不到页面，或是请求到空白的页面，超时要重试几次，使用try…except语句
升级requests，pip install --upgrade requests

requests ip代理爬虫报错 HTTPSConnectionPool

import time
import random
import requests


USER_AGENTS = [
    "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
    "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
    "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"
]

headers = {
   "User-Agent": ""
}
# 借助上面的USER-AGENT反爬
s = requests.session()
s.keep_alive = False
requests.adapters.DEFAULT_RETRIES = 10


url = "https://baike.baidu.com/item/人工智能/9180?fromtitle=AI&fromid=25417&fr=aladdin"
for i in range(10):
    proxys = {
    # news_ip是已经读取好的ip 就不放上面代码了
        "https": "http://"+ new_ips[i],
        "http": "http://" + new_ips[i]
    }
    headers['User-Agent'] = random.choice(USER_AGENTS)
    print(proxys)
    print(headers['User-Agent'])
    req = requests.get(url, headers=headers, verify = False, proxies = proxys, timeout = 20).content.decode('utf-8')
    print(req)
    
    time.sleep(5)

给出的方式，确认不是IP不可行的问题。

后来在知乎上看到有人在传入proxy给proxies的时候，

将字典中的"https" 和 “http” 全部大写了，尝试之后确实可行了将字典中的"https"和"http"全部大写了，尝试之后确实可行了，将字典中的"https"和"http"全部大写了，尝试之后确实可行了

for i in range(10):
    proxys = {
        "HTTPS": "HTTP://"+ new_ips[i],
        "HTTP": "HTTP://" + new_ips[i]
        # 在这里全部大写了！
        
    }
    headers['User-Agent'] = random.choice(USER_AGENTS)
    print(proxys)
    print(headers['User-Agent'])
    req = requests.get(url, headers=headers, verify = False, proxies = proxys, timeout = 20).content.decode('utf-8')
    print(req)
    
    time.sleep(5)

记录一下今晚踩的几个雷：

注意字典proxy中，对于每个value无论key是HTTP还是HTTPS，都用HTTP开头！只有key用HTTPS！
如果requests想要爬取的网站是https:// ，那么一定一定需要在requests里加上verify = False这句话

4. 代理连接不上

代理服务器拒绝建立连接，端口拒绝连接或未开放，抛出 requests.exceptions.ProxyError

requests.get('http://github.com', timeout=(6.05, 27.05), proxies={"http": "192.168.10.1:800"}) 

# 抛出错误
requests.exceptions.ProxyError: HTTPConnectionPool(host='192.168.10.1', port=800): Max retries exceeded with url: http://github.com/ (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fce3438c6d8>: Failed to establish a new connection: [Errno 111] Connection refused',)))

requests.exceptions.ProxyError: HTTPConnectionPool

查看代理设置

使用命令查看本机当前代理：

$: env|grep -i proxy

结果：

https_proxy=127.0.0.1:8888
http_proxy=127.0.0.1:8888
socks_proxy=
ftp_proxy=

解决方法

网上给出的办法基本都是一样的：

export http_proxy=''
export https_proxy=''

但这个对于我来说，不起作用。然后翻墙去了google，得到了下面的解法：

修改用户目录下的.bashrc文件

查看.bashrc:

$ cat ./bashrc  
# 最后四行如下：
export https_proxy='127.0.0.1:8888'
export http_proxy='127.0.0.1:8888'
export socks_proxy=''
export ftp_proxy=''

修改.bashrc文件

$ vi ./bashrc
# 修改成下面的就好
export https_proxy=''
export http_proxy=''
export socks_proxy=''
export ftp_proxy=''
# 退出保存

# 执行下面的命令，使bashrc文件生效
$ source ~/.bashrc

据说这种方式也有效的（禁用代理）：

session = requests.Session() 
session.trust_env = False 
response = session.get('http://ff2.pw') 


import os

# os.environ['NO_PROXY']设置为你的目标网址的域名即可
os.environ['NO_PROXY'] = 'stackoverflow.com'

# 如果要设置多个域名，逗号分割
os.environ['NO_PROXY'] = 'stackoverflow.com,baidu.com'

NO_PROXY的意思就是指定某个域名别用代理去处理

5. 连接代理超时

代理服务器没有响应 requests.exceptions.ConnectTimeout

requests.get('http://github.com', timeout=(6.05, 27.05), proxies={"http": "10.200.123.123:800"}) 

# 抛出错误
requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='10.200.123.123', port=800): Max retries exceeded with url: http://github.com/ (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fa8896cc6d8>, 'Connection to 10.200.123.123 timed out. (connect timeout=6.05)'))

6. 代理读取超时

说明与代理建立连接成功，代理也发送请求到目标站点，但是代理读取目标站点资源超时

即使代理访问很快，如果代理服务器访问的目标站点超时，这个锅还是代理服务器背

假定代理可用，timeout就是向代理服务器的连接和读取过程的超时时间，不用关心代理服务器是否连接和读取成功

requests.get('http://github.com', timeout=(2, 0.01), proxies={"http": "192.168.10.1:800"}) 

# 抛出错误
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='192.168.10.1:800', port=1080): Read timed out. (read timeout=0.5)

7. 网络环境异常

可能是断网导致，抛出 requests.exceptions.ConnectionError

requests.get('http://github.com', timeout=(6.05, 27.05)) 

# 抛出错误
requests.exceptions.ConnectionError: HTTPConnectionPool(host='github.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc8c17675f8>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))

8. 官网的一些参考

你可以告诉 requests 在经过以 timeout 参数设定的秒数时间之后停止等待响应。基本上所有的生产代码都应该使用这一参数。如果不使用，你的程序可能会永远失去响应：
 
>>> requests.get('http://github.com', timeout=0.001)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)
 
并不是整个下载响应的时间限制，而是如果服务器在 timeout 秒内没有应答，将会引发一个异常（更精确地说，是在 timeout 秒内没有从基础套接字上接收到任何字节的数据时）
 
 
- 遇到网络问题（如：DNS 查询失败、拒绝连接等）时，Requests 会抛出一个 requests.exceptions.ConnectionError 异常。
- 如果 HTTP 请求返回了不成功的状态码， Response.raise_for_status() 会抛出一个 HTTPError 异常。
- 若请求超时，则抛出一个 Timeout 异常。
- 若请求超过了设定的最大重定向次数，则会抛出一个 TooManyRedirects 异常。
- 所有Requests显式抛出的异常都继承自 requests.exceptions.RequestException 。