scrapy.Request(url, [callback=None, method='GET', headers=None, body=None,
cookies=None, meta=None, encoding='utf-8', priority=0,
dont_filter=False])
注:一般文档中方括号中的参数表示可选参数
scrapy.Request常用参数为:
callback:指定传入的url交给哪个解析函数去处理
meta:现在不同的解析函数中传递数据,meta默认会携带部分信息,比如下载延迟,请求深度等
dont_ filter: 让scrapy的去重不会过 滤当前url.scrapy默认有ur去重的功能,对需要重复请求的url有重要用途
常用参数:
url:HTTP响应的url地址,str类型
status:HTTP响应的状态码, int类型
headers:HTTP响应的头部, 类字典类型, 可以调用get或者getlist方法对其进行访问
body:HTTP响应正文, bytes类型
text:文本形式的HTTP响应正文, str类型
response.text = response.body.decode(response.encoding)
encoding:HTTP响应正文的编码
meta:即response.request.meta, 在构造Request对象时, 可将要传递给响应处理函数的信息通过meta参数传入, 响应处理函数处理响应时, 通过response.meta将信息提取出来
selector:Selector对象用于在Response中提取数据使用下面详细将,主要是 xpath,css取值之后的处理
xpath(query):用xpath的方法提取网页中信息
1、常见爬虫scrapy gensplider -t crawl爬虫名allow _domain
2、指定start_url,对应的响应会通过rules提取url地址
3、完善rules,添加Rule
例如:
Rule(LinkExtractor(allow=r'/web/site8/tab5248/into\d+\.html*),
callback='parse_ item'),
LinkExactor参数为链接提取器,其中提取方法有很多,常用的有正则与xpath,下图为参数以及解释
注意点:
该案例是爬取豆瓣某电影评论中用户的部分个人信息
以下为spider中的代码
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class DoubanSpider(CrawlSpider):
name = 'douban'
allowed_domains = ['www.douban.com']
start_urls = ['https://movie.douban.com/subject/26363254/comments?status=P']
rules = (
Rule(LinkExtractor(allow=r'https://www.douban.com/people/.*?/'), callback='parse_item'),
Rule(LinkExtractor(restrict_xpaths=('//*[@id="paginator"]/a[3]')), follow=True)
)
def parse_item(self, response):
item = {}
item['name'] = response.xpath('//*[@id="db-usr-profile"]/div[2]/ul/li[1]/a/text()').extract_first()[:-3]
item['region'] = response.xpath('//*[@id="profile"]/div/div[2]/div[1]/div/a/text()').extract_first()
item['joinTime'] = response.xpath('//*[@id="profile"]/div/div[2]/div[1]/div/div/text()[2]').extract_first()[:-2]
item['register'] = response.xpath('//*[@id="intro_display"]/text()').extract_first()
print(item)
下载中间件是scrapy提供用于用于在爬虫过程中可修改Request和Response,用于扩展scrapy的功能;比如:
可以在请求被Download之前,请求头部加上某些信息;
完成请求之后,回包需要解压等处理;
在配置文件settings.py中的DOWNLOADER_MIDDLEWARES中配置键值对,键为要打开的中间件,值为数字,代表优先级,值越低,优先级越高。
DOWNLOADER_MIDDLEWARES_BASE =
{
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}
下载中间件主要分为两个方法
1、process_request(request, spider)
当每个Request对象经过下载中间件时会被调用,优先级越高的中间件,越先调用;该方法应该返回以下对象:None/Response对象/Request对象/抛出IgnoreRequest异常;
返回None:scrapy会继续执行其他中间件相应的方法;
返回Response对象:scrapy不会再调用其他中间件的process_request方法,也不会去发起下载,而是直接返回该Response对象;
返回Request对象:scrapy不会再调用其他中间件的process_request()方法,而是将其放置调度器待调度下载;
抛出IgnoreRequest异常:已安装中间件的process_exception()会被调用,如果它们没有捕获该异常,则Request.errback会被调用;如果再没被处理,它会被忽略,且不会写进日志。
2、process_response(request,request, response, spider)
当每个Response经过下载中间件会被调用,优先级越高的中间件,越晚被调用,与process_request()相反;该方法返回以下对象:Response对象/Request对象/抛出IgnoreRequest异常。
返回Response对象:scrapy会继续调用其他中间件的process_response方法;
返回Request对象:停止中间器调用,将其放置到调度器待调度下载;
抛出IgnoreRequest异常:Request.errback会被调用来处理函数,如果没有处理,它将会被忽略且不会写进日志。
下载中间件常用来设置随机User_Agent以及设置随机代理ip
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import random
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
class test():
USER_AGENTS = [
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"]
def process_request(self, request, spider):
ua = random.choice(self.USER_AGENTS)
request.headers['User_Agent'] = ua
request.meta['proxy'] = 'http://117.86.88.64:3617'
def process_response(self, request, response, spider):
print(request.headers['User_Agent'])
return response