爬虫--Scrapy进阶(CrawlSpider翻页规则使用)

时间：08-16来源：作者：点击数：27

Scrapy Shell

Scrapy终端是一个交互终端，我们可以在未启动spider的情况下尝试及调试代码

启动Scrapy Shell

scrapy shell "https://hr.tencent.com/position.php?&start=0#a"

Selectors选择器

Scrapy Selectors 内置 XPath 和 CSS Selector 表达式机制
Selector有四个基本的方法，最常用的还是xpath:
	xpath(): 传入xpath表达式，返回该表达式所对应的所有节点的selector list列表
	extract(): 序列化该节点为Unicode字符串并返回list， extract_first()/get()
	css(): 传入CSS表达式，返回该表达式所对应的所有节点的selector list列表，语法同 BeautifulSoup4
	re(): 根据传入的正则表达式对数据进行提取，返回Unicode字符串list列表
	
# 使用xpath
response.xpath('//title')

Spider类

Spider类定义了如何爬取某个(或某些)网站。包括了爬取的动作(例如:是否跟进链接)以及如
何从网页的内容中提取结构化数据(爬取item)。 换句话说，Spider就是你定义爬取的动作及
分析某个网页(或者是有些网页)的地方。

scrapy.Spider是最基本的类，所有编写的爬虫必须继承这个类。

主要用到的函数及调用顺序为：
__init__(): 
	初始化爬虫名字和start_urls列表
start_requests() 
	调用make_requests_from_url():生成Requests对象交给Scrapy下载并返回response
parse(self, response):
	解析response，并返回Item或Requests（需指定回调函数）。
	Item传给Item pipline持久化，而Requests交由Scrapy下载，并由指定的回调函数处理
	（默认parse())，一直进行循环，直到处理完所有的数据为止。

Spider类源码参考

#所有爬虫的基类，用户定义的爬虫必须从这个类继承
class Spider(object_ref):
    # 定义spider名字的字符串(string)。
    # spider的名字定义了Scrapy如何定位(并初始化)spider，所以其必须是唯一的。
    # name是spider最重要的属性，而且是必须的。
    # 一般做法是以该网站(domain)(不加后缀 )来命名spider。 例如，如果spider爬取 mywebsite.com ，该spider通常会被命名为 mywebsite
    name = None

    # 初始化，提取爬虫名字，start_urls
    def __init__(self, name=None, **kwargs):
        if name is not None:
            self.name = name
        # 如果爬虫没有名字，中断后续操作则报错
        elif not getattr(self, 'name', None):
            raise ValueError("%s must have a name" % type(self).__name__)

        # python 对象或类型通过内置成员__dict__来存储成员信息
        self.__dict__.update(kwargs)

        #URL列表。当没有指定的URL时，spider将从该列表中开始进行爬取。 因此，第一个被获取到的页面的URL将是该列表之一。 后续的URL将会从获取到的数据中提取。
        if not hasattr(self, 'start_urls'):
            self.start_urls = []

    # 打印Scrapy执行后的log信息
    def log(self, message, level=log.DEBUG, **kw):
        log.msg(message, spider=self, level=level, **kw)

    # 判断对象object的属性是否存在，不存在则断言处理
    def set_crawler(self, crawler):
        assert not hasattr(self, '_crawler'), "Spider already bounded to %s" % crawler
        self._crawler = crawler

    @property
    def crawler(self):
        assert hasattr(self, '_crawler'), "Spider not bounded to any crawler"
        return self._crawler

    @property
    def settings(self):
        return self.crawler.settings

    #该方法将读取start_urls内的地址，并为每一个地址生成一个Request对象，交给Scrapy下载并返回Response
    #该方法仅调用一次
    def start_requests(self):
        for url in self.start_urls:
            yield self.make_requests_from_url(url)

    #start_requests()中调用，实际生成Request的函数。
    #Request对象默认的回调函数为parse()，提交的方式为get
    def make_requests_from_url(self, url):
        return Request(url, dont_filter=True)

    #默认的Request对象回调函数，处理返回的response。
    #生成Item或者Request对象。用户必须实现这个
    def parse(self, response):
        raise NotImplementedError

    @classmethod
    def handles_request(cls, request):
        return url_is_from_spider(request.url, cls)

    def __str__(self):
        return "<%s %r at 0x%0x>" % (type(self).__name__, self.name, id(self))

    __repr__ = __str__

主要属性和方法

name

定义spider名字的字符串。唯一
allowed_domains

包含了spider允许爬取的域名(domain)的列表，可选。
start_urls

初始URL元祖/列表。当没有制定特定的URL时，spider将从该列表中开始进行爬取。
start_requests(self)

该方法必须返回一个可迭代对象(iterable)。该对象包含了spider用于爬取（默认实现是使用 start_urls 的url）的第一个Request。

当spider启动爬取并且未指定start_urls时，该方法被调用。
parse(self, response)

当请求url返回网页没有指定回调函数时，默认的Request对象回调函数。用来处理网页返回的response，以及生成Item或者Request对象。
log(self, message[, level, component])

使用 scrapy.log.msg() 方法记录日志信息

CrawlSpider(重点)

CrawlSpider是Spider的派生类，Spider类的设计原则是只爬取start_urls列表中的网页，而CrawlSpider类定义了一些规则(rule)来提供跟进link的方便的机制，从爬取的网页中获取link并继续爬取的工作更适合。

CrawlSpider类源码参考

class CrawlSpider(Spider):
    rules = ()
    def __init__(self, *a, **kw):
        super(CrawlSpider, self).__init__(*a, **kw)
        self._compile_rules()

    #首先调用parse()来处理start_urls中返回的response对象
    #parse()则将这些response对象传递给了_parse_response()函数处理，并设置回调函数为parse_start_url()
    #设置了跟进标志位True
    #parse将返回item和跟进了的Request对象    
    def parse(self, response):
        return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)

    #处理start_url中返回的response，需要重写
    def parse_start_url(self, response):
        return []

    def process_results(self, response, results):
        return results

    #从response中抽取符合任一用户定义'规则'的链接，并构造成Resquest对象返回
    def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
        #抽取之内的所有链接，只要通过任意一个'规则'，即表示合法
        for n, rule in enumerate(self._rules):
            links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
            #使用用户指定的process_links处理每个连接
            if links and rule.process_links:
                links = rule.process_links(links)
            #将链接加入seen集合，为每个链接生成Request对象，并设置回调函数为_repsonse_downloaded()
            for link in links:
                seen.add(link)
                #构造Request对象，并将Rule规则中定义的回调函数作为这个Request对象的回调函数
                r = Request(url=link.url, callback=self._response_downloaded)
                r.meta.update(rule=n, link_text=link.text)
                #对每个Request调用process_request()函数。该函数默认为indentify，即不做任何处理，直接返回该Request.
                yield rule.process_request(r)

    #处理通过rule提取出的连接，并返回item以及request
    def _response_downloaded(self, response):
        rule = self._rules[response.meta['rule']]
        return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

    #解析response对象，会用callback解析处理他，并返回request或Item对象
    def _parse_response(self, response, callback, cb_kwargs, follow=True):
        #首先判断是否设置了回调函数。（该回调函数可能是rule中的解析函数，也可能是 parse_start_url函数）
        #如果设置了回调函数（parse_start_url()），那么首先用parse_start_url()处理response对象，
        #然后再交给process_results处理。返回cb_res的一个列表
        if callback:
            #如果是parse调用的，则会解析成Request对象
            #如果是rule callback，则会解析成Item
            cb_res = callback(response, **cb_kwargs) or ()
            cb_res = self.process_results(response, cb_res)
            for requests_or_item in iterate_spider_output(cb_res):
                yield requests_or_item

        #如果需要跟进，那么使用定义的Rule规则提取并返回这些Request对象
        if follow and self._follow_links:
            #返回每个Request对象
            for request_or_item in self._requests_to_follow(response):
                yield request_or_item

    def _compile_rules(self):
        def get_method(method):
            if callable(method):
                return method
            elif isinstance(method, basestring):
                return getattr(self, method, None)

        self._rules = [copy.copy(r) for r in self.rules]
        for rule in self._rules:
            rule.callback = get_method(rule.callback)
            rule.process_links = get_method(rule.process_links)
            rule.process_request = get_method(rule.process_request)

    def set_crawler(self, crawler):
        super(CrawlSpider, self).set_crawler(crawler)
        self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)

LinkExtractors

使用LinkExtractors 的目的: 提取链接｡每个LinkExtractor有唯一的公共方法是 extract_links()，它接收
一个 Response 对象，并返回一个 scrapy.link.Link 对象。
scrapy.linkextractors.LinkExtractor(
    allow = (),
    deny = (),
    allow_domains = (),
    deny_domains = (),
    deny_extensions = None,
    restrict_xpaths = (),
    tags = ('a','area'),
    attrs = ('href'),
    canonicalize = True,
    unique = True,
    process_value = None
)

主要参数：
	allow：满足括号中“正则表达式”的值会被提取，如果为空，则全部匹配。
	deny：与这个正则表达式(或正则表达式列表)匹配的URL一定不提取。
	allow_domains：会被提取的链接的domains。
	deny_domains：一定不会被提取链接的domains。
	restrict_xpaths：使用xpath表达式，和allow共同作用过滤链接。

rules

在rules中包含一个或多个Rule对象，每个Rule对爬取网站的动作定义了特定操作。如果多个rule
匹配了相同的链接，则根据规则在本集合中被定义的顺序，第一个会被使用。

scrapy.spiders.Rule(
        link_extractor, 
        callback = None, 
        cb_kwargs = None, 
        follow = None, 
        process_links = None, 
        process_request = None
)

link_extractor：是一个Link Extractor对象，用于定义需要提取的链接。

callback： 从link_extractor中每获取到链接时，参数所指定的值作为回调函数，该回调函数接
受一个response作为其第一个参数。
	注意：当编写爬虫规则时，避免使用parse作为回调函数。由于CrawlSpider使用parse方法来
	实现其逻辑，如果覆盖了parse方法，crawlspider将会运行失败。
	
follow：是一个布尔(boolean)值，指定了根据该规则从response提取的链接是否需要跟进。 如
果callback为None，follow默认设置为True ，否则默认为False。

process_links：指定该spider中哪个的函数将会被调用，从link_extractor中获取到链接列表时
将会调用该函数。该方法主要用来过滤。

process_request：指定该spider中哪个的函数将会被调用， 该规则提取到每个request时都会
调用该函数。 (用来过滤request)

实例:

import time
import scrapy
from ..items import XinlangItem	
# 导入CrawlSpider: 可以连续爬取网页(翻页爬比较方便)
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


# 使用CrawlSpider
class XinlangnewsSpider(CrawlSpider):   #需要继承这个类
    name = 'xinlangnews'
    allowed_domains = ['sina.com.cn']
    start_urls = ['http://roll.news.sina.com.cn/news/gnxw/gdxw1/index_1.shtml']

    # 定义rules: 提取链接规则
    rules = [
        Rule(
            # 链接提取规则
            LinkExtractor(
                allow=('index_\d+\.shtml',),  # 要匹配哪些href的链接内容, 支持正则
                restrict_xpaths=('//div[@class="pagebox"]',),  # 限制提取链接的范围, 支持xpath
                # deny=('index_2.shtml',)  # 不爬第2页
            ),
            callback='parse_item',  # 提取的链接会自动请求, 会自动回调parse_item方法
            #follow=True,  # 是否跟随:是否从提起的新链接中继续提取新链接
        )
    ]

    def parse_item(self, response):
        # 使用xpath
        time.sleep(1)
        news_list = response.xpath('//ul[@class="list_009"]/li')
        for news in news_list:
            # 新闻标题
            news_title = news.xpath('./a/text()').get()
            # 新闻时间
            news_time = news.xpath('./span/text()').get()
            print(news_title,news_time)
            # item
            item = XinlangItem()
            item['newstitle'] = news_title
            item['newstime'] = news_time
            yield item


#settings.py中
	
	
	BOT_NAME = 'xinlang'
	
	SPIDER_MODULES = ['xinlang.spiders']
	NEWSPIDER_MODULE = 'xinlang.spiders'
	
	# Configure item pipelines
	# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
	ITEM_PIPELINES = {
	    'xinlang.pipelines.XinlangPipeline': 300,
	}
	ROBOTSTXT_OBEY = False


#items.py中
import scrapy

class XinlangItem(scrapy.Item):
    newstitle = scrapy.Field()
    newstime = scrapy.Field()


#pipelines.py中
from itemadapter import ItemAdapter
import pymysql

class XinlangPipeline:

    # 爬虫开始
    def open_spider(self, spider):
        print('开始')
        # 连接MySQL
        self.db = pymysql.connect(
            host='IP地址', port=3306,
            user='root', password='密码',
            database='spider88', charset='utf8'
        )
        self.cursor = self.db.cursor()

    # 爬虫结束
    def close_spider(self, spider):
        print('结束')
        self.cursor.close()
        self.db.close()

    def process_item(self, item, spider):
        print(spider.name)  #

        #print(f'---- {item} ----')

        news_title = item['newstitle']
        news_time = item['newstime']

        # sql语句
        sql = 'insert into xinlangnews(newstitle, newtime) values("%s","%s")' \
              % (news_title, news_time)

        # 执行和提交sql
        self.cursor.execute(sql)
        self.db.commit()

        return item