您当前的位置：首页 > 计算机 > 编程开发 > Python

python爬虫，抓取完整小说、图片

时间：06-28来源：作者：点击数：53

虽然以前接触过爬虫，但是也很久没有去做过相关的编码了，正好最近有需求要收集一些图片和txt，就还是一步一个脚印，玩了一下。

本文从简单的框架开始，爬取图片和爬取txt分别进行介绍。如何访问网址，打印html，分析数据，正则匹配，文件I/O，到最后爬取数据，以及如何基础的规避网站安全系统。

新手可跟着本文的节奏同步分析，只对代码感兴趣的，可直接跳转到最后。

环境

python3+pycharm

安装第三方库requests和urllib3，这俩可在pycharm中直接安装，很少被墙

1.爬取图片

图片网址：https://page.om.qq.com/page/OGosuxbOv7M6LKD__rB7FeSA0.

浏览器访问这个地址，然后F12,查看源码，可以发现，每张图片在iframe里有一个地址。通过正则匹配出这些地址，然后下载地址中的内容即可。

1.1第一步、访问连接，获取html

库导入

import urllib.request
import requests
import re
import os
import time

创建picture文件夹用于存放

if not os.path.exists('picture'):
    os.mkdir('picture')

通过requests里的get方法访问地址，并获取html

html = 'https://page.om.qq.com/page/OGosuxbOv7M6LKD__rB7FeSA0'
url = requests.get(html)
htmlcode=url.content

将html打印到本地

pagefile = open('picture/picture.txt','wb')
pagefile.write(htmlcode)
pagefile.close()

这时就可在工程下的picture文件夹中，更直观的查看html结构

1.2分析html，正则匹配

规则还是很明显的，由 preview-src=" 开头， "> 结尾，中间则为完整的链接。

reg = r'preview-src="(.+)">'
reg_img = re.compile(reg)
imglist = reg_img.findall(htmlcode.decode())

这里需要注意一下转码，因为用content获取的html为bytes格式，但是findall是针对str的。

此时可打印一下验证是否得到了所有的图片链接

1.3下载链接中的图片

urllib库request类中，urlretrieve方法可下载链接，并保存到具体路径。

对地址列表进行遍历下载，同时为防止因本地网络或者网站原因导致报错，加一个try

flag = 1
for img in imglist:
    try:
        urllib.request.urlretrieve(img,'picture/%d.png'%flag)
        print('第%d张图片下载完成'%flag)
    except:
        print('第%d张图片下载异常，请稍后再试'%flag)
    flag += 1

1.4爬取图片的完整代码

import urllib.request
import requests
import re
import os
import time

if not os.path.exists('picture'):
    os.mkdir('picture')
#图片来源地址
html = 'https://page.om.qq.com/page/OGosuxbOv7M6LKD__rB7FeSA0'

#获取网页html
url = requests.get(html)
htmlcode=url.content

#将html打印到txt中
pagefile = open('picture/picture.txt','wb')
pagefile.write(htmlcode)
pagefile.close()

reg = r'preview-src="(.+)">'
reg_img = re.compile(reg)
imglist = reg_img.findall(htmlcode.decode())
#print(imglist)

time.sleep(1)
flag = 1
for img in imglist:
    #print(img)
    try:
        urllib.request.urlretrieve(img,'picture/%d.png'%flag)
        print('第%d张图片下载完成'%flag)
    except:
        print('第%d张图片下载异常，请稍后再试'%flag)
    flag += 1

将html打印到TXT的代码可注释掉，只是我个人习惯在nodepad++中分析html，方便查找正则规律

2.爬取完整小说

爬取TXT的框架类似，只是如果需要爬取一整部小说的内容，需要注意的点还是挺多的，而且需要进行伪装，和网站的安全系统斗智斗勇。

小说链接：https://www.ddxs.cn/book/195261/.

可从具体每一章节中看到，文字都直接存在html中。

首先需要进入小说的章节列表地址，从列表html获取每一章的地址，然后再从具体章节的html中抓取正文，最后优化格式，输出即可

导入库信息

import os
import requests
import re
import time
import random

2.1自定义获取网页html方法

def get_htmlcode(url):
    try:
        urlhtml = requests.get(url)
        htmlcode = urlhtml.content
    # path = open('Txt/html.txt','wb')
    # path.write(htmlcode)
    # path.close()
    except:
        htmlcode = 'error'
    return htmlcode

形参为网络url，会返回网页的html。

打印html的代码块被我注释了，调试过程中，可取消注释，辅助分析

2.2处理章节列表地址html方法

经过观察可以发现有最新章节和章节目录两块，结构很相似。直接匹配 <a href="/ 开头和 " title= 结尾，会将最新章节里的地址也匹配下来，这一块是存在重复的，我们只需要章节目录下的数据，所以需要对html进行处理。以章节目录为分割线，之后所有匹配的地址则是我们需要的。

对元素值‘章节目录’进行查找，返回索引值，然后以索引下标进行切片，只保留‘章节目录’后的数据。再对处理后的html进行正则就可以获取每一章的地址。

def get_chapter_list(htmlcode):
    htmldecode = htmlcode.decode()
    a = htmldecode.index('章节目录')
    htmldecode_final = htmldecode[a:]
    reg = r'<a href="(.+?)" title="八荒凤鸣'
    reg_msg = re.compile(reg)
    chapter_list = reg_msg.findall(htmldecode_final)
    return chapter_list

这里得到的还不是完整的地址，缺少了这个网址的前缀’https://www.ddxs.cn’。在后面的代码中注意这个点，加上就好

2.3爬取章节名和正文的方法

2.2中已经获取到了所有章节的地址列表，通过2.1的方法再获取任意一个章节的html，发现每个章节地址的html格式规律都是相同的，只要遍历地址列表即可。

这里又分两步，匹配章节名、正文的正则表达式规则是不同的

匹配章节名

// 获取章节名称
def get_subject(htmlcode):
    reg = '<h1>(.+?)</h1>'
    reg_msg = re.compile(reg)
    subject = reg_msg.findall(htmlcode.decode())
    return subject

匹配正文

// 获取正文
def get_novel(htmlcode):
    reg = 'class="content"><p>(.+?)</p></article>'
    reg_msg = re.compile(reg)
    novel = reg_msg.findall(htmlcode.decode())
    # print(novel)
    return novel

这里可以分别打印爬取的数据，验证下正则表达式的逻辑是否正确

2.4 优化正文格式

在2.3中，爬取的正文是一整行的数据，中间杂糅着很多</p><p>。这些就是段落的规律，我们只需要把</p>替换成换行，<p>替换成缩进，就可以形成很符合咱们观看习惯的排版。

// 调整正文格式
def novel_format(novel):
    novel = str(novel).replace('</p>','\n')
    novel_Format = novel.replace('<p>','   ')
    return novel_Format

2.5输出到txt

最后一步是将抓取的章节名和正文写入txt。因为前面已经把方法都写清楚了，到了这一步就轻松多，但是也要注意代码运行的逻辑性。

再是对txt文件进行读写操作时，要注意下打开方式，采用追加方式打开。

也可以自己对文件进行一些调整，比如在每章的正文后面加入一些分割符，或者是两个换行，诸如此类，使排版更清晰一些。

2.6避免被识别成恶意访问

大部分小伙伴，都会遇到被网站后台识别成恶意访问的困难。

最近也和做后台的同学讨论了下，他们判断恶意访问的准则主要为：单位时间内，相同IP的多次连续访问。

我的解决办法其实很傻瓜式，就是以降低性能为代价，强制加入等待时间。

起先等待时间是可以设置成固定值的，比如1秒。但是随着爬取量的上升，后台也会识别到咱们的诡计。

所以我最终决定采用的等待时间是0.5s~1.5s之间的随机数。按照这个反侦察手段，在一定爬取数量和网站对象内，我还未失手。可能是我爬取的数据都是一些不那么重要的资料，网站也大多为盗版。

2.7完整代码

import os
import requests
import re
import time
import random

#创建目录
if not os.path.exists('Txt'):
    os.mkdir('Txt')

#获取网页html
def get_htmlcode(url):
    try:
        urlhtml = requests.get(url)
        htmlcode = urlhtml.content
    # path = open('Txt/html.txt','wb')
    # path.write(htmlcode)
    # path.close()
    except:
        htmlcode = 'error'
    return htmlcode

#获取章节地址列表（无https://www.ddxs.cn前缀）
def get_chapter_list(htmlcode):
    htmldecode = htmlcode.decode()
    a = htmldecode.index('章节目录')
    htmldecode_final = htmldecode[a:]
    reg = r'<a href="(.+?)" title="八荒凤鸣'
    reg_msg = re.compile(reg)
    chapter_list = reg_msg.findall(htmldecode_final)
    return chapter_list

#获取章节名称
def get_subject(htmlcode):
    reg = '<h1>(.+?)</h1>'
    reg_msg = re.compile(reg)
    subject = reg_msg.findall(htmlcode.decode())
    return subject

#获取正文
def get_novel(htmlcode):
    reg = 'class="content"><p>(.+?)</p></article>'
    reg_msg = re.compile(reg)
    novel = reg_msg.findall(htmlcode.decode())
    # print(novel)
    return novel

#调整小说格式
def novel_format(novel):
    novel = str(novel).replace('</p>','\n')
    novel_Format = novel.replace('<p>','   ')
    return novel_Format

url_list = 'https://www.ddxs.cn/book/195261/'

#获取章节列表的地址
if __name__ =='__main__':
    #小说列表网站的html
    Htmlcode_url_list = get_htmlcode(url_list)
    if Htmlcode_url_list == 'error':
        print('~~~~~~~~~~~~~网页出错了，请稍后再试~~~~~~~~~~~~~')
    else:
        #从源码中匹配出所有章节的地址
        Chapter_list = get_chapter_list(Htmlcode_url_list)
        flag = 1
        path = open('Txt/八荒凤鸣.txt', 'a')
        path.write('~~~~~~八荒凤鸣~~~~~~')
        path.write('\n\n\n\n')
        #循环处理每个章节地址
        for a in Chapter_list:
            #逐一处理章节地址，加上前缀
            url_chapter = 'https://www.ddxs.cn'+a
            try:
                # 获取具体章节的html
                Htmlcode = get_htmlcode(url_chapter)
                # 从章节的html中得到章节名
                Subject = get_subject(Htmlcode)
                # 从html中得到小说具体内容并优化格式
                Novel = get_novel(Htmlcode)
                Novel_Format = novel_format(Novel)
                # 输入到最终TXT中
                path.write(str(Subject))
                path.write(str(Novel_Format))
                path.write('\n\n')
                #加一个提示信息
                print('小说八荒凤鸣第%d章节输出完成'%flag)
                flag += 1
                #随机休息0.5到1.5秒，防止被网站后台封禁
                time.sleep(round(random.uniform(0.5,1.5),2))
            except:
                print('第%d章输出失败，网络错误，请稍后再试'%flag)
        path.close()
        print('小说爬取完成')