python爬虫爬取360图片(非结构化数据)

时间：03-29来源：作者：点击数：37

爬虫思路：先拼接json数据包的url，再从中提取图片链接

域名：image.so.com
抓包
1. 360图片是动态加载的数据
2. 点击图片分类中的清新美女 --> ctrl + shift + i --> Network --> xhr -->向下滑动鼠标加载图片抓包
3. 找到图片所在包，查看queryStringParameters
  1. ch:beauty
  2. t1:595
  3. src:banner_beauty
  4. sn:90 #这个值一直在变，变化规律：0 30 60 90 …
  5. listtype:new
  6. temp:1
    除参数sn外，其余参数在清新美女图片分类下不变
    所以用RE改写获取josn数据的Request URL(requestURL在headers中获取)：
    https://image.so.com/zjl?ch=beauty&t1=595&src=banner_beauty&gid=fc8d902e79ef72ded26b6b625909538b&sn={}&listtype=new&temp=1
4. 从后端传给前端的json数据中获取图片的连接
  1. 将requestURL在新的网页中打开，查看到图片链接所在key：qhimg_url:“xxx”
创建项目
scrapy startproject So
cd So
scrapy genspider so image.so.com
items.py中

import scrapy

#定义数据结构：图片链接+图片名称
class SoItem(scrapy.Item)
	
	img_link = scrapy.Field()
	#图片名称，用于给下载下来的图片命名
	img_title = scrapy.Field()

爬虫文件中

import scrapy
import json
from ..items import SoItem

class SoSpider(scrapy.Spider):
	name = 'so'
	allowed_domains = ['image.so.com']

	#图片的json数据包的url	
	url = 'https://image.so.com/zjl?ch=beauty&t1=595&src=banner_beauty&gid=fc8d902e79ef72ded26b6b625909538b&sn={}&listtype=new&temp=1'

	#重写scrapy.Spider的start_requests方法，因为一次性给调度器多个链接
	def start_requests(self):
		#拼个5个图片的数据包的url
		for sn in range(0,121,30):
			url = self.url.format(sn):

			#将网页交给调度器
			yield = scrapy.Request(url=url,callback=self.parse)

	#解析函数，提取图片链接
	def parse(self,response):
		#拿响应内容
		html = json.loads(response.text)
		#从items.py中导入item，将每个图片的链接和名字赋值给item,并传给管道
		item = SoItem()
		for img in html['list']:	#html['list']为所有列表的信息
			#向item中赋值并传给管道文件
			item['img_link'] = img['qhimg_url']	#单个图片的链接
			item['img_title'] = img['title']
			
			#将item传给管道文件
			yield item

pipelines.py中

import scrapy
#导入scrapy提供的图片管道类,实现下载图片功能
#功能封装了爬虫文件中：为了保存下来图片，将图片链接发给调度器入队列，再指定解析函数，然后再得到response.body,然后再with open文件保存下来的步骤
from scrapy.pipelines.images import ImagesPipeline

class SoPipeline(ImagesPipeline):
	#重写ImagePipeline类的get_media_requests方法
	#该方法用于将图片链接scrapy.Request传给调度器
	def get_media_requests(self,item,info):
		yield scrapy.Request(
			url=item['img_link'],
			meta={'name':item['img_title']}	#将图片名通过item传给管道
			)
		#接下来下载图片保存图片的功能已被封装

	#源码默认将下载的文件名用hashlib.sha1加密，并放在一个full文件夹下
	#我们想用文件名title给下载下来的图片命名，并且去掉full文件夹，所以需要重写方法：file_path()
	def file_path(self,request,respnse=None,info=None):
		#scrapy.Request()中所有参数都为请求对象requests的属性(可以将这两个看成一样)
		name = request.meta['name']		
		#拼接图片名
		filename = name + '.' + request.url.split('.')[-1]	#request.url.split('.')[-1]为图片的格式(jpg,png等)
		return filename

#打印出数据的管道类
class SoPipeline(object):
	def process_item(self,item,spider):
		print(item['img_link'])
		return item

settings.py中

ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 0.1
DEFAULT_REQUEST_HEADERS = {	
		'Accept':'...',
		'Accept-Language':'..',
		'User-Agent':'...'
		}
ITEM_PIPELINES = {
	   'So.pipelines.SoPipeline': 300,
		}

#指定图片(非结构化数据)保存的路径
IMAGES_STORE = '/home/ubuntu/360_images/'

主项目目录下创建run.py

from scrapy import cmdline

cmdline.execute('scrapy crawl so'.split())

知识点扩展

如果以后要抓文件，scrapy的管道文件中也帮你封装了下载保存的方法，步骤如下:

pipelines.py中

from scrapy.pipelines.files import FilePipeline

class SoPipeline(FilePipeline):
	#要重写的方法也和图片一样
	...

settings.py中

#指定文件保存的路径
FILES_STORE = '/home/...'

方便获取更多学习、工作、生活信息请关注本站微信公众号 城东书院微信服务号

来顶一下

返回首页

上一篇:python自带jwt库使用下一篇:python爬虫使用selenium操作浏览器

高考生入学注意：这些大	【健康】纯净水、天然
14种竞赛生升学路径盘	excel后缀xls和xlsx有

首页

学习

工作

生活

兴趣组

电子

计算机

掌上机件

图库

游戏

考试与竞赛

黑板报

国学

外语

下载

故事汇

社区

课程

python爬虫爬取360图片(非结构化数据)

python爬虫 爬取360图片(非结构化数据)

python爬虫爬取360图片(非结构化数据)