特别说明:
因为早些时候鼠绘的接口调整,之前的代码已经不能用了。
正好最近在学习scrapy,于是重新写了一个,项目放在githubhttps://github.com/TurboWay/ishuhui
一、起因:
很喜欢看海贼漫画,其中鼠绘汉化的海贼王无疑是最好的,更新最快的。但是由于版权的问题,迫于压力,鼠绘官网早一点的海贼王已经看不了,但是。。。重点是,我发现接口还是可以用的,于是就写了个爬虫把鼠绘翻译的海贼王漫画都爬了下来。分享下源码,供有需要的海迷使用。另外建议不要在高峰时段爬取,毕竟我们都爱鼠绘。
二、如何使用:
有安装python环境的,直接复制源码,运行.py
三、代码如下:
- # -*- coding: utf-8 -*-
- import requests,json,time,os,shutil,logging,sys
- from PIL import Image
- from io import BytesIO
-
- logger = logging.getLogger('log')
- logger.setLevel(logging.DEBUG)
-
- # log format
- formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
-
- # console log
- ch = logging.StreamHandler()
- ch.setLevel(logging.DEBUG)
- ch.setFormatter(formatter)
- logger.addHandler(ch)
-
- def get_url(url):
- headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)'
- ' Chrome/62.0.3202.75 Safari/537.36'}
- response = requests.get(url=url, headers=headers, timeout=5)
- js = json.loads(response.text)
- if js["errNo"] == 0:
- return js["data"]
- else:
- logger.warning("请求失败:{0}".format(js))
-
- # 去掉文件名禁止符号
- def clean(text):
- kws = ['/','\\',':','*','"','<','>','|','?']
- for kw in kws:
- text = text.replace(kw,'.')
- return text
-
- # 新建文件夹
- def makefile(path,istruncate):
- if os.path.exists(path) and istruncate:
- shutil.rmtree(path)
- os.mkdir(path)
- elif not os.path.exists(path):
- os.mkdir(path)
-
- # 下载图片
- def save_pic(img_src,picname):
- try:
- response = requests.get(img_src)
- image = Image.open(BytesIO(response.content))
- image = image.convert('RGB')
- image.save(picname)
- logger.info("{0}图片下载成功".format(picname))
- flag = True
- except Exception as e:
- logger.info("{0}图片下载失败:{1}".format(picname,e))
- flag = False
- return flag
-
- # 保存图片
- def resave_pic(img_src,picname):
- count,flag = 0,save_pic(img_src,picname)
- while not flag:
- flag = save_pic(img_src, picname)
- count += 1
- if count > 5:
- break
-
- def get_data(path,nextid):
- url = 'http://hhzapi.ishuhui.com/cartoon/post/ver/76906890/id/{0}.json'.format(nextid)
- data = get_url(url)
- if data:
- server = 'http://pic04.ishuhui.com/'
- source, id, title, book, number = data['source'], data['id'], data['title'], data['book_text'], data['number']
- content_img = eval(data['content_img']) if data['content_img'] else {}
- if source == 1: # 鼠绘汉化
- makefile(path + '\\' + book, False)
- title = clean(title)
- filepath = path + '\{0}\{0} 第 {1} 话 {2}'.format(book,number,title)
- makefile(filepath, True) # 新建文件夹
- if content_img: # 下载图片
- for img, imgurl in content_img.items():
- imgurl = server + imgurl.replace('/upload/','')
- picname = filepath + '\\'+ img
- resave_pic(imgurl,picname)
- logger.info("ID:{2} 第 {0} 话 {1}下载完成".format(number,title,id))
- next = data['prev']
- if next:
- return next['id']
- elif nextid == 900: # 900的时候会找不到上一页
- return 899
-
- if __name__ == "__main__":
- path=sys.path[0]
- nextid=get_data(path,10881)
- while nextid:
- nextid=get_data(path,nextid)
- time.sleep(3)
四、结果如下:
第598话 2年后 --第908话 世界會議開幕,共309话,3.22G,其中680和681话缺失了,接口扫了一下也没找到。