ScrapingHub 爬虫实践 – TaterLi 个人博客

爬虫费时费力费IP,而且很多VPS提供商不允许爬虫的存在的,但是ScrapingHub刚好就是个爬虫平台,他兼容scrapy脚本,免费资源包括1个Scrapy Cloud单元.

1个Scrapy Cloud单元 = 1 GB的RAM + 2.5GB磁盘空间 + 1个CPU + 1个运行中任务

由于运行的代码是基于Python2的,不过Python3很多语法和Python2兼容,问题也不大.

不需要编写process_item,因为自动导出JSON了,只要自己编写好爬虫和Items类就可以了.

我大概写了个例子,爬虫文件如下.

如果还没入门的建议学习:https://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html

# -*- coding: utf-8 -*-

import scrapy

import re

from Fiction.items import FictionItem

from scrapy.http import Request



class BooksSpider(scrapy.Spider):

    name = 'Books'

    allowed_domains = ['www.69shu.com']

    start_urls = [

        'https://www.69shu.com/top/1.htm',

        'https://www.69shu.com/top/2.htm',

    ]



    # 获取每一本书的URL

    def parse(self, response):

        book_urls = response.xpath('//*[@id="content"]/div[1]/div[1]/div/div[2]/div/div[2]/h4/a/@href').extract()

        for book_url in book_urls:

            yield Request(book_url,callback=self.parse_read)



    #获取马上阅读按钮的URL，进入章节目录

    def parse_read(self, response):

        read_url_slice = response.xpath('//html/body/div[2]/div[4]/div[2]')

        read_url = read_url_slice.xpath('a/@href').extract()[0]

        yield Request(read_url, callback=self.parse_chapter)



    #获取小说章节的URL

    def parse_chapter(self, response):

        chapter_urls = response.xpath('/html/body/div[2]/div[4]/ul[1]/li/a/@href').extract()



        yield item

        for chapter_url in chapter_urls:

            yield Request(chapter_url, callback=self.parse_content)



    #获取小说名字,章节的名字和内容

    def parse_content(self, response):

        #小说名字

        name = response.xpath('/html/body/div[2]/div[2]/div[1]/a[3]/text()').extract_first()

        result = response.text

        #小说章节名字

        chapter_name = response.xpath('/html/body/div[2]/table/tbody/tr/td/h1/text()').extract_first()

        #小说章节内容

        chapter_content = response.xpath('/html/body/div[2]/table/tbody/tr/td/div[1]/text()').extract()

        chapter_content_all = ''

        chapter_content_all.join(chapter_content)

 

        item = FictionItem()

        item['name'] = name

        item['chapter_name'] = chapter_name

        item['chapter_content'] = chapter_content

        yield item

爬虫类自己猜也能猜到了,在本地先爬一小段试试,可以就deploy到平台上,在scrapy工程的根目录,按照提示执行就可以.

其后指定他去RUN,接下来就是等结果了.

由于他自己会勤勤恳恳的爬,自己可以去另一边该干什么就干什么了,由于爬虫是并发的,所以最后取到数据之后是乱的,排序什么的就不是爬虫的活了.

爬虫勤勤恳恳工作中... 如果你是免费注册爬虫,则受到1小时抓取时间限制,如果你是学生包获得的爬虫,那可以超过1个小时...

工作完成: