2023数据采集与融合技术实践作业3

发布时间 2023-10-26 23:51:39作者: 无餍
  • 作业①:
    • 要求:指定一个网站,爬取这个网站中的所有的所有图片,例如:中国气象网(http://www.weather.com.cn)。使用scrapy框架分别实现单线程和多线程的方式爬取。

      –务必控制总页数(学号尾数2位)、总下载的图片数量(尾数后3位)等限制爬取的措施。

    • 输出信息: 将下载的Url信息在控制台输出,并将下载的图片存储在images子文件中,并给出截图。
    • Gitee文件夹链接:实践作业三/demo1 · Rookie-LJX/2023级数据采集与融合技术 - 码云 - 开源中国 (gitee.com)
    • 代码:(指定当当网的跑步鞋网站爬取图片,学号末三位为121,爬取121张图片) 
      • spider代码:
        查看代码
         import scrapy
        from demo1.items import Demo1Item
        
        
        class ShoesSpider(scrapy.Spider):
            name = "shoes"
            allowed_domains = ["category.dangdang.com"]
            start_urls = ["https://category.dangdang.com/pg1-cid4002385.html"]
            base_url = "https://category.dangdang.com/pg"
            page = 1
            count = 0
        
            def parse(self, response):
                li_list = response.xpath('//ul[@id="component_47"]/li')
                id = 0
                for li in li_list:
                    id += 1
                    src = li.xpath('./a/img/@data-original').extract_first()
                    # name = li.xpath('./a/img/@alt').extract_first()
                    if src:
                        src = src
                    else:
                        src = li.xpath('./a/img/@src').extract_first()
                    if src.startswith("//"):
                        src = "http:" + src
        
                    images = Demo1Item(src=src, id=id, page=self.page)
                    yield images
                    self.count += 1
                    if self.count >= 121:
                        break
                if self.page <= 2:
                    self.page += 1
                    url = self.base_url + str(self.page) + '-cid4002385.html'
                    yield scrapy.Request(url=url, callback=self.parse)
      • items代码:
        查看代码
         # Define here the models for your scraped items
        #
        # See documentation in:
        # https://docs.scrapy.org/en/latest/topics/items.html
        
        import scrapy
        
        
        class Demo1Item(scrapy.Item):
            # define the fields for your item here like:
            # name = scrapy.Field()
            page = scrapy.Field()
            id = scrapy.Field()
            src = scrapy.Field()
            pass
      •  pipelines代码:

        查看代码
         # Define your item pipelines here
        #
        # Don't forget to add your pipeline to the ITEM_PIPELINES setting
        # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
        
        
        # useful for handling different item types with a single interface
        from itemadapter import ItemAdapter
        import urllib.request
        
        
        class Demo1Pipeline:
            def process_item(self, item, spider):
                url = item.get('src')
                filename = './images/' +str(item.get('page'))+'_'+ str(item.get('id')) + '.jpg'
                urllib.request.urlretrieve(url=url, filename=filename)
                return item
      •  控制单线程爬取和多线程爬取,通过控制setting的代码参数值来控制单线程还是多线程
        查看代码
         DOWNLOAD_DELAY = 3
         The download delay setting will honor only one of:
        CONCURRENT_REQUESTS_PER_DOMAIN = 16      # 设置参数
        CONCURRENT_REQUESTS_PER_IP = 16          # 设置参数
        # 当以上两个值都为1时为单线程
    • 运行结果:

                     

      心得体会:这次爬取的是当当网的跑步鞋商品图片,限制121张图片,同时限制总页数为21页,这次充分了解了scrapy框架,同时学习item和pipelines的运用,在parse函数中,首先使用XPath表达式来提取网页上的数据。在本代码中,提取了所有id为"component_47"的ul标签下的li标签。然后,遍历这些li标签,并从中提取需要的信息,在提取图片src属性时,需要注意两种情况:一是图片src属性以"//"开头,表示它是一个相对路径;二是图片src属性不存在,此时需要提取a标签中的img标签的src属性作为图片的src属性。在提取完所需信息后,需要将这些信息封装到一个Demo1Item对象中,并使用yield关键字将其返回。这样,Scrapy框架就会知道如何处理这些数据。在遍历过程中,还需要考虑翻页的情况。在本例中,当爬取到第121个商品时,停止遍历,并将页面加1,以便爬取下一页的数据。最后,需要将新的URL传递给scrapy.Request对象,并指定回调函数为parse。这样,Scrapy框架就会继续执行parse函数,直到爬取完所有页面的数据。

 

  • 作业②
    • 要求:熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取股票相关信息。
    • 候选网站:东方财富网:https://www.eastmoney.com/
    •          
    • 输出信息:MySQL数据库存储和输出格式如下:
    • 表头英文命名例如:序号id,股票代码:bStockNo……,由同学们自行定义设计
    • 序号

      股票代码

      股票名称

      最新报价

      涨跌幅

      涨跌额

      成交量

      振幅

      最高

      最低

      今开

      昨收

      1

      688093

      N世华

      28.47

      10.92

      26.13万

      7.6亿

      22.34

      32.0

      28.08

      30.20

      17.55

      2……

       

       

       

       

       

       

       

       

       

       

       

    • Gitee文件夹链接:实践作业三/demo2 · Rookie-LJX/2023级数据采集与融合技术 - 码云 - 开源中国 (gitee.com)
    • 代码: 
      • spider代码:
        查看代码
         import scrapy
        import json
        from demo2.items import Demo2Item
        
        
        
        class StockSpider(scrapy.Spider):
            name = "stock"
            allowed_domains = ["quote.eastmoney.com"]
            start_urls = ['http://47.push2.eastmoney.com/api/qt/clist/get?cb=jQuery11240080126732179717_1697270562788&pn=1&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&wbp2u=|0|0|0|web&fid=f3&fs=m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23,m:0+t:81+s:2048&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1697270562789']
        
            def parse(self, response):
                content = response.text
                content = content.split('(')[1].split(')')[0]
                json_data = json.loads(content)
                data = json_data['data']['diff']
                for obj in data:
                    item = Demo2Item()
                    item['code'] = obj['f12']
                    item['name'] = obj['f14']
                    item['quotation'] = obj['f2']
                    item['percentage'] = obj['f3']
                    item['amount']= obj['f4']
                    item['volume'] = obj['f5']
                    item['turnover'] = obj['f6']
                    item['volatility'] = obj['f7']
                    item['highest'] = obj['f15']
                    item['lowest'] = obj['f16']
                    item['open'] = obj['f17']
                    item['close'] = obj['f18']
                    # print(item)
                    yield item
      • items代码:
        查看代码
         # Define here the models for your scraped items
        #
        # See documentation in:
        # https://docs.scrapy.org/en/latest/topics/items.html
        
        import scrapy
        
        
        class Demo2Item(scrapy.Item):
            # define the fields for your item here like:
            # name = scrapy.Field()
            code = scrapy.Field()
            name = scrapy.Field()
            quotation = scrapy.Field()
            percentage = scrapy.Field()
            amount = scrapy.Field()
            volume = scrapy.Field()
            turnover = scrapy.Field()
            volatility = scrapy.Field()
            highest = scrapy.Field()
            lowest = scrapy.Field()
            open = scrapy.Field()
            close = scrapy.Field()
            pass
      • pipelines代码:
        查看代码
         # Define your item pipelines here
        #
        # Don't forget to add your pipeline to the ITEM_PIPELINES setting
        # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
        
        
        # useful for handling different item types with a single interface
        from itemadapter import ItemAdapter
        import sqlite3
        
        class Demo2Pipeline:
            def open_spider(self, spider):
                self.con = sqlite3.connect("沪深京A股.db")
                self.cursor = self.con.cursor()
                self.cursor.execute('''CREATE TABLE IF NOT EXISTS stocks (
                id INTEGER PRIMARY KEY,
                代码 TEXT,
                名称 TEXT,
                最新价 REAL,
                涨跌幅 REAL,
                涨跌额 REAL,
                成交量 REAL,
                成交额 REAL,
                振幅 REAL,
                最高 REAL,
                最低 REAL,
                今收 REAL,
                昨收 REAL
            )
        ''')
        
            def process_item(self, item, spider):
                self.cursor.execute('''
                    insert into stocks(
                代码 ,名称 ,最新价 ,涨跌幅 ,涨跌额 ,成交量 ,成交额 ,振幅 ,最高 ,最低 ,今收 ,昨收 
                )
                VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
                ''', (
                    item['code'],
                    item['name'],
                    item['quotation'],
                    item['percentage'],
                    item['amount'],
                    item['volume'],
                    item['turnover'],
                    item['volatility'],
                    item['highest'],
                    item['lowest'],
                    item['open'],
                    item['close']
                )
                               )
                return item
        
            def close_spider(self, spider):
                self.con.commit()
                self.con.close()
      • setting代码:
        查看代码
         # Scrapy settings for demo2 project
        #
        # For simplicity, this file contains only settings considered important or
        # commonly used. You can find more settings consulting the documentation:
        #
        #     https://docs.scrapy.org/en/latest/topics/settings.html
        #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
        #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
        
        BOT_NAME = "demo2"
        
        SPIDER_MODULES = ["demo2.spiders"]
        NEWSPIDER_MODULE = "demo2.spiders"
        
        # Crawl responsibly by identifying yourself (and your website) on the user-agent
        # USER_AGENT = "demo2 (+http://www.yourdomain.com)"
        
        # Obey robots.txt rules
        # ROBOTSTXT_OBEY = True
        
        # Configure maximum concurrent requests performed by Scrapy (default: 16)
        # CONCURRENT_REQUESTS = 32
        
        # Configure a delay for requests for the same website (default: 0)
        # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
        # See also autothrottle settings and docs
        # DOWNLOAD_DELAY = 3
        # The download delay setting will honor only one of:
        # CONCURRENT_REQUESTS_PER_DOMAIN = 16
        # CONCURRENT_REQUESTS_PER_IP = 16
        
        # Disable cookies (enabled by default)
        # COOKIES_ENABLED = False
        
        # Disable Telnet Console (enabled by default)
        # TELNETCONSOLE_ENABLED = False
        
        # Override the default request headers:
        # DEFAULT_REQUEST_HEADERS = {
        #
        #     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36'
        # }
        
        # Enable or disable spider middlewares
        # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
        # SPIDER_MIDDLEWARES = {
        #    "demo2.middlewares.Demo2SpiderMiddleware": 543,
        # }
        
        # Enable or disable downloader middlewares
        # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
        # DOWNLOADER_MIDDLEWARES = {
        #    "demo2.middlewares.Demo2DownloaderMiddleware": 543,
        # }
        
        # Enable or disable extensions
        # See https://docs.scrapy.org/en/latest/topics/extensions.html
        # EXTENSIONS = {
        #    "scrapy.extensions.telnet.TelnetConsole": None,
        # }
        
        # Configure item pipelines
        # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
        ITEM_PIPELINES = {
           "demo2.pipelines.Demo2Pipeline": 300,
        }
        
        # Enable and configure the AutoThrottle extension (disabled by default)
        # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
        # AUTOTHROTTLE_ENABLED = True
        # The initial download delay
        # AUTOTHROTTLE_START_DELAY = 5
        # The maximum download delay to be set in case of high latencies
        # AUTOTHROTTLE_MAX_DELAY = 60
        # The average number of requests Scrapy should be sending in parallel to
        # each remote server
        # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
        # Enable showing throttling stats for every response received:
        # AUTOTHROTTLE_DEBUG = False
        
        # Enable and configure HTTP caching (disabled by default)
        # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
        # HTTPCACHE_ENABLED = True
        # HTTPCACHE_EXPIRATION_SECS = 0
        # HTTPCACHE_DIR = "httpcache"
        # HTTPCACHE_IGNORE_HTTP_CODES = []
        # HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"
        
        # Set settings whose default value is deprecated to a future-proof value
        REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
        TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
        FEED_EXPORT_ENCODING = "utf-8"
    • 运行结果

    • 心得体会:之前类似的,只不过换成用scrapy,总体难度不大,就是码字,对数据库应用更加得心应手,同时在编写过程中思路更加清晰。

 

  • 作业③:
    • 要求:熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取外汇网站数据。
    • 候选网站:中国银行网:https://www.boc.cn/sourcedb/whpj/
    • 输出信息:
    • Gitee文件夹链接:实践作业三/demo3 · Rookie-LJX/2023级数据采集与融合技术 - 码云 - 开源中国 (gitee.com)

        Currency

                   TBP

                   CBP

                   TSP

                   CSP

                   Time

       阿联酋迪拉姆

                   198.58

                   192.31

                   199.98

                   206.59

                   11:27:14

    • 代码: 
      • spider代码:
        查看代码
         import scrapy
        from demo3.items import Demo3Item
        
        class MoneySpider(scrapy.Spider):
            name = "money"
            allowed_domains = ["https://www.boc.cn/sourcedb/whpj/index_1.html"]
            start_urls = ["https://www.boc.cn/sourcedb/whpj/index_1.html"]
        
            def parse(self, response):
                li_list = response.xpath('//div[2]/table[@align="left"]//tr')
                # / html / body / div / div[5] / div[1] / div[2]
                # print(response.xpath('//div/table[@align="left"]//tr/td[1]/text()').extract())
                for li in li_list[1:]:
                    item = Demo3Item()
                    item['Currency'] = li.xpath('./td[1]/text()').extract_first()
                    item['TBP'] = li.xpath('./td[2]/text()').extract_first()
                    item['CBP'] = li.xpath('./td[3]/text()').extract_first()
                    item['TSP'] = li.xpath('./td[4]/text()').extract_first()
                    item['CSP'] = li.xpath('./td[5]/text()').extract_first()
                    item['Time'] = li.xpath('./td[8]/text()').extract_first()
                    yield item
                pass
      • items代码:
        查看代码
         # Define here the models for your scraped items
        #
        # See documentation in:
        # https://docs.scrapy.org/en/latest/topics/items.html
        
        import scrapy
        
        
        class Demo3Item(scrapy.Item):
            # define the fields for your item here like:
            # name = scrapy.Field()
            Currency = scrapy.Field()
            TBP = scrapy.Field()
            CBP = scrapy.Field()
            TSP = scrapy.Field()
            CSP = scrapy.Field()
            Time = scrapy.Field()
            pass
      • pipelines代码:
        查看代码
         # Define your item pipelines here
        #
        # Don't forget to add your pipeline to the ITEM_PIPELINES setting
        # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
        
        
        # useful for handling different item types with a single interface
        from itemadapter import ItemAdapter
        import sqlite3
        
        
        class Demo3Pipeline:
            def open_spider(self, spider):
                self.con = sqlite3.connect("外汇牌价.db")
                self.cursor = self.con.cursor()
                self.cursor.execute('''CREATE TABLE IF NOT EXISTS money (
                currency TEXT,
                TBP REAL,
                CBP REAL,
                TSP REAL,
                CSP REAL,
                Time TEXT    
            )
        ''')
        
            def process_item(self, item, spider):
                self.cursor.execute('''
                    insert into money(
                currency,TBP,CBP,TSP,CSP,Time
                )
                VALUES(?, ?, ?, ?, ?, ?)
                ''', (
                    item['Currency'],
                    item['TBP'],
                    item['CBP'],
                    item['TSP'],
                    item['CSP'],
                    item['Time']
                )
                                    )
                return item
        
            def close_spider(self, spider):
                self.con.commit()
                self.con.close()
      • setting代码:
        查看代码
         # Scrapy settings for demo3 project
        #
        # For simplicity, this file contains only settings considered important or
        # commonly used. You can find more settings consulting the documentation:
        #
        #     https://docs.scrapy.org/en/latest/topics/settings.html
        #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
        #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
        
        BOT_NAME = "demo3"
        
        SPIDER_MODULES = ["demo3.spiders"]
        NEWSPIDER_MODULE = "demo3.spiders"
        
        
        # Crawl responsibly by identifying yourself (and your website) on the user-agent
        #USER_AGENT = "demo3 (+http://www.yourdomain.com)"
        
        # Obey robots.txt rules
        # ROBOTSTXT_OBEY = True
        
        # Configure maximum concurrent requests performed by Scrapy (default: 16)
        #CONCURRENT_REQUESTS = 32
        
        # Configure a delay for requests for the same website (default: 0)
        # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
        # See also autothrottle settings and docs
        #DOWNLOAD_DELAY = 3
        # The download delay setting will honor only one of:
        #CONCURRENT_REQUESTS_PER_DOMAIN = 16
        #CONCURRENT_REQUESTS_PER_IP = 16
        
        # Disable cookies (enabled by default)
        #COOKIES_ENABLED = False
        
        # Disable Telnet Console (enabled by default)
        #TELNETCONSOLE_ENABLED = False
        
        # Override the default request headers:
        #DEFAULT_REQUEST_HEADERS = {
        #    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        #    "Accept-Language": "en",
        #}
        
        # Enable or disable spider middlewares
        # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
        #SPIDER_MIDDLEWARES = {
        #    "demo3.middlewares.Demo3SpiderMiddleware": 543,
        #}
        
        # Enable or disable downloader middlewares
        # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
        #DOWNLOADER_MIDDLEWARES = {
        #    "demo3.middlewares.Demo3DownloaderMiddleware": 543,
        #}
        
        # Enable or disable extensions
        # See https://docs.scrapy.org/en/latest/topics/extensions.html
        #EXTENSIONS = {
        #    "scrapy.extensions.telnet.TelnetConsole": None,
        #}
        
        # Configure item pipelines
        # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
        ITEM_PIPELINES = {
           "demo3.pipelines.Demo3Pipeline": 300,
        }
        
        # Enable and configure the AutoThrottle extension (disabled by default)
        # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
        #AUTOTHROTTLE_ENABLED = True
        # The initial download delay
        #AUTOTHROTTLE_START_DELAY = 5
        # The maximum download delay to be set in case of high latencies
        #AUTOTHROTTLE_MAX_DELAY = 60
        # The average number of requests Scrapy should be sending in parallel to
        # each remote server
        #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
        # Enable showing throttling stats for every response received:
        #AUTOTHROTTLE_DEBUG = False
        
        # Enable and configure HTTP caching (disabled by default)
        # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
        #HTTPCACHE_ENABLED = True
        #HTTPCACHE_EXPIRATION_SECS = 0
        #HTTPCACHE_DIR = "httpcache"
        #HTTPCACHE_IGNORE_HTTP_CODES = []
        #HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"
        
        # Set settings whose default value is deprecated to a future-proof value
        REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
        TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
        FEED_EXPORT_ENCODING = "utf-8"
    • 运行结果:

    • 心得体会:这题用xpath+scrapy爬取,总体难度不打,就是忘记了xpath对tbody标签不适用,识别不出来,导致卡了好久,使我知道以后写代码时多注意细节注:结果显示NULL是网站数据就是空值,不是爬取失败
  •