- 作业①:
- 要求:指定一个网站,爬取这个网站中的所有的所有图片,例如:中国气象网(http://www.weather.com.cn)。使用scrapy框架分别实现单线程和多线程的方式爬取。
–务必控制总页数(学号尾数2位)、总下载的图片数量(尾数后3位)等限制爬取的措施。
- 输出信息: 将下载的Url信息在控制台输出,并将下载的图片存储在images子文件中,并给出截图。
- Gitee文件夹链接:实践作业三/demo1 · Rookie-LJX/2023级数据采集与融合技术 - 码云 - 开源中国 (gitee.com)
- 代码:(指定当当网的跑步鞋网站爬取图片,学号末三位为121,爬取121张图片)
- spider代码:
查看代码
import scrapy from demo1.items import Demo1Item class ShoesSpider(scrapy.Spider): name = "shoes" allowed_domains = ["category.dangdang.com"] start_urls = ["https://category.dangdang.com/pg1-cid4002385.html"] base_url = "https://category.dangdang.com/pg" page = 1 count = 0 def parse(self, response): li_list = response.xpath('//ul[@id="component_47"]/li') id = 0 for li in li_list: id += 1 src = li.xpath('./a/img/@data-original').extract_first() # name = li.xpath('./a/img/@alt').extract_first() if src: src = src else: src = li.xpath('./a/img/@src').extract_first() if src.startswith("//"): src = "http:" + src images = Demo1Item(src=src, id=id, page=self.page) yield images self.count += 1 if self.count >= 121: break if self.page <= 2: self.page += 1 url = self.base_url + str(self.page) + '-cid4002385.html' yield scrapy.Request(url=url, callback=self.parse) - items代码:
查看代码
# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class Demo1Item(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() page = scrapy.Field() id = scrapy.Field() src = scrapy.Field() pass -
pipelines代码:
查看代码
# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface from itemadapter import ItemAdapter import urllib.request class Demo1Pipeline: def process_item(self, item, spider): url = item.get('src') filename = './images/' +str(item.get('page'))+'_'+ str(item.get('id')) + '.jpg' urllib.request.urlretrieve(url=url, filename=filename) return item - 控制单线程爬取和多线程爬取,通过控制setting的代码参数值来控制单线程还是多线程
查看代码
DOWNLOAD_DELAY = 3 The download delay setting will honor only one of: CONCURRENT_REQUESTS_PER_DOMAIN = 16 # 设置参数 CONCURRENT_REQUESTS_PER_IP = 16 # 设置参数 # 当以上两个值都为1时为单线程
- spider代码:
- 运行结果:
心得体会:这次爬取的是当当网的跑步鞋商品图片,限制121张图片,同时限制总页数为21页,这次充分了解了scrapy框架,同时学习item和pipelines的运用,在parse函数中,首先使用XPath表达式来提取网页上的数据。在本代码中,提取了所有id为"component_47"的ul标签下的li标签。然后,遍历这些li标签,并从中提取需要的信息,在提取图片src属性时,需要注意两种情况:一是图片src属性以"//"开头,表示它是一个相对路径;二是图片src属性不存在,此时需要提取a标签中的img标签的src属性作为图片的src属性。在提取完所需信息后,需要将这些信息封装到一个Demo1Item对象中,并使用yield关键字将其返回。这样,Scrapy框架就会知道如何处理这些数据。在遍历过程中,还需要考虑翻页的情况。在本例中,当爬取到第121个商品时,停止遍历,并将页面加1,以便爬取下一页的数据。最后,需要将新的URL传递给scrapy.Request对象,并指定回调函数为parse。这样,Scrapy框架就会继续执行parse函数,直到爬取完所有页面的数据。
- 要求:指定一个网站,爬取这个网站中的所有的所有图片,例如:中国气象网(http://www.weather.com.cn)。使用scrapy框架分别实现单线程和多线程的方式爬取。
- 作业②
- 要求:熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取股票相关信息。
- 候选网站:东方财富网:https://www.eastmoney.com/
- 输出信息:MySQL数据库存储和输出格式如下:
- 表头英文命名例如:序号id,股票代码:bStockNo……,由同学们自行定义设计
-
序号
股票代码
股票名称
最新报价
涨跌幅
涨跌额
成交量
振幅
最高
最低
今开
昨收
1
688093
N世华
28.47
10.92
26.13万
7.6亿
22.34
32.0
28.08
30.20
17.55
2……
- Gitee文件夹链接:实践作业三/demo2 · Rookie-LJX/2023级数据采集与融合技术 - 码云 - 开源中国 (gitee.com)
- 代码:
- spider代码:
查看代码
import scrapy import json from demo2.items import Demo2Item class StockSpider(scrapy.Spider): name = "stock" allowed_domains = ["quote.eastmoney.com"] start_urls = ['http://47.push2.eastmoney.com/api/qt/clist/get?cb=jQuery11240080126732179717_1697270562788&pn=1&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&wbp2u=|0|0|0|web&fid=f3&fs=m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23,m:0+t:81+s:2048&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1697270562789'] def parse(self, response): content = response.text content = content.split('(')[1].split(')')[0] json_data = json.loads(content) data = json_data['data']['diff'] for obj in data: item = Demo2Item() item['code'] = obj['f12'] item['name'] = obj['f14'] item['quotation'] = obj['f2'] item['percentage'] = obj['f3'] item['amount']= obj['f4'] item['volume'] = obj['f5'] item['turnover'] = obj['f6'] item['volatility'] = obj['f7'] item['highest'] = obj['f15'] item['lowest'] = obj['f16'] item['open'] = obj['f17'] item['close'] = obj['f18'] # print(item) yield item - items代码:
查看代码
# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class Demo2Item(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() code = scrapy.Field() name = scrapy.Field() quotation = scrapy.Field() percentage = scrapy.Field() amount = scrapy.Field() volume = scrapy.Field() turnover = scrapy.Field() volatility = scrapy.Field() highest = scrapy.Field() lowest = scrapy.Field() open = scrapy.Field() close = scrapy.Field() pass - pipelines代码:
查看代码
# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface from itemadapter import ItemAdapter import sqlite3 class Demo2Pipeline: def open_spider(self, spider): self.con = sqlite3.connect("沪深京A股.db") self.cursor = self.con.cursor() self.cursor.execute('''CREATE TABLE IF NOT EXISTS stocks ( id INTEGER PRIMARY KEY, 代码 TEXT, 名称 TEXT, 最新价 REAL, 涨跌幅 REAL, 涨跌额 REAL, 成交量 REAL, 成交额 REAL, 振幅 REAL, 最高 REAL, 最低 REAL, 今收 REAL, 昨收 REAL ) ''') def process_item(self, item, spider): self.cursor.execute(''' insert into stocks( 代码 ,名称 ,最新价 ,涨跌幅 ,涨跌额 ,成交量 ,成交额 ,振幅 ,最高 ,最低 ,今收 ,昨收 ) VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) ''', ( item['code'], item['name'], item['quotation'], item['percentage'], item['amount'], item['volume'], item['turnover'], item['volatility'], item['highest'], item['lowest'], item['open'], item['close'] ) ) return item def close_spider(self, spider): self.con.commit() self.con.close() - setting代码:
查看代码
# Scrapy settings for demo2 project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = "demo2" SPIDER_MODULES = ["demo2.spiders"] NEWSPIDER_MODULE = "demo2.spiders" # Crawl responsibly by identifying yourself (and your website) on the user-agent # USER_AGENT = "demo2 (+http://www.yourdomain.com)" # Obey robots.txt rules # ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) # CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs # DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: # CONCURRENT_REQUESTS_PER_DOMAIN = 16 # CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) # COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) # TELNETCONSOLE_ENABLED = False # Override the default request headers: # DEFAULT_REQUEST_HEADERS = { # # 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36' # } # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html # SPIDER_MIDDLEWARES = { # "demo2.middlewares.Demo2SpiderMiddleware": 543, # } # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # DOWNLOADER_MIDDLEWARES = { # "demo2.middlewares.Demo2DownloaderMiddleware": 543, # } # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html # EXTENSIONS = { # "scrapy.extensions.telnet.TelnetConsole": None, # } # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { "demo2.pipelines.Demo2Pipeline": 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html # AUTOTHROTTLE_ENABLED = True # The initial download delay # AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies # AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: # AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings # HTTPCACHE_ENABLED = True # HTTPCACHE_EXPIRATION_SECS = 0 # HTTPCACHE_DIR = "httpcache" # HTTPCACHE_IGNORE_HTTP_CODES = [] # HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage" # Set settings whose default value is deprecated to a future-proof value REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7" TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" FEED_EXPORT_ENCODING = "utf-8"
- spider代码:
- 运行结果
- 心得体会:之前类似的,只不过换成用scrapy,总体难度不大,就是码字,对数据库应用更加得心应手,同时在编写过程中思路更加清晰。
- 作业③:
- 要求:熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取外汇网站数据。
- 候选网站:中国银行网:https://www.boc.cn/sourcedb/whpj/
- 输出信息:
- Gitee文件夹链接:实践作业三/demo3 · Rookie-LJX/2023级数据采集与融合技术 - 码云 - 开源中国 (gitee.com)
Currency
TBP
CBP
TSP
CSP
Time
阿联酋迪拉姆
198.58
192.31
199.98
206.59
11:27:14
- 代码:
- spider代码:
查看代码
import scrapy from demo3.items import Demo3Item class MoneySpider(scrapy.Spider): name = "money" allowed_domains = ["https://www.boc.cn/sourcedb/whpj/index_1.html"] start_urls = ["https://www.boc.cn/sourcedb/whpj/index_1.html"] def parse(self, response): li_list = response.xpath('//div[2]/table[@align="left"]//tr') # / html / body / div / div[5] / div[1] / div[2] # print(response.xpath('//div/table[@align="left"]//tr/td[1]/text()').extract()) for li in li_list[1:]: item = Demo3Item() item['Currency'] = li.xpath('./td[1]/text()').extract_first() item['TBP'] = li.xpath('./td[2]/text()').extract_first() item['CBP'] = li.xpath('./td[3]/text()').extract_first() item['TSP'] = li.xpath('./td[4]/text()').extract_first() item['CSP'] = li.xpath('./td[5]/text()').extract_first() item['Time'] = li.xpath('./td[8]/text()').extract_first() yield item pass - items代码:
查看代码
# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class Demo3Item(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() Currency = scrapy.Field() TBP = scrapy.Field() CBP = scrapy.Field() TSP = scrapy.Field() CSP = scrapy.Field() Time = scrapy.Field() pass - pipelines代码:
查看代码
# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface from itemadapter import ItemAdapter import sqlite3 class Demo3Pipeline: def open_spider(self, spider): self.con = sqlite3.connect("外汇牌价.db") self.cursor = self.con.cursor() self.cursor.execute('''CREATE TABLE IF NOT EXISTS money ( currency TEXT, TBP REAL, CBP REAL, TSP REAL, CSP REAL, Time TEXT ) ''') def process_item(self, item, spider): self.cursor.execute(''' insert into money( currency,TBP,CBP,TSP,CSP,Time ) VALUES(?, ?, ?, ?, ?, ?) ''', ( item['Currency'], item['TBP'], item['CBP'], item['TSP'], item['CSP'], item['Time'] ) ) return item def close_spider(self, spider): self.con.commit() self.con.close() - setting代码:
查看代码
# Scrapy settings for demo3 project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = "demo3" SPIDER_MODULES = ["demo3.spiders"] NEWSPIDER_MODULE = "demo3.spiders" # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = "demo3 (+http://www.yourdomain.com)" # Obey robots.txt rules # ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", # "Accept-Language": "en", #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # "demo3.middlewares.Demo3SpiderMiddleware": 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # "demo3.middlewares.Demo3DownloaderMiddleware": 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # "scrapy.extensions.telnet.TelnetConsole": None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { "demo3.pipelines.Demo3Pipeline": 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = "httpcache" #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage" # Set settings whose default value is deprecated to a future-proof value REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7" TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" FEED_EXPORT_ENCODING = "utf-8"
- spider代码:
- 运行结果:
- 心得体会:这题用xpath+scrapy爬取,总体难度不打,就是忘记了xpath对tbody标签不适用,识别不出来,导致卡了好久,使我知道以后写代码时多注意细节。注:结果显示NULL是网站数据就是空值,不是爬取失败