2023数据采集与融合技术实践作业3-JZTXT

作业①:

要求：指定一个网站，爬取这个网站中的所有的所有图片，例如：中国气象网（http://www.weather.com.cn）。使用scrapy框架分别实现单线程和多线程的方式爬取。
–务必控制总页数（学号尾数2位）、总下载的图片数量（尾数后3位）等限制爬取的措施。
输出信息: 将下载的Url信息在控制台输出，并将下载的图片存储在images子文件中，并给出截图。
Gitee文件夹链接：实践作业三/demo1 · Rookie-LJX/2023级数据采集与融合技术 - 码云 - 开源中国 (gitee.com)

代码：（指定当当网的跑步鞋网站爬取图片，学号末三位为121，爬取121张图片）

spider代码：

查看代码

 import scrapy
from demo1.items import Demo1Item


class ShoesSpider(scrapy.Spider):
    name = "shoes"
    allowed_domains = ["category.dangdang.com"]
    start_urls = ["https://category.dangdang.com/pg1-cid4002385.html"]
    base_url = "https://category.dangdang.com/pg"
    page = 1
    count = 0

    def parse(self, response):
        li_list = response.xpath('//ul[@id="component_47"]/li')
        id = 0
        for li in li_list:
            id += 1
            src = li.xpath('./a/img/@data-original').extract_first()
            # name = li.xpath('./a/img/@alt').extract_first()
            if src:
                src = src
            else:
                src = li.xpath('./a/img/@src').extract_first()
            if src.startswith("//"):
                src = "http:" + src

            images = Demo1Item(src=src, id=id, page=self.page)
            yield images
            self.count += 1
            if self.count >= 121:
                break
        if self.page <= 2:
            self.page += 1
            url = self.base_url + str(self.page) + '-cid4002385.html'
            yield scrapy.Request(url=url, callback=self.parse)

items代码：

查看代码

 # Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class Demo1Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    page = scrapy.Field()
    id = scrapy.Field()
    src = scrapy.Field()
    pass

pipelines代码：

查看代码

 # Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import urllib.request


class Demo1Pipeline:
    def process_item(self, item, spider):
        url = item.get('src')
        filename = './images/' +str(item.get('page'))+'_'+ str(item.get('id')) + '.jpg'
        urllib.request.urlretrieve(url=url, filename=filename)
        return item

控制单线程爬取和多线程爬取，通过控制setting的代码参数值来控制单线程还是多线程

查看代码

 DOWNLOAD_DELAY = 3
 The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 16      # 设置参数
CONCURRENT_REQUESTS_PER_IP = 16          # 设置参数
# 当以上两个值都为1时为单线程

运行结果：

心得体会：这次爬取的是当当网的跑步鞋商品图片，限制121张图片，同时限制总页数为21页，这次充分了解了scrapy框架，同时学习item和pipelines的运用，在parse函数中，首先使用XPath表达式来提取网页上的数据。在本代码中，提取了所有id为"component_47"的ul标签下的li标签。然后，遍历这些li标签，并从中提取需要的信息，在提取图片src属性时，需要注意两种情况：一是图片src属性以"//"开头，表示它是一个相对路径；二是图片src属性不存在，此时需要提取a标签中的img标签的src属性作为图片的src属性。在提取完所需信息后，需要将这些信息封装到一个Demo1Item对象中，并使用yield关键字将其返回。这样，Scrapy框架就会知道如何处理这些数据。在遍历过程中，还需要考虑翻页的情况。在本例中，当爬取到第121个商品时，停止遍历，并将页面加1，以便爬取下一页的数据。最后，需要将新的URL传递给scrapy.Request对象，并指定回调函数为parse。这样，Scrapy框架就会继续执行parse函数，直到爬取完所有页面的数据。

作业②

要求：熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法；使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取股票相关信息。
候选网站：东方财富网：https://www.eastmoney.com/
输出信息：MySQL数据库存储和输出格式如下：
表头英文命名例如：序号id，股票代码：bStockNo……，由同学们自行定义设计

序号	股票代码	股票名称	最新报价	涨跌幅	涨跌额	成交量	振幅	最高	最低	今开	昨收
1	688093	N世华	28.47	10.92	26.13万	7.6亿	22.34	32.0	28.08	30.20	17.55
2……

Gitee文件夹链接：实践作业三/demo2 · Rookie-LJX/2023级数据采集与融合技术 - 码云 - 开源中国 (gitee.com)

代码：

spider代码：

查看代码

 import scrapy
import json
from demo2.items import Demo2Item



class StockSpider(scrapy.Spider):
    name = "stock"
    allowed_domains = ["quote.eastmoney.com"]
    start_urls = ['http://47.push2.eastmoney.com/api/qt/clist/get?cb=jQuery11240080126732179717_1697270562788&pn=1&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&wbp2u=|0|0|0|web&fid=f3&fs=m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23,m:0+t:81+s:2048&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1697270562789']

    def parse(self, response):
        content = response.text
        content = content.split('(')[1].split(')')[0]
        json_data = json.loads(content)
        data = json_data['data']['diff']
        for obj in data:
            item = Demo2Item()
            item['code'] = obj['f12']
            item['name'] = obj['f14']
            item['quotation'] = obj['f2']
            item['percentage'] = obj['f3']
            item['amount']= obj['f4']
            item['volume'] = obj['f5']
            item['turnover'] = obj['f6']
            item['volatility'] = obj['f7']
            item['highest'] = obj['f15']
            item['lowest'] = obj['f16']
            item['open'] = obj['f17']
            item['close'] = obj['f18']
            # print(item)
            yield item

items代码：

查看代码

 # Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class Demo2Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    code = scrapy.Field()
    name = scrapy.Field()
    quotation = scrapy.Field()
    percentage = scrapy.Field()
    amount = scrapy.Field()
    volume = scrapy.Field()
    turnover = scrapy.Field()
    volatility = scrapy.Field()
    highest = scrapy.Field()
    lowest = scrapy.Field()
    open = scrapy.Field()
    close = scrapy.Field()
    pass

pipelines代码：

查看代码

 # Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import sqlite3

class Demo2Pipeline:
    def open_spider(self, spider):
        self.con = sqlite3.connect("沪深京A股.db")
        self.cursor = self.con.cursor()
        self.cursor.execute('''CREATE TABLE IF NOT EXISTS stocks (
        id INTEGER PRIMARY KEY,
        代码 TEXT,
        名称 TEXT,
        最新价 REAL,
        涨跌幅 REAL,
        涨跌额 REAL,
        成交量 REAL,
        成交额 REAL,
        振幅 REAL,
        最高 REAL,
        最低 REAL,
        今收 REAL,
        昨收 REAL
    )
''')

    def process_item(self, item, spider):
        self.cursor.execute('''
            insert into stocks(
        代码 ,名称 ,最新价 ,涨跌幅 ,涨跌额 ,成交量 ,成交额 ,振幅 ,最高 ,最低 ,今收 ,昨收 
        )
        VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        ''', (
            item['code'],
            item['name'],
            item['quotation'],
            item['percentage'],
            item['amount'],
            item['volume'],
            item['turnover'],
            item['volatility'],
            item['highest'],
            item['lowest'],
            item['open'],
            item['close']
        )
                       )
        return item

    def close_spider(self, spider):
        self.con.commit()
        self.con.close()

setting代码：

查看代码

 # Scrapy settings for demo2 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "demo2"

SPIDER_MODULES = ["demo2.spiders"]
NEWSPIDER_MODULE = "demo2.spiders"

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = "demo2 (+http://www.yourdomain.com)"

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
# COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#
#     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36'
# }

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    "demo2.middlewares.Demo2SpiderMiddleware": 543,
# }

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    "demo2.middlewares.Demo2DownloaderMiddleware": 543,
# }

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
# }

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   "demo2.pipelines.Demo2Pipeline": 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = "httpcache"
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

运行结果
心得体会：之前类似的，只不过换成用scrapy，总体难度不大，就是码字，对数据库应用更加得心应手，同时在编写过程中思路更加清晰。

作业③:

要求：熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法；使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取外汇网站数据。
候选网站：中国银行网：https://www.boc.cn/sourcedb/whpj/
输出信息：

Gitee文件夹链接：实践作业三/demo3 · Rookie-LJX/2023级数据采集与融合技术 - 码云 - 开源中国 (gitee.com)

Currency	TBP	CBP	TSP	CSP	Time
阿联酋迪拉姆	198.58	192.31	199.98	206.59	11:27:14

代码：

spider代码：

查看代码

 import scrapy
from demo3.items import Demo3Item

class MoneySpider(scrapy.Spider):
    name = "money"
    allowed_domains = ["https://www.boc.cn/sourcedb/whpj/index_1.html"]
    start_urls = ["https://www.boc.cn/sourcedb/whpj/index_1.html"]

    def parse(self, response):
        li_list = response.xpath('//div[2]/table[@align="left"]//tr')
        # / html / body / div / div[5] / div[1] / div[2]
        # print(response.xpath('//div/table[@align="left"]//tr/td[1]/text()').extract())
        for li in li_list[1:]:
            item = Demo3Item()
            item['Currency'] = li.xpath('./td[1]/text()').extract_first()
            item['TBP'] = li.xpath('./td[2]/text()').extract_first()
            item['CBP'] = li.xpath('./td[3]/text()').extract_first()
            item['TSP'] = li.xpath('./td[4]/text()').extract_first()
            item['CSP'] = li.xpath('./td[5]/text()').extract_first()
            item['Time'] = li.xpath('./td[8]/text()').extract_first()
            yield item
        pass

items代码：

查看代码

 # Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class Demo3Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    Currency = scrapy.Field()
    TBP = scrapy.Field()
    CBP = scrapy.Field()
    TSP = scrapy.Field()
    CSP = scrapy.Field()
    Time = scrapy.Field()
    pass

pipelines代码：

查看代码

 # Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import sqlite3


class Demo3Pipeline:
    def open_spider(self, spider):
        self.con = sqlite3.connect("外汇牌价.db")
        self.cursor = self.con.cursor()
        self.cursor.execute('''CREATE TABLE IF NOT EXISTS money (
        currency TEXT,
        TBP REAL,
        CBP REAL,
        TSP REAL,
        CSP REAL,
        Time TEXT    
    )
''')

    def process_item(self, item, spider):
        self.cursor.execute('''
            insert into money(
        currency,TBP,CBP,TSP,CSP,Time
        )
        VALUES(?, ?, ?, ?, ?, ?)
        ''', (
            item['Currency'],
            item['TBP'],
            item['CBP'],
            item['TSP'],
            item['CSP'],
            item['Time']
        )
                            )
        return item

    def close_spider(self, spider):
        self.con.commit()
        self.con.close()

setting代码：

查看代码

 # Scrapy settings for demo3 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "demo3"

SPIDER_MODULES = ["demo3.spiders"]
NEWSPIDER_MODULE = "demo3.spiders"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "demo3 (+http://www.yourdomain.com)"

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "demo3.middlewares.Demo3SpiderMiddleware": 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "demo3.middlewares.Demo3DownloaderMiddleware": 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   "demo3.pipelines.Demo3Pipeline": 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

运行结果：
心得体会：这题用xpath+scrapy爬取，总体难度不打，就是忘记了xpath对tbody标签不适用，识别不出来，导致卡了好久，使我知道以后写代码时多注意细节。注：结果显示NULL是网站数据就是空值，不是爬取失败