代码放在码云：仓库链接

第三次作业

作业1：
要求：指定一个网站，爬取这个网站中的所有的所有图片，例如：中国气象网。使用scrapy框架分别实现单线程和多线程的方式爬取。

作业一代码文件夹：作业一

下面是pic.py代码文件

from typing import Any, Iterable
import scrapy
from scrapy.http import Request

class PicSpider(scrapy.Spider):
    name = "pic"
    allowed_domains = ["__all__"]
    start_urls = ["http://www.weather.com.cn/"]

    def parse(self, response):
        img_address = response.xpath("//img/@src").getall()
        for img in img_address:
            print(img)
        yield {
            "image_urls":img_address
        }

同时要在Pipeline.py文件中引入ImgPipeline

from scrapy.pipelines.images import ImagesPipeline

同时还要再settings.py文件中加入ImagePipeline

ITEM_PIPELINES = {
   "homework01.pipelines.ImgPipeline":300
}

下面是ImgPipeline类的代码

class ImgPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        return Request(item.get("image_urls"))

下面是运行代码的截图
下图是保存下来的图片的image文件夹
对于scrapy多线程爬取的技术，到scrapy官网中进行确认

可以发现，我们只需要设置settings.py中的配置项，将CONCURRENT_REQUESTS设置为自己想要的多线程即可。还是非常方便的。不需要我们自己去设置Thread类。
心得体会：

scrapy的图片爬取与下载，对于自己来说是之前没有涉及过的，所以在完成这次实验时，有一些束手无策，对于Pipeline的配置等都不熟悉，经常容易这里配置好了，其他地方忘记配置的。还是得多进行一些练习，把ImgPipeLine练的如火纯青。

作业二：
要求：熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法；使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取股票相关信息。
候选网站：东方财富网

作业二代码文件夹：作业二

下面是stock.py文件代码

from typing import Iterable
import scrapy
from scrapy.http import Request
import json
from ..items import Homework01Item
import pandas as pd
from sqlalchemy.engine import create_engine


class StockSpider(scrapy.Spider):
    name = "stock"
    allowed_domains = ["eastmoney.com"]
    start_urls = ["http://38.push2.eastmoney.com/api/qt/clist/get?cb=jQuery112406848566904145428_1697696179672&pn=1&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&wbp2u=|0|0|0|web&fid=f3&fs=m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23,m:0+t:81+s:2048&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1697696179673"]

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url)

    def parse(self, response):
        '''
        序号,股票代码：f12,股票名称：f14,最新报价：f2,涨跌幅：f3,涨跌额：f4,成交量：f5,成交额：f6
        振幅：f7，最高：f15，最低：f16，今开：f17，昨收：f18
        f12,f14,f2,f3,f4,f5,f6,f7,f15,f16,f17,f18
        '''
        # 提取括号内的JSON数据部分
        start_index = response.text.find('(') + 1
        end_index = response.text.rfind(')')
        json_data = response.text[start_index:end_index]
        # 解析JSON数据
        # print(json_data)
        # print(type(json_data))
        json_obj = json.loads(json_data)
        # 打印输出
        # print(type(json_obj))
        # print(json_obj)
        #取出data中的数据列表list
        data = json_obj['data']['diff']
        # # print(data,type(data))
        goods_list = []
        name = ['f12','f14','f2','f3','f4','f5','f6','f7','f15','f16','f17','f18']
        count=0
        for li in data:
            list =[]
            list.append(count)
            for n in name:
                list.append(li[n])
            count += 1
            goods_list.append(list)
        # # print(goods_list)
        for k in goods_list:
            #[1, '301348', '蓝箭电子', 50.53, 20.0, 8.42, 172116, 815409272.27, 22.56, 50.53, 41.03, 41.04, 42.11]
            stock = Homework01Item()
            stock['id'] = str(k[0])
            stock['number'] = str(k[1])
            stock['name'] = str(k[2])
            stock['new_price'] = str(k[3])
            stock['up_down_precent'] = str(k[4])
            stock['up_down_num'] = str(k[5])
            stock['turnover'] = str(k[6])
            stock['Transaction_volume'] = str(k[7])
            stock['vibration'] = str(k[8])
            stock['maxx'] = str(k[9])
            stock['minn'] = str(k[10])
            stock['today'] = str(k[11])
            stock['yesterday'] = str(k[12])
            yield stock

items.py文件代码如下

import scrapy


class Homework01Item(scrapy.Item):
    # define the fields for your item here like:
    id = scrapy.Field()
    number = scrapy.Field()
    name = scrapy.Field()
    new_price = scrapy.Field()
    up_down_precent = scrapy.Field()
    up_down_num = scrapy.Field()
    turnover = scrapy.Field()
    Transaction_volume = scrapy.Field()
    vibration = scrapy.Field()
    maxx = scrapy.Field()
    minn = scrapy.Field()
    today = scrapy.Field()
    yesterday = scrapy.Field()

下面是Pipeline.py文件代码

from itemadapter import ItemAdapter
import pymysql
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy import Request

class Homework01Pipeline:
    def open_spider(self,spider):
        # print("================================")
        self.client = pymysql.connect(host="localhost",port=3306,user="root",password="123456",db="homework1",charset="utf8")
        self.cursor = self.client.cursor()
        # print("************************************")

    def process_item(self, item, spider):
        # print("===============*****************==================")
        print(item.get("id"))
        args = [
            item.get("id"),
            item.get("number"),
            item.get("name"),
            item.get("new_price"),
            item.get("up_down_precent"),
            item.get("up_down_num"),
            item.get("turnover"),
            item.get("Transaction_volume"),
            item.get("vibration"),
            item.get("maxx"),
            item.get("minn"),
            item.get("today"),
            item.get("yesterday"),
        ]
        sql = "insert into stock_copy1 values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
        self.cursor.execute(sql,args)
        self.client.commit()
        return item
    
    def close_spider(self,spider):
        # print("123123123123123123123123123123")
        self.client.close()
        self.cursor.close()

同时别忘记将pipeline配置到settings.py文件中

ITEM_PIPELINES = {,
   "homework01.pipelines.Homework01Pipeline":300,
}

查看MySQL数据库，利用Navicat可视化软件查看
心得体会:

这个实验实际上是向巩固XPATH，但是scrapy只能爬取静态网站（前提是没有结合selenium的情况，后续钻研一下）,所以最后使用Json包进行处理。这个实践再次巩固了，在爬取文件之后，如何设计Item类和Pipeline类，进行存储数据库。

作业三：
要求：熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法；使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取股票相关信息。
候选网站：中国银行网：https://www.boc.cn/sourcedb/whpj/

作业三代码文件夹：作业三

下面是主函数代码currency.py

import scrapy
import pandas as pd
from sqlalchemy.engine import create_engine
import mysql
from ..items import CurrencyItem
class CurrencySpider(scrapy.Spider):
    name = "currency"
    allowed_domains = ["__all__"]
    start_urls = ["https://www.boc.cn/sourcedb/whpj/"]

    def parse(self, response):
        # 使用XPath选择所有<tr>元素
        rows = response.xpath("//tr[position()>1]")  # 忽略第一个<tr>元素
        # 遍历每个<tr>元素
        '''
        <th>货币名称</th>
        <th>现汇买入价</th>
        <th>现钞买入价</th>
        <th>现汇卖出价</th>
        <th>现钞卖出价</th>
        <th>中行折算价</th>
        <th>发布日期</th>
        <th>发布时间</th>
        '''
        for row in rows:
            # 使用XPath选择当前<tr>下的所有<td>元素，并提取文本值
            currencyname = row.xpath("./td[1]//text()").get()
            hui_in = row.xpath("./td[2]//text()").get()
            chao_in = row.xpath("./td[3]//text()").get()
            hui_out = row.xpath("./td[4]//text()").get()
            chao_out = row.xpath("./td[5]//text()").get()
            zhonghang = row.xpath("./td[6]//text()").get()
            date = row.xpath("./td[7]//text()").get()
            time = row.xpath("./td[8]//text()").get()
            # print(currencyname)
            # print(hui_in)
            # print(chao_in)
            # print(hui_out)
            # print(chao_out)
            # print(zhonghang)
            # print(date)
            # print(time)
            currency = CurrencyItem()
            currency['currencyname'] = str(currencyname)
            currency['hui_in'] = str(hui_in)
            currency['chao_in'] = str(chao_in)
            currency['hui_out'] = str(hui_out)
            currency['chao_out'] = str(chao_out)
            currency['zhonghang'] = str(zhonghang)
            currency['date'] = str(date)
            currency['time'] = str(time)
            yield currency

下面是Items.py文件

class CurrencyItem(scrapy.Item):
    currencyname = scrapy.Field()
    hui_in = scrapy.Field()
    chao_in = scrapy.Field()
    hui_out = scrapy.Field()
    chao_out = scrapy.Field()
    zhonghang = scrapy.Field()
    date = scrapy.Field()
    time = scrapy.Field()

下面是Pipeline.py文件

class CurrencyPipeline:
    def open_spider(self,spider):
        self.client = pymysql.connect(host="localhost",port=3306,user="root",password="123456",db="homework1",charset="utf8")
        self.cursor = self.client.cursor()

    def process_item(self,item,spider):
        args = [
            item.get("currencyname"),
            item.get("hui_in"),
            item.get("chao_in"),
            item.get("hui_out"),
            item.get("chao_out"),
            item.get("zhonghang"),
            item.get("date"),
            item.get("time"),
        ]
        sql = "insert into currency values(%s,%s,%s,%s,%s,%s,%s,%s)"
        self.cursor.execute(sql,args)
        self.client.commit()
        return item

    def close_spider(self,spider):
        self.cursor.close()
        self.client.close()

同时别忘记将pipeline配置到settings.py文件中

ITEM_PIPELINES = {,
   "homework01.pipelines.CurrencyPipeline":300,
}

查看MySQL数据库，利用Navicat可视化软件查看
心得体会

本次实践其实是和第二个实验差不多，但是我最开始其实是将td所有内容直接`get_all`下来,放在一个列表中，但是存入MySQL数据库时报错，发现out of index，后来print出来发现，部分地区的统计可能是不完全，所以会有缺漏项。所以需要一个一个进行xpath匹配，就算没有也是null值，保证列表都是固定元素。于是就可以存储了。

JZTXT

数据采集与融合技术实践作业三

第三次作业

作业1：

要求：指定一个网站，爬取这个网站中的所有的所有图片，例如：中国气象网。使用scrapy框架分别实现单线程和多线程的方式爬取。

对于scrapy多线程爬取的技术，到scrapy官网中进行确认

心得体会：

作业二：

要求：熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法；使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取股票相关信息。

候选网站：东方财富网

心得体会:

这个实验实际上是向巩固XPATH，但是scrapy只能爬取静态网站（前提是没有结合selenium的情况，后续钻研一下）,所以最后使用Json包进行处理。这个实践再次巩固了，在爬取文件之后，如何设计Item类和Pipeline类，进行存储数据库。

作业三：

要求：熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法；使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取股票相关信息。

候选网站：中国银行网：https://www.boc.cn/sourcedb/whpj/

心得体会