2023数据采集与融合技术实践作业一

发布时间 2023-09-21 17:21:47作者: 酱酱酱酱江

作业①:

要求:用requests和BeautifulSoup库方法定向爬取给定网址(http://www.shanghairanking.cn/rankings/bcur/2020 )的数据,屏幕打印爬取的大学排名信息。
输出信息:
排名 学校名称 省市 学校类型 总分
1 清华大学 北京 综合 852.5
2......

实验


import requests
from bs4 import BeautifulSoup
import bs4

uinfo = []
url = "https://www.shanghairanking.cn/rankings/bcur/2020"
res = requests.get(url)
res.encoding = 'text/html'
html = res.text
soup = BeautifulSoup(html, "lxml")
for tr in soup.find('tbody').children:
    if isinstance(tr, bs4.element.Tag):
        a = tr('a')
        tds = tr('td')
        uinfo.append([tds[0].text.strip(), a[0].string.strip(), tds[2].text.strip(),
                      tds[3].text.strip(), tds[4].text.strip()])
tplt = "{0:^10}\t{1:^10}\t{2:^12}\t{3:^12}\t{4:^10}"
print(tplt.format("排名", "学校名称", "省份", "学校类型", "总分"))
for i in range(30):
    print(tplt.format(uinfo[i][0], uinfo[i][1],uinfo[i][2], uinfo[i][3], uinfo[i][4]))


心得体会

通过使用 requests 库发送网络请求,以及 BeautifulSoup 库解析网页内容,我们可以方便地进行网页数据的提取和处理。网络爬虫技术可以帮助我们快速获取所需的数据,例如这里的大学排名信息。

作业②

实验

代码一 爬当当

import requests
import urllib.parse
from bs4 import BeautifulSoup

print("输入页数:")
page=input()
url = 'http://search.dangdang.com/?key=%CA%E9%B0%FC&act=input&'
data = {
    'page_index': page,
}
data = urllib.parse.urlencode(data)
url = url + data
response = requests.get(url)
response.encoding = 'gb2312'
content = response.text
soup = BeautifulSoup(content, 'lxml')
name = soup.select('p[class="name"] > a[title]')
name_list = soup.select('span[class="price_n"]')
for i in range(60):
    print(str(i)+"\t"+name[i].text + "\t" + name_list[i].text)

代码二 爬京东

import requests
import urllib.parse
from bs4 import BeautifulSoup

page = int(input())
url = 'https://search.jd.com/Search?keyword=%E4%B9%A6%E5%8C%85&qrst=1&wq=%E4%B9%A6%E5%8C%85&stock=1&pvid=92cd471d7ad04c26bf539da0881c3cd9&isList=0&'
data = {
    'page': page * 2 + 1,
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
    'Cookie': '__jdu=2112503369; areaId=16; ipLoc-djd=16-1303-0-0; shshshfpa=0c30a2b4-09e7-b60c-4448-ae7b8574600a-1695036268; shshshfpx=0c30a2b4-09e7-b60c-4448-ae7b8574600a-1695036268; rkv=1.0; mba_muid=2112503369; wlfstk_smdl=u8d7k3nnkhl69q6wx4l2fdk7sc7zvbup; _pst=t1235232; logintype=wx; unick=t1235232; pin=t1235232; npin=t1235232; thor=880DB8B476A1C203007F9AD33DB97F3781270D1A201163005718C83CDA8CEA7D5DD9AAFF7814C2FB14ABF26C3ABC0C40A8F5AE12FD8976BB5C2FD424719966195B1EE8C23C2A43940FD14C1FCD561CB331B1D9A53120FD56791C00E3EB6CE7D9731A490C63C22660E452B1ABF92E99C98941E5BA4DBBDB15E5389A80F681E6189B91F8E87E620C5685574C8C6FDF3EDD; flash=2_tvGki6ONru5zjl5vcZqnhv96BdS4CzyjHdu9T4s9Muas4OkoM_uqr4N8u8wPcmV258dOmpZ9M8PoL1LhVWxcUzwBmEzDOpiehC8-ZTrKNAq*; _tp=UMiYqkYi3p0bTG59GXOwoQ%3D%3D; pinId=oWM3qE-svHi1ffsfn98I-w; qrsc=3; jsavif=1; jsavif=1; 3AB9D23F7A4B3CSS=jdd03V5JDX6UC4BR3NEA5XLU2HQROHPM52HRBZNV22U5IZRMRJWQVMXYVLEV5I2XWSL4L3APOBM3UDO56LBT2N4KWV4OQUMAAAAMKVB43U6YAAAAACMGYBEMNKIVTHMX; unpl=JF8EALBnNSttCB5QA0kGTxEZGAoGWw4KS0cFaWRRVAoPTFwDEwUaGxV7XlVdXxRKHx9vbhRXXVNIXA4aCysSEXteXVdZDEsWC2tXVgQFDQ8VXURJQlZAFDNVCV9dSRZRZjJWBFtdT1xWSAYYRRMfDlAKDlhCR1FpMjVkXlh7VAQrARsSE09cVlxaAHsWM2hXNWRbXE9QDR8yGiIRex8AAl0NThEDaCoGVF1bT1UHGQUTIhF7Xg; __jdv=76161171|baidu-pinzhuan|t_288551095_baidupinzhuan|cpc|0f3d30c8dba7459bb52f2eb5eba8ac7d_0_ad57c5e28bfc4b31a772e1ff69796084|1695043751360; xapieid=jdd03V5JDX6UC4BR3NEA5XLU2HQROHPM52HRBZNV22U5IZRMRJWQVMXYVLEV5I2XWSL4L3APOBM3UDO56LBT2N4KWV4OQUMAAAAMKVB43U6YAAAAACMGYBEMNKIVTHMX; shshshsID=01f9cd0a9048c5743d0441a80845040c_2_1695043760884; __jda=122270672.2112503369.1695036258.1695036259.1695043667.2; __jdb=122270672.3.2112503369|2.1695043667; __jdc=122270672; shshshfpb=AAoE0e6iKEjCitAnntgxESK57hXRgChaVA2JoQwAAAAA; 3AB9D23F7A4B3C9B=V5JDX6UC4BR3NEA5XLU2HQROHPM52HRBZNV22U5IZRMRJWQVMXYVLEV5I2XWSL4L3APOBM3UDO56LBT2N4KWV4OQUM'}
data = urllib.parse.urlencode(data)
url = url + data
response = requests.get(url=url, headers=headers)
response.encoding = 'utf-8'
content = response.text
soup = BeautifulSoup(content, 'lxml')
name = soup.select('div[class="p-name p-name-type-2"] em')
price = soup.select('div[class="p-price"] i')
for i in range(30):
    print(str(i+1)+"\t"+name[i].text + "\t" + price[i].text)

心得体会

re库和urllib.request库很相似,但是re库更方便,像自动挡和手动挡的区别,
京东需要登陆才能查看,所以要在headers中加入cookie

作业③:

要求:爬取一个给定网页( https://xcb.fzu.edu.cn/info/1071/4481.htm)或者自选网页的所有JPEG和JPG格式文件
输出信息:将自选网页内的所有JPEG和JPG文件保存在一个文件夹中

实验

import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import os

url = 'https://xcb.fzu.edu.cn/info/1071/4481.htm'
request = urllib.request.Request(url=url)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
soup = BeautifulSoup(content, 'lxml')
name = soup.select('p[class="vsbcontent_img"] img')
path = "picture"
os.mkdir(path)
page = 0
for i in name:
    img_url = 'https://xcb.fzu.edu.cn' + str(i.get("src"))
    page = page + 1
    urllib.request.urlretrieve(img_url, path + '\\' + str(page) + '.jpeg')

心得体会

可能是网站问题,下载图片非常慢,一开始把page=0放入for循环导致图片名称都一样只输出一个图片