2023数据采集与融合技术实践作业一-JZTXT

作业1：
要求：用REQUESTS和BEAUTIFULSOUP库方法定向爬取给定网址（HTTP://WWW.SHANGHAIRANKING.CN/RANKINGS/BCUR/2020）的数据，屏幕打印爬取的大学排名信息。
输出信息：

排名学校名称省市学校类型总分

1 清华大学北京综合 852.5

代码如下：

from bs4 import BeautifulSoup
import urllib.request
ls=[]
def collect(url,number):
	url = url
	response = urllib.request.urlopen(url)
	re = response.read()
	res = re.decode()
	soup = BeautifulSoup(res, "lxml")
	tbody = soup.find("tbody")
	x=0
	for y in tbody.find_all('tr'):
		text = []
		for y1 in y.find_all('td'):
			try:
				text.append(y1.text.strip())
			except AttributeError:
				break
		ls.append(text)
		x=x+1
		if x==number:
			break
def printf():
	print("排名\t学校名称\t省市\t学校类型\t总分\n")
	for x in ls:
		print(x[0]+" "+x[1][0:4]+" "+x[2]+" "+x[3]+" "+x[4]+"\n")
if __name__ == "__main__":
	number=int(input("你的学号最后两位是:"))
	url="http://www.shanghairanking.cn/rankings/bcur/2020"
	collect(url,number)
	printf()

运行结果如下：

心得体会：
通过这次实验，我深深体会到了python爬虫的魅力，对bs4有了进一步的了解，同时让我学习了request和BeautifulSoup库中较为简单的知识，学会了find，BeautifulSoup，request等函数的使用，令我受益匪浅。

作业2：

要求：用requests和re库方法设计某个商城（自已选择）商品比价定向爬虫，爬取该商城，以关键词“书包”搜索页面的数据，爬取商品名称和价格
代码如下：

import requests
import re
if __name__=="__main__":
	url="http://search.dangdang.com/?key=%CA%E9%B0%FC&act=input"
	headers={
		'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.36'
	}
	response=requests.get(url=url,headers=headers).text
	ex='<p class="name" name="title" ><a title="(.*?)"'
	ey='<span class="price_n">&yen;(.*?)</span>'
	a1=re.findall(ex,response,re.S)
	a2=re.findall(ey,response,re.S)
	print("序号 价格 商品名")
	for i in range(0,len(a1)):
		print(str(i+1)+" "+str(a2[i])+" "+a1[i])

运行结果如下：

心得体会：
学习了通过使用正则表达式来进行对所需内容的提取，其中也尝试了使用soup来进行选取，学习并掌握了两者的异同，正则表达式在前后内容都是独一无二的时候比较好使用，可以比较方便地容易筛选出多个相同的内容。

作业3：
要求：爬取一个给定网页（ https://xcb.fzu.edu.cn/info/1071/4481.htm ）或者自选网页的所有JPEG和JPG格式文件
输出信息：将自选网页内的所有JPEG和JPG文件保存在一个文件夹中
代码如下：

import os
import requests
from bs4 import BeautifulSoup

class ImageDownloader:
	def __init__(self, url, save_path='downloaded_images'):
		self.url = url
		self.save_path = save_path

	def fetch_content(self):
		response = requests.get(self.url)
		response.raise_for_status()
		return response.content

	def extract_images(self, content):
		soup = BeautifulSoup(content, 'html.parser')
		return soup.find_all('img')

	def download_image(self, img_url):
		if img_url.endswith(('.jpeg', '.jpg')):
			if not img_url.startswith(('http:', 'https:')):
				img_url = requests.compat.urljoin(self.url, img_url)

			response = requests.get(img_url, stream=True)
			response.raise_for_status()

			# Clean up the filename to remove query parameters
			clean_filename = os.path.basename(img_url.split('?')[0])

			with open(os.path.join(self.save_path, clean_filename), 'wb') as f:
				for chunk in response.iter_content(8192):
					f.write(chunk)

	def ensure_save_path_exists(self):
		if not os.path.exists(self.save_path):
			os.makedirs(self.save_path)

	def run(self):
		self.ensure_save_path_exists()
		content = self.fetch_content()
		img_tags = self.extract_images(content)
		for img in img_tags:
			self.download_image(img['src'])
		print("下载完成!")

if __name__ == "__main__":
	URL = 'https://xcb.fzu.edu.cn/info/1071/4481.htm'
	downloader = ImageDownloader(URL)
	downloader.run()

结果如下：

心得体会：
加强了自己对requests的熟练度，掌握了如何运用爬虫爬取图片。