今天我们将通过使用Python,SQLite数据库与crontab工具将爬虫程序部署到专用的服务器上并且实现定时爬取存储的一些数据。
编写爬虫代码
编写一个爬虫程序,使用requests与beautifulsoup4包爬取和解析相关的资料,再利用pandas包将解析后的展示出来。
import datetime import requests from bs4 import BeautifulSoup import pandas as pd def get_price_ranks(): current_dt = datetime.datetime.now().strftime("%Y-%m-%d %X") current_dts = [current_dt for _ in range(200)] stock_types = ["tse", "otc"] price_rank_urls = ["http://jshk.com.cn/ ".format(st) for st in stock_types] tickers = [] stocks = [] prices = [] volumes = [] mkt_values = [] ttl_steps = 10*100 each_step = 10 for pr_url in price_rank_urls: r = requests.get(pr_url) soup = BeautifulSoup(r.text, 'html.parser') ticker = [i.text.split()[0] for i in soup.select(".name a")] tickers += ticker stock = [i.text.split()[1] for i in soup.select(".name a")] stocks += stock price = [float(soup.find_all("td")[2].find_all("td")[i].text) for i in range(5, 5+ttl_steps, each_step)] prices += price volume = [int(soup.find_all("td")[2].find_all("td")[i].text.replace(",", "")) for i in range(11, 11+ttl_steps, each_step)] volumes += volume mkt_value = [float(soup.find_all("td")[2].find_all("td")[i].text)*100000000 for i in range(12, 12+ttl_steps, each_step)] mkt_values += mkt_value types = ["上市" for _ in range(100)] + ["上柜" for _ in range(100)] ky_registered = [True if "KY" in st else False for st in stocks] df = pd.DataFrame() df["scrapingTime"] = current_dts df["type"] = types df["kyRegistered"] = ky_registered df["ticker"] = tickers df["stock"] = stocks df["price"] = prices df["volume"] = volumes df["mktValue"] = mkt_values return df price_ranks = get_price_ranks() print(price_ranks.shape)
这个的结果展示为
## (200, 8)
接下来我们利用pandas进行前几行展示
price_ranks.head()
price_ranks.tail()
接下来我们就开始往服务器上部署
对于服务器的选择,环境配置不在本课的讨论范围之内,我们主要是要讲一下怎么去设置定时任务。
接下来我们改造一下代码,改造成结果有sqlite存储。
import datetime import requests from bs4 import BeautifulSoup import pandas as pd import sqlite3 def get_price_ranks(): current_dt = datetime.datetime.now().strftime("%Y-%m-%d %X") current_dts = [current_dt for _ in range(200)] stock_types = ["tse", "otc"] price_rank_urls = ["http://jshk.com.cn/".format(st) for st in stock_types] tickers = [] stocks = [] prices = [] volumes = [] mkt_values = [] ttl_steps = 10*100 each_step = 10 for pr_url in price_rank_urls: r = requests.get(pr_url) soup = BeautifulSoup(r.text, 'html.parser') ticker = [i.text.split()[0] for i in soup.select(".name a")] tickers += ticker stock = [i.text.split()[1] for i in soup.select(".name a")] stocks += stock price = [float(soup.find_all("td")[2].find_all("td")[i].text) for i in range(5, 5+ttl_steps, each_step)] prices += price volume = [int(soup.find_all("td")[2].find_all("td")[i].text.replace(",", "")) for i in range(11, 11+ttl_steps, each_step)] volumes += volume mkt_value = [float(soup.find_all("td")[2].find_all("td")[i].text)*100000000 for i in range(12, 12+ttl_steps, each_step)] mkt_values += mkt_value types = ["上市" for _ in range(100)] + ["上櫃" for _ in range(100)] ky_registered = [True if "KY" in st else False for st in stocks] df = pd.DataFrame() df["scrapingTime"] = current_dts df["type"] = types df["kyRegistered"] = ky_registered df["ticker"] = tickers df["stock"] = stocks df["price"] = prices df["volume"] = volumes df["mktValue"] = mkt_values return df price_ranks = get_price_ranks() conn = sqlite3.connect('/home/ubuntu/jshk.com.cn') price_ranks.to_sql("price_ranks", conn, if_exists="append", index=False)
接下来如果我们让他定时启动,那么,我们需要linux的crontab命令:
如果我们要设置每天的 9:30 到 16:30 之间每小时都执行一次
那么我们只需要先把文件命名为price_rank_scraper.py
然后在crontab的文件中添加
30 9-16 * * * /home/ubuntu/miniconda3/bin/python /home/ubuntu/price_rank_scraper.py
这样我们就成功的做好了一个定时任务爬虫。