本文共 3844 字,大约阅读时间需要 12 分钟。
技术路线:requests‐bs4‐re
候选数据网站的选择:
选取原则:股票信息静态存在于HTML页面中,非js代码生成,没有Robots协议限制
选取方法:浏览器F12,源代码查看等
选取心态:不要纠结于某个网站,多找信息源尝试
数据网站的确定如下:
获取股票列表:
获取个股信息:
步骤1:从东方财富网获取股票列表
步骤2:根据股票列表逐个到百度股票获取个股信息 步骤3:将结果存储到文件个股信息采用键值对维护
import requestsfrom bs4 import BeautifulSoupimport reimport tracebackdef getHTMLText(url): try: r = requests.get(url, timeout=30) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: print('Error!')def getStockList(lst, stockURL): html = getHTMLText(stockURL) soup = BeautifulSoup(html, 'html.parser') a = soup.find_all('a') for i in a: try: href = i.attrs['href'] string = re.findall(r'[s][h]\d{6}', href)[0].replace('sh', '') if lst == []: lst.append(string) continue if lst[-1] == string: continue else: lst.append(string) except: continuedef getStockInfo(lst, stockURL, fpath): regex_symbol = r'\"symbol\":\"\d{6}\"' regex_name = r'\"nameCN\":\".*?\"' regex_latestPrice = r'\"latestPrice\":[\d\.]*' count = 0 total_list = [] for stock in lst: url = stockURL + stock html = getHTMLText(url) try: if html == '': continue stockInfo = [] for match in re.finditer(regex_symbol, html): stockInfo.append(match.group(0).replace("\"symbol\":","")) for match in re.finditer(regex_name, html): stockInfo.append(match.group(0).replace("\"nameCN\":","")) for match in re.finditer(regex_latestPrice, html): stockInfo.append(match.group(0).replace("\"latestPrice\":","")) with open(fpath, 'a', encoding='utf-8') as f: tmpl = '{0:^10}{1:{3}^6}{2:^8}\n' if count == 0: string = tmpl.format('代码','股票名称','最新价',chr(12288)) else: string = tmpl.format(str(stockInfo[0]).replace('\"',''), str(stockInfo[1].replace('\"','')), str(stockInfo[2]),chr(12288)) f.write(string) count += 1 except: traceback.print_exc() # 输出详细的异常信息 continuedef main(): stock_list_html = 'http://app.finance.ifeng.com/list/stock.php?t=ha&f=chg_pct&o=desc&p=1' stock_info_url = 'https://www.laohu8.com/stock/' output_file = 'E://Users/Yang SiCheng/PycharmProjects/Graduation_Project/StockList.txt' slist = [] getStockList(slist, stock_list_html) getStockInfo(slist, stock_info_url, output_file) # print(slist)if __name__ == '__main__': main()
最终在目标路径下得到了一个txt文件:
程序运行的时候大多数时候都要等着——如何提高用户体验?code = 'utf-8'r.encoding = code
with open(fpath, 'a', encoding='utf-8') as f: tmpl = '{0:^10}{1:{3}^6}{2:^8}\n' if count == 0: string = tmpl.format('代码','股票名称','最新价',chr(12288)) else: print('\r当前进度:{:.2f}%'.format(count*100/len(lst)),end='') string = tmpl.format(str(stockInfo[0]).replace('\"',''), str(stockInfo[1].replace('\"','')), str(stockInfo[2]),chr(12288)) f.write(string)
采用requests‐bs4‐re路线实现了股票信息爬取和存储,实现了展示爬取进程的动态滚动条
转载地址:http://sgvrn.baihongyu.com/