ICode9

精准搜索请尝试: 精确搜索
首页 > 编程语言> 文章详细

Python爬取笔趣阁小说

2021-04-26 14:32:03  阅读:293  来源: 互联网

标签:xpath filePath Python title tree page 爬取 url 笔趣


在这里插入代码片@TOC

-- coding:utf-8 --

#[url=https://www.biquge.info/wanjiexiaoshuo/]https://www.biquge.info/wanjiexiaoshuo/[/url] 笔趣阁小说全本爬虫
import time
import requests
import os
import random
from lxml import etree
import webbrowser
header = {
“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36 Edg/89.0.774.77”
}
noName = [’#’,’/’,’\’,’:’,’’,’?’,’"’,’<’,’>’,’|’] #/:?"<>|
filePath = ‘./保存小说’
def strZ(_str): #将特殊字符转换为空格
ret = ‘’
for _ in _str:
if _ in noName:
ret += " "
else:
ret += _
return ret
def main():
webbrowser.open(‘https://www.biquwx.la/’)
if not os.path.exists(filePath):
os.mkdir(filePath)
print(‘1.爬取指定小说’)
print(‘2.爬取整个站点’)
if input(‘使用哪种方式爬取小说? ‘) == ‘1’:
appintDown()
else :
allDown()
input(“按下任意键退出”)
def appintDown(): #爬取指定小说 前提是网页没错
page_url = input(‘输入要爬取的小说网站(例如 [url=https://www.biquwx.la/10_10240/]https://www.biquwx.la/10_10240/[/url]) : ‘)
page = requests.get(url=page_url, headers=header)
if page.status_code == 200: # 响应就爬取
page.encoding = ‘utf-8’
page_tree = etree.HTML(page.text)
page_title = page_tree.xpath(’//div[@id=“info”]/h1/text()’)[0]
_filePath = filePath + ‘/’ + page_title
if not os.path.exists(_filePath):
os.mkdir(_filePath)
page_dl_list = page_tree.xpath(’//div[@class=“box_con”]/div[@id=“list”]/dl/dd’)
for _ in page_dl_list:
_page_url = page_url + _.xpath(’./a/@href’)[0]
_page_title = filePath + ‘/’ + strZ(.xpath(’./a/@title’)[0]) + ‘.txt’
_page = requests.get(_page_url, headers=header)
if _page.status_code == 200:
_page.encoding = ‘utf-8’
_tree = etree.HTML(_page.text)
_page_content = _tree.xpath(’//div[@id=“content”]/text()’)
fileContent = ‘’
for _ in _page_content:
fileContent += _ + ‘\n’
with open(_page_title, ‘w’, encoding=‘utf-8’) as fp:
fp.write(fileContent)
print(’%s成功下载到本地’ % (_page_title))
time.sleep(random.uniform(0.05, 0.2))
def allDown(): #整个站点小说爬取
url = ‘https://www.biquge.info/wanjiexiaoshuo/’ # 目录
page = requests.get(url=url, headers=header)
if page.status_code == 200: # 响应就爬取
page.encoding = ‘utf-8’
tree = etree.HTML(page.text)
page_last = tree.xpath(’//div[@class=“pagelink”]/a[@class=“last”]/text()’)[0]
for page_i in range(1, int(page_last)): # 小说页数遍历
url = ‘https://www.biquge.info/wanjiexiaoshuo/’ + str(page_i)
page = requests.get(url=url, headers=header)
if page.status_code == 200: # 响应就爬取
page.encoding = ‘utf-8’
tree = etree.HTML(page.text)
li_list = tree.xpath(’//div[@class=“novelslistss”]/ul/li’)
for li in li_list:
page_url = li.xpath(’./span[@class=“s2”]/a/@href’)[0] # 目录链接
page_title = strZ(li.xpath(’./span[@class=“s2”]/a/text()’)[0])
page = requests.get(url=page_url, headers=header)
if page.status_code == 200: # 响应就爬取
page.encoding = ‘utf-8’
page_tree = etree.HTML(page.text)
_filePath = filePath + ‘/’ + page_title
if not os.path.exists(_filePath):
os.mkdir(_filePath)
page_dl_list = page_tree.xpath(’//div[@class=“box_con”]/div[@id=“list”]/dl/dd’)
for _ in page_dl_list:
_page_url = page_url + _.xpath(’./a/@href’)[0]
_page_title = filePath + ‘/’ + strZ(.xpath(’./a/@title’)[0]) + ‘.txt’
_page = requests.get(_page_url, headers=header)
if _page.status_code == 200:
_page.encoding = ‘utf-8’
_tree = etree.HTML(_page.text)
_page_content = _tree.xpath(’//div[@id=“content”]/text()’)
fileContent = ‘’
for _ in _page_content:
fileContent += _ + ‘\n’
with open(_page_title, ‘w’, encoding=‘utf-8’) as fp:
fp.write(fileContent)
print(’%s成功下载到本地’ % (_page_title))
time.sleep(random.uniform(0.05, 0.2))
if name == ‘main’:
main()

标签:xpath,filePath,Python,title,tree,page,爬取,url,笔趣
来源: https://blog.csdn.net/qq_48123164/article/details/116154465

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有