如何使用python beautifulSoup刮取深层嵌入的链接

2019-11-20 21:56:27 阅读：297 来源： 互联网

标签：beautifulsoup web-scraping html-parsing html python

我正在尝试为学术目的构建蜘蛛/网络爬虫,以从学术出版物中获取文本并将相关链接附加到URL堆栈.我正在尝试抓取1个名为“ PubMed”的网站.我似乎无法抓住我需要的链接.这是我的带有示例页面的代码,此页面应代表他们数据库中的其他人：

 website = 'http://www.ncbi.nlm.nih.gov/pubmed/?term=mtap+prmt'
 from bs4 import BeautifulSoup
 import requests
 r = requests.get(website)
 soup = BeautifulSoup(r.content)

我已将html树分解成几个变量,以提高可读性,以便它们都适合1个屏幕宽度.

 key_text = soup.find('div', {'class':'grid'}).find('div',{'class':'col twelve_col nomargin shadow'}).find('form',{'id':'EntrezForm'})
 side_column = key_text.find('div', {'xmlns:xi':'http://www.w3.org/2001/XInclude'}).find('div', {'class':'supplemental col three_col last'})
 side_links = side_column.find('div').findAll('div')[1].find('div', {'id':'disc_col'}).findAll('div')[1]

 for link in side_links:
      print link

如果您使用chrome inspect元素查看html源代码,则应该在“ side_links”中包含链接的其他多个嵌套div.但是,上面的代码产生以下错误：

 Traceback (most recent call last):
 File "C:/Users/ballbag/Copy/web_scraping/google_search.py", line 22, in <module>
 side_links = side_column.find('div').findAll('div')[1].find('div',      {'id':'disc_col'}).findAll('div')[1]
 IndexError: list index out of range

如果您转到网址,则右侧会显示一列“相关链接”,其中包含我要抓取的网址.但是我似乎无法找到他们.有一个声明说我正在尝试进入div,我怀疑这与它有关.任何人都可以帮助抓住这些链接吗？我真的很感激任何指针

解决方法:

问题是侧栏加载了其他异步请求.

这里的想法是：

>使用requests.Session维护网络抓取会话
>解析用于获取侧栏的网址
>跟随该链接,并从div获得class =“ portlet_content”的链接.

码：

from urlparse import urljoin

from bs4 import BeautifulSoup
import requests


base_url = 'http://www.ncbi.nlm.nih.gov'
website = 'http://www.ncbi.nlm.nih.gov/pubmed/?term=mtap+prmt'

# parse the main page and grab the link to the side bar
session = requests.Session()
soup = BeautifulSoup(session.get(website).content)

url = urljoin(base_url, soup.select('div#disc_col a.disc_col_ph')[0]['href'])

# parsing the side bar
soup = BeautifulSoup(session.get(url).content)

for a in soup.select('div.portlet_content ul li.brieflinkpopper a'):
    print a.text, urljoin(base_url, a.get('href'))

印刷品：

The metabolite 5'-methylthioadenosine signals through the adenosine receptor A2B in melanoma. http://www.ncbi.nlm.nih.gov/pubmed/25087184
Down-regulation of methylthioadenosine phosphorylase (MTAP) induces progression of hepatocellular carcinoma via accumulation of 5'-deoxy-5'-methylthioadenosine (MTA). http://www.ncbi.nlm.nih.gov/pubmed/21356366
Quantitative analysis of 5'-deoxy-5'-methylthioadenosine in melanoma cells by liquid chromatography-stable isotope ratio tandem mass spectrometry. http://www.ncbi.nlm.nih.gov/pubmed/18996776
...
Cited in PMC http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/23265702/citedby/?tool=pubmed

标签：beautifulsoup,web-scraping,html-parsing,html,python
来源： https://codeday.me/bug/20191120/2047233.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

如何使用python beautifulSoup刮取深层嵌入的链接