如何使用python更容易的抓取网页源代码？selenium抓取网页源代码教程

2023-02-04 19:08:46 阅读：666 来源： 互联网

在审查网站的时候，经常需要获取网页的源代码，这也是python自动化测试的一个重要环节。icode9小编将探讨如何使用 Selenium WebDriver 获取页面源，并进行演示 Selenium 如何在使用 Python 的同时获取 XML 页面源。一起来看看吧。

检索受审查网站的页面源是大多数测试自动化工程师的日常任务。页面源分析有助于消除在常规网站测试、功能测试或安全测试演练中发现的错误。在广泛复杂的应用程序测试过程中，可以编写自动化测试脚本，如果在程序中检测到错误，那么它会自动：
保存该特定页面的源代码。
通知负责页面 URL 的人员。
提取特定元素或代码块的 HTML 源代码，如果错误发生在一个特定的独立 HTML Web 元素或代码块中，则将其委托给负责机构。
这是跟踪和修复前端代码中的逻辑和语法错误的简便方法。在本文中，我们首先了解涉及的术语，并探讨如何使用 Python 获取 Selenium WebDriver 中的页面源。
什么是 HTML 页面源代码？
在非技术术语中，它是一组指令，供浏览器以美观的方式在屏幕上显示信息。浏览器以自己的方式解释这些指令，为客户端创建浏览器屏幕。这些通常是使用超文本标记语言 (HTML)、层叠样式表 (CSS) 和 Javascript 编写的。
制作网页的整套 HTML 指令称为页面源、HTML 源或简称为源代码。网站源代码是来自各个网页的源代码的集合。
下面是一个包含标题、表单、图像和提交按钮的基本页面的源代码示例。
<!DOCTYPE html> <html> <head> <title>Page Source Example - LambdaTest</title> </head> <body> <h2>Debug selenium testing results : LambdaTest</h2> <img loading="lazy" data-fr-src="https://cdn.lambdatest.com/assetsnew/images/debug-selenium-testing-results.jpg" alt="debug selenium testing" width="550" height="500"><br><br> <form action="/"> <label for="debug">Do you debug test results using LambdaTest?</label><br> <input type="text" id="debug" name="debug" value="Of-course!"><br> <br> <input type="submit" value="Submit"> </form> <br><br> <button type="button" onclick="alert('Page Source Example : LambdaTest!')">Click Me!</button> </body> </html>

什么是 HTML Web 元素？
描述 HTML 网络元素的最简单方法是，“构成 HTML 页面源代码的任何 HTML 标记都是网络元素。” 它可以是一个 HTML 代码块，一个独立的 HTML 标签，如</br>，网页上的一个媒体对象——图像、音频、视频、一个 JS 函数，或者一个包裹在<script> </script>标签中的 JSON 对象。
在上面的例子中，<title>是一个 HTML 网页元素，body 标签的子元素也是 HTML 网页元素，即 ,<img>等<button>。
如何使用 Python 在 Selenium WebDriver 中获取页面源
Selenium WebDriver 是一个强大的自动化测试工具，为自动化测试工程师提供了一组多样化的随时可用的 API。为了使 Selenium WebDriver 获取页面源代码，Selenium Python 绑定为我们提供了一个驱动程序函数page_source，用于获取浏览器中当前活动 URL 的 HTML 源代码。
或者，我们也可以使用GETPython的request库的函数来加载页面源码。另一种方法是使用驱动程序函数执行 JavaScript execute_script，并使 Selenium WebDriver 在 Python 中获取页面源。一种不推荐的获取页面源代码的方法是将 XPath 与“view-source:”URL 结合使用。让我们探索如何使用 Python 在 Selenium WebDriver 中获取页面源的这四种方法的示例。
对于所有四个示例，我们将使用GitHub 上托管的示例小网页。创建此页面是为了演示使用 LambdaTest 在 Selenium Python 中进行拖放测试。
使用 driver.page_source 获取 HTML 页面源
我们将获取pynishant.github.ioChromeDriver 并将其内容保存到名为page_source.html. 该文件名可以是您选择的任何名称。接下来，我们读取文件的内容并在关闭驱动程序之前将其打印在终端上：
from selenium import webdriver driver = webdriver.Chrome() driver.maximize_window() driver.get("https://pynishant.github.io/") pageSource = driver.page_source fileToWrite = open("page_source.html", "w") fileToWrite.write(pageSource) fileToWrite.close() fileToRead = open("page_source.html", "r") print(fileToRead.read()) fileToRead.close() driver.quit()

成功执行上述脚本后，您的终端输出将显示以下页面源：
使用 driver.execute_javascript 获取 HTML 页面源
在前面的示例中，我们必须注释掉（或替换）该driver.page_source行并添加以下行：driver.execute_scriptis a Selenium Python WebDriver API to execute JS in a Selenium environment. 在这里，我们执行一个返回 HTML 正文元素的 JS 脚本。
# pageSource = driver.page_source pageSource = driver.execute_script("return document.body.innerHTML;")

输出代码如下所示：
如您所见，它只返回innerHTMLbody 元素的。和上一个输出一样，我们没有得到整个页面的源代码。要获取整个文档，我们执行document.documentElement.outerHTML. 该execute_script行现在看起来像这样：
pageSource = driver.execute_script("return document.documentElement.outerHTML;")

这准确地为我们提供了使用driver.page_source.
在 Selenium WebDriver 中使用 Python 的请求库获取页面源
此方法与 Selenium 无关，但您可以查看“What Is Selenium?” article，这是一种获取网页源的纯Pythonic方式。在这里，我们使用Python 的请求库向URL 发出get 请求，并将请求的响应（即页面源）保存到HTML 文件并在终端上打印。
这是脚本：
import requests url = 'https://pynishant.github.io/' pythonResponse = requests.get(url) fileToWrite = open("py_source.html", "w") fileToWrite.write(pythonResponse.text) fileToWrite.close() fileToRead = open("py_source.html", "r") print(fileToRead.read()) fileToRead.close()

该方法可用于在Selenium控制的浏览器无需加载页面的情况下快速存储网页源代码。同样，我们可以使用 urllib Python 库来获取 HTML 页面源。
使用“view-source”URL 获取 HTML 页面源代码
这很少需要，但您可以附加目标 URLview-source并将其加载到浏览器窗口中以加载源代码并将其保存在手动测试中：
以编程方式，要在 Python Selenium 中获取屏幕截图的源代码（如果需要），您可以使用以下方式加载页面：
driver.get("view-source:https://pynishant.github.io/")

使用 XPath 在 Selenium Python WebDriver 中获取 HTML 页面源
第四种使Selenium WebDriver 获取页面源的方法是使用XPath 来保存它。在这里，我们不page_source执行 JavaScript，而是识别源元素，即<html>提取它。将之前的页面取源逻辑注释掉，替换成如下内容：
# pageSource = driver.page_source pageSource = driver.find_element_by_xpath("//*").get_attribute("outerHTML")

在上面的脚本中，我们使用驱动程序方法find_element_by_xpath来定位网页的 HTML 元素。我们使用 source nod: 输入文档"//*"，并获取其“外部 HTML”，即文档本身。输出看起来与我们之前使用driver.page_source.
如何在 Selenium 中检索 WebElement 的 HTML 源代码
要在 Selenium WebDriver 中获取 WebElement 的 HTML 源，我们可以使用Selenium Python WebDriverget_attribute的方法。首先，我们使用像 () 这样的驱动元素定位器方法获取 HTML WebElement 。接下来，我们将方法应用到这个抓取的元素上以获取它的 HTML 源代码。find_element_by_xpath or find_element_by_css_selectorget_attribute()
假设，从pynishant.github.io，我们想要获取并打印 ID 为“div1”的 div 的源代码。代码如下所示：
from selenium import webdriver driver = webdriver.Chrome() driver.maximize_window() driver.get("https://pynishant.github.io/") elementSource = driver.find_element_by_id("div1").get_attribute("outerHTML") print(elementSource) driver.quit()

这是输出：
同样，要获取innerHTMLWebElement 的子项或：
driver.find_element_by_id("some_id_or_selector").get_attribute("innerHTML")

有另一种方法可以做到这一点并获得相同的结果：
elementSource = driver.find_element_by_id("id_selector_as_per_requirement") driver.execute_script("return arguments[0].innerHTML;", elementSource)

如何在 Python Selenium WebDriver 中从 HTML 页面源中检索 JSON 数据
现代应用程序是使用多个 API 构建的。通常，这些 API 会动态更改 HTML 元素的内容。JSON 对象已经成为 XML 响应类型的替代品。因此，专业的 Selenium Python 测试人员必须处理 JSON 对象，尤其是嵌入在<script>HTML 标记中的对象。Python 为我们提供了一个内置的 JSON 库来试验 JSON 对象。
为了举例说明，我们在 Selenium 驱动程序中加载“https://www.cntraveller.in/”，并查找其中包含的 SEO 模式，<script type=”application/ld+json”> </script>以验证徽标 URL 是否包含在“JSON”模式中。顺便说一句，如果您感到困惑，这个“SEO 模式”对于让网页在谷歌上排名很有用。它与代码逻辑或测试无关。我们仅将其用于演示。
我们将在此演示中使用 LambdaTest：
from selenium import webdriver import json import re username = "hustlewiz247" accessToken = "1BtTGpkzkYeOKJiUdivkWxvmHQppbahpev3DpcSfV460bXq0GC" gridUrl = "hub.lambdatest.com/wd/hub" desired_cap = { 'platform' : "win10", 'browserName' : "chrome", 'version' : "71.0", "resolution": "1024x768", "name": "LambdaTest json object test ", "build": "LambdaTest json object test", "network": True, "video": True, "visual": True, "console": True, } url = "https://"+username+":"+accessToken+"@"+gridUrl print("Initiating remote driver on platform: "+desired_cap["platform"]+" browser: "+desired_cap["browserName"]+" version: "+desired_cap["version"]) driver = webdriver.Remote( desired_capabilities=desired_cap, command_executor= url ) # driver = webdriver.Chrome() driver.maximize_window() driver.get("https://www.cntraveller.in/") jsonSource = driver.find_element_by_xpath("//script[contains(text(),'logo') and contains(@type, 'json')]").get_attribute('text') jsonSource = re.sub(";","",jsonSource) jsonSource = json.loads(jsonSource) if "logo" in jsonSource: print("\n logoURL : " + str(jsonSource["logo"])) else: print("JSON Schema has no logo url.") try: if "telephone" in jsonSource: print(jsonSource["telephone"]) else: print("No Telephone - here is the source code :\n") print(driver.find_element_by_xpath("//script[contains(text(),'logo') and contains(@type, 'json')]").get_attribute('outerHTML')) except Exception as e: print(e) driver.quit()

输出包含logoURL和 webElement 来源：
代码分解
以下三行导入所需的库：Selenium WebDriver、Python 的 JSON 和 re 库来处理 JSON 对象和使用正则表达式：
from selenium import webdriver import json import re

接下来，我们配置我们的脚本以在 LambdaTest 的云上成功运行它。我只用了不到三十秒就开始了（可能是因为我之前有使用该平台的经验）。但即使您是初学者，也只需不到一分钟的时间。在LambdaTest官网注册，使用Google登录，点击“ Profile ”复制用户名和access token：
username = "your_username_on_lambdaTest" accessToken = "your lambdaTest access token" gridUrl = "hub.lambdatest.com/wd/hub" desired_cap = { 'platform' : "win10", 'browserName' : "chrome", 'version' : "71.0", "resolution": "1024x768", "name": "LambdaTest json object test ", "build": "LambdaTest json object test", "network": True, "video": True, "visual": True, "console": True, } url = "https://"+username+":"+accessToken+"@"+gridUrl

我们以全屏模式启动驱动程序并使用以下代码行加载 cntraveller 主页：
driver = webdriver.Remote( desired_capabilities=desired_cap, command_executor= url ) # driver = webdriver.Chrome() driver.maximize_window() driver.get("https://www.cntraveller.in/")

现在，我们使用XPath 定位器定位包含脚本的 JSON 对象，并删除不必要的分号以正确加载 JSON 格式的字符串：
jsonSource = driver.find_element_by_xpath("//script[contains(text(),'logo') and contains(@type, 'json')]").get_attribute('text') jsonSource = re.sub(";","",jsonSource) jsonSource = json.loads(jsonSource)

然后，我们检查徽标 URL 是否存在。如果存在，我们打印它：
if "logo" in jsonSource: print("\n logoURL : " + str(jsonSource["logo"])) else: print("JSON Schema has no logo url.")

此外，我们还会检查电话详细信息是否存在。如果没有，我们打印 WebElement 的源代码：
try: if "telephone" in jsonSource: print(jsonSource["telephone"]) else: print("No Telephone - here is the source code :\n") print(driver.find_element_by_xpath("//script[contains(text(),'logo') and contains(@type, 'json')]").get_attribute('outerHTML')) except Exception as e: print(e)

最后，我们退出驱动程序：
driver.quit()
如何在 Selenium WebDriver 中将页面源作为 XML 获取
如果您正在加载 XML 呈现的网站，您可能希望保存 XML 响应。这是使 Selenium 获取 XML 页面源的有效解决方案：
drive.execute_script(‘return document.getElementById(“webkit-xml-viewer-source-xml”).innerHTML’)

结论
您可以使用上述任何一种方法并利用 LambdaTest Selenium Grid 云的敏捷性和可扩展性来自动化您的测试流程。它允许您在 3000 多种浏览器、操作系统及其版本上执行测试用例。此外，您可以将自动化测试流程与现代 CI/CD 工具集成，并遵循最佳的连续测试实践。

标签：python爬虫,selenium,获取网页源代码,自动化测试
来源：

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

如何使用python更容易的抓取网页源代码？selenium抓取网页源代码教程