python 读取pdf，导出 txt 或 html

2022-09-08 18:01:38 阅读：226 来源： 互联网

标签：en https python readthedocs html io output pdf

本文链接：https://www.cnblogs.com/tujia/p/16670374.html

一、安装 pdfminer.six

pip install pdfminer.six

二、使用代码读取pdf

from io import StringIO
from pdfminer.layout import LAParams
from pdfminer.high_level import extract_text_to_fp


output_string = StringIO()

with open('test.pdf', 'rb') as fin:
    # 导出txt
    # extract_text_to_fp(fin, output_string)
    # 导出html
    extract_text_to_fp(fin, output_string, laparams=LAParams(), output_type='html', codec=None)


with open('test.html', 'w', encoding='utf-8') as f:
    f.write(output_string.getvalue().strip())

官方文档：

https://pdfminersix.readthedocs.io/en/latest/tutorial/highlevel.html

https://pdfminersix.readthedocs.io/en/latest/reference/highlevel.html

三、使用脚本读取pdf

https://pdfminersix.readthedocs.io/en/latest/tutorial/commandline.html

https://pdfminersix.readthedocs.io/en/latest/reference/commandline.html

说明：略

本文链接：https://www.cnblogs.com/tujia/p/16670374.html

完。

标签：en,https,python,readthedocs,html,io,output,pdf
来源： https://www.cnblogs.com/tujia/p/16670374.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

python 读取pdf，导出 txt 或 html