python读取pdf文件

时间 2020-08-22

标签 python 读取 pdf 文件栏目 Python 繁體版

原文原文链接

pdfplumber简介

Pdfplumber是一个能够处理pdf格式信息的库。能够查找关于每一个文本字符、矩阵、和行的详细信息，也能够对表格进行提取并进行可视化调试。html

文档参考https://github.com/jsvine/pdfplumberpython

pdfplumber安装

安装直接采用pip便可。命令行中输入git

pip install pdfplumbergithub

若是要进行可视化的调试，则须要安装ImageMagick。
Pdfplumber GitHub： https://github.com/jsvine/pdfplumber
ImageMagick地址：
http://docs.wand-py.org/en/latest/guide/install.html#install-imagemagick-windows
（官网地址没有6x， 6x地址：https://imagemagick.org/download/binaries/）windows

（注意：我在装ImageMagick，使用起来是报错了，网上参照了这里了解到应该装6x版，7x版会报错。故找了6x的地址如上。）ide

在使用to_image函数输出图片时，若是报错DelegateException。则安装GhostScript 32位。（注意，必定要下载32位版本，哪怕Windows和python的版本是64位的。）
GhostScript: https://www.ghostscript.com/download/gsdnld.html函数

简单使用

import pdfplumber
with pdfplumber.open("path/file.pdf") as pdf:
    first_page = pdf.pages[0]  #获取第一页
    print(first_page.chars[0])

pdfplumber.pdf中包含了.metadata和.pages两个属性。
metadata是一个包含pdf信息的字典。
pages是一个包含页面信息的列表。ui

每一个pdfplumber.page的类中包含了几个主要的属性。
page_number 页码
width 页面宽度
height 页面高度
objects/.chars/.lines/.rects 这些属性中每个都是一个列表，每一个列表都包含一个字典，每一个字典用于说明页面中的对象信息，包括直线，字符，方格等位置信息。spa

经常使用方法

extract_text() 用来提页面中的文本，将页面的全部字符对象整理为的那个字符串
extract_words() 返回的是全部的单词及其相关信息
extract_tables() 提取页面的表格
to_image() 用于可视化调试时，返回PageImage类的一个实例.net

经常使用参数

table_settings

表提取设置

默认状况下，extract_tables使用页面的垂直和水平线（或矩形边）做为单元格分隔符。可是方法该能够经过table_settings参数高度定制。可能的设置及其默认值：

{
    "vertical_strategy": "lines", 
    "horizontal_strategy": "lines",
    "explicit_vertical_lines": [],
    "explicit_horizontal_lines": [],
    "snap_tolerance": 3,
    "join_tolerance": 3,
    "edge_min_length": 3,
    "min_words_vertical": 3,
    "min_words_horizontal": 1,
    "keep_blank_chars": False,
    "text_tolerance": 3,
    "text_x_tolerance": None,
    "text_y_tolerance": None,
    "intersection_tolerance": 3,
    "intersection_x_tolerance": None,
    "intersection_y_tolerance": None,
}

表提取策略

vertical_strategy 和 horizontal_strategy 的参数选项

`"lines"`	Use the page's graphical lines — including the sides of rectangle objects — as the borders of potential table-cells.
`"lines_strict"`	Use the page's graphical lines — but not the sides of rectangle objects — as the borders of potential table-cells.
`"text"`	For `vertical_strategy`: Deduce the (imaginary) lines that connect the left, right, or center of words on the page, and use those lines as the borders of potential table-cells. For `horizontal_strategy`, the same but using the tops of words.
`"explicit"`	Only use the lines explicitly defined in `explicit_vertical_lines` / `explicit_horizontal_lines`.

举例使用

读取文字

import pdfplumber
import pandas as pd

with pdfplumber.open("E:\\600aaa_2.pdf") as pdf:
    page_count = len(pdf.pages)
    print(page_count)  # 获得页数
    for page in pdf.pages:
        print('---------- 第[%d]页 ----------' % page.page_number)
        # 获取当前页面的所有文本信息，包括表格中的文字
        print(page.extract_text())

读取表格

import pdfplumber
import pandas as pd
import re

with pdfplumber.open("E:\\600aaa_1.pdf") as pdf:
    page_count = len(pdf.pages)
    print(page_count)  # 获得页数
    for page in pdf.pages:
        print('---------- 第[%d]页 ----------' % page.page_number)

        for pdf_table in page.extract_tables(table_settings={"vertical_strategy": "text",
                                                         "horizontal_strategy": "lines",
                                                        "intersection_tolerance":20}): # 边缘相交合并单元格大小

            # print(pdf_table)
            for row in pdf_table:
                # 去掉回车换行
                print([re.sub('\s+', '', cell) if cell is not None else None for cell in row])

部分参照：https://blog.csdn.net/Elaine_jm/article/details/84841233