【问题标题】:PDFBox - Removing invisible text (by clip/filling paths issue)PDFBox - 删除不可见文本(通过剪辑/填充路径问题)
【发布时间】:2018-06-03 03:17:20
【问题描述】:

示例 PDF 链接:click here。这里可以看到左边很多标签被剪掉了(因为一些剪裁指令)

当我使用 PDFTextStripper 时,它会打印在示例 PDF 文件中实际剪切/隐藏的所有文本。我已经尝试过描述here 的解决方案,但它甚至值得,因为删除了顶部的很多文本+每行开头的一些文本。 有没有其他方法可以使用 PDFBox 仅显示可见字符并跳过所有重叠?或者也许有任何其他工具可以只返回可见文本? 提前致谢。

【问题讨论】:

    标签: java pdf pdfbox


    【解决方案1】:

    this answer 中的PDFVisibleTextStripper 来自 this answer 引用的 OP 不起作用的原因是在被覆盖的 processTextPosition 中计算字符基线 end 的结尾没有考虑页面旋转。但是,如果您更改该方法以仅测试每个字符基线的开头而忽略结尾,则它对手头的文档非常有效:

    @Override
    protected void processTextPosition(TextPosition text) {
        Matrix textMatrix = text.getTextMatrix();
        Vector start = textMatrix.transform(new Vector(0, 0));
    
        PDGraphicsState gs = getGraphicsState();
        Area area = gs.getCurrentClippingPath();
        if (area == null || area.contains(start.getX(), start.getY()))
            super.processTextPosition(text);
    }
    

    使用此processTextPosition 覆盖文本提取的结果(将SortByPosition 设置为true)是:

    Profit & Loss 12 Month Recap
    Property: 8151 W. 183rd Street
    Monthly recap 05/01/16 - 04/30/17  (cash basis)
    MAY 16 JUN 16 JUL 16 AUG 16 SEP 16 OCT 16 NOV 16 DEC 16 JAN 17 FEB 17 MAR 17 APR 17 TOTAL
    INCOME
        4000 RENTAL INCOME
            4001 Base Rent 343,002.59 38,045.11 38,045.11 38,045.11 66,081.36 122,153.86 66,081.36 38,045.11 0.00 76,090.22 38,598.49 66,634.74 930,823.06
            4004 Prepaid Rent Inco -165,742.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 38,045.11 -38,045.11 0.00 0.00 -165,742.50
            4000 Total RENTAL INC 177,260.09 38,045.11 38,045.11 38,045.11 66,081.36 122,153.86 66,081.36 38,045.11 38,045.11 38,045.11 38,598.49 66,634.74 765,080.56
        4200 INCOME CHARGEB
            4205 Property Tax Reco 4,250.00 4,250.00 4,250.00 4,250.00 4,250.00 4,250.00 4,250.00 4,250.00 4,250.00 4,250.00 3,696.62 4,250.00 50,446.62
            4210 CAM Recoveries 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 57,000.00
            4200 Total INCOME CH 9,000.00 9,000.00 9,000.00 9,000.00 9,000.00 9,000.00 9,000.00 9,000.00 9,000.00 9,000.00 8,446.62 9,000.00 107,446.62
        4600 OTHER INCOME
            4610 Late / NSF Fees 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1,394.72 3,828.61 0.00 0.00 0.00 5,223.33
            4600 Total OTHER INC 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1,394.72 3,828.61 0.00 0.00 0.00 5,223.33
    TOTAL INCOME 186,260.09 47,045.11 47,045.11 47,045.11 75,081.36 131,153.86 75,081.36 48,439.83 50,873.72 47,045.11 47,045.11 75,634.74 877,750.51
    EXPENSE
        6000 PROFESSIONAL FE
            6010 Professional Fees 0.00 0.00 0.00 2,500.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2,500.00
            6020 Legal Fees 0.00 0.00 0.00 4,592.71 0.00 1,466.33 1,703.35 2,006.00 0.00 685.96 4,368.50 0.00 14,822.85
            6000 Total PROFESSIO 0.00 0.00 0.00 7,092.71 0.00 1,466.33 1,703.35 2,006.00 0.00 685.96 4,368.50 0.00 17,322.85
        6100 UTILITIES
            6105 Water & Sewer 0.00 0.00 0.00 21.21 0.00 0.00 25.81 0.00 0.00 31.91 0.00 0.00 78.93
            6110 Electricity 1,000.91 358.23 390.43 350.71 353.69 0.00 666.39 381.97 486.85 449.62 480.21 486.81 5,405.82
            6125 Trash Removal 229.54 231.34 232.56 232.78 231.66 240.94 240.94 241.40 241.40 518.97 259.18 0.00 2,900.71
            6100 Total UTILITIES 1,230.45 589.57 622.99 604.70 585.35 240.94 933.14 623.37 728.25 1,000.50 739.39 486.81 8,385.46
        6200 REPAIR & MAINTEN
            6210 Field & Grounds - 3,094.00 0.00 0.00 2,313.84 1,009.50 0.00 1,439.58 1,302.75 600.00 0.00 0.00 1,909.73 11,669.40
            6211 Irrigation / Sprinkle 0.00 0.00 0.00 0.00 0.00 1,121.08 350.00 0.00 0.00 0.00 0.00 0.00 1,471.08
            6215 Landscape / Lawn 565.71 565.71 565.71 565.71 565.71 565.71 1,165.71 0.00 0.00 0.00 0.00 495.00 5,054.97
            6220 Sanitary Sewers 0.00 0.00 0.00 950.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 950.00
            6221 Storm Drains 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2,500.00 0.00 2,500.00
            6223 Snow Removal 1,365.00 3,440.00 0.00 0.00 0.00 0.00 0.00 1,350.00 4,440.00 4,106.00 790.00 2,340.00 17,831.00
            6228 Ceiling Tiles 0.00 0.00 0.00 0.00 53.30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 53.30
            6231 Building - General 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 634.65 634.65
            6233 Roof / Flashing 1,840.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 764.00 0.00 2,604.00
            6234 Electrical Repairs 0.00 0.00 0.00 395.00 0.00 0.00 960.00 90.00 0.00 0.00 0.00 0.00 1,445.00
            6236 Plumbing Repairs 0.00 0.00 3,316.59 0.00 2,315.95 0.00 930.00 812.17 0.00 0.00 0.00 0.00 7,374.71
            6237 Fire & Life Safety 0.00 0.00 0.00 0.00 0.00 150.00 0.00 0.00 660.00 0.00 0.00 1,550.00 2,360.00
            6238 Lighting Supplies 0.00 0.00 0.00 0.00 0.00 0.00 875.00 193.05 0.00 0.00 0.00 0.00 1,068.05
    Profit & Loss 12 Month Recap          05/02/17 11:13 AM Page 1 of rentmanager.com - property management systems   rev.12.180
    MAY 16 JUN 16 JUL 16 AUG 16 SEP 16 OCT 16 NOV 16 DEC 16 JAN 17 FEB 17 MAR 17 APR 17 TOTAL
            6240 Lock & Key 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 14.59 0.00 0.00 0.00 14.59
            6242 HVAC Expense 4,375.00 0.00 1,370.00 2,043.25 0.00 0.00 0.00 415.00 1,326.00 1,835.00 0.00 0.00 11,364.25
            6251 Pest Control 0.00 71.07 0.00 71.07 0.00 0.00 71.07 71.07 0.00 71.07 0.00 71.07 426.42
            6200 Total REPAIR & M 11,239.71 4,076.78 5,252.30 6,338.87 3,944.46 1,836.79 5,791.36 4,234.04 7,040.59 6,012.07 4,054.00 7,000.45 66,821.42
        6300 JANITORIAL
            6310 Janitorial Services 1,935.00 1,935.00 1,935.00 1,935.00 1,935.00 0.00 3,870.00 1,935.00 1,935.00 1,935.00 1,995.00 1,995.00 23,340.00
            6320 Janitorial Supplies 79.74 260.01 79.74 90.84 113.14 0.00 170.58 0.00 365.61 90.84 0.00 153.01 1,403.51
            6300 Total JANITORIAL 2,014.74 2,195.01 2,014.74 2,025.84 2,048.14 0.00 4,040.58 1,935.00 2,300.61 2,025.84 1,995.00 2,148.01 24,743.51
        6400 PAYROLL
            6410 P/R Salaries - Offi 2,167.72 2,190.43 2,213.14 2,213.14 1,512.40 2,342.28 2,224.93 2,107.58 2,107.58 2,107.58 2,190.78 2,344.16 25,721.72
            6412 P/R Taxes - Office 179.87 167.56 169.30 169.30 115.70 179.18 170.21 161.23 238.16 231.10 199.89 196.42 2,177.92
            6420 Employee Insuran 76.06 76.14 76.22 199.23 104.30 161.06 152.29 137.91 139.14 139.14 143.91 175.02 1,580.42
            6421 Employee Benefit 3.54 2.40 87.37 141.59 35.59 114.13 111.50 110.15 89.47 107.81 114.80 49.60 967.95
            6423 Workers Compens 42.50 42.94 37.74 32.10 21.93 33.96 32.26 30.56 30.56 30.56 31.76 33.98 400.85
            6400 Total PAYROLL 2,469.69 2,479.47 2,583.77 2,755.36 1,789.92 2,830.61 2,691.19 2,547.43 2,604.91 2,616.19 2,681.14 2,799.18 30,848.86
        6500 TAXES INSURANCE
            6510 Real Estate Tax E 69,570.07 0.00 0.00 0.00 0.00 69,570.07 0.00 0.00 0.00 0.00 0.00 0.00 139,140.14
            6520 Insurance Expens 2,078.00 2,704.50 0.00 2,704.50 0.00 0.00 2,704.50 0.00 0.00 2,704.50 0.00 0.00 12,896.00
            6500 Total TAXES INSU 71,648.07 2,704.50 0.00 2,704.50 0.00 69,570.07 2,704.50 0.00 0.00 2,704.50 0.00 0.00 152,036.14
        6600 Property Manageme 9,575.44 8,381.70 2,117.03 2,117.03 2,117.03 3,378.66 5,901.92 3,378.66 2,179.79 2,000.00 3,829.06 2,117.03 47,093.35
        6650 Receiver Fees 6,625.00 6,125.00 0.00 0.00 6,875.00 0.00 7,062.50 8,375.00 0.00 0.00 8,875.00 0.00 43,937.50
        6700 GENERAL & ADMIN
            6710 PM / Work Order S 95.00 95.00 95.00 95.00 95.00 95.00 95.00 95.00 95.00 95.00 95.00 95.00 1,140.00
            6720 Postage / Messen 63.58 0.00 7.59 9.64 20.63 5.98 6.99 0.00 17.38 7.21 14.36 10.98 164.34
            6725 Office Supplies 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 148.88 148.88
            6735 Office Equipment 0.00 0.00 0.00 0.00 0.00 0.00 0.00 218.40 0.00 0.00 0.00 0.00 218.40
            6740 Telephone 21.33 0.00 11.54 15.00 21.12 8.76 9.77 0.00 13.19 11.96 3.14 7.88 123.69
            6760 Auto Mileage & Ex 100.44 0.00 68.75 140.24 104.14 61.29 142.59 29.00 56.04 0.00 23.14 0.00 725.63
            6770 Leasing & Maint. O 0.00 0.00 0.00 0.00 0.00 0.00 75.00 0.00 0.00 0.00 0.00 0.00 75.00
            6780 Bank Fees 129.45 0.00 0.00 105.91 87.62 0.00 53.61 0.00 120.92 56.46 77.49 79.74 711.20
            6700 Total GENERAL & 409.80 95.00 182.88 365.79 328.51 171.03 382.96 342.40 302.53 170.63 213.13 342.48 3,307.14
    TOTAL EXPENSE 105,212.90 26,647.03 12,773.71 24,004.80 17,688.41 79,494.43 31,211.50 23,441.90 15,156.68 17,215.69 26,755.22 14,893.96 394,496.23
    NOI 81,047.19 20,398.08 34,271.40 23,040.31 57,392.95 51,659.43 43,869.86 24,997.93 35,717.04 29,829.42 20,289.89 60,740.78 483,254.28
    N/O EXPENSE
        7100 NON-OPERATING E
            7110 Lease Commissio 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 33,203.00 0.00 0.00 0.00 33,203.00
            7130 Professional Fees 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1,276.00 0.00 0.00 1,276.00
            7100 Total NON-OPER 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 33,203.00 1,276.00 0.00 0.00 34,479.00
    TOTAL N/O EXPENSE 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 33,203.00 1,276.00 0.00 0.00 34,479.00
    NET INCOME 81,047.19 20,398.08 34,271.40 23,040.31 57,392.95 51,659.43 43,869.86 24,997.93 2,514.04 28,553.42 20,289.89 60,740.78 448,775.28
    Profit & Loss 12 Month Recap          05/02/17 11:13 AM Page 2 of rentmanager.com - property management systems   rev.12.180
    

    乍一看,唯一可见的文本缺失是两页页脚中的总页数。


    正如 OP 在评论中所说的那样

    似乎在 deleteCharsInPath() 中应该应用同样的东西

    确实,deleteCharsInPath也应该改成:

    void deleteCharsInPath() {
        for (List<TextPosition> list : charactersByArticle) {
            List<TextPosition> toRemove = new ArrayList<>();
            for (TextPosition text : list) {
                Matrix textMatrix = text.getTextMatrix();
                Vector start = textMatrix.transform(new Vector(0, 0));
                if (linePath.contains(start.getX(), start.getY())) {
                    toRemove.add(text);
                }
            }
            if (toRemove.size() != 0) {
                System.out.println("Removed " + toRemove.size() + " TextPosition objects as they are being covered.");
                list.removeAll(toRemove);
            }
        }
    }
    

    OP 提供了另一个文档,其中即使是上面更正的 PDFVisibleTextStripper 也无法正确识别可见字符。

    原因是 PDFBox 文本剥离将原点移动到裁剪框的左下角进行了另一种规范化。

    修补PDFVisibleTextStripper 方法以再次添加左下方裁剪框坐标值,从而可以很好地提取可见文本。

    覆盖processPage 允许我们读取左下方裁剪框坐标:

    float lowerLeftX = 0;
    float lowerLeftY = 0;
    
    @Override
    public void processPage(PDPage page) throws IOException {
        PDRectangle pageSize = page.getCropBox();
    
        lowerLeftX = pageSize.getLowerLeftX();
        lowerLeftY = pageSize.getLowerLeftY();
    
        super.processPage(page);
    }
    

    processTextPositiondeleteCharsInPath 需要考虑这些值:

    @Override
    protected void processTextPosition(TextPosition text) {
        Matrix textMatrix = text.getTextMatrix();
        Vector start = textMatrix.transform(new Vector(0, 0));
    
        PDGraphicsState gs = getGraphicsState();
        Area area = gs.getCurrentClippingPath();
        if (area == null || area.contains(lowerLeftX + start.getX(), lowerLeftY + start.getY()))
            super.processTextPosition(text);
    }
    
    [...]
    
    void deleteCharsInPath() {
        for (List<TextPosition> list : charactersByArticle) {
            List<TextPosition> toRemove = new ArrayList<>();
            for (TextPosition text : list) {
                Matrix textMatrix = text.getTextMatrix();
                Vector start = textMatrix.transform(new Vector(0, 0));
                if (linePath.contains(lowerLeftX + start.getX(), lowerLeftY + start.getY())) {
                    toRemove.add(text);
                }
            }
            if (toRemove.size() != 0) {
                System.out.println("Removed " + toRemove.size() + " TextPosition objects as they are being covered.");
                list.removeAll(toRemove);
            }
        }
    }
    

    现在新文件的提取结果也可以了。 ;)

    【讨论】:

    • 非常感谢您的快速响应,它工作正常。似乎应该在处理填充的 deleteCharsInPath() 中应用相同的东西
    • 顺便说一句,由于某些原因,页脚甚至与“area.contains(start.getX(), start.getY()”条件不匹配。在这种情况下没关系,它被跳过但有趣的是为什么。对于在此link 示例中,对于顶部的大量文本,条件失败。是否可能需要向 PDFTextStripper 子类添加更多具有一些额外指令处理的类?
    • 我还没有分析过test2.pdf。但是在这种情况下的一个问题是 PDFBox PDFTextStripper 以多种方式规范化坐标,PDFVisibleTextStripper 的添加很可能在各个方面都没有效仿。如果这是这些问题的根源,我不会感到惊讶......
    • 我看了一眼。实际上,使用的裁剪框的原点不在其左下方。这意味着我的路径处理还没有模拟另一个“规范化”,参见。 this answer。我会在有时间的时候尝试解决这个问题。
    • 那太好了,非常感谢!至少我有一点要看
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-08-08
    • 1970-01-01
    • 2016-10-19
    • 1970-01-01
    • 2017-02-07
    • 1970-01-01
    相关资源
    最近更新 更多