【问题标题】:apache poi word to html conversion - words boundryapache poi word 到 html 转换 - 词边界
【发布时间】:2016-10-21 21:15:39
【问题描述】:

我正在使用下面的代码将 word 转换为 html 文件

    public Map convert(String wordDocPath, String htmlPath,
        Map conversionParams)
{
    log.info("Converting word file "+wordDocPath)
    try
    {
        String workingFolder = "C:\temp"
        File workingFolderFile = new File(workingFolder)

        FileInputStream fis = new FileInputStream(wordDocPath);
        XWPFDocument document = new XWPFDocument(fis);
        XHTMLOptions options = XHTMLOptions.create().URIResolver(new FileURIResolver(workingFolderFile));
        options.setExtractor(new FileImageExtractor(workingFolderFile))
        File htmlFile = new File(htmlPath);
        OutputStream out = new FileOutputStream(htmlFile)
        XHTMLConverter.getInstance().convert(document, out, options);

        log.info("Converted to HTML file "+htmlPath)

    }
    catch(Exception e)
    {
        log.error("Exception :"+e.getMessage(),e)
    }
}

代码正在正确生成 html 输出。

我需要在文档中添加一些参数,例如[[AGENT_NAME]],稍后我将在代码中将其替换为正则表达式。但是 apache poi 并没有将此模式视为单个单词,有时会拆分“[[”、“AGENT_NAME”和“]]”,并在其间插入一些带有样式的标签。因此我无法编写正则表达式并替换参数。

apache poi 如何决定字边界?有办法控制吗?

【问题讨论】:

  • Apache POI 不决定 Word 边界,生成原始文件时将由 Microsoft Word 选择...
  • 可以详细解释一下吗?任何链接都会有所帮助。是否有任何特殊字符是单词边界的一部分?
  • 调试代码 (XWPFDocument.paragraphs) 并通过 OpenOffice 规范officeopenxml.com/WPparagraph.php,我了解到 MsWord 可以将文本拆分为文档中任何位置的运行。它甚至可以拆分不包含任何特殊字符(如 AGENTNAME )的纯连续文本。但是我们可以控制这种行为吗?如何使文本被视为一次运行?
  • 您必须致电 Microsoft 才能获得“正确”的答案。通常,在 Word 中突出显示您希望连续的文本,将其显式格式化为不同的样式,然后将其重新格式化,会导致 Word 将该文本放入自己的运行中
  • 我尝试了突出显示和格式化(斜体),但它仍然被拆分

标签: java html apache-poi openxml docx


【解决方案1】:

经过所有努力,我最终决定编写代码来解析 word doc 并合并拆分运行。这是代码,希望对其他人有所帮助

注意:我使用的模式是${pattern}

void mergeSplittedPatterns(XWPFDocument document)
{
    List<XWPFParagraph> paragraphs = document.paragraphs

    for(XWPFParagraph paragraph : paragraphs)
    {
        List<XWPFRun> runs = paragraph.getRuns()

        int firstCharRun,closingCharRun
        boolean firstCharFound = false;
        boolean secondCharFoundImmediately = false;
        boolean closingCharFound = false;
        boolean gotoNextRun = true

        boolean scan = (runs!=null && runs.size()>0)
        int index = 0

        while(scan)
        {
            gotoNextRun = true;
            XWPFRun run = runs.get(index)
            String runText = run.getText(0)
            if(runText!=null)
                for (int i = 0; i < runText.length(); i++)
            {
                char character = runText.charAt(i);

                if(secondCharFoundImmediately)
                {
                    closingCharFound = (character=="}")
                    if(closingCharFound)
                    {
                        closingCharRun = index

                        if(firstCharRun==closingCharRun)
                        {
                            firstCharFound = secondCharFoundImmediately = closingCharFound = false
                            continue;
                        }
                        else
                        {
                            String mergedText= ""
                            for(int j=firstCharRun;j<=closingCharRun;j++)
                            {
                                mergedText += runs.get(j).getText(0)
                            }
                            runs.get(firstCharRun).setText(mergedText,0)

                            for(int j=closingCharRun;j>firstCharRun;j--)
                            {
                                paragraph.removeRun(j)
                            }
                            firstCharFound = secondCharFoundImmediately = closingCharFound = gotoNextRun = false
                            index = firstCharRun
                            break;
                        }
                    }
                }
                else if(firstCharFound)
                {
                    secondCharFoundImmediately = (character=="{")
                    if(!secondCharFoundImmediately)
                    {
                        firstCharFound = secondCharFoundImmediately = closingCharFound = false
                    }
                }
                else if(character=="\$")
                {
                    firstCharFound = true;
                    firstCharRun = index
                }
            }

            if(gotoNextRun)
            {
                index++;
            }

            if(index>=runs.size())
            {
                scan = false;
            }
        }
    }
}

【讨论】:

    猜你喜欢
    • 2023-03-14
    • 1970-01-01
    • 2020-12-13
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-08-25
    • 2018-02-24
    • 1970-01-01
    相关资源
    最近更新 更多