解析“ul”和“ol”标签答案

【问题标题】：Parse 'ul' and 'ol' tags解析“ul”和“ol”标签
【发布时间】：2018-10-23 21:13:36
【问题描述】：

我必须处理ul、ol 和li 标签的深层嵌套。我需要给出与我们在浏览器中给出的相同的视图。我想在 pdf 文件中实现以下示例：

 text = "
<body>
    <ol>
        <li>One</li>
        <li>Two

            <ol>
                <li>Inner One</li>
                <li>inner Two

                    <ul>
                        <li>hey

                            <ol>
                                <li>hiiiiiiiii</li>
                                <li>why</li>
                                <li>hiiiiiiiii</li>
                            </ol>
                        </li>
                        <li>aniket </li>
                    </li>
                </ul>
                <li>sup </li>
                <li>there </li>
            </ol>
            <li>hey </li>
            <li>Three</li>
        </li>
    </ol>
    <ol>
        <li>Introduction</li>
        <ol>
            <li>Introduction</li>
        </ol>
        <li>Description</li>
        <li>Observation</li>
        <li>Results</li>
        <li>Summary</li>
    </ol>
    <ul>
        <li>Introduction</li>
        <li>Description

            <ul>
                <li>Observation

                    <ul>
                        <li>Results

                            <ul>
                                <li>Summary</li>
                            </ul>
                        </li>
                    </ul>
                </li>
            </ul>
        </li>
        <li>Overview</li>
    </ul>
</body>"

我必须使用虾来完成我的任务。但是大虾不支持 HTML 标签。所以，我想出了一个使用nokogiri: 的解决方案。我正在解析并稍后使用 gsub 删除标签。我为上述内容的一部分编写了以下解决方案，但问题是 ul 和 ol 可能会有所不同。

     RULES = {
  ol: {
    1 => ->(index) { "#{index + 1}. " },
    2 => ->(index) { "#{}" },
    3 => ->(index) { "#{}" },
    4 => ->(index) { "#{}" }
  },
  ul: {
    1 => ->(_) { "\u2022 " },
    2 => ->(_) { "" },
    3 => ->(_) { "" },
    4 => ->(_) { "" },
  }
}

def ol_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ol][deepness].call(i)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def ul_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ul][deepness].call(i)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def descend(item, deepness)
  item.search('> ol').each do |ol|
    ol_rule(ol, deepness: deepness)
  end
  item.search('> ul').each do |ul|
    ul_rule(ul, deepness: deepness)
  end
end

doc = Nokogiri::HTML.fragment(text)

doc.search('ol').each do |group|
  ol_rule(group, deepness: 1)
end

doc.search('ul').each do |group|
  ul_rule(group, deepness: 1)
end


  puts doc.inner_text


1. One
2. Two

1. Inner One
2. inner Two

• hey

1. hiiiiiiiii
2. why
3. hiiiiiiiii


• aniket 


3. sup 
4. there 

3. hey 
4. Three



1. Introduction

1. Introduction

2. Description
3. Observation
4. Results
5. Summary



• Introduction
• Description

• Observation

• Results

• Summary






• Overview

问题

1) 我想要实现的是在使用 ul 和 ol 标签时如何处理空间
2) li 进入 ul 或 li 进入 ol 时如何处理深度嵌套

【问题讨论】：

这是一个关于递归的作业问题吗？这似乎是一个没有任何问题的问题，但这是一个奇怪的现实问题。
不是作业问题。这是我在工作中面临的问题

标签： ruby-on-rails ruby algorithm ruby-on-rails-4 nokogiri

【解决方案1】：

首先为了处理空间，我在 lambda 调用中使用了 hack。另外，我正在使用 nokogiri 提供的 add_previous_sibling 函数在开始时附加一些内容。最后，当我们处理 ul & ol 标签时，Prawn 不处理空间，因此我使用了这个 gsub gsub(/^([^\S\r\n]+)/m) { |m| "\xC2\xA0" * m.size }。您可以从link 阅读更多内容

注意：Nokogiri 不处理无效的 HTML，因此始终提供有效的 HTML

RULES = {
  ol: {
    1 => ->(index) { "#{index + 1}. " },
    2 => ->(index) { "#{}" },
    3 => ->(index) { "#{}" },
    4 => ->(index) { "#{}" }
  },
  ul: {
    1 => ->(_) { "\u2022 " },
    2 => ->(_) { "" },
    3 => ->(_) { "" },
    4 => ->(_) { "" },
  },
  space: {
    1 => ->(index) { " "  },
    2 => ->(index) { "  " },
    3 => ->(index) { "   " },
    4 => ->(index) { "    " },
  }
}

def ol_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ol][deepness].call(i)
    space = RULES[:space][deepness].call(i)
    item.add_previous_sibling(space)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def ul_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    space = RULES[:space][deepness].call(i)
    prefix = RULES[:ul][deepness].call(i)
    item.add_previous_sibling(space)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def descend(item, deepness)
  item.search('> ol').each do |ol|
    ol_rule(ol, deepness: deepness)
  end
  item.search('> ul').each do |ul|
    ul_rule(ul, deepness: deepness)
  end
end

doc = Nokogiri::HTML.parse(text)

doc.search('ol').each do |group|
  ol_rule(group, deepness: 1)
end

doc.search('ul').each do |group|
  ul_rule(group, deepness: 1)
end

Prawn::Document.generate("hello.pdf") do
  #puts doc.inner_text
  text doc.at('body').children.to_html.gsub(/^([^\S\r\n]+)/m) { |m| "\xC2\xA0" * m.size }.gsub("<ul>","").gsub("<\/ul>","").gsub("<ol>","").gsub("<\/ol>","").gsub("<li>", "").gsub("</li>","").gsub("\\n","").gsub(/[\n]+/, "\n")
end

【讨论】：

【解决方案2】：

我想出了一个解决方案，可以处理多个身份，每个级别都有可配置的编号规则：

require 'nokogiri'
ROMANS = %w[i ii iii iv v vi vii viii ix]

RULES = {
  ol: {
    1 => ->(index) { "#{index + 1}. " },
    2 => ->(index) { "#{('a'..'z').to_a[index]}. " },
    3 => ->(index) { "#{ROMANS.to_a[index]}. " },
    4 => ->(index) { "#{ROMANS.to_a[index].upcase}. " }
  },
  ul: {
    1 => ->(_) { "\u2022 " },
    2 => ->(_) { "\u25E6 " },
    3 => ->(_) { "* " },
    4 => ->(_) { "- " },
  }
}

def ol_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ol][deepness].call(i)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def ul_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ul][deepness].call(i)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def descend(item, deepness)
  item.search('> ol').each do |ol|
    ol_rule(ol, deepness: deepness)
  end
  item.search('> ul').each do |ul|
    ul_rule(ul, deepness: deepness)
  end
end

doc = Nokogiri::HTML.fragment(text)

doc.search('ol:root').each do |group|
  binding.pry
  ol_rule(group, deepness: 1)
end

doc.search('ul:root').each do |group|
  ul_rule(group, deepness: 1)
end

然后您可以根据您的环境删除标签或使用 doc.inner_text。

但有两个警告：

必须仔细选择您的条目选择器。我使用你的 sn-p 逐字不带根元素，因此我不得不使用 ul:root/ol:root。也许“body > ol”也适用于您的情况。也许选择每个 ol/ul，但不是遍历每个，只找到那些没有列表父级的。
逐字使用您的示例，Nokogiri 不能很好地处理第一组 ol 的最后 2 个列表项（“嘿”、“三”）当使用 nokogiri 进行解析时，元素已经“离开”了它们的 ol 树并被放置在根树中。

电流输出：

  1. One
  2. Two
      a. Inner One
      b. inner Two
        ◦ hey
        ◦ hey
      3. hey
      4. hey
  hey
  Three

  1. Introduction
    a. Introduction
  2. Description
  3. Observation
  4. Results
  5. Summary

  • Introduction
  • Description
      ◦ Observation
          * Results
              - Summary
  • Overview

【讨论】：

整个内容将在正文中。但是对于前两个 Inner One 和 inner Two，它应该给出数字而不是字母。另外，它可以与 ul 和 ol 的任何其他结构一起使用吗？最后，我们必须从哪里打印整个数据？
然后您可以更改代码以传递“深度”参数，ol_deepness，ul_deepness，并且仅在下降同一组时增加。我使用 doc.inner_text 来提取文本，但这会在中间留下一些换行符。抱歉，我现在没有时间了。
我上面的代码使用 Nokogiri::HTML.fragment 方法 + ul:root 选择器。如果您的结构不同并且您使用的是完整的 Nokogiri::HTML.parse() 方法，那么您需要使用例如调整根选择器doc.search('ol:root') doc.search('body > ol')。我只能使用您提供的示例。

【解决方案3】：

当您在 ol、li 或 ul 元素中时，您必须递归检查 ol、li 和 ul。如果没有，则返回（作为子结构发现的），如果有，则在新节点上调用相同的函数并将其返回值添加到当前结构中。

您对每个节点执行不同的操作，无论它在哪里，取决于它的类型，然后该函数会自动重新打包所有内容。

【讨论】：

@AniketShivamTiwari 抱歉，我在打电话。想法：使用 css 选择器选择每个 li 然后检查其父级是有序列表还是无序列表不是更容易吗？另外我刚刚注意到我看不出为什么你的代码没有按照你想要的方式运行。
@AniketShivamTiwari 当我运行您的代码并执行puts content.text 时，我得到this。这不是你想要的吗？我无法理解您的解决方案为何与您的期望不符。
我为示例 ul 和 li 标签编写的上述代码。它不适用于我给出的示例。