解析文本文件以获取文本选择答案

【问题标题】：Parsing text file to get a selection of text解析文本文件以获取文本选择
【发布时间】：2018-07-31 20:25:38
【问题描述】：

我有一个文本文件，我想在其中抓取一段文本，以便将其放入两个数组中，一个是成分，另一个是方向。

对于成分，我可以做如下的事情，但我不能保证它的完整性。

ingredients = []
list.each_line do |l|
  ingredients << l if l =~ /\d\s?\w.*/
end

这是文本块：

635860
581543
2011-03-21T13:50:10Z

Image:black bean soup.jpg|right|Mexican Black Bean Soup

== Ingredients ==
1lb black beans
2 tbsp extra-virgin olive oil
2 onions, large, diced
6 cloves garlic, minced
1 cup tomato, peeled, seeded, and chopped (fresh or canned)
1 sprig epazote, fresh or dried (optional)
1 tbsp chipotle pepper|chipotle chiles, canned, chopped (or ¼ tsp cayenne)
1 tsp cumin, ground
1 tsp coriander seed|coriander, ground
2 tsp salt

== Directions ==
Soak the black beans for 2 hours and drain.
In a deep pot, heat the olive oil over medium heat.
Add the onions and cook about 5 minutes.
Until translucent.
Add the black beans|beans, garlic, and 6 cups cold water.
Bring to a boil, skimming any foam that rises to the surface.
Reduce to a simmer.
In an hour or when the black beans|beans are soft, add the tomato, epazote, chipotle chile peppers|chile, cumin, coriander, and salt.
Continue cooking until the black beans|beans start to break down and the broth begins to thicken.
Taste for seasoning and add salt and pepper if needed.
If you’re serving this soup immediately, you may want to thicken it by puréeing a cup or two of the black beans|beans in a blender or food processor and then recombining them with the rest of the soup.
The soup will thicken on its own if refrigerated overnight.

Category:Black bean Recipes
Category:Chile pepper Recipes
Category:Chipotle pepper Recipes
Category:Epazote Recipes
Category:Mexican Soups
Category:Tomato Recipes
bx0ztz9xbf8qr9z4gwkad26u6q3hly3

【问题讨论】：

这一切都很好，直到有人在食谱中有“半磅黄油”或“两个全蛋”。
几个提示：不要使用 Perl 风格的 =~ 运算符，而是使用 .match(...) ，这样可以更清楚地说明发生了什么。其次，使用/\A\s*\d\s*\w.*/ 作为将其锚定到行首的一种方式，它们可能无缘无故地成为几个空格。
我通常处理此类问题的方法是获取一个充满各种丑陋边缘情况的数据语料库，然后编写一个解析器来处理它们。 单元测试，尤其是可以在 JSON 或 YAML 等标准格式中定义输出的类型。

标签： ruby regex parsing

【解决方案1】：

我在这里要做的不是尝试匹配您可能无法控制的数据，而是尝试匹配看起来您可以控制的数据。具体来说，在我看来 == Ingredients == 和 == Directions == 和 Category:Tomato Recipes 可能是文件格式的一部分，而不是用户输入的。所以，每当你看到这样的一行时，我都会将文本分开：

sections = list.each_line.slice_before do |line|
  line.match?(/\A(==|[a-zA-Z]+:)/)
end.entries

然后您就可以assoc 组中的数据：

puts sections.assoc("== Ingredients ==\n")
puts '---'
puts sections.assoc("== Directions ==\n")

这仍然有一些缺陷（如果用户输入类似 Note: Preheat oven first 的内容作为指示的一部分，这最终会分裂，认为它是元数据），但应该是向前迈出的一大步，并且可以从这里进行调整.

【讨论】：

谢谢你帮了我很多。