【发布时间】:2011-08-30 00:47:48
【问题描述】:
我已经使用 Python 脚本导入了 PDF 的纯文本版本,但它有一堆我不关心的垃圾伪影。
我关心的唯一空格是 (1) single 个空格,和 (2) double \n's。
单个空格, 出于显而易见的原因,在单词边界之间。 双\n,用于区分段落。
它包含的 垃圾 空格如下所示:
[\ \n\t]+ all jumbled together
这导致我遇到另一个问题,有时段落由
划分[\n][\s]+[\n]
我对正则表达式的经验不足,无法使其忽略两个\n 之间的内部空格。作为一个业余RegExer,我的问题是\s 包含\n。
如果没有——我认为这将是一个非常容易解决的问题。
所有其他空白都是无关紧要的,我正在尝试的任何东西都没有真正起作用。
任何建议将不胜感激。
示例文本
Summary: The Department of Environment in Bangladesh seized 265 sacks of poultry feed
tainted with tannery waste and various chemicals.
Synthesis/Analysis: The Department of Environment seized the tainted poultry feed on
28 March from a house in the city of Adabar located in Dhaka province. Workers were
found in the house, which was used as an illegal factory, producing the tainted feed. The
Bangladesh Environment Conservation Act allowed for a case to be filed against the
factory’s manager, Mahmud Hossain, and the owner, who was not named.
It was reported that the Department of Environment had also closed three other factories
in Hazaribag a month prior to this instance for the same charges. The Bangladesh Council of
Scientific and Industrial Research found that samples from the feed taken from these
factories had “dangerous levels of chromium…” The news report also stated that “poultry
6
and eggs became poisonous” from consuming the tainted feed, which would also cause
health concerns for consumers.
这只是引导我进行更多修复...必须删除所有页码和随机双 \n。
【问题讨论】:
-
查看您提出的正则表达式问题的数量。如果您开始认真学习它们,您认为值得您(和我们的)时间吗?如果你仔细阅读 OReilly Regex 书,这并不难。
-
感谢您的评论迈克。我很高兴 RegEx 对您来说很容易。
-
迈克的评论是正确的。如果你必须问那么多问题,也许你应该研究这个主题,而不是继续要求人们为你做事。我自己喜欢正则表达式食谱。
标签: php html regex preg-replace