使用rapidminer提取文本信息答案

【问题标题】：Extracting text information using rapidminer使用rapidminer提取文本信息
【发布时间】：2012-11-06 10:49:25
【问题描述】：

我有一个文本数据列表，我想从中提取某些部分。我目前正在使用正则表达式来提取我想要的数据，但它开始变得非常复杂，因为每条记录都略有不同。有没有办法使用 Rapidminer 根据一些典型的例子来“学习”一个正则表达式？

例如，对于以下每条记录，我想将文本 24 和 18 提取到两个新属性中：

word 24 on line 18
Wrd 24 of Ln 18
Line 18, Word 24
Word 24 comes after word 22 on line 18 (not line 19)

我看过所有的文本处理视频，但没有一个显示如何做这种事情，我真的不知道从哪里开始。除了手动创建正则表达式之外，任何人都可以建议一种方法吗？

【问题讨论】：

标签： text full-text-search text-processing rapidminer

【解决方案1】：

TXR language 有一种直接的方式来表达模式匹配变体，而无需神秘的正则表达式：

这是您的数据文件：

$ cat 13249396.dat 
word 24 on line 18
Wrd 24 of Ln 18
Line 18, Word 24
Word 24 comes after word 22 on line 18 (not line 19)

这里是 txr 脚本：

@(collect)
@  (some)
word @wd on line @ln
@  (or)
Wrd @wd of Ln @ln
@  (or)
Line @ln, Word @wd
@  (or)
Word @wd comes after word @nil on line @ln (@(skip)
@  (end)
@(end)
@(output)
@  (repeat)
@wd:@ln
@  (end)
@(end)

试运行：

$ txr 13249396.txr 13249396.dat
24:18
24:18
24:18
24:18

脚本是通过从示例文件中提取案例并用特殊语法替换一些内容而开发的。

【讨论】：