【发布时间】:2015-04-06 23:02:25
【问题描述】:
我正在创建包含任意数量字符(人类字符/声音)的文档(参见 this),如下所示:
<span class="sam" title="This is Sam speaking">
<span class="higbie" title="This is Calvin Higbie speaking">
<span class="ballou" title="This is Mr. Ballou speaking">
对于某些上下文,这里是一个文档的 sn-p:
<p><span class="others" title="This is 'an elderly pilgrim' speaking">"Jack, do you see that range of mountains over yonder that bounds the Jordan valley? The mountains of Moab, Jack! Think of it, my
boy--the actual mountains of Moab--renowned in Scripture history!
We are actually standing face to face with those illustrious crags
and peaks--and for all we know" [dropping his voice impressively],
"our eyes may be resting at this very moment upon the spot WHERE
LIES THE MYSTERIOUS GRAVE OF MOSES! Think of it, Jack!"</span></p>
当一个文档完成后,我想为这种标记模式生成一个不同的列表。 IOW,我想检查遵循该模式的每一段 HTML,但只返回每个不同的人/演讲者的一个实例。我不想要其中的 400 个:
<span class="sam" title="This is Sam speaking">
...(只有一个)。
在伪 SQL 术语中,我想要的是这样的:
SELECT DISTINCT SOMETHING FROM FILE WHERE SLIDING_WINDOW_OF_TEXT STARTSWITH("<span class=\"") AND SLIDING_WINDOW_OF_TEXT ENDSWITH(" speaking\">")
我不知道这是否是最好的使用正则表达式攻击的东西,或者是否有类似“LinqToText”之类的东西,或者其他东西......
【问题讨论】: