假设您正在处理 HTML 的片段(而不是完整的文档),您可以编写一个正则表达式来匹配大多数格式良好的最里面的非嵌套元素,然后递归地应用这个正则表达式来删除所有标记的材料,在标签之间留下所需的未标记材料。这是一个匹配大多数空和非空、非嵌套、非短标签 HTML 元素的正则表达式(在注释的 PHP/PCRE 'x' 语法中)。
$re_html = '%# Match non-nested, non-shorttag HTML empty and non-empty elements.
< # Opening tag opening "<" delimiter.
(\w+)\b # $1: Tag name.
(?: # Non-capture group for optional attribute(s).
\s+ # Attributes must be separated by whitespace.
[\w\-.:]+ # Attribute name is required for attr=value pair.
(?: # Non-capture group for optional attribute value.
\s*=\s* # Name and value separated by "=" and optional ws.
(?: # Non-capture group for attrib value alternatives.
"[^"]*" # Double quoted string.
| \'[^\']*\' # Single quoted string.
| [\w\-.:]+\b # Non-quoted attrib value can be A-Z0-9-._:
) # End of attribute value alternatives.
)? # Attribute value is optional.
)* # Allow zero or more attribute=value pairs
\s* # Whitespace is allowed before closing delimiter.
(?: # This element is either empty or has close tag.
/> # Is either an empty tag having no contents,
| > # or has both opening and closing tags.
( # $2: Tag contents.
[^<]* # Everything up to next tag. (normal*)
(?: # We found a tag (open or close).
(?!</?\1\b) < # Not us? Match the "<". (special)
[^<]* # More of everything up to next tag. (normal*)
)* # Unroll-the-loop. (special normal*)*
) # End $2. Tag contents.
</\1\s*> # Closing tag.
)
%x';
这是 Javascript 语法中相同的正则表达式:
var re_html = /<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[\w\-.:]+\b))?)*\s*(?:\/>|>([^<]*(?:(?!<\/?\1\b)<[^<]*)*)<\/\1\s*>)/;
以下 javascript 函数去除 HTML 元素,在标签之间留下所需的文本:
// Strip HTML elements.
function strip_html_elements(text) {
// Match non-nested, non-shorttag HTML empty and non-empty elements.
var re = /<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[\w\-.:]+\b))?)*\s*(?:\/>|>([^<]*(?:(?!<\/?\1\b)<[^<]*)*)<\/\1\s*>)/g;
// Loop removing innermost HTML elements from inside out.
while (text.search(re) !== -1) {
text = text.replace(re, '');
}
return text;
}
这个正则表达式解决方案不是一个合适的解析器,只处理只有 html 元素的简单 HTML 片段。它不能(也不能)正确处理具有诸如 cmets、CDATA 部分和 doctype 语句之类的更复杂的标记。它不会删除缺少可选关闭标签的元素(即<p> 和<li> 元素。)