crawler4j 将 <script> </script> 标记之间的行检测为文本答案

【问题标题】：crawler4j detects lines between the <script> </script> tag as textcrawler4j 将 <script> </script> 标记之间的行检测为文本
【发布时间】：2019-12-26 07:22:38
【问题描述】：

 <html>
 <head>
  
 </head>      
 <body> 
  <div style="width: 100%;"> This question already
  </div> 
  <div id="player"> hi crawler4j </div> 
  <script>
	player = new Clappr.Player({source: "http://123.30.215.65/hls/4545780bfa790819/5/3/d836ad614748cdab11c9df291254cf836f21144da20bf08142455a8735b328ca/dnR2MQ==_m.m3u8",
			parentId: '#player',
			width: '100%', height: "100%",
		    hideMediaControl: true,
		    autoPlay: true
					        });	
	</script>   
 </body>
</html>

<!-- begin snippet: js hide: false console: true babel: false -->

在我上面作为示例给出的代码行中，我执行以下操作；

HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String body = htmlParseData.getHtml();

crawler4j 将<script> </script> 标记之间的行检测为文本。我想删除body变量中<script> </script>标签之间的所有内容，然后执行getText()。你帮帮我好吗？

我想打印出来：

This question already

hi crawler4j

【问题讨论】：

标签： web-crawler html-parsing crawler4j

【解决方案1】：

HtmlParseData of crawler4j 不包含获取的 HTML 页面的完整 DOM 树。因此，String 表示中的纯 HTML 包含在 HtmlParseData 对象中。

如果你想删除<script>标签之间的内容，你可以

使用正则表达式将其删除，如on this Stackoverflow post 所述
使用JSoup（它已经是crawler4j的依赖项来解析DOM树并从结果树中删除<script>标签。

【讨论】：