帮助本次内容提取+美汤答案

【问题标题】：Help in this content extraction + beautiful soup帮助本次内容提取+美汤
【发布时间】：2011-07-14 20:25:03
【问题描述】：

我正在尝试从这种格式的网站中提取数据

<div id=storytextp class=storytextp align=center style='padding:10px;'> 
<div id=storytext class=storytext> 
<div class='a2a_kit a2a_default_style' style='float:right;margin-left:10px;border:none;'> 
..... extra stuff
</div>  **Main Content**
</div>
</div>

注意 MainContent 可以包含其他标签，但我想要整个内容，如字符串

所以我做的是这个

_divTag = data.find( "div" , id = "storytext" )
innerdiv = _divTag.find( "div" ) # find the first div tag
innerdiv.contents[0].replaceWith("") # replace with null

因此 _divTag 将只有主要内容，但这不起作用。谁能告诉我我犯了什么错误以及我应该如何提取主要内容

【问题讨论】：

标签： python beautifulsoup

【解决方案1】：

只需_divTag.contents[2]。

您的格式可能误导了您 - 此文本不属于最里面的 div 标签（innerdiv.text、innerdiv.contents 或 innerdiv.findChildren() 会显示给您）。

如果你缩进你的原始 XML，事情会更清楚：

<div id=storytextp class=storytextp align=center style='padding:10px;'> 
  <div id=storytext class=storytext> 
    <div class='a2a_kit a2a_default_style' style='float:right;margin-left:10px;border:none;'> 
      ..... extra stuff
    </div>  **Main Content**
  </div>
</div>

（PS：我不清楚你的innerdiv.contents[0].replaceWith("") 的意图是什么？压制属性？换行符？无论如何，BS 的哲学不是编辑解析树，而只是忽略 99.9%你不在乎。BS文档是here)。

【讨论】：

嘿，但碰巧我的主要内容是诸如更多标签之类的东西
.... .. 当我按照您所说的进行操作时，它只会打印第一段标签主要内容和之后的其余内容不来。我试图做的是用空字符串替换第一个内部 div 标记，以便在此之后的所有内容都可以由 contents[0]
调用
您可以编辑示例文本以显示多行的位置吗？应该是_divTag.contents[2:]