【问题标题】:python how to extract text after br?python如何提取br后的文本?
【发布时间】:2015-12-09 16:05:03
【问题描述】:

我正在使用 2.7.8 并且有点惊讶 bcz 我得到了所有文本,但最后一个 之后包含的文本没有得到。喜欢我的html页面:

<html>
<body>
<div class="entry-content" >
<p>Here is a listing of C interview questions on “Variable Names” along with answers, explanations and/or solutions:
</p>

<p>Which of the following is not a valid C variable name?<br>
a) int number;<br>
b) float rate;<br>
c) int variable_count;<br>
d) int $main;</p>   <!--not getting-->

<p> more </p>

<p>Which of the following is true for variable names in C?<br>
a) They can contain alphanumeric characters as well as special characters<br>
b) It is not an error to declare a variable to be one of the keywords(like goto, static)<br>
c) Variable names cannot start with a digit<br>
d) Variable can be of any length</p> <!--not getting -->!

</div>
</body>
</html>

和我的代码:

url = "http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/"
#url="http://www.sanfoundry.com/c-programming-questions-answers-variable-names-2/"
req = Request(url)
resp = urllib2.urlopen(req)
htmls = resp.read()
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmls)
for br in soup.findAll('br'):
    next = br.nextSibling
    if not (next and isinstance(next,NavigableString)):
        continue
    next2 = next.nextSibling
    if next2 and isinstance(next2,Tag) and next2.name == 'br':
        text = str(next).strip()
        if text:
            print "Found:", next.encode('utf-8')
           # print '...........sfsdsds.............',answ[0].encode('utf-8')   # 

输出:

Found: 
a) int number;
Found: 
b) float rate;
Found: 
c) int variable_count;

Found: 
a) They can contain alphanumeric characters as well as special characters
Found: 
b) It is not an error to declare a variable to be one of the keywords(like goto, static)
Found: 
c) Variable names cannot start with a digit

但是我没有得到最后一个“文本”,例如:

 d) int $main
    and 
 d) Variable can be of any length  

之后

我想要得到的输出:

Found: 
a) int number;
Found: 
b) float rate;
Found: 
c) int variable_count;
Found:
d) int $main

Found: 
a) They can contain alphanumeric characters as well as special characters
Found: 
b) It is not an error to declare a variable to be one of the keywords(like goto, static)
Found: 
c) Variable names cannot start with a digit
d) Variable can be of any length

【问题讨论】:

  • 添加更多打印语句。当您continue 打印您正在跳过的内容时。将 else 语句放在 if 语句中并打印您正在跳过的内容。
  • 好的,我正在尝试......
  • 为什么你还在用旧的方式而不是我建议的方式here?..
  • 好吧,在某种程度上我面临一些问题,因为我的代码要大得多。由于您提到的较小原因,我解决了我的最后一个问题。但在这里我也面临与你的解决方案相同的情况

标签: python html beautifulsoup html-parsing


【解决方案1】:

您可以使用Requests 代替urllib2,并通过lxml 的html 模块提取xml。

from lxml import html
import requests

#request page
page=requests.get("http://www.sanfoundry.com/c-programming-questions-answers-variable-names-1/")

#get content in html format
page_content=html.fromstring(page.content)

#recover all text from <p> elements
items=page_content.xpath('//p/text()')

上面的代码返回一个包含&lt;a&gt;元素的文档中所有文本的数组。
有了它,您可以简单地索引到数组中以打印您想要的内容。

【讨论】:

    【解决方案2】:

    这是因为 BeautifulSoup 通过关闭 &lt;/p&gt; 之前的 &lt;br&gt; 标签来强制将文本转换为有效的 xml。美化版很清楚:

    <p>
     Which of the following is not a valid C variable name?
     <br>
      a) int number;
      <br>
       b) float rate;
       <br>
        c) int variable_count;
        <br>
         d) int $main;
        </br>
       </br>
      </br>
     </br>
    </p>
    

    所以文本d) int $main; 不是最后一个&lt;br&gt; 标签的兄弟,但是这个标签的文本

    代码可以是(这里):

    ...
    soup = BeautifulSoup(htmls)
    for br in soup.findAll('br'):
        if len(br.contents) > 0:  # avoid errors if a tag is correctly closed as <br/>
            print 'Found', br.contents[0]
    

    它按预期给出:

    Found 
    a) int number;
    Found 
    b) float rate;
    Found 
    c) int variable_count;
    Found 
    d) int $main;
    Found 
    a) They can contain alphanumeric characters as well as special characters
    Found 
    b) It is not an error to declare a variable to be one of the keywords(like goto, static)
    Found 
    c) Variable names cannot start with a digit
    Found 
    d) Variable can be of any length
    

    【讨论】:

    • 我得到这个:IndexError: list index out of range
    • @user3440716:没有您的真实输入很难说。我认为这是因为br.contents[0]。我的最后一次编辑应该修复它
    猜你喜欢
    • 1970-01-01
    • 2023-03-16
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-03-28
    • 1970-01-01
    • 2021-02-10
    • 1970-01-01
    相关资源
    最近更新 更多