【问题标题】:How to find spans with a specific class containing specific text using beautiful soup and re?如何使用漂亮的汤和重新找到包含特定文本的特定类的跨度?
【发布时间】:2013-04-21 08:28:15
【问题描述】:

我怎样才能找到所有包含'blue' 类且包含格式文本的跨度:

04/18/13 7:29pm

因此可能是:

04/18/13 7:29pm

或:

Posted on 04/18/13 7:29pm

就构建执行此操作的逻辑而言,这是我目前所得到的:

new_content = original_content.find_all('span', {'class' : 'blue'}) # using beautiful soup's find_all
pattern = re.compile('<span class=\"blue\">[data in the format 04/18/13 7:29pm]</span>') # using re
for _ in new_content:
    result = re.findall(pattern, _)
    print result

我一直在参考https://stackoverflow.com/a/7732827https://stackoverflow.com/a/12229134 来尝试找出一种方法来做到这一点,但以上就是我目前所掌握的全部内容。

编辑:

为了澄清这个场景,有跨度:

<span class="blue">here is a lot of text that i don't need</span>

<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>

请注意,我只需要 04/18/13 7:29pm 而不是其余的内容。

编辑 2:

我也试过了:

pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
for _ in new_content:
    result = re.findall(pattern, _)
    print result

得到错误:

'TypeError: expected string or buffer'

【问题讨论】:

    标签: python regex beautifulsoup


    【解决方案1】:
    import re
    from bs4 import BeautifulSoup
    
    html_doc = """
    <html>
    <body>
    <span class="blue">here is a lot of text that i don't need</span>
    <span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
    <span class="blue">04/19/13 7:30pm</span>
    <span class="blue">Posted on 04/20/13 10:31pm</span>
    </body>
    </html>
    """
    
    # parse the html
    soup = BeautifulSoup(html_doc)
    
    # find a list of all span elements
    spans = soup.find_all('span', {'class' : 'blue'})
    
    # create a list of lines corresponding to element texts
    lines = [span.get_text() for span in spans]
    
    # collect the dates from the list of lines using regex matching groups
    found_dates = []
    for line in lines:
        m = re.search(r'(\d{2}/\d{2}/\d{2} \d+:\d+[a|p]m)', line)
        if m:
            found_dates.append(m.group(1))
    
    # print the dates we collected
    for date in found_dates:
        print(date)
    

    输出:

    04/18/13 7:29pm
    04/19/13 7:30pm
    04/20/13 10:31pm
    

    【讨论】:

    • 我可以成功运行上面的确切代码,但它在我的实现中不起作用。我认为这可能是因为原始源代码中的日期和时间之间有一个&amp;nbsp;,例如04/18/13&amp;nbsp;7:29pm。作为参考,我在原始的'urlopen read object' 中添加了.replace("&amp;nbsp;"," "),它起作用了。非常感谢(感谢所有响应者!)。
    【解决方案2】:

    这是一个灵活的正则表达式,您可以使用:

    "(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])"
    

    例子:

    >>> import re
    >>> from bs4 import BeautifulSoup
    >>> html = """
    <html>
    <body>
    <span class="blue">here is a lot of text that i don't need</span>
    <span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
    <span class="blue">04/19/13 7:30pm</span>
    <span class="blue">04/18/13 7:29pm</span>
    <span class="blue">Posted on 15/18/2013 10:00AM</span>
    <span class="blue">Posted on 04/20/13 10:31pm</span>
    <span class="blue">Posted on 4/1/2013 17:09aM</span>
    </body>
    </html>
    """
    >>> soup = BeautifulSoup(html)
    >>> lines = [i.get_text() for i in soup.find_all('span', {'class' : 'blue'})]
    >>> ok = [m.group(1)
          for line in lines
            for m in (re.search(r'(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])', line),)
              if m]
    >>> ok
    [u'04/18/13 7:29pm', u'04/19/13 7:30pm', u'04/18/13 7:29pm', u'15/18/2013 10:00AM', u'04/20/13 10:31pm', u'4/1/2013 17:09aM']
    >>> for i in ok:
        print i
    
    04/18/13 7:29pm
    04/19/13 7:30pm
    04/18/13 7:29pm
    15/18/2013 10:00AM
    04/20/13 10:31pm
    4/1/2013 17:09aM
    

    【讨论】:

      【解决方案3】:

      这种模式似乎可以满足您的需求:

      >>> pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
      >>> pattern.match('<span class="blue">here is a lot of text that i dont need</span>')
      >>> pattern.match('<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>').groups()
      ('04/18/13 7:29pm',)
      

      【讨论】:

      • 我不知道如何实现这一点,我根据您的建议将我尝试的代码发布到原始帖子中(参见编辑 2)。
      • @user1063287 尝试将您的第三行更改为result = pattern.match(_).groups()re.findall 需要一个字符串(就像您之前调用 re.compile 时使用的字符串一样,而是给它一个已经编译的正则表达式。本质上,您尝试编译您的模式两次。
      • 听起来_ 还不是字符串,您需要先从_ 变量中提取实际字符串,然后才能对其使用正则表达式。我假设您可以调用 _.string 之类的名称,尝试一些诸如 print _print dir(_) 之类的打印语句,以确定您现在正在使用哪种对象。
      • @user1063287 Corey 的回答为您提供了有关如何执行此操作的更全面的说明,您需要调用_ 的方法是get_text()。但他提供了一个更完整的答案:)
      • 您得到的AttributeError 来自正则表达式与字符串不匹配的情况,因此它返回None。这会导致代码调用不存在的None.groups()。 Corey 的代码用他的if m: 行说明了这一点,这就是我将您引向他的代码的原因。希望这会有所帮助!
      猜你喜欢
      • 2022-01-12
      • 1970-01-01
      • 1970-01-01
      • 2021-04-23
      • 2022-11-30
      • 1970-01-01
      • 2012-03-14
      • 2017-03-12
      • 1970-01-01
      相关资源
      最近更新 更多