【问题标题】:How to download intext images with beautiful soup如何用漂亮的汤下载文本图像
【发布时间】:2018-08-29 14:20:49
【问题描述】:

我正在尝试使用漂亮的汤和请求在 Python 中编写网站爬虫程序。我可以轻松收集我想要的所有文本,但我尝试下载的一些文本具有重要的内嵌图像。我想用它的标题替换图像,并将其添加到我以后可以解析的字符串中,但我不知道该怎么做。

这是我试图解析的那种 HTML 示例:

    <td colspan="3"><b>"Assemble under Siegfried!"</b> 
        <a href="/wiki/index.php/File:Continuous.png" class="image" title="CONT"><img alt="CONT" src="/wiki/images/thumb/7/78/Continuous.png/14px-Continuous.png" width="14" height="17" srcset="/wiki/images/thumb/7/78/Continuous.png/21px-Continuous.png 1.5x, /wiki/images/7/78/Continuous.png 2x">
        </a> This unit gains +10 attack for each 
        <a href="/wiki/index.php/File:Black.png" class="image" title="Black"><img alt="Black" src="/wiki/images/thumb/7/71/Black.png/15px-Black.png" width="15" height="15" srcset="/wiki/images/thumb/7/71/Black.png/23px-Black.png 1.5x, /wiki/images/thumb/7/71/Black.png/30px-Black.png 2x">
        </a> and 
        <a href="/wiki/index.php/File:White.png" class="image" title="White"><img alt="White" src="/wiki/images/thumb/8/80/White.png/15px-White.png" width="15" height="15" srcset="/wiki/images/thumb/8/80/White.png/23px-White.png 1.5x, /wiki/images/thumb/8/80/White.png/30px-White.png 2x">
        </a> ally besides this unit.
    </td>

我想从这个 HTML 中提取:

“在齐格弗里德的带领下集结!继续,除了这个单位之外,每有一个黑白盟友,这个单位就会获得 +10 攻击力。”

使用普通的get_text() 方法不包含图像的标题,这就是问题所在。

【问题讨论】:

    标签: python html web-scraping beautifulsoup python-requests


    【解决方案1】:

    哦...我得到了你需要的东西。

    试试这个:

    html_data = """ <td colspan="3"><b>"Assemble under Siegfried!"</b> 
        <a href="/wiki/index.php/File:Continuous.png" class="image" title="CONT"><img alt="CONT" src="/wiki/images/thumb/7/78/Continuous.png/14px-Continuous.png" width="14" height="17" srcset="/wiki/images/thumb/7/78/Continuous.png/21px-Continuous.png 1.5x, /wiki/images/7/78/Continuous.png 2x">
        </a> This unit gains +10 attack for each 
        <a href="/wiki/index.php/File:Black.png" class="image" title="Black"><img alt="Black" src="/wiki/images/thumb/7/71/Black.png/15px-Black.png" width="15" height="15" srcset="/wiki/images/thumb/7/71/Black.png/23px-Black.png 1.5x, /wiki/images/thumb/7/71/Black.png/30px-Black.png 2x">
        </a> and 
        <a href="/wiki/index.php/File:White.png" class="image" title="White"><img alt="White" src="/wiki/images/thumb/8/80/White.png/15px-White.png" width="15" height="15" srcset="/wiki/images/thumb/8/80/White.png/23px-White.png 1.5x, /wiki/images/thumb/8/80/White.png/30px-White.png 2x">
        </a> ally besides this unit.
    </td>"""
    from bs4 import BeautifulSoup
    html = BeautifulSoup(html_data, "html.parser")
    
    texts = [html.find("b").get_text()]
    for a in html.find_all("a"):
        texts.append(a.attrs.get("title"))
        texts.append(a.next_element.next_element.next_element.strip())
    print(" ".join(texts))
    

    我不确定你是否真的想要。但我需要标签的属性。

    示例: 从 bs4 导入 BeautifulSoup

    html = BeautifulSoup(html_data)
    for a in html.find_all("a"):
        print(a.attrs.get("title"))
    

    输出:

    CONT
    Black
    White
    

    如果你想下载图片: 从 urllib.parse 导入 urljoin 导入请求 从 bs4 导入 BeautifulSoup

    cdn_url = "http://some.com/" # root url of site with static content
    html = BeautifulSoup(html_data)
    for img in html.find_all("img"):
        img_response = requests.get(urljoin(cdn_url, img.attrs.get("src"))) #img content should save in file
    

    【讨论】:

      【解决方案2】:

      您期望从上述 html 元素中获得的输出并不容易实现(至少对我而言)。但是,我已经尝试过一种可以为您获取所需的确切输出的方法。

      from bs4 import BeautifulSoup
      
      content="""
      <td colspan="3"><b>"Assemble under Siegfried!"</b> 
          <a href="/wiki/index.php/File:Continuous.png" class="image" title="CONT"><img alt="CONT" src="/wiki/images/thumb/7/78/Continuous.png/14px-Continuous.png" width="14" height="17" srcset="/wiki/images/thumb/7/78/Continuous.png/21px-Continuous.png 1.5x, /wiki/images/7/78/Continuous.png 2x">
          </a> This unit gains +10 attack for each 
          <a href="/wiki/index.php/File:Black.png" class="image" title="Black"><img alt="Black" src="/wiki/images/thumb/7/71/Black.png/15px-Black.png" width="15" height="15" srcset="/wiki/images/thumb/7/71/Black.png/23px-Black.png 1.5x, /wiki/images/thumb/7/71/Black.png/30px-Black.png 2x">
          </a> and 
          <a href="/wiki/index.php/File:White.png" class="image" title="White"><img alt="White" src="/wiki/images/thumb/8/80/White.png/15px-White.png" width="15" height="15" srcset="/wiki/images/thumb/8/80/White.png/23px-White.png 1.5x, /wiki/images/thumb/8/80/White.png/30px-White.png 2x">
          </a> ally besides this unit.
      </td>
      """
      soup = BeautifulSoup(content,"lxml")
      part1 = soup.select_one("td > b").text.strip('"')
      part2 = ' '.join(''.join([''.join([item['title'], item.next_sibling]) for item in soup.select("td a")]).split())
      print("{} {}".format(part1,part2))
      

      输出:

      Assemble under Siegfried! CONT This unit gains +10 attack for each Black and White ally besides this unit.
      

      我们不要再这样做了。

      【讨论】:

        【解决方案3】:

        另一种方法是遍历td 标记的内容。我觉得这有点容易理解。

        html = '''<td colspan="3"><b>"Assemble under Siegfried!"</b> 
            <a href="/wiki/index.php/File:Continuous.png" class="image" title="CONT"><img alt="CONT" src="/wiki/images/thumb/7/78/Continuous.png/14px-Continuous.png" width="14" height="17" srcset="/wiki/images/thumb/7/78/Continuous.png/21px-Continuous.png 1.5x, /wiki/images/7/78/Continuous.png 2x">
            </a> This unit gains +10 attack for each 
            <a href="/wiki/index.php/File:Black.png" class="image" title="Black"><img alt="Black" src="/wiki/images/thumb/7/71/Black.png/15px-Black.png" width="15" height="15" srcset="/wiki/images/thumb/7/71/Black.png/23px-Black.png 1.5x, /wiki/images/thumb/7/71/Black.png/30px-Black.png 2x">
            </a> and 
            <a href="/wiki/index.php/File:White.png" class="image" title="White"><img alt="White" src="/wiki/images/thumb/8/80/White.png/15px-White.png" width="15" height="15" srcset="/wiki/images/thumb/8/80/White.png/23px-White.png 1.5x, /wiki/images/thumb/8/80/White.png/30px-White.png 2x">
            </a> ally besides this unit.
        </td>'''
        
        soup = BeautifulSoup(html, 'lxml')
        final_text = []
        
        for content in soup.find('td').contents:
            if content.name == 'a':
                final_text.append(content['title'])
            elif content.name == 'b':
                final_text.append(content.text.strip())
            else:
                final_text.append(content.strip())
        
        print(' '.join(final_text))
        

        输出:

        "Assemble under Siegfried!"  CONT This unit gains +10 attack for each Black and White ally besides this unit.
        

        或者,单行:

        final_text = ' '.join((x['title'] if x.name == 'a' else (x.text.strip() if x.name == 'b' else x.strip())) for x in soup.find('td').contents)
        print(final_text)
        

        或者,更好的是,使用类似于get_text() 的函数名来获取td 标签的文本:

        def get_modified_text(td):
            return ' '.join((x['title'] if x.name == 'a' else (x.text.strip() if x.name == 'b' else x.strip())) for x in td.contents)
        
        soup = BeautifulSoup(html, 'lxml')
        print(get_modified_text(soup.find('td')))
        # "Assemble under Siegfried!"  CONT This unit gains +10 attack for each Black and White ally besides this unit.
        

        注意:如果您不想在第一个文本周围加上引号",只需使用.strip('"')

        【讨论】:

        • @ElleryBuntel,如果您需要任何代码行的任何解释,请随时询问。
        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2018-07-03
        • 1970-01-01
        • 1970-01-01
        • 2015-11-19
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多