【问题标题】:Scraping the top ten stories of a website using Beautiful Soup使用 Beautiful Soup 抓取网站的十大故事
【发布时间】:2020-04-16 13:18:41
【问题描述】:

我正在尝试抓取网站:http://edition.cnn.com/EVENTS/1996/year.in.review/ 并试图获得前 10 名故事,这是我迄今为止的尝试,我想知道是否有一种更简单的方法可以让我忽略这一点?另外,我正在尝试找到一种方法来删除每次打印之间的换行符,因为我不知道为什么每个标题之间存在间隔。

import requests
from bs4 import BeautifulSoup
import lxml

html = """
<HTML>

<HEAD>
    <TITLE>Top Ten Stories From 1996</TITLE>
</HEAD>

<BODY BGCOLOR="#FFFFCC" LINK="#162323" ALINK="#FFFFCE" VLINK="#162323">

<CENTER>
<P><BR>

<TABLE BORDER="0" CELLPADDING="0" CELLSPACING="0">
    <TR>
        <TD><IMG SRC="logos.gif" WIDTH="112" HEIGHT="60" ALIGN="TOP"></TD>
        <TD><IMG SRC="banner.gif" WIDTH="360" HEIGHT="60" ALIGN="TOP"></TD>
    </TR>
</TABLE>
</P>
</CENTER>


<BLOCKQUOTE>
    <CENTER>

    <TABLE BORDER="0" CELLPADDING="2">
    <TR>
        <TD WIDTH="90" VALIGN="TOP" ROWSPAN="11">
            <P ALIGN="RIGHT"><B><TT>What were the biggest stories of the year?</TT></B><BR>
            <BR>
            <FONT SIZE="2">It's a question journalists like to ask themselves at the end of every
            year. Now you can join in the process. Here are our selections for the top ten news
            stories of 1996.<BR>
            <BR>
            Disagree with our choices? Then tell us what stories you think were most compelling
            in the poll below.</FONT>
        </TD>
        <TD WIDTH="4" ROWSPAN="11"></TD>
        <TD VALIGN="MIDDLE" ROWSPAN="11"><IMG SRC="generic/dot.gif" WIDTH="1" HEIGHT="250" ALIGN="MIDDLE"></TD>
        <TD WIDTH="10" ROWSPAN="11"></TD>
        <TD COLSPAN="4" VALIGN=TOP>
            <P ALIGN="CENTER"><IMG SRC="generic/topten.gif" WIDTH="263" HEIGHT="24" ALIGN="MIDDLE" VSPACE="5">
        </TD>
    </TR>
    <TR>
        <TD><A HREF="topten/israel/israel.index.html" TARGET=_top><IMG SRC="generic/1.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
        <TD><A HREF="topten/israel/israel.index.html" TARGET=_top><B>Israel</B> elects <B>Netanyahu</A></B></TD>
    </TR>
    <TR>
        <TD><A HREF="topten/twa/twa.index.html" TARGET=_top><IMG SRC="generic/2.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
        <TD><A HREF="topten/twa/twa.index.html" TARGET=_top>Crash of TWA Flight 800</A></TD>
    </TR>
    <TR>
        <TD><A HREF="topten/yeltsin/yeltsin.index.html" TARGET=_top><IMG SRC="generic/3.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
        <TD><A HREF="topten/yeltsin/yeltsin.index.html" TARGET=_top><B>Russia</B> elects <B>Yeltsin</B></A></TD>
    </TR>
    <TR>
        <TD><A HREF="topten/clinton/clinton.index.html" TARGET=_top><IMG SRC="generic/4.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
        <TD><A HREF="topten/clinton/clinton.index.html" TARGET=_top><B>U.S</B>. elects <B>Clinton</B></A></TD>
    </TR>
    <TR>
        <TD><A HREF="topten/hutu/hutu.index.html" TARGET=_top><IMG SRC="generic/5.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
        <TD><A HREF="topten/hutu/hutu.index.html" TARGET=_top><B>Hutu-Tutsi</B> conflict in central Africa</A></TD>
    </TR>
    <TR>
        <TD><A HREF="topten/bosnia/bosnia.index.html" TARGET=_top><IMG SRC="generic/6.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
        <TD><A HREF="topten/bosnia/bosnia.index.html" TARGET=_top>Peace, elections in <B>Bosnia</B></A></TD>
    </TR>
    <TR>
        <TD><A HREF="topten/saudi/saudi.index.html" TARGET=_top><IMG SRC="generic/7.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
        <TD><A HREF="topten/saudi/saudi.index.html" TARGET=_top><B>U.S</B>. base bombed in <B>Saudi Arabia</B></A></TD>
    </TR>
    <TR>
        <TD><A HREF="topten/olympics/olympics.index.html" TARGET=_top><IMG SRC="generic/8.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
        <TD><A HREF="topten/olympics/olympics.index.html" TARGET=_top>Centennial <B>Olympic</B> Games</A></TD>
    </TR>
    <TR>
        <TD><A HREF="topten/aids/aids.index.html" TARGET=_top><IMG SRC="generic/9.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
        <TD><A HREF="topten/aids/aids.index.html" TARGET=_top>Advances against <B>AIDS</B></A></TD>
    </TR>
    <TR>
        <TD><A HREF="topten/unabomb/unabomb.index.html" TARGET=_top><IMG SRC="generic/10.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
        <TD><A HREF="topten/unabomb/unabomb.index.html" TARGET=_top><B>Unabomb</B> suspect <B>Ted Kaczynski</B> arrested</A></TD>
    </TR>
    </TABLE>
<BR clear = "all">


    <TABLE WIDTH=300>
    <TR>
    <TD>
    <CENTER><A HREF="topten/poll.html" TARGET=_top><IMG SRC="poll.gif" WIDTH="120" HEIGHT="60" ALIGN="MIDDLE" BORDER="0"></CENTER></A>
    </TD>
    <TD>
    <CENTER><A HREF="http://www-cgi.cnn.com/cgi-bin/quiz/yir_main/go.pl/main" TARGET=_top><IMG SRC="quiz.gif" WIDTH="120" HEIGHT="60" ALIGN="MIDDLE" BORDER="0"></CENTER></A>
    </TD>
    </TR>
    <TR><TD COLSPAN=2><CENTER><A TARGET=_top HREF="http://www-cgi.cnn.com/cgi-bin/poll/heavypoll.pl?slug=9612%2Fyir_top_10">The top 10 stories according to our users</A></CENTER></TD></TR>
    </TABLE>


    <IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE"><BR>

    <BR><IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE"><BR>
    <BR>

    <CENTER>
    <A HREF="http://pathfinder.com/time/bestof1996/index.html" TARGET=_top>
    T I M E: The Best of 1996</A>
    <BR clear = "all"><BR>
        <A HREF="http://pathfinder.com/@@qsdFOQcA62PJWEWu/time/moy/index.html" TARGET=_top>
    T I M E: Man of the Year</A>
    <BR clear = "all"><BR>
    <A HREF="http://pathfinder.com/time/1996/" TARGET=_top>
    <IMG SRC="time.gif" WIDTH="540" HEIGHT="50" ALIGN="MIDDLE" BORDER="0"></A>
    <BR clear = "all"><BR><BR>
    <IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE">
    </CENTER>
    <BR clear = "all">

    <TABLE BORDER="0" CELLPADDING="0" CELLSPACING="0" WIDTH="63%">
    <TR>
        <TD WIDTH="100%">
            <P><B><TT>What makes a </TT></B><FONT SIZE="5"><TT><B>big</B></TT></FONT><TT><B>
            story </B></TT><FONT SIZE="5"><TT><B>BIG?</B></TT></FONT>


            <BLOCKQUOTE>
            <P>It depends on your criteria, of course, and your perspective. That's why we offered
            a poll to find out what you think.</P>
            <P>For our list, we polled producers throughout the CNN/Pathfinder family of networks
            and publications, and weighed such criteria as a story's long-term implications,
            geopolitical significance, user interest, amount of coverage, and old-fashioned newsworthiness.
            All these things help make a &quot;big&quot; story big.</P>

            <P>By no means do we think our lists are the final word. Even our polls among CNN
            producers turned up a wide variety of responses. The process is meant to encourage
            you to reconsider the stories that dominated the media during the past year and determine
            for yourself which were mere sensations and which were truly significant.
            </BLOCKQUOTE>
        </TD>
    </TR>
    </TABLE>

<BR CLEAR=ALL>
<BR>
<CENTER>
<BR CLEAR=ALL>
<BR>
<IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE">
<BR CLEAR=ALL><BR>

<TABLE WIDTH=300><TR VALIGN=CENTER>
<TD ALIGN=CENTER><IMG SRC="what_you_think.gif" ALT="What you think" WIDTH="60" HEIGHT="59" BORDER="0"></TD>
<TD><STRONG><A NAME="_top" HREF="/feedback/index.html">Tell us what you think</A></STRONG><BR><BR>
<STRONG><A NAME="_top" HREF="/feedback/comments.html">You said it...</A></STRONG></TD>
</TR></TABLE>

<BR CLEAR=ALL>
<BR>
<IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE">
<BR CLEAR=ALL><BR>
</CENTER>

<CENTER><A HREF="generic/credits.index.html" TARGET=_top><TT><B>C R E D I T S</B></TT></A></CENTER>

<BR CLEAR=ALL>
<BR>
<IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE">
<BR CLEAR=ALL><BR>
<CENTER><A HREF="#TOP"><TT><B>Back to top</B></TT></A></CENTER>
<BR CLEAR=ALL><BR>
<FONT SIZE=-1><P>&#169; 1996 Cable News Network, Inc.<BR>
All Rights Reserved.</FONT>
<H6><A HREF="http://cnn.com/interactive_legal.html" target=_top>Terms</A> under which this
     service is provided to you.</H6>
</CENTER>
</CENTER>
    </BLOCKQUOTE>


</BODY>

</HTML>
"""


soup = BeautifulSoup(html, "lxml")
td_list = soup.find_all('td')
count = 0
for link in td_list:
    if count == 20:
        pass
    elif link.a is not None:
        print(link.text.strip())
        count += 1

输出:

Israel elects Netanyahu

Crash of TWA Flight 800

Russia elects Yeltsin

U.S. elects Clinton

Hutu-Tutsi conflict in central Africa

Peace, elections in Bosnia

U.S. base bombed in Saudi Arabia

Centennial Olympic Games

Advances against AIDS

Unabomb suspect Ted Kaczynski arrested

【问题讨论】:

    标签: python html python-3.x web-scraping beautifulsoup


    【解决方案1】:

    好吧,我使用re 来缩短选择所有标签a 的路径,其中href 的值以topten 开头,您也可以使用不同的方式,例如。

    for item in soup.select("a[href^=topten]"):
    

    然后我得到了标签中的所有文本,然后将 strippedstrip=True 放在一起,然后放了一个空的 separator,所以 text 不会被分配在一起。

    import requests
    from bs4 import BeautifulSoup
    import re
    
    
    def main(url):
        r = requests.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')
        for item in soup.findAll("a", href=re.compile("^topten")):
            item = item.get_text(strip=True, separator=" ")
            if item:
                print(item)
    
    
    main("http://edition.cnn.com/EVENTS/1996/year.in.review/main.html")
    

    输出:

    Israel elects Netanyahu
    Crash of TWA Flight 800
    Russia elects Yeltsin
    U.S . elects Clinton
    Hutu-Tutsi conflict in central Africa
    Peace, elections in Bosnia
    U.S . base bombed in Saudi Arabia
    Centennial Olympic Games
    Advances against AIDS
    Unabomb suspect Ted Kaczynski arrested
    

    【讨论】:

    • 你能告诉我你为什么使用 re 表达式和 item.get_text
    • strip 有什么作用,以及分隔符是如何工作的
    • @MahdeenSky 我已经更新了我的答案,你也可以避免使用regex
    猜你喜欢
    • 2021-04-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-05-16
    相关资源
    最近更新 更多