Find on beautiful soup in loop 返回 TypeError答案

【问题标题】：Find on beautiful soup in loop returns TypeErrorFind on beautiful soup in loop 返回 TypeError
【发布时间】：2012-07-30 03:47:18
【问题描述】：

我正在尝试使用 Beautiful Soup 在 ajax 页面上抓取表格，并使用 TextTable 库以表格形式打印出来。

import BeautifulSoup
import urllib
import urllib2
import getpass
import cookielib
import texttable

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)

...

def show_queue():
    url = 'https://www.animenfo.com/radio/nowplaying.php'
    values = {'ajax' : 'true', 'mod' : 'queue'}
    data = urllib.urlencode(values)
    f = opener.open(url, data)
    soup = BeautifulSoup.BeautifulSoup(f)
    stable = soup.find('table')
    table = texttable.Texttable()
    header = stable.findAll('th')
    header_text = []
    for th in header:
        header_append = th.find(text=True)
        header.append(header_append)
    table.header(header_text)
    rows = stable.find('tr')
    for tr in rows:
        cells = []
        cols = tr.find('td')
        for td in cols:
            cells_append = td.find(text=True)
            cells.append(cells_append)
        table.add_row(cells)
    s = table.draw
    print s

...

虽然我试图抓取的相关 HTML 的 URL 显示在代码中，但这里有一个示例：

<table cellspacing="0" cellpadding="0">
    <tbody>
        <tr>
                        <th>Artist - Title</th>
            <th>Album</th>
            <th>Album Type</th>
            <th>Series</th>
            <th>Duration</th>
            <th>Type of Play</th>
            <th>
                <span title="...">Time to play</span>
            </th>
                    </tr>
                <tr>
                        <td class="row1">
                <a href="..." class="songinfo">Song 1</a>
            </td>
            <td class="row1">
                <a href="..." class="album_link">Album 1</a>
            </td>
            <td class="row1">...</td>
            <td class="row1">

            </td>
            <td class="row1" style="text-align: center">
                5:43
            </td>
            <td class="row1" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row1" style="text-align: center">
                ~0:00:00
            </td>
                    </tr>
                <tr>
                        <td class="row2">
                <a href="..." class="songinfo">Song2</a>
            </td>
            <td class="row2">
                <a href="..." class="album_link">Album 2</a>
            </td>
            <td class="row2">...</td>
            <td class="row2">

            </td>
            <td class="row2" style="text-align: center">
                6:16
            </td>
            <td class="row2" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row2" style="text-align: center">
                ~0:05:43
            </td>
                    </tr>
                <tr>
                        <td class="row1">
                <a href="..." class="songinfo">Song 3</a>
            </td>
            <td class="row1">
                <a href="..." class="album_link">Album 3</a>
            </td>
            <td class="row1">...</td>
            <td class="row1">

            </td>
            <td class="row1" style="text-align: center">
                4:13
            </td>
            <td class="row1" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row1" style="text-align: center">
                ~0:11:59
            </td>
                    </tr>
                <tr>
                        <td class="row2">
                <a href="..." class="songinfo">Song 4</a>
            </td>
            <td class="row2">
                <a href="..." class="album_link">Album 4</a>
            </td>
            <td class="row2">...</td>
            <td class="row2">

            </td>
            <td class="row2" style="text-align: center">
                5:34
            </td>
            <td class="row2" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row2" style="text-align: center">
                ~0:16:12
            </td>
                    </tr>
                <tr>
                        <td class="row1"><a href="..." class="songinfo">Song 5</a>

            </td>
            <td class="row1">
                <a href="..." class="album_link">Album 5</a>
            </td>
            <td class="row1">...</td>
            <td class="row1"></td>
            <td class="row1" style="text-align: center">
                4:23
            </td>
            <td class="row1" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row1" style="text-align: center">
                ~0:21:46
            </td>
                    </tr>
                <tr>
            <td style="height: 5px;">
        </td></tr>
        <tr>
            <td class="row2" style="font-style: italic; text-align: center;" colspan="5">There are x songs in the queue with a total length of x:y:z.</td>
        </tr>
    </tbody>
</table>

每当我尝试运行此脚本函数时，它都会在header_append = th.find(text=True) 行中以TypeError: find() takes no keyword arguments 中止。我有点难过，因为我似乎在做代码示例中显示的事情，而且它似乎应该工作，但它没有。

简而言之，我该如何修复代码以使没有 TypeError 以及我做错了什么？

编辑：我在编写脚本时参考的文章和文档：

【问题讨论】：

您能否提供相关文档的链接？
您确定使用与文档中相同的版本吗？我在BeautifoulSoup 4 doc 任何 text=True 的地方都看不到。它们仅显示带有字符串或已编译正则表达式的示例...
我已经提供了解决方案。如果您再次显示已编辑的数据，其他人最容易理解，因为错误是由于数据结构和您的代码尝试解析它的方式造成的。

标签： python web-scraping beautifulsoup html-table html-parsing

【解决方案1】：

基本问题

解析器行为正确。您只是使用相同的表达式来解析不同类型的元素。

修改代码

这是一个 sn-p，只专注于返回抓取的列表。获得列表后，您可以轻松地格式化文本表：

import BeautifulSoup

def get_queue(data):
    # Args:
    #   data: string, contains the html to be scraped
    soup = BeautifulSoup.BeautifulSoup(data)
    stable = soup.find('table')

    header = stable.findAll('th')
    headers = [ th.text for th in header ]

    cells = [ ]
    rows = stable.findAll('tr')
    for tr in rows[1:-2]:
        # Process the body of the table
        row = []
        td = tr.findAll('td')
        row.append( td[0].find('a').text )
        row.append( td[1].find('a').text )
        row.extend( [ td.text for td in td[2:] ] )
        cells.append( row )

    footer = rows[-1].find('td').text
    return headers, cells, footer

输出

headers、cells 和 footer，现在可以将单元格输入到 texttable 格式化函数中：

import texttable
def show_table(headers, cells, footer):
    retval = ''
    table = texttable.Texttable()
    table.header(headers)
    for cell in cells:
        table.add_row(cell)
    retval = table.draw()
    return retval + '\n' + footer

print show_table(headers, cells, footer)

+----------+----------+----------+----------+----------+----------+----------+
| Artist - |  Album   |  Album   |  Series  | Duration | Type of  | Time to  |
|  Title   |          |   Type   |          |          |   Play   |   play   |
+==========+==========+==========+==========+==========+==========+==========+
| Song 1   | Album 1  | ...      |          | 5:43     | S.A.M.   | ~0:00:00 |
+----------+----------+----------+----------+----------+----------+----------+
| Song2    | Album 2  | ...      |          | 6:16     | S.A.M.   | ~0:05:43 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 3   | Album 3  | ...      |          | 4:13     | S.A.M.   | ~0:11:59 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 4   | Album 4  | ...      |          | 5:34     | S.A.M.   | ~0:16:12 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 5   | Album 5  | ...      |          | 4:23     | S.A.M.   | ~0:21:46 |
+----------+----------+----------+----------+----------+----------+----------+
There are x songs in the queue with a total length of x:y:z.

【讨论】：

谢谢，您的解决方案很有帮助。

【解决方案2】：

您收到错误TypeError: find() takes no keyword arguments 的原因是因为您实际上是在字符串上调用find()。

字符串查找

find 是一个 Python 字符串方法不接受关键字参数。示例：

>>> 'hello'.find('l')
2
>>> 'hello'.find('l', foo='bar')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: find() takes no keyword arguments

beautifulsoup 发现

beautifulsoup 的 Tag 也有一个 find 方法，这是您尝试使用的方法。

底线

在您的代码中的某个时刻，当您想使用标签时，您最终调用了字符串 find。

Python 使用duck typing，在这种情况下可能会导致混淆。

【讨论】：