在 Python 中使用 Beautiful Soup 时出错答案

【问题标题】：Error when using Beautiful Soup in Python在 Python 中使用 Beautiful Soup 时出错
【发布时间】：2015-04-20 02:45:50
【问题描述】：

我的代码运行良好。但是，对于某些数据，我的代码会出错。有问题的数据是：本月 T 年满 10 岁。为了纪念周年纪念日和即将发行的 T@10 期，本系列回顾了该杂志第一个十年中最令人难忘的故事。

报告的问题是

Traceback（最近一次调用最后一次）：文件 “/Users/mas/Documents/workspace/DeepLearning/BagOfWords.py”，第 41 行，在 clean_train_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(train["Snippet"][i], 真的）））文件 "/Users/mas/Documents/workspace/DeepLearning/KaggleWord2VecUtility.py", 第 22 行，在 review_to_wordlist 中 review_text = BeautifulSoup(review).get_text() 文件“/Library/Python/2.7/site-packages/bs4/init.py”，第 162 行，在 初始化 elif len(markup)

代码：

def deprecated_argument(old_name, new_name):
        if old_name in kwargs:
            warnings.warn(
                'The "%s" argument to the BeautifulSoup constructor '
                'has been renamed to "%s."' % (old_name, new_name))
            value = kwargs[old_name]
            del kwargs[old_name]
            return value
        return None

    parse_only = parse_only or deprecated_argument(
        "parseOnlyThese", "parse_only")

    from_encoding = from_encoding or deprecated_argument(
        "fromEncoding", "from_encoding")

    if len(kwargs) > 0:
        arg = kwargs.keys().pop()
        raise TypeError(
            "__init__() got an unexpected keyword argument '%s'" % arg)

    if builder is None:
        if isinstance(features, basestring):
            features = [features]
        if features is None or len(features) == 0:
            features = self.DEFAULT_BUILDER_FEATURES
        builder_class = builder_registry.lookup(*features)
        if builder_class is None:
            raise FeatureNotFound(
                "Couldn't find a tree builder with the features you "
                "requested: %s. Do you need to install a parser library?"
                % ",".join(features))
        builder = builder_class()
    self.builder = builder
    self.is_xml = builder.is_xml
    self.builder.soup = self

    self.parse_only = parse_only

    if hasattr(markup, 'read'):        # It's a file-type object.
        markup = markup.read()
    elif len(markup) <= 256:
        # Print out warnings for a couple beginner problems
        # involving passing non-markup to Beautiful Soup.
        # Beautiful Soup will still parse the input as markup,
        # just in case that's what the user really wants.
        if (isinstance(markup, unicode)
            and not os.path.supports_unicode_filenames):
            possible_filename = markup.encode("utf8")
        else:
            possible_filename = markup
        is_file = False
        try:
            is_file = os.path.exists(possible_filename)
        except Exception, e:
            # This is almost certainly a problem involving
            # characters not valid in filenames on this
            # system. Just let it go.
            pass
        if is_file:
            warnings.warn(
                '"%s" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.' % markup)
        if markup[:5] == "http:" or markup[:6] == "https:":
            # TODO: This is ugly but I couldn't get it to work in
            # Python 3 otherwise.
            if ((isinstance(markup, bytes) and not b' ' in markup)
                or (isinstance(markup, unicode) and not u' ' in markup)):
                warnings.warn(
                    '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)

    for (self.markup, self.original_encoding, self.declared_html_encoding,
     self.contains_replacement_characters) in (
        self.builder.prepare_markup(markup, from_encoding)):
        self.reset()
        try:
            self._feed()
            break
        except ParserRejectedMarkup:
            pass

    # Clear out the markup and remove the builder's circular
    # reference to this object.
    self.markup = None
    self.builder.soup = None

这是我的主要代码：

import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from KaggleWord2VecUtility import KaggleWord2VecUtility
import pandas as pd
import numpy as np

if __name__ == '__main__':
    train = pd.read_csv(os.path.join(os.path.dirname(__file__), 'data', 'NYTimesBlogTrain.csv'), header=0)
    test = pd.read_csv(os.path.join(os.path.dirname(__file__), 'data', 'NYTimesBlogTest.csv'), header=0)

    print 'A sample Abstract is:'
    print train["Abstract"][2838]

    print 'A sample Snippet is:'
    print train["Snippet"][2838]
    #raw_input("Press Enter to continue...")


    #print 'Download text data sets. If you already have NLTK datasets downloaded, just close the Python download window...'
    #nltk.download()  # Download text data sets, including stop words

    # Initialize an empty list to hold the clean reviews
    clean_train_reviews = []

    # Loop over each review; create an index i that goes from 0 to the length
    # of the movie review list
    print len(train["Snippet"])
    print "Cleaning and parsing the training set abstracts...\n"
    for i in xrange( 0, 3000):
        clean_train_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(train["Snippet"][i], True)))
        if not train["Snippet"][i]:
            print i  
#

【问题讨论】：

标签： python python-2.7 beautifulsoup

【解决方案1】：

对我来说，这个问题是由于在审查功能中有一些根本不包含数据的样本造成的。您可以使用以下方法更改此设置并将没有评论的示例设置为空白：

train = train.fillna(" ")

【讨论】：

【解决方案2】：

如果没有看到完整的上下文（例如，传递给构造函数的 review 的值），您的 KaggleWord2VecUtility 方法可能会在 @ 符号和/或数字上拆分，从而将令牌作为浮点数传递而不是字符串/ unicode 对象？异常表明 markup 在 init 期望字符串或 unicode 对象时是一个意外的浮点数

def __init__(self, markup="", features=None, builder=None,
             parse_only=None, from_encoding=None, **kwargs):

【讨论】：

这是我评论的价值：T 这个月满 10 岁。为纪念周年纪念日和即将发行的 T@10 期，本系列回顾了该杂志第一个十年中一些最令人难忘的故事。 @randalv
美汤什么版本？ get_text 没有为我解析 3.2.1，如果我将审查作为具有该确切值的字符串通过，我无法使用 bs4 (4.3.2) 重现。
我正在使用 bs4 @randalv
更具体地说？如果您仅使用以下内容创建新文件，您能否复制： from bs4 import BeautifulSoup review = "T 本月 10 岁。为了纪念周年纪念日和即将发行的 T@10 问题，本系列回顾了一些最令人难忘的故事杂志的第一个十年。” review_text = BeautifulSoup(review).get_text() print review_text（我想我可以在评论中格式化代码，猜不——我可以编辑原始帖子，但希望你能读到这个）
假设相同的 python 环境（并假设您执行模块导入的方式相同），那么在传递到构造函数之前必须先修改这个值。