【问题标题】:Catching an exception and imputing some value with pandas apply() function?捕获异常并使用 pandas apply() 函数输入一些值?
【发布时间】:2017-04-12 13:28:17
【问题描述】:

我将一个从 url 中提取文本的过程包装到一个函数中:

def text(link):
    article = Article(link)
    article.download()
    article =  article.parse()
    return article

我打算将此函数应用到 pandas 列:

df['text'] = df['links'].apply(text)

但是,links 列的某些链接已损坏(即HTTPError: HTTP Error 404: Not Found)。所以我的问题是,如何将 NaN 添加到损坏的 url,并传递它们?我试着做:

from newspaper import Article
import numpy as np
import requests

def text(link):
    article = Article(link)
    try:
        article.download()
        article = article.parse()
    except requests.exceptions.HTTPError:
        return np.nan
    return article

df['text'] = df['links'].apply(text)

不过,我不知道是否可以处理 apply() 函数以便将 NaN 值归入其链接已损坏的单元格。

更新

我尝试使用ArticleException 处理如下:

df:

title   Link
Inside tiny tubes, water turns solid when it should be boiling  http://news.mit.edu/2016/carbon-nanotubes-water-solid-boiling-1128
Four MIT students named 2017 Marshall Scholars  http://news.mit.edu/2016/four-mit-students-marshall-scholars-11282
Saharan dust in the wind    http://news.mit.edu/2016/saharan-dust-monsoons-11231
The science of friction on graphene http://news.mit.edu/2016/sliding-flexible-graphene-surfaces-1123

在:

import numpy as np
from newspaper import Article, ArticleException
import requests

def text_extractor2(link):
    article = Article(link)
    try:
        article.download()
    except ArticleException:
        article = article.parse()
        return np.nan
    return article

df['text'] = df['Link'].apply(text_extractor2)
df

输出:

    title   Link    text
0   Inside tiny tubes, water turns solid when it s...   http://news.mit.edu/2016/carbon-nanotubes-wate...   <newspaper.article.Article object at 0x10c8a0320>
1   Four MIT students named 2017 Marshall Scholars  http://news.mit.edu/2016/four-mit-students-mar...   <newspaper.article.Article object at 0x1070df0f0>
2   Saharan dust in the wind    http://news.mit.edu/2016/saharan-dust-monsoons...   <newspaper.article.Article object at 0x107b035c0>
3   The science of friction on graphene     http://news.mit.edu/2016/sliding-flexible-grap...   <newspaper.article.Article object at 0x10c8bf8d0>

【问题讨论】:

  • 损坏的意思是链接指向无效的 URL?如果链接无效,您是否尝试过返回numpy.nan
  • @PyNoob 对不起,我想说的是:HTTPError: HTTP Error 404: Not Found。感谢您的帮助!

标签: python python-3.x pandas exception-handling python-requests


【解决方案1】:

据我了解,您希望与断开的链接对应的行在 text 列中具有 NaN 值。如果您还没有,我们可以先添加 numpy 导入:

import numpy as np

我假设抛出的异常是HTTPError,并将使用 NumPy 作为其 NaN 值:

def text(link):
    article = Article(link)

    try:
        article.download()
    except HTTPError:
        return np.nan

    article = article.parse()
    return article

然后,使用熊猫apply

df['text'] = df['links'].apply(text)

文本列应包含损坏链接的缺失值和有效链接的文章文本。


不使用newspaper,您可以更改函数以捕获ur.urlopen(url).read() 上的异常,例如

def text_extractor(url):
    try:
        html = ur.urlopen(url).read()
    except ur.HTTPError:
        return np.nan

    soup = BeautifulSoup(html, 'lxml')
    for script in soup(["script", "style"]):
        script.extract()
        text = soup.get_text()
        lines = (line.strip() for line in text.splitlines())
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        text = ' '.join(chunk for chunk in chunks if chunk)
    sentences = ', '.join(sent_tokenize(str(text.strip('\'"') )))
    return sentences

【讨论】:

  • 现在我似乎无法使用 newspaper 包重现该情况(甚至是断开的链接情况)。您是否有断开链接和回溯的示例?
  • 感谢您的帮助!我试过了,我又得到了:HTTPError: HTTP Error 404: Not Found
  • 第二个确实有效。不过,我也有兴趣用报纸来做……您认为哪种解决方案更好?..
  • 我认为让它与报纸一起工作会很好,但这取决于你。也许您需要指定定义异常的位置 - 例如而不是HTTPError,使用requests.exceptions.HTTPError(假设requests已经被导入)。
  • 您是否有该 HTTPError 的完整回溯,例如发生时的线?
猜你喜欢
  • 2014-05-15
  • 2018-04-18
  • 1970-01-01
  • 2015-09-09
  • 1970-01-01
  • 2022-11-11
  • 1970-01-01
  • 2022-01-16
  • 2020-08-24
相关资源
最近更新 更多