【发布时间】:2017-04-12 13:28:17
【问题描述】:
我将一个从 url 中提取文本的过程包装到一个函数中:
def text(link):
article = Article(link)
article.download()
article = article.parse()
return article
我打算将此函数应用到 pandas 列:
df['text'] = df['links'].apply(text)
但是,links 列的某些链接已损坏(即HTTPError: HTTP Error 404: Not Found)。所以我的问题是,如何将 NaN 添加到损坏的 url,并传递它们?我试着做:
from newspaper import Article
import numpy as np
import requests
def text(link):
article = Article(link)
try:
article.download()
article = article.parse()
except requests.exceptions.HTTPError:
return np.nan
return article
df['text'] = df['links'].apply(text)
不过,我不知道是否可以处理 apply() 函数以便将 NaN 值归入其链接已损坏的单元格。
更新
我尝试使用ArticleException 处理如下:
df:
title Link
Inside tiny tubes, water turns solid when it should be boiling http://news.mit.edu/2016/carbon-nanotubes-water-solid-boiling-1128
Four MIT students named 2017 Marshall Scholars http://news.mit.edu/2016/four-mit-students-marshall-scholars-11282
Saharan dust in the wind http://news.mit.edu/2016/saharan-dust-monsoons-11231
The science of friction on graphene http://news.mit.edu/2016/sliding-flexible-graphene-surfaces-1123
在:
import numpy as np
from newspaper import Article, ArticleException
import requests
def text_extractor2(link):
article = Article(link)
try:
article.download()
except ArticleException:
article = article.parse()
return np.nan
return article
df['text'] = df['Link'].apply(text_extractor2)
df
输出:
title Link text
0 Inside tiny tubes, water turns solid when it s... http://news.mit.edu/2016/carbon-nanotubes-wate... <newspaper.article.Article object at 0x10c8a0320>
1 Four MIT students named 2017 Marshall Scholars http://news.mit.edu/2016/four-mit-students-mar... <newspaper.article.Article object at 0x1070df0f0>
2 Saharan dust in the wind http://news.mit.edu/2016/saharan-dust-monsoons... <newspaper.article.Article object at 0x107b035c0>
3 The science of friction on graphene http://news.mit.edu/2016/sliding-flexible-grap... <newspaper.article.Article object at 0x10c8bf8d0>
【问题讨论】:
-
损坏的意思是链接指向无效的 URL?如果链接无效,您是否尝试过返回
numpy.nan? -
@PyNoob 对不起,我想说的是:
HTTPError: HTTP Error 404: Not Found。感谢您的帮助!
标签: python python-3.x pandas exception-handling python-requests