如何让 pandas 循环更快：从 url 中抓取文本答案

【问题标题】：How to make pandas loop faster: scraping the text from url如何让 pandas 循环更快：从 url 中抓取文本
【发布时间】：2020-07-28 08:53:22
【问题描述】：

我正在尝试从我网站上的文章中抓取文本。我有一个“for”循环，但它的工作速度很慢。有没有更快的方法来做到这一点？我已经阅读过 Pandas Built-In-Loop、矢量化和 numpy 矢量化，但未能将其应用到我的代码中。

def scarp_text(df):

pd.options.mode.chained_assignment = None
session = requests.Session()

for j in range(0, len(df)):
    try:
        url = df['url'][j] #takes a url of an article in a column 'url'
        req = session.get(url)
        soup = BeautifulSoup(req.text, 'lxml')
    except Exception as e:
        print(e)

    tags = soup.find_all('p')
    if tags == []:
        tags = soup.find_all('p', itemprop = 'articleBody')

    # Putting together all text from HTML p tags
    article = ''
    for p in paragraph_tags:
        article = article + ' ' + p.get_text()
        article = " ".join(article.split())

    df['article_text'][j] = article #put collected text to a corresponding cell

return df

【问题讨论】：

你有没有为你的内部循环计时，看看这里的瓶颈是什么？
对于整个循环，每个循环 54.2 ns ± 0.684 ns（平均值 ± 标准偏差，7 次运行，每次 10000000 次循环）
你认为我需要检查哪些行？

标签： python pandas performance for-loop web-scraping

【解决方案1】：

你有 2 个 for 循环，最内层的循环通常是最好的起点。加号运算符对于字符串连接效率低下。 Str.join 是一个更好的选择，它还需要一个生成器作为输入。

article = " ".join(p.get_text() for p in paragraph_tags)

article = " ".join(article.split())

【讨论】：