【发布时间】:2019-06-21 14:01:31
【问题描述】:
我正在迭代 M 个数据帧,每个数据帧都包含一个包含 N 个 URL 的列。对于每个 URL,我提取段落文本,然后在计算“情感”分数之前进行文本分析的标准清理。
这样对我来说是否更有效率:
照原样继续(在 URL for 循环本身中计算分数)
先从URL中提取所有文本,然后分别遍历文本列表/列?
还是没有区别?
当前在循环本身内运行计算。每个 DF 大约有 15,000 到 20,000 个 URL,所以它也需要大量的时间!
# DFs are stored on a website
# I extract links to each .csv file and store it as a list in "df_links"
for link in df_links:
cleaned_articles = []
df = pd.read_csv(link, sep="\t", header=None)
# Conduct df cleaning
# URLs for articles to scrape are stored in 1 column, which I iterate over as...
for url in df['article_url']:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
para_text = [text.get_text() for text in soup.findAll('p')]
text = " ".join(para_text)
words = text.split()
if len(words) > 500:
# Conduct Text Cleaning & Scores Computations
# Cleaned text stored as a variable "clean_text"
cleaned_articles.append(clean_text)
df['article_text'] = cleaned_articles
df.to_csv('file_name.csv')
【问题讨论】:
-
也许您应该投票并接受您之前问题的答案。否则,任何人都不太可能在答案中投入相关的时间。
标签: python performance for-loop