【发布时间】:2017-12-06 00:14:23
【问题描述】:
所以我目前正在尝试导出一个 twitter .html 页面,并且我使用 BeautifulSoup 创建了这个 webscraper。 OUTPUT.csv 文件目前真的很乱,这是我的问题(当前的 .py 文件在下面):
我可以采取哪些步骤来清理代码?我的输出 csv 有推文,但它们真的很乱,并且用逗号分隔。有什么方法可以通过使用新行将它们分开吗?另外,我怎样才能只提取推文中写着“美国银行:增长回归 – Bank of 美国公司”(我用星星包围)在我的 cleanup() 函数中?
"<div class=""js-tweet-text-container"">
<p class=""TweetTextSize js-tweet-text tweet-text"" data-aria-label-
part=""0"" lang=""en"">*****Bank Of America: Growth Is Back – Bank of
America Corporation***** (<strong>NYSE:BAC</strong>) <a class=""twitter-
timeline-link u-hidden"" data-expanded-url=""https://good-
stockinvest.com/2017/11/29/bank-of-america-growth-is-back-bank-of-
america-corporation-nysebac/"" dir=""ltr""
href="" rel=""nofollow noopener""
target=""_blank"" title=""https://good-stockinvest.com/2017/11/29/bank-
of-america-growth-is-back-bank-of-america-corporation-nysebac/""><span
class=""tco-ellipsis""></span><span class=""invisible"">https://</span>
<span class=""js-display-url"">good-
stockinvest.com/2017/11/29/ban</span><span class=""invisible"">k-of-
america-growth-is-back-bank-of-america-corporation-nysebac/</span><span
class=""tco-ellipsis""><span class=""invisible""> </span>…</span></a>
</p>
</div>"
下面是我的代码:
from bs4 import BeautifulSoup
import csv
new = csv.writer(open("OUTPUT", "w"))
new.writerow(["Tweets:"])
new.writerow([ ]) # allowing for a simple space
data = open("bac.html", "r").read()
soup = BeautifulSoup(data, "html.parser")
tweets = soup.find_all('div', class_="js-tweet-text-container")
def writetweets():
for tweet in tweets:
new.writerow(tweets)
new.writerow([ ])
print "writetweets - open OUTPUT.csv for the tweet divs"
def cleanup():
print "cleanup - nothing here for now"
def tests():
print "tests - nothing here for now"
def demo():
writetweets()
cleanup()
tests()
if __name__ == '__main__':
demo()
【问题讨论】:
标签: python csv twitter beautifulsoup code-cleanup