Python Web Scraper + 清理答案

【问题标题】：Python Web Scraper + CleanupPython Web Scraper + 清理
【发布时间】：2017-12-06 00:14:23
【问题描述】：

所以我目前正在尝试导出一个 twitter .html 页面，并且我使用 BeautifulSoup 创建了这个 webscraper。 OUTPUT.csv 文件目前真的很乱，这是我的问题（当前的 .py 文件在下面）：

我可以采取哪些步骤来清理代码？我的输出 csv 有推文，但它们真的很乱，并且用逗号分隔。有什么方法可以通过使用新行将它们分开吗？另外，我怎样才能只提取推文中写着“美国银行：增长回归 – Bank of 美国公司”（我用星星包围）在我的 cleanup() 函数中？

"<div class=""js-tweet-text-container"">
<p class=""TweetTextSize js-tweet-text tweet-text"" data-aria-label-
part=""0"" lang=""en"">*****Bank Of America: Growth Is Back – Bank of 
America Corporation***** (<strong>NYSE:BAC</strong>) <a class=""twitter-
timeline-link u-hidden"" data-expanded-url=""https://good-
stockinvest.com/2017/11/29/bank-of-america-growth-is-back-bank-of-
america-corporation-nysebac/"" dir=""ltr"" 
href="" rel=""nofollow noopener"" 
target=""_blank"" title=""https://good-stockinvest.com/2017/11/29/bank-
of-america-growth-is-back-bank-of-america-corporation-nysebac/""><span 
class=""tco-ellipsis""></span><span class=""invisible"">https://</span>
<span class=""js-display-url"">good-
stockinvest.com/2017/11/29/ban</span><span class=""invisible"">k-of-
america-growth-is-back-bank-of-america-corporation-nysebac/</span><span 
class=""tco-ellipsis""><span class=""invisible""> </span>…</span></a>
</p>
</div>"

下面是我的代码：

from bs4 import BeautifulSoup
import csv


new = csv.writer(open("OUTPUT", "w"))
new.writerow(["Tweets:"])
new.writerow([ ])       # allowing for a simple space

data = open("bac.html", "r").read()
soup = BeautifulSoup(data, "html.parser")

tweets = soup.find_all('div', class_="js-tweet-text-container")

def writetweets():
    for tweet in tweets:
        new.writerow(tweets)
        new.writerow([ ])   
    print "writetweets - open OUTPUT.csv for the tweet divs"

def cleanup():
    print "cleanup - nothing here for now"

def tests():
    print "tests - nothing here for now"

def demo():
    writetweets()
    cleanup()
    tests()

if __name__ == '__main__':
    demo()

【问题讨论】：

标签： python csv twitter beautifulsoup code-cleanup

【解决方案1】：

如果您使用 split() 函数仅获取星号之间的文本，则可以快速解决。您获得的每条推文是在星号之间还是在这个特定的推文之间？

另一种解决方案是更多地搜索标签，以便最终得到“更干净”的字符串。即在“tweets”字符串中进一步使用 find_all。

【讨论】：

我只是在星号中添加了我想要的文本！不幸的是，所有推文都没有在星号之间。当我尝试搜索更多标签（'div'，class_='TweetTextSize js-tweet-text tweet-text'）时，不会向 CSV 文件写入任何内容
哦，好吧，我的错。因此，这对您的 csv 文件没有任何作用，因为您需要搜索
标记。尝试 soup.findAll('p', class_=""TweetTextSize js-tweet-text tweet-text"")。所以，总而言之，尝试一个 for 循环： for tag in soup.findAll('p', class_=""TweetTextSize js-tweet-text tweet-text"") 并在这个 for 循环内尝试打印 tag.get_text ()。看看你得到了什么并相应地调整你的 findAll() 。也许你也不得不玩弄 class_ 中的引号。

【解决方案2】：

首先你有几个错误，你使用 for 来迭代推文，但你正在编写推文而不是推文，

此外，如果您希望它是逐行而不是逗号分隔的值，您可以从使用 csv 更改为使用

with open(fine_name,'w') as file_output: for tweet in tweets: file_output.write(tweet) 这样一来，每条推文就只有一行，你也可以使用 file_output = open(file_name,'w') for tweet in tweets: file_output.write() file_output.close() 由你决定

【讨论】：

【解决方案3】：

基于之前的答案，但有助于清理：

from bs4 import BeautifulSoup
import csv


data = open("bac.html", "r").read()
soup = BeautifulSoup(data, "html.parser")

#tweets = soup.find_all('div', class_="js-tweet-text-container")
tweets = soup.find_all("div", {"class": "js-tweet-text-container"})

def writetweets():
    with open("OUTPUT.txt", "w") as new:
        new.write("Tweets:\r\n")
        for tweet in tweets:
            new.write(tweet.getText() + "\r\n")
    print "writetweets - open OUTPUT.txt for the tweet divs"

def cleanup():
    print "cleanup - nothing here for now"

def tests():
    print "tests - nothing here for now"

def demo():
    writetweets()
    cleanup()
    tests()

if __name__ == '__main__':
    demo()

我明白了：

在 [29] 中：tweet.getText()

Out[29]：'*****美国银行：增长回归 - 美国银行公司*****（纽约证券交易所代码：BAC）https://good-stockinvest.com/2017/11/29/bank-of-america-growth-is-back-bank-of-america-corporation-nysebac/ ...'

【讨论】：

当我尝试运行它时，我得到： Traceback（最近一次调用最后一次）：文件“scraper.py”，第 30 行，在 demo() 文件“scraper.py”，第 25 行，在演示 writetweets() 文件“scraper.py”，第 15 行，在 writetweets new.write(tweet.getText() +“\r\n”) UnicodeEncodeError: 'ascii' codec can't encode character u' \u2013' 在第 33 位：序数不在范围内（128）