修复 BeautifulSoup4 中的循环编码错误？答案

【问题标题】：Fix encoding error with loop in BeautifulSoup4?修复 BeautifulSoup4 中的循环编码错误？
【发布时间】：2016-01-28 02:15:23
【问题描述】：

这是对Focusing in on specific results while scraping Twitter with Python and Beautiful Soup 4? 和Using Python to Scrape Nested Divs and Spans in Twitter? 的跟进。

我没有使用 Twitter API，因为它不会通过这么久的标签查看推文。

编辑：此处描述的错误仅发生在 Windows 7 中。代码在 Linux 上按预期运行，正如 bernie 报告的那样，请参阅下面的评论，并且我能够在 OSX 10.10.2 上运行它而不会出现编码错误。

当我尝试循环抓取推文内容的代码时，会发生编码错误。

第一个 sn-p 只抓取第一条推文，并按预期获取 <p> 标签中的所有内容。

amessagetext = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})
amessage = amessagetext[0]

但是，当我尝试使用循环来使用第二个 sn-p 抓取所有推文时，

messagetexts = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})  
messages = [messagetext for messagetext in messagetexts]

我收到了这个众所周知的cp437.py 编码错误。

File "C:\Anaconda3\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2014' in     position 4052: character maps to <undefined>

那么为什么第一条推文的代码被成功抓取，但多条推文却导致编码问题？仅仅是因为第一条推文碰巧没有包含有问题的字符吗？我已经尝试在几个不同的搜索中成功抓取第一条推文，所以我不确定这是否是原因。

我该如何解决这个问题？我已经阅读了一些关于此的帖子和书籍部分，并且我理解它发生的原因，但我不确定如何在 BeautifulSoup 代码中更正它。

这是完整的代码供参考。

from bs4 import BeautifulSoup
import requests
import sys
import csv #Will be exporting to csv

url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
headers = {'User-Agent': 'Mozilla/5.0'} # (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text.encode('utf-8')
soup = BeautifulSoup(data, "html.parser")

names = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
usernames = [name.contents[0] for name in names]

handles = soup('span', {'class': 'username js-action-profile-name'})
userhandles = [handle.contents[1].contents[0] for handle in handles]   
athandles = [('@')+abhandle for abhandle in userhandles]

links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]
fullurls = [('http://www.twitter.com')+permalink for permalink in urls] 

timestamps = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
datetime = [timestamp["title"] for timestamp in timestamps]

messagetexts = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})  
messages = [messagetext for messagetext in messagetexts] 

amessagetext = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})
amessage = amessagetext[0]

retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
retweetcounts = [retweet.contents[3].contents[1].contents[1].string for retweet in retweets]

favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
favcounts = [favorite.contents[3].contents[1].contents[1].string for favorite in favorites]

print (usernames, "\n", "\n", athandles, "\n", "\n", fullurls, "\n", "\n", datetime, "\n", "\n",retweetcounts, "\n", "\n", favcounts, "\n", "\n", amessage, "\n", "\n", messages)

【问题讨论】：

很好的问题。投赞成票。 FWIW 我无法在 Linux (ubuntu 14.04) 上重现该错误
感谢您的信息。我在使用控制台 2 的 Windows 7 上，以防万一。
这可能对BeautifulSoup4 stripped_strings gives me byte objects?有帮助
感谢您的提示。我会调查 stripped_strings。在阅读了 bernie 的评论后，我在运行 OSX 10.10.2 的笔记本电脑上尝试了它，我得到了我正在寻找的输出，没有编码错误。不过，我将保留这个问题，因为我对 Windows 的修复感兴趣。原始帖子已编辑以包含此信息。
我刚刚在 Windows 7 上再次尝试过此操作，但仍然无法重现该错误。有趣的。希望其他人可以出现并复制。

标签： python twitter web-scraping beautifulsoup

【解决方案1】：

通过消除我用于错误检查的打印语句并通过将encoding="utf-8" 添加到两个with open 命令来指定被抓取的HTML 文件和csv 输出文件的编码，我已经解决了这个问题，令我自己满意。

from bs4 import BeautifulSoup
import requests
import sys
import csv
import re
from datetime import datetime
from pytz import timezone

url = input("Enter the name of the file to be scraped:")
with open(url, encoding="utf-8") as infile:
    soup = BeautifulSoup(infile, "html.parser")

#url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
#headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
#r = requests.get(url, headers=headers)
#data = r.text.encode('utf-8')
#soup = BeautifulSoup(data, "html.parser")

names = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
usernames = [name.contents for name in names]

handles = soup('span', {'class': 'username js-action-profile-name'})
userhandles = [handle.contents[1].contents[0] for handle in handles]  
athandles = [('@')+abhandle for abhandle in userhandles]

links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]
fullurls = [permalink for permalink in urls]

timestamps = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
datetime = [timestamp["title"] for timestamp in timestamps]

messagetexts = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'}) 
messages = [messagetext for messagetext in messagetexts]  

retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
retweetcounts = [retweet.contents[3].contents[1].contents[1].string for retweet in retweets]

favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
favcounts = [favorite.contents[3].contents[1].contents[1].string for favorite in favorites]

images = soup('div', {'class': 'content'})
imagelinks = [src.contents[5].img if len(src.contents) > 5 else "No image" for src in images]

#print (usernames, "\n", "\n", athandles, "\n", "\n", fullurls, "\n", "\n", datetime, "\n", "\n",retweetcounts, "\n", "\n", favcounts, "\n", "\n", messages, "\n", "\n", imagelinks)

rows = zip(usernames,athandles,fullurls,datetime,retweetcounts,favcounts,messages,imagelinks)

rownew = list(rows)

#print (rownew)

newfile = input("Enter a filename for the table:") + ".csv"

with open(newfile, 'w', encoding='utf-8') as f:
    writer = csv.writer(f, delimiter=",")
    writer.writerow(['Usernames', 'Handles', 'Urls', 'Timestamp', 'Retweets', 'Favorites', 'Message', 'Image Link'])
    for row in rownew:
        writer.writerow(row)

【讨论】：