【问题标题】:Fix encoding error with loop in BeautifulSoup4?修复 BeautifulSoup4 中的循环编码错误?
【发布时间】:2016-01-28 02:15:23
【问题描述】:

这是对Focusing in on specific results while scraping Twitter with Python and Beautiful Soup 4?Using Python to Scrape Nested Divs and Spans in Twitter? 的跟进。

我没有使用 Twitter API,因为它不会通过这么久的标签查看推文。

编辑:此处描述的错误仅发生在 Windows 7 中。代码在 Linux 上按预期运行,正如 bernie 报告的那样,请参阅下面的评论,并且我能够在 OSX 10.10.2 上运行它而不会出现编码错误。

当我尝试循环抓取推文内容的代码时,会发生编码错误。

第一个 sn-p 只抓取第一条推文,并按预期获取 <p> 标签中的所有内容。

amessagetext = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})
amessage = amessagetext[0]

但是,当我尝试使用循环来使用第二个 sn-p 抓取所有推文时,

messagetexts = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})  
messages = [messagetext for messagetext in messagetexts] 

我收到了这个众所周知的cp437.py 编码错误。

File "C:\Anaconda3\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2014' in     position 4052: character maps to <undefined>

那么为什么第一条推文的代码被成功抓取,但多条推文却导致编码问题?仅仅是因为第一条推文碰巧没有包含有问题的字符吗?我已经尝试在几个不同的搜索中成功抓取第一条推文,所以我不确定这是否是原因。

我该如何解决这个问题?我已经阅读了一些关于此的帖子和书籍部分,并且我理解它发生的原因,但我不确定如何在 BeautifulSoup 代码中更正它。

这是完整的代码供参考。

from bs4 import BeautifulSoup
import requests
import sys
import csv #Will be exporting to csv

url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
headers = {'User-Agent': 'Mozilla/5.0'} # (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text.encode('utf-8')
soup = BeautifulSoup(data, "html.parser")

names = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
usernames = [name.contents[0] for name in names]

handles = soup('span', {'class': 'username js-action-profile-name'})
userhandles = [handle.contents[1].contents[0] for handle in handles]   
athandles = [('@')+abhandle for abhandle in userhandles]

links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]
fullurls = [('http://www.twitter.com')+permalink for permalink in urls] 

timestamps = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
datetime = [timestamp["title"] for timestamp in timestamps]

messagetexts = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})  
messages = [messagetext for messagetext in messagetexts] 

amessagetext = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})
amessage = amessagetext[0]

retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
retweetcounts = [retweet.contents[3].contents[1].contents[1].string for retweet in retweets]

favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
favcounts = [favorite.contents[3].contents[1].contents[1].string for favorite in favorites]

print (usernames, "\n", "\n", athandles, "\n", "\n", fullurls, "\n", "\n", datetime, "\n", "\n",retweetcounts, "\n", "\n", favcounts, "\n", "\n", amessage, "\n", "\n", messages)

【问题讨论】:

  • 很好的问题。投赞成票。 FWIW 我无法在 Linux (ubuntu 14.04) 上重现该错误
  • 感谢您的信息。我在使用控制台 2 的 Windows 7 上,以防万一。
  • 感谢您的提示。我会调查 stripped_strings。在阅读了 bernie 的评论后,我在运行 OSX 10.10.2 的笔记本电脑上尝试了它,我得到了我正在寻找的输出,没有编码错误。不过,我将保留这个问题,因为我对 Windows 的修复感兴趣。原始帖子已编辑以包含此信息。
  • 我刚刚在 Windows 7 上再次尝试过此操作,但仍然无法重现该错误。有趣的。希望其他人可以出现并复制。

标签: python twitter web-scraping beautifulsoup


【解决方案1】:

通过消除我用于错误检查的打印语句并通过将encoding="utf-8" 添加到两个with open 命令来指定被抓取的HTML 文件和csv 输出文件的编码,我已经解决了这个问题,令我自己满意。

from bs4 import BeautifulSoup
import requests
import sys
import csv
import re
from datetime import datetime
from pytz import timezone

url = input("Enter the name of the file to be scraped:")
with open(url, encoding="utf-8") as infile:
    soup = BeautifulSoup(infile, "html.parser")

#url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
#headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
#r = requests.get(url, headers=headers)
#data = r.text.encode('utf-8')
#soup = BeautifulSoup(data, "html.parser")

names = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
usernames = [name.contents for name in names]

handles = soup('span', {'class': 'username js-action-profile-name'})
userhandles = [handle.contents[1].contents[0] for handle in handles]  
athandles = [('@')+abhandle for abhandle in userhandles]

links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]
fullurls = [permalink for permalink in urls]

timestamps = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
datetime = [timestamp["title"] for timestamp in timestamps]

messagetexts = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'}) 
messages = [messagetext for messagetext in messagetexts]  

retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
retweetcounts = [retweet.contents[3].contents[1].contents[1].string for retweet in retweets]

favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
favcounts = [favorite.contents[3].contents[1].contents[1].string for favorite in favorites]

images = soup('div', {'class': 'content'})
imagelinks = [src.contents[5].img if len(src.contents) > 5 else "No image" for src in images]

#print (usernames, "\n", "\n", athandles, "\n", "\n", fullurls, "\n", "\n", datetime, "\n", "\n",retweetcounts, "\n", "\n", favcounts, "\n", "\n", messages, "\n", "\n", imagelinks)

rows = zip(usernames,athandles,fullurls,datetime,retweetcounts,favcounts,messages,imagelinks)

rownew = list(rows)

#print (rownew)

newfile = input("Enter a filename for the table:") + ".csv"

with open(newfile, 'w', encoding='utf-8') as f:
    writer = csv.writer(f, delimiter=",")
    writer.writerow(['Usernames', 'Handles', 'Urls', 'Timestamp', 'Retweets', 'Favorites', 'Message', 'Image Link'])
    for row in rownew:
        writer.writerow(row)

【讨论】:

    猜你喜欢
    • 2020-01-03
    • 1970-01-01
    • 1970-01-01
    • 2010-09-18
    • 1970-01-01
    • 2016-01-20
    • 1970-01-01
    • 1970-01-01
    • 2023-03-30
    相关资源
    最近更新 更多