在爬虫中经常遇到中文字符存储乱码的情况,比如对我的博客进行爬虫:
import json import requests from bs4 import BeautifulSoup user_agent = \'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)\' headers = {\'User-Agent\': user_agent} r=requests.get(\'https://www.cnblogs.com/yue-qian/\',headers=headers) soup=BeautifulSoup(r.text,\'html.parser\') text=[] for zx in soup.find_all(\'div\',class_="c_b_p_desc"): text.append(zx.text) with open("xyz.txt",\'w\') as fp: json.dump(text, fp=fp,indent=4)
结果部分截图如下:
如上所示,将爬虫结果存入json中后会出现乱码情况,这是因为Python在安装时,默认的编码是Ascii码
做如下更改:
import json import requests from bs4 import BeautifulSoup import sys reload(sys) sys.setdefaultencoding("utf-8") user_agent = \'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)\' headers = {\'User-Agent\': user_agent} r=requests.get(\'https://www.cnblogs.com/yue-qian/\',headers=headers) soup=BeautifulSoup(r.text,\'html.parser\') text=[] for zx in soup.find_all(\'div\',class_="c_b_p_desc"): text.append(zx.text) with open("xyz.txt",\'w\') as fp: json.dump(text, fp=fp,ensure_ascii=False,indent=4)
结果如下: