再用python爬取网页时,用模拟浏览器登陆,得到的中文字符出现乱码,该怎么解决呢?
url = “http://newhouse.hfhouse.com/” req = urllib2.Request(url,headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; rv:24.0) Gecko/20100101 Firefox/24.0" }) reqHtml = urllib2.urlopen(req).read() #print reqHtml songtasteHtmlEncoding=\'utf-8\' soup = BeautifulSoup.BeautifulStoneSoup(reqHtml,fromEncoding=songtasteHtmlEncoding) #print soup re_h = re.compile(\'</?\w+[^>]*>\') s = len(soup.findAll(\'a\',{"class":"area_list"})) finda = soup.findAll(\'a\',{"class":"area_list"}) i = 0 while(i<s): quyuz = re_h.sub(\'\',str(finda[i])).strip() try: quyu = quyuz.decode(\'utf-8\').encode(\'gbk\') except: if quyuz[:3] == codecs.BOM_UTF8: quyu = quyuz[3:] print quyu.decode("utf-8").encode(\'gbk\') #quyu = quyu.decode(\'utf-8\').encode(\'gbk\') #number = int(filter(str.isdigit, quyuz)) #dir2 = make_dir(dir1,quyu) value = finda[i][\'val\'] houseid = finda[i][\'href\'] print houseid,value,quyu
总是报eUnicodeEncodeError: \'gbk\' codec can\'t encode character u\'\xe7\' in position 0: illegal multibyte sequence,网页head里编码是utf-8该怎么办呢?