mindmac

在对读取到的网页内容进行中文匹配,大体思路是:

1.对读取到的网页内容提取http header中的content-type,获取网页内容的编码格式;

2.根据获取的编码格式将网页内容转换为unicode格式;

3.使用[\u2e80-\u4dfh]进行正则匹配;

4.将匹配获取的字符进行编码为utf-8格式

Demo:

   1: #coding=utf-8
   2: 
   3: import urllib2
   4: 
   5: if __name__ == \'__main__\':
   6: try:
   7: url = \'https://play.google.com/store/apps/category/TRANSPORTATION/collection/topselling_free?start=48&num=24\'
   8: req = urllib2.Request(url)
   9: res = urllib2.urlopen( req )
  10: # get content encode
  11: encoding = res.headers[\'content-type\'].split(\'charset=\')[-1]
  12: # get http content
  13: data = res.read()
  14: # encode with unicode
  15: data = unicode(data,encoding)
  16: res.close()
  17: # match with regex
  18: str = re.findall(ur\'[\u2e80-\u4dfh]+\',data)
  19: for item in str:
  20: # encode with utf-8
  21: item = item.encode(\'utf-8\')
  22: print item
  23: catch Excepiton,e:
  24: print e

分类:

技术点:

相关文章:

  • 2021-09-17
  • 2021-09-17
  • 2021-11-27
  • 2021-09-07
  • 2021-12-24
  • 2021-09-17
  • 2021-12-18
猜你喜欢
  • 2021-09-17
  • 2021-09-17
  • 2019-09-05
  • 2021-09-07
  • 2021-11-27
  • 2021-09-17
相关资源
相似解决方案