【发布时间】:2015-05-27 03:58:06
【问题描述】:
有没有办法将编码不一致的文档转换为 utf-8?
我的项目涉及从 MS SQL 2000 读取文本(通常是 text 或 varchar),“清理”文本(去除样式属性,将部分包装在 div 中)并将“清理”记录插入 MySQL 表中。
我经常会找到这样的文字:
重要的道路包括城市西北侧的费萨尔国王高速公路、东侧的 Al Fatih 高速公路和南岸的 Sh Isa Bin Salman 高速公路。在附近的 المحرق (Muharraq) 岛的水面上,20 号和 21 号高速公路环绕着机场。
但是得到 ???处理后。
我的代码:
# -*- coding: UTF-8 -*-
from bs4 import BeautifulSoup as B_S, UnicodeDammit as U_D
import pymysql as db
import time
def mod_content():
conn = db.connect( host='192.168.0.131', port=3306, user='USER', passwd='PASS', db='GRW', charset='utf8' )
c = conn.cursor()
sql = "SELECT city_id,nid,html_content,notes FROM content_city WHERE nid = 13 AND city_id = 182 ORDER BY city_id"
c.execute( sql )
for rec in c:
contents = rec[2]
contents = U_D.detwingle( contents )
soup = B_S( contents )
rs = soup.find_all( 'div', { 'class':'node_content' } )
for r in rs:
'''
do clean up stuff
'''
contents = soup.prettify( formatter='html' ) # B_S function
contents = ' '.join( contents.split() )
##### writing to a txt file here, but would want to do a MySQL INSERT
raw = open( 'raw_182_mod.txt', 'a', 4 ) # a - append r - read w - write (writes over)
raw.write( contents )
raw.close()
print( 'mod_content Complete' )
mod_content()
有没有办法将所有内容都转换为 utf-8?
更新 3/24 因此,根据这篇文章(How to make unicode string with python3),Python2 的 unicode 在 Python3 中是 str()。 contents = str( contents, 'utf-8' ) 给了我 TypeErrors,而 contents = contents.decode( 'utf-8' ) 给了我 AttributeError: 'str' object has no attribute 'decode'。那么,如何将其纳入我的工作流程?
def mod_content():
conn = db.connect( host='192.168.0.131', port=3306, user='wtp', passwd='wtp', db='GRW', charset='utf8' )
c = conn.cursor()
sql = "SELECT city_id,nid,html_content,notes FROM content_city WHERE nid = 13 AND city_id = 182 ORDER BY city_id"
c.execute( sql )
print( 'type(c) is', type( c ) ) ## type(c) is <class 'pymysql.cursors.Cursor'>
for rec in c:
contents = rec[2]
print( 'type(contents) is', type( contents ) ) ## type(contents) is <class 'str'>
#print( contents ) ## this give's me ?????
#contents = U_D.detwingle( contents )
#contents = str( contents, 'utf-8' ) ## TypeError: decoding str is not supported
soup = B_S( contents )
print( 'type(soup) is', type( soup ) ) ## type(soup) is <class 'bs4.BeautifulSoup'>
rs = soup.find_all( 'div', { 'class':'node_content' } )
for r in rs:
'''
do clean up stuff
'''
#contents = str( contents, 'utf-8' ) ## TypeError: decoding str is not supported
contents = soup.prettify( formatter='html' ) # B_S function
contents = ' '.join( contents.split() )
print( 'type(contents) AFTER prettify is', type( contents ) ) ## type(contents) AFTER prettify is <class 'str'>
raw = open( 'raw_182_mod.txt', 'a', 4 ) # a - append r - read w - write (writes over)
raw.write( contents )
raw.close()
print( 'mod_content Complete' )
mod_content()
【问题讨论】:
标签: python-3.x encoding utf-8 beautifulsoup