【问题标题】:Converting inconsistant encoding to utf-8 Python 3.4 BS 4.3将不一致的编码转换为 utf-8 Python 3.4 BS 4.3
【发布时间】:2015-05-27 03:58:06
【问题描述】:

有没有办法将编码不一致的文档转换为 utf-8?

我的项目涉及从 MS SQL 2000 读取文本(通常是 text 或 varchar),“清理”文本(去除样式属性,将部分包装在 div 中)并将“清理”记录插入 MySQL 表中。

我经常会找到这样的文字:

重要的道路包括城市西北侧的费萨尔国王高速公路、东侧的 Al Fatih 高速公路和南岸的 Sh Isa Bin Salman 高速公路。在附近的 المحرق (Muharraq) 岛的水面上,20 号和 21 号高速公路环绕着机场。

但是得到 ???处理后。

我的代码:

# -*- coding: UTF-8 -*-
from bs4 import BeautifulSoup as B_S, UnicodeDammit as U_D
import pymysql as db
import time

def mod_content():
    conn = db.connect( host='192.168.0.131', port=3306, user='USER', passwd='PASS', db='GRW', charset='utf8' )
    c = conn.cursor()
    sql = "SELECT city_id,nid,html_content,notes FROM content_city WHERE nid = 13 AND city_id = 182 ORDER BY city_id"
    c.execute( sql )
    for rec in c:
        contents = rec[2]
        contents = U_D.detwingle( contents )

        soup = B_S( contents )
        rs = soup.find_all( 'div', { 'class':'node_content' } )
        for r in rs:
            '''
            do clean up stuff
            '''
        contents = soup.prettify( formatter='html' ) # B_S function
        contents = ' '.join( contents.split() )
        ##### writing to a txt file here, but would want to do a MySQL INSERT
        raw = open( 'raw_182_mod.txt', 'a', 4 ) # a - append r - read w - write (writes over)
        raw.write( contents )
        raw.close()

    print( 'mod_content Complete' )

mod_content()

有没有办法将所有内容都转换为 utf-8?

更新 3/24 因此,根据这篇文章(How to make unicode string with python3),Python2 的 unicode 在 Python3 中是 str()。 contents = str( contents, 'utf-8' ) 给了我 TypeErrors,而 contents = contents.decode( 'utf-8' ) 给了我 AttributeError: 'str' object has no attribute 'decode'。那么,如何将其纳入我的工作流程?

def mod_content():
    conn = db.connect( host='192.168.0.131', port=3306, user='wtp', passwd='wtp', db='GRW', charset='utf8' )
    c = conn.cursor()
    sql = "SELECT city_id,nid,html_content,notes FROM content_city WHERE nid = 13 AND city_id = 182 ORDER BY city_id"
    c.execute( sql )
    print( 'type(c) is', type( c ) ) ## type(c) is <class 'pymysql.cursors.Cursor'>
    for rec in c:
        contents = rec[2]
        print( 'type(contents) is', type( contents ) ) ## type(contents) is <class 'str'>
        #print( contents ) ## this give's me ?????
        #contents = U_D.detwingle( contents )
        #contents = str( contents, 'utf-8' ) ## TypeError: decoding str is not supported


        soup = B_S( contents )
        print( 'type(soup) is', type( soup ) ) ## type(soup) is <class 'bs4.BeautifulSoup'>
        rs = soup.find_all( 'div', { 'class':'node_content' } )
        for r in rs:
            '''
            do clean up stuff
            '''
        #contents = str( contents, 'utf-8' ) ## TypeError: decoding str is not supported
        contents = soup.prettify( formatter='html' ) # B_S function
        contents = ' '.join( contents.split() )
        print( 'type(contents) AFTER prettify is', type( contents ) ) ## type(contents) AFTER prettify is <class 'str'>
        raw = open( 'raw_182_mod.txt', 'a', 4 ) # a - append r - read w - write (writes over)
        raw.write( contents )
        raw.close()

    print( 'mod_content Complete' )

mod_content()

【问题讨论】:

    标签: python-3.x encoding utf-8 beautifulsoup


    【解决方案1】:

    更新 3/31 这是我解决这个问题的方法。如果有更好的方法,请告诉我

    # -*- coding: UTF-8 -*-
    
    from bs4 import BeautifulSoup as B_S
    import pymysql as db
    import time
    
    def mod_content():
    
        conn = db.connect( host='192.xxx.x.xxx', port=3306, user='USER', passwd='PASSWORD', db='GRW', charset='utf8' ) ## declare charset
        c = conn.cursor()
        sql = "SELECT city_id,nid,html_content,notes FROM content_city WHERE nid = 13 AND city_id = 182 ORDER BY city_id"
        c.execute( sql )
    
        for rec in c.fetchall():
            contents = rec[2]
    
            temp = B_S( contents)
            soup = temp.body
    
            allDivs = soup.find_all( 'div', { 'class':'picright' } )
            for div in allDivs:
                print( str( div )[ :80 ] )
                '''
                do clean up stuff
                '''
    
            # now, output the data. I end up with utf-8 string with ascii diacritics
            contents = soup.encode( 'ascii' )
            content_2str = contents.decode( 'utf-8' )
            content_2str = content_2str.replace( "'", "&#39;" ) ## single quotes replaced
            content_2str = ' '.join( content_2str.split() ) ## removes extra spaces and line breaks - now compacted  
    
    
            ## I can now print it to file or update MySQL
            if updateSQL == 'yes':
    
                sql = "UPDATE content_city SET html_content = '" + content_2str + \
                "',notes = '" + notes_2str + "' WHERE city_id = " + str( recID ) + \
                " AND nid = " + str( nid ) + ""
    
                c.execute( sql )
                conn.commit()
    
    
            if printToFile == 'yes':
    
                file2 = tempRoot + NIDs[ key ]+'_MOD.html'
                mod = open( file2, 'a',4 ) 
    
                mod.write( '\n' + str( nid ) + '\n' + str( recID ) + '\n' + \
                           content_2str + '\n' + notes_2str + '\n\n' )
                time.sleep(1)
                mod.close()
    
    
    
        print( 'mod_content Complete' )
    
    mod_content()
    

    【讨论】:

      【解决方案2】:

      str1 = "重要的道路包括城市西北侧的费萨尔国王高速公路、东侧的 Al Fatih 高速公路和南岸的 Sh Isa Bin Salman 高速公路。在附近的 المحرق (Muharraq) 岛上过水),20 号和 21 号高速公路环绕机场。”

      将文本转换为 unciode 使用

       unicode(str1,"utf-8")
      
      u'Important roads include King Faisal Highway on the northwestern side of the city, Al Fatih Highway on the eastern side, and Sh Isa Bin Salman Highway along the southern shore. Across the water on the nearby island of \u0627\u0644\u0645\u062d\u0631\u0642 (Muharraq), highways 20 and 21 encircle the airport.'
      

      从字符串中删除 unicode 使用

      import unicodedata
      unicodedata.normalize('NFKD', unicode(str1,"utf-8")).encode('ascii','ignore')
      
      'Important roads include King Faisal Highway on the northwestern side of the city, Al Fatih Highway on the eastern side, and Sh Isa Bin Salman Highway along the southern shore. Across the water on the nearby island of  (Muharraq), highways 20 and 21 encircle the airport.'
      

      【讨论】:

      • 谢谢。您的回复让我更好地理解了 Python3.4 str 类型。但是您的解决方案在 3.4 中给了我错误。请参阅上面的更新。
      猜你喜欢
      • 1970-01-01
      • 2012-06-30
      • 2011-06-26
      • 2014-02-02
      • 2012-01-15
      • 2011-01-28
      • 2021-07-11
      • 2015-08-15
      • 1970-01-01
      相关资源
      最近更新 更多