【问题标题】:pd.read_html changed number formattingpd.read_html 更改了数字格式
【发布时间】:2021-09-16 17:47:14
【问题描述】:

无法从CCCCCCC的列中获取1,2,3,4,5,6pd.read_html格式更改为123456后,我的预期结果应该保持1,2,3,4,5,6

HTML 代码

html = """<html>
<body>
<div id="MMMMMMMM" class="MMMMMMMMMMM" style="">
        <table class="OOOOOOOO" style="">
            <thead>
                <tr class="PPPPPPPPPP">
                    <td colspan="3" style="font-size:14px;font-weight:bold;" class="QQQQQQQQQQ">AAAAAAA</td>
                </tr>
                <tr class="RRRRRRRRRR">
                    <td>BBBBBB</td>
                    <td>CCCCCCC</td>
                    <td>AAAAAAA</td>
                </tr>
            </thead>
            <tbody>
                    <tr class="SSSSSSSS">
                        <td rowspan="1">DDDDDD</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                    <tr class="">
                        <td rowspan="3">EEEEEEEEE</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                        <tr class="">
                            <td class="L_LLLL67">1,2,3,4,5,6</td>
                            <td class="L_LLLL67 f_tar">1234.56</td>
                        </tr>
                        <tr class="">
                            <td class="L_LLLL67">1,2,3,4,5,6</td>
                            <td class="L_LLLL67 f_tar">1234.56</td>
                        </tr>
                    <tr class="">
                        <td rowspan="1">FFFFFFFFF</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                    <tr class="TTTTTT">
                        <td rowspan="1">GGGGGGGGG</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                    <tr class="">
                        <td rowspan="1">HHHHHHHHH</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                    <tr class="TTTTTTT">
                        <td rowspan="1">IIIIIIIIII</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                    <tr class="">
                        <td rowspan="1">JJJJJJJJ</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                    <tr class="TTTTT">
                        <td rowspan="2">KKKKKKKK</td>
                        <td class="L_LLLL67">1/2/3/4/5/6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                        <tr class="TTTTTT">
                            <td class="L_LLLL67">1/2/3/4/5/6</td>
                            <td class="L_LLLL67 f_tar">1234.56</td>
                        </tr>
            </tbody>
        </table>
</body>
</html>"""

Python 代码

from bs4 import BeautifulSoup
import pandas as pd

soup = BeautifulSoup(html,'html.parser')
table = soup.find('div', attrs={'id':'MMMMMMMM'})
df_list = pd.read_html(str(table), header=1)
df_list

执行结果

 [        BBBBBB      CCCCCCC  AAAAAAA
 0       DDDDDD       123456  1234.56
 1    EEEEEEEEE       123456  1234.56
 2    EEEEEEEEE       123456  1234.56
 3    EEEEEEEEE       123456  1234.56
 4    FFFFFFFFF       123456  1234.56
 5    GGGGGGGGG       123456  1234.56
 6    HHHHHHHHH       123456  1234.56
 7   IIIIIIIIII       123456  1234.56
 8     JJJJJJJJ       123456  1234.56
 9     KKKKKKKK  1/2/3/4/5/6  1234.56
 10    KKKKKKKK  1/2/3/4/5/6  1234.56]

预期结果

 [        BBBBBB      CCCCCCC  AAAAAAA
 0       DDDDDD       1,2,3,4,5,6  1234.56
 1    EEEEEEEEE       1,2,3,4,5,6  1234.56
 2    EEEEEEEEE       1,2,3,4,5,6  1234.56
 3    EEEEEEEEE       1,2,3,4,5,6  1234.56
 4    FFFFFFFFF       1,2,3,4,5,6  1234.56
 5    GGGGGGGGG       1,2,3,4,5,6  1234.56
 6    HHHHHHHHH       1,2,3,4,5,6  1234.56
 7   IIIIIIIIII       1,2,3,4,5,6  1234.56
 8     JJJJJJJJ       1,2,3,4,5,6  1234.56
 9     KKKKKKKK       1/2/3/4/5/6  1234.56
 10    KKKKKKKK       1/2/3/4/5/6  1234.56]
 

【问题讨论】:

    标签: python pandas list dataframe beautifulsoup


    【解决方案1】:

    您需要添加thousands参数并设置为None,默认为','

    from bs4 import BeautifulSoup
    import pandas as pd
    
    soup = BeautifulSoup(html,'html.parser')
    table = soup.find('div', attrs={'id':'MMMMMMMM'})
    df_list = pd.read_html(str(table), header=1, thousands=None)
    df_list
    
    输出:
    [        BBBBBB      CCCCCCC  AAAAAAA
     0       DDDDDD  1,2,3,4,5,6  1234.56
     1    EEEEEEEEE  1,2,3,4,5,6  1234.56
     2    EEEEEEEEE  1,2,3,4,5,6  1234.56
     3    EEEEEEEEE  1,2,3,4,5,6  1234.56
     4    FFFFFFFFF  1,2,3,4,5,6  1234.56
     5    GGGGGGGGG  1,2,3,4,5,6  1234.56
     6    HHHHHHHHH  1,2,3,4,5,6  1234.56
     7   IIIIIIIIII  1,2,3,4,5,6  1234.56
     8     JJJJJJJJ  1,2,3,4,5,6  1234.56
     9     KKKKKKKK  1/2/3/4/5/6  1234.56
     10    KKKKKKKK  1/2/3/4/5/6  1234.56]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-02-26
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多