【问题标题】:How toiIgnore utf-8 encoding when using Beautifulsoup for webscraping data使用 Beautifulsoup 抓取网页数据时如何忽略 utf-8 编码
【发布时间】:2021-10-20 20:44:22
【问题描述】:

我正在使用 Beautifulsoup 来网页爬取prayprofiler.com。但是,数据具有 utf-8 编码,我无法处理。每当我打印数据时,我都会收到错误

UnicodeEncodeError: 'charmap' codec can't encode character '\u2605' in position 184621: character maps to <undefined>

我可以使用

print(stats_page.encode("utf-8"))

但是在那之后,如果我想使用命令来抓取数据,我将无法使用它

column_headers_row = stats_page.findAll('tr')

如何从网站获取数据,并搜索表格行并处理数据。

这是主要的代码块:

import pandas as pd 
import numpy as np 
from bs4 import BeautifulSoup
import requests

r = requests.get("https://www.playerprofiler.com/nfl/george-kittle").text

stats_page = BeautifulSoup(r, 'lxml')

column_headers_row = stats_page.findAll('tr')

print(column_headers_row)

感谢您的帮助!

【问题讨论】:

  • 您的代码的哪一行出错了?
  • 根据您的 IDE/终端配置,print 不能打印每个 Unicode 字符。您仍然可以处理文本,但问题在于 IDE/终端配置,而不是 BeautifulSoup。事实上,您的代码在 Windows 10 64 位、Python 3.8 64 位、Windows cmd.exe 上运行良好。如果您使用支持 UTF-8 编码的 IDE,或者将您的终端配置为 UTF-8 编码,应该没有问题。
  • 对我来说也很好用(macOS 11.5.2 和 Python 3.9.6)。除了代码一点都不健壮之外,我看不出问题
  • 谢谢大家的回复。正如@MarkTolonen 所说,我需要在我的 IDE 上启用对 UTF-8 编码的支持。这样做,解决了问题。

标签: python web-scraping beautifulsoup utf-8


【解决方案1】:

尝试添加这行代码locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
r = requests.get("https://www.playerprofiler.com/nfl/george-kittle").text:
你必须import locale

【讨论】:

    【解决方案2】:

    让 pandas 解析表格。它将返回一个数据框列表。只需通过索引调出您想要的数据框并从那里开始:

    import pandas as pd
    
    url = 'https://www.playerprofiler.com/nfl/george-kittle'
    df = pd.read_html(url)
    

    由于某种原因,如果上面的代码不起作用,请尝试:

    import pandas as pd
    import requests
    
    url = 'https://www.playerprofiler.com/nfl/george-kittle'
    html = requests.get(url).text
    df = pd.read_html(html)
    

    输出:

    print(df)
    [   Year Year  ...  Fantasy Points Per Game FPts/G
    0       2020  ...                       15.6 (#3)
    1       2019  ...                       15.9 (#1)
    2       2018  ...                         16 (#3)
    3       2017  ...                       7.1 (#21)
    
    [4 rows x 9 columns],   Snap Share Snap Share  ... Target Share Tgt Rate
    0                 87.4%  ...        24.1% (9.8 rz)
    1                    #4  ...                    #4
    
    [2 rows x 7 columns],   Air Yards Air Yards  ... Target Rate Tgt Rate
    0      460 (57.5 p/g)  ...                29.2%
    1                 #22  ...                  #27
    
    [2 rows x 7 columns],   Receptions Receptions  ... Fantasy Points Per Game Fantasy PTS/G
    0            48 (6 p/g)  ...                                  15.6
    1                   #15  ...                                    #3
    
    [2 rows x 7 columns],   Yards Per Reception YPR  ... True Catch Rate True Catch Rate
    0                    13.2  ...                           85.7%
    1                      #6  ...                             #21
    
    [2 rows x 7 columns],   Target Premium Tgt Prem  ... Contested Catch Rate Contested Catch %
    0                   13.7%  ...                          80% (10 tgts)
    1                      #8  ...                                     #1
    
    [2 rows x 7 columns],   Production Premium Prod Premium  ... Fantasy Points Per Target Fantasy Pts/Tgt
    0                            16.1  ...                                      1.99
    1                              #3  ...                                        #9
    
    [2 rows x 7 columns],   Snap Share Snap Share  ... Target Share Tgt Rate
    0                   89%  ...       28.2% (26.2 rz)
    1                    #5  ...                    #1
    
    [2 rows x 7 columns],   Air Yards Air Yards  ... Target Rate Tgt Rate
    0      623 (44.5 p/g)  ...                39.1%
    1                 #12  ...                  #11
    
    [2 rows x 7 columns],   Receptions Receptions  ... Fantasy Points Per Game Fantasy PTS/G
    0          85 (6.1 p/g)  ...                                  15.9
    1                    #4  ...                                    #1
    
    [2 rows x 7 columns],   Yards Per Reception YPR  ... True Catch Rate True Catch Rate
    0                    12.4  ...                           87.6%
    1                      #9  ...                              #9
    
    [2 rows x 7 columns],   Target Premium Tgt Prem  ... Contested Catch Rate Contested Catch %
    0                    1.5%  ...                        53.8% (13 tgts)
    1                     #18  ...                                     #6
    
    [2 rows x 7 columns],   Production Premium Prod Premium  ... Fantasy Points Per Target Fantasy Pts/Tgt
    0                            10.2  ...                                      2.08
    1                              #6  ...                                        #8
    
    [2 rows x 7 columns],   Snap Share Snap Share  ... Target Share Tgt Rate
    0                 94.2%  ...         26.4% (26 rz)
    1                    #3  ...                    #2
    
    [2 rows x 7 columns],   Air Yards Air Yards  ... Target Rate Tgt Rate
    0     1049 (65.6 p/g)  ...                34.2%
    1                  #4  ...                  #21
    
    [2 rows x 7 columns],   Receptions Receptions  ... Fantasy Points Per Game Fantasy PTS/G
    0          88 (5.5 p/g)  ...                                    16
    1                    #3  ...                                    #3
    
    [2 rows x 7 columns],   Yards Per Reception YPR  ... True Catch Rate True Catch Rate
    0                    15.6  ...                           82.2%
    1                      #3  ...                             #25
    
    [2 rows x 7 columns],   Target Premium Tgt Prem  ... Contested Catch Rate Contested Catch %
    0                   21.8%  ...                        29.4% (17 tgts)
    1                      #6  ...                                    #27
    
    [2 rows x 7 columns],   Production Premium Prod Premium  ... Fantasy Points Per Target Fantasy Pts/Tgt
    0                             6.3  ...                                       1.9
    1                              #7  ...                                       #13
    
    [2 rows x 7 columns],   Snap Share Snap Share  ... Target Share Tgt Rate
    0                 60.6%  ...           11% (18 rz)
    1                   #36  ...                   #27
    
    [2 rows x 7 columns],   Air Yards Air Yards  ... Target Rate Tgt Rate
    0      486 (32.4 p/g)  ...                  20%
    1                 #23  ...                  #88
    
    [2 rows x 7 columns],   Receptions Receptions  ... Fantasy Points Per Game Fantasy PTS/G
    0          43 (2.9 p/g)  ...                                   7.1
    1                   #18  ...                                   #21
    
    [2 rows x 7 columns],   Yards Per Reception YPR  ... True Catch Rate True Catch Rate
    0                      12  ...                           82.7%
    1                     #13  ...                             #18
    
    [2 rows x 7 columns],   Target Premium Tgt Prem  ... Contested Catch Rate Contested Catch %
    0                    1.8%  ...                        45.5% (11 tgts)
    1                     #16  ...                                    #20
    
    [2 rows x 7 columns],   Production Premium Prod Premium  ... Fantasy Points Per Target Fantasy Pts/Tgt
    0                            -3.6  ...                                      1.69
    1                             #15  ...                                       #16
    
    [2 rows x 7 columns],    Week Wk  ... Fantasy Points Fantasy Points
    0        1  ...                     9.3 (#17)
    1        4  ...                     40.1 (#1)
    2        5  ...                     8.4 (#16)
    3        6  ...                     23.9 (#2)
    4        7  ...                    10.5 (#13)
    5        8  ...                     5.9 (#21)
    6       16  ...                    13.2 (#13)
    7       17  ...                     13.8 (#6)
    
    [8 rows x 9 columns],     Week Wk  ... Fantasy Points Fantasy Points
    0         1  ...                    13.4 (##9)
    1         2  ...                    8.4 (##12)
    2         3  ...                   11.7 (##11)
    3         5  ...                    20.8 (##1)
    4         6  ...                    18.3 (##3)
    5         7  ...                    6.8 (##18)
    6         8  ...                    14.6 (##6)
    7         9  ...                   19.9  (##3)
    8        12  ...                    24.9 (##2)
    9        13  ...                    3.4 (##33)
    10       14  ...                    18.7 (##4)
    11       15  ...                    26.4 (##1)
    12       16  ...                    18.9 (##8)
    13       17  ...                    16.3 (##5)
    
    [14 rows x 9 columns],     Week Wk  ... Fantasy Points Fantasy Points
    0         1  ...                    14.0 (##6)
    1         2  ...                    4.2 (##34)
    2         3  ...                    12.9 (##7)
    3         4  ...                    24.5 (##2)
    4         5  ...                    13.3 (##9)
    5         6  ...                    7.0 (##21)
    6         7  ...                    20.8 (##2)
    7         8  ...                   10.7 (##14)
    8         9  ...                    20.8 (##4)
    9        10  ...                    17.3 (##4)
    10       12  ...                   11.8 (##12)
    11       13  ...                    13.0 (##7)
    12       14  ...                    34.0 (##1)
    13       15  ...                    8.1 (##12)
    14       16  ...                    14.4 (##9)
    15       17  ...                    29.9 (##2)
    
    [16 rows x 9 columns],     Week Wk  ... Fantasy Points Fantasy Points
    0         1  ...                    7.7 (##16)
    1         2  ...                    3.3 (##36)
    2         3  ...                    1.8 (##41)
    3         4  ...                    5.5 (##29)
    4         5  ...                    21.3 (##2)
    5         6  ...                    8.6 (##18)
    6         7  ...                    2.6 (##39)
    7         8  ...                    4.2 (##24)
    8         9  ...                    5.7 (##24)
    9        12  ...                    2.4 (##38)
    10       13  ...                    4.0 (##31)
    11       14  ...                   3.0  (##30)
    12       15  ...                    9.2 (##17)
    13       16  ...                    13.2 (##7)
    14       17  ...                    14.0 (##2)
    
    [15 rows x 9 columns],   School School  ... Special Teams Yards Spc Tm Share
    0   Iowa (2013)  ...                                0
    1   Iowa (2014)  ...                                0
    2   Iowa (2015)  ...                                0
    3  Iowa  (2016)  ...                                0
    
    [4 rows x 9 columns]]
    

    【讨论】:

    • @sfaisal 很高兴它成功了。请接受解决方案。
    猜你喜欢
    • 1970-01-01
    • 2017-10-15
    • 2016-07-05
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-09-17
    • 1970-01-01
    相关资源
    最近更新 更多