【问题标题】:Python beautifulSoup: create and combine lists and remove redundancies like \nPython beautifulSoup:创建和组合列表并删除像 \n 这样的冗余
【发布时间】:2020-04-23 10:55:30
【问题描述】:

如何将完整列表合并到数据框中。当我打印时,它似乎只打印第一条记录,它还包括 \n 和其他冗余,如 ' 等。

    import requests
    from requests_html import HTML, HTMLSession
    from bs4 import BeautifulSoup
    import pandas as pd
    import csv
    import json

    url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
    lehigh = requests.get(url).text
    soup = BeautifulSoup(lehigh,'lxml')

    for opp in soup.find_all('div',class_="sidearm-schedule-game-opponent-text"):
        opp_list = []
        opp_list.append(opp.text)
     #   print(opp_list)

    for conf in soup.find_all('div',class_="sidearm-schedule-game-conference-conference"):
        conf_list = []
        conf_list.append(conf.text)
    #    print(conf_list)

    dict = {'opponent':[opp_list],'conference':[conf_list]}
    df = pd.DataFrame(dict)
    print(df)

【问题讨论】:

    标签: python-3.x beautifulsoup python-requests removing-whitespace


    【解决方案1】:

    您在每次迭代中将 opp_listconf_list 设置为 [] - 仅将它们初始化一次。另外,您不必在创建字典时放入括号{'opponent':opp_list,'conference':conf_list}

    要删除空格,您可以使用带有strip=Trueseparator= 参数的.get_text() 方法。

    例如:

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
    lehigh = requests.get(url).text
    soup = BeautifulSoup(lehigh,'lxml')
    
    opp_list = []
    for opp in soup.find_all('div',class_="sidearm-schedule-game-opponent-text"):
        opp_list.append(opp.get_text(strip=True, separator=' '))
    
    conf_list = []
    for conf in soup.find_all('div',class_="sidearm-schedule-game-conference-conference"):
        conf_list.append(conf.get_text(strip=True))
    
    dict = {'opponent':opp_list,'conference':conf_list}
    df = pd.DataFrame(dict)
    print(df)
    

    打印:

                             opponent       conference
    0                        at UConn                 
    1                       vs Drexel                 
    2            at George Washington                 
    3                   at St. John's                 
    4                   vs Binghamton                 
    5                        at Rider                 
    6                         vs Penn                 
    7                         at Army  Patriot League*
    8                      vs Cornell                 
    9                     at Boston U  Patriot League*
    10                 vs #20 Colgate  Patriot League*
    11                        vs Navy  Patriot League*
    12                   at Lafayette  Patriot League*
    13                   at Dartmouth                 
    14                    vs American  Patriot League*
    15                    at Bucknell  Patriot League*
    16                at Loyola (Md.)  Patriot League*
    17     vs Holy Cross Senior Night  Patriot League*
    18  vs No. 3 Colgate (Semifinals)                 
    

    【讨论】:

      猜你喜欢
      • 2021-03-20
      • 1970-01-01
      • 2012-07-29
      • 2012-09-09
      • 2016-07-08
      • 1970-01-01
      • 1970-01-01
      • 2011-05-03
      • 1970-01-01
      相关资源
      最近更新 更多