【问题标题】:Matching rows from a table匹配表中的行
【发布时间】:2016-02-18 04:11:15
【问题描述】:

我正在抓取这个维基百科页面:

https://en.wikipedia.org/wiki/List_of_shopping_malls_in_the_South_Florida_metropolitan_area

并从表中获取数据,如下所示:

Location = response.xpath('//*[@id="mw-content-text"]/table/tr/td[2]/a/text()').extract()[0]

Name =  response.xpath('//*[@id="mw-content-text"]/table/tr/td[1]/a/text()').extract()

一旦我有了它,计划是将这些列表添加到数据框中。问题是我最后得到的:

len(Name)
 40

 len(Location)
47

这是因为在 location 列的某些行中有几个元素,例如在第三列中:迈阿密椰林 在那里我得到了元素。

【问题讨论】:

    标签: python pandas scrapy


    【解决方案1】:

    您可以使用read_htmldfdf 的第一个dfs

    df = pd.read_html('https://en.wikipedia.org/wiki/List_of_shopping_malls_in_the_South_Florida_metropolitan_area',
                       header=0 )[0]
    print df
    
                                      Name                           Location
    0                        Aventura Mall                           Aventura
    1                    Bal Harbour Shops                        Bal Harbour
    2                  Bayside Marketplace                     Downtown Miami
    3                   Boynton Beach Mall                      Boynton Beach
    4                            CityPlace                    West Palm Beach
    5                             CocoWalk               Coconut Grove, Miami
    6                         Coral Square                      Coral Springs
    7                        Dadeland Mall                            Kendall
    8                         Dolphin Mall                         Sweetwater
    9              Downtown at the Gardens                 Palm Beach Gardens
    10                           The Falls                            Kendall
    11          Galeria International Mall                     Downtown Miami
    12     The Galleria at Fort Lauderdale                    Fort Lauderdale
    13                    The Gardens Mall                 Palm Beach Gardens
    14          The Grand Doubletree Shops                     Downtown Miami
    15                 Las Olas Riverfront                    Fort Lauderdale
    16                      Las Olas Shops                    Fort Lauderdale
    17                   Lincoln Road Mall                        Miami Beach
    18           Loehmann's Fashion Island                           Aventura
    19                Mall of the Americas                              Miami
    20            The Mall at 163rd Street                  North Miami Beach
    21        The Mall at Wellington Green                         Wellington
    22            Miami International Mall                              Doral
    23                 Miracle Marketplace                              Miami
    24              Metrofare Shops & Cafe  Government Center, Downtown Miami
    25                 Pembroke Lakes Mall                     Pembroke Pines
    26                 Pompano Citi Centre                      Pompano Beach
    27                      Sawgrass Mills                            Sunrise
    28                   Seminole Paradise                          Hollywood
    29          The Shops at Fontainebleau                        Miami Beach
    30  The Shops at Mary Brickell Village                    Brickell, Miami
    31          The Shops at Midtown Miami                      Midtown Miami
    32       The Shops at Pembroke Gardens                     Pembroke Pines
    33           The Shops at Sunset Place                        South Miami
    34                      Southland Mall                         Cutler Bay
    35           Town Center at Boca Raton                         Boca Raton
    36      The Village at Gulfstream Park                   Hallandale Beach
    37             Village of Merrick Park                       Coral Gables
    38                   Westfield Broward                         Plantation
    39                       Westland Mall                            Hialeah
    

    【讨论】:

      【解决方案2】:

      您只需要正确的 xpath:

      rows = response.xpath('//table[@class="wikitable"]//tr[not(./th)]')
      for row in rows:
          print ''.join(row.xpath('.//td[1]//text()').extract()), ' | ' , ''.join(row.xpath('.//td[2]//text()').extract())
      
      Aventura Mall  |  Aventura
      Bal Harbour Shops  |  Bal Harbour
      Bayside Marketplace  |  Downtown Miami
      Boynton Beach Mall  |  Boynton Beach
      CityPlace  |  West Palm Beach
      CocoWalk  |  Coconut Grove, Miami
      Coral Square  |  Coral Springs
      Dadeland Mall  |  Kendall
      Dolphin Mall  |  Sweetwater
      Downtown at the Gardens  |  Palm Beach Gardens
      The Falls  |  Kendall
      Galeria International Mall  |  Downtown Miami
      The Galleria at Fort Lauderdale  |  Fort Lauderdale
      The Gardens Mall  |  Palm Beach Gardens
      The Grand Doubletree Shops  |  Downtown Miami
      Las Olas Riverfront  |  Fort Lauderdale
      Las Olas Shops  |  Fort Lauderdale
      Lincoln Road Mall  |  Miami Beach
      Loehmann's Fashion Island  |  Aventura
      Mall of the Americas  |  Miami
      The Mall at 163rd Street  |  North Miami Beach
      The Mall at Wellington Green  |  Wellington
      Miami International Mall  |  Doral
      Miracle Marketplace  |  Miami
      Metrofare Shops & Cafe  |  Government Center, Downtown Miami
      Pembroke Lakes Mall  |  Pembroke Pines
      Pompano Citi Centre  |  Pompano Beach
      Sawgrass Mills  |  Sunrise
      Seminole Paradise  |  Hollywood
      The Shops at Fontainebleau  |  Miami Beach
      The Shops at Mary Brickell Village  |  Brickell, Miami
      The Shops at Midtown Miami  |  Midtown Miami
      The Shops at Pembroke Gardens  |  Pembroke Pines
      The Shops at Sunset Place  |  South Miami
      Southland Mall  |  Cutler Bay
      Town Center at Boca Raton  |  Boca Raton
      The Village at Gulfstream Park  |  Hallandale Beach
      Village of Merrick Park  |  Coral Gables
      Westfield Broward  |  Plantation
      Westland Mall  |  Hialeah
      

      【讨论】:

        【解决方案3】:

        如果您想要将两个单词视为一个单词,则可以对整个单词进行字符串替换以将逗号替换为空字符串:

        location = [loc.replace(',', '') for loc in location]
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2021-10-21
          • 2015-07-18
          • 2020-11-24
          • 2015-11-19
          • 1970-01-01
          • 2017-12-20
          • 2019-01-17
          相关资源
          最近更新 更多