【问题标题】:Extracting table Content from html python从html python中提取表格内容
【发布时间】:2020-04-17 04:16:11
【问题描述】:

我是Python 的新手。我想从wiki 网站上抓取带有该国家/地区州列表的iso 代码。 Here's the Link

所需输出:

mapState={'Alabama': 'US-AL', 'Alaska': 'US-AK',.....,'Wyoming':'US-WY}'

这是我尝试过的代码:

import requests
from bs4 import BeautifulSoup
def crawl_wiki():
    url = 'https://en.wikipedia.org/wiki/ISO_3166-2:US'
    source_code = requests.get(url)
    plain_text = source_code.text
    print(plain_text)

crawl_wiki()

我从site 得到了text。但我不知道如何用代码获取状态字典。帮我解决一些问题。

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup urllib2


    【解决方案1】:
    import pandas as pd
    
    df = pd.read_html(
        "https://en.wikipedia.org/wiki/ISO_3166-2:US")[0]
    result = df['Subdivision name (en)'], df['Code']
    d = pd.DataFrame(result)
    d = d.T
    newd = d.set_index('Subdivision name (en)', 'Code').to_dict()
    print(newd['Code'])
    

    输出:

    {'Alabama': 'US-AL', 'Alaska': 'US-AK', 'Arizona': 'US-AZ', 'Arkansas': 'US-AR', 'California': 'US-CA', 'Colorado': 'US-CO', 'Connecticut': 'US-CT', 'Delaware': 'US-DE', 'Florida': 'US-FL', 'Georgia': 'US-GA', 'Hawaii': 'US-HI', 'Idaho': 'US-ID', 'Illinois': 'US-IL', 'Indiana': 'US-IN', 'Iowa': 'US-IA', 'Kansas': 'US-KS', 'Kentucky': 'US-KY', 'Louisiana': 'US-LA', 'Maine': 'US-ME', 'Maryland': 'US-MD', 'Massachusetts': 'US-MA', 'Michigan': 'US-MI', 'Minnesota': 'US-MN', 'Mississippi': 'US-MS', 'Missouri': 'US-MO', 'Montana': 'US-MT', 'Nebraska': 'US-NE', 'Nevada': 'US-NV', 'New Hampshire': 'US-NH', 'New Jersey': 'US-NJ', 'New Mexico': 'US-NM', 'New York': 'US-NY', 'North Carolina': 'US-NC', 'North Dakota': 'US-ND', 'Ohio': 'US-OH', 'Oklahoma': 'US-OK', 'Oregon': 'US-OR', 'Pennsylvania': 'US-PA', 'Rhode Island': 'US-RI', 'South Carolina': 'US-SC', 'South Dakota': 'US-SD', 'Tennessee': 'US-TN', 'Texas': 'US-TX', 'Utah': 'US-UT', 'Vermont': 'US-VT', 'Virginia': 'US-VA', 'Washington': 'US-WA', 'West Virginia': 'US-WV', 'Wisconsin': 'US-WI', 'Wyoming': 'US-WY', 'District of Columbia': 'US-DC', 'American Samoa': 'US-AS', 'Guam': 'US-GU', 'Northern Mariana Islands': 'US-MP', 'Puerto Rico': 'US-PR', 'United States Minor Outlying Islands': 'US-UM', 'Virgin Islands, U.S.': 'US-VI'}
    

    【讨论】:

      【解决方案2】:

      试试这个:

      import bs4
      import requests
      
      response = requests.get('https://en.wikipedia.org/wiki/ISO_3166-2:US')
      html = response.content.decode('utf-8')
      
      soup = bs4.BeautifulSoup(html, "lxml")
      code_list = soup.select("#mw-content-text > div > table:nth-child(11) > tbody > tr > td:nth-child(1) > span")
      name_list = soup.select("#mw-content-text > div > table:nth-child(11) > tbody > tr > td:nth-child(2) > a")
      
      
      mapState = {}
      ## mapState={'Alabama': 'US-AL', 'Alaska': 'US-AK',.....,'Wyoming':'US-WY}'
      
      for i in range(len(code_list)):
          mapState[code_list[i].string] = name_list[i].string
      
      
      print(mapState)
      

      【讨论】:

        【解决方案3】:

        这是一个 SimplifiedDoc 方案,类似于 BeautifulSoup

        import requests
        from simplified_scrapy.simplified_doc import SimplifiedDoc 
        url = 'https://en.wikipedia.org/wiki/ISO_3166-2:US'
        response = requests.get(url)
        doc = SimplifiedDoc(response.text,start='Subdivision category',end='</table>')
        datas = [tr.tds for tr in doc.trs]
        mapState = {}
        for tds in datas:
          mapState[tds[1].a.text]=tds[0].text
        

        【讨论】:

          【解决方案4】:

          试试熊猫read_html -

          https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html

          然后将pandas df提取到dict

          示例 -

          import pandas as pd
          
          df = pd.read_html("https://en.wikipedia.org/wiki/ISO_3166-2:US")[0].to_dict()
          print(df)
          

          【讨论】:

            猜你喜欢
            • 2013-06-16
            • 1970-01-01
            • 1970-01-01
            • 2012-01-03
            • 1970-01-01
            • 1970-01-01
            • 2016-05-29
            • 2016-09-09
            相关资源
            最近更新 更多