【问题标题】:Fetching content from html and write fetched content in a specific format in CSV从 html 获取内容并将获取的内容以特定格式写入 CSV
【发布时间】:2017-12-10 00:59:27
【问题描述】:

我有类似的 HTML 代码:

<!-- Snippet snippets/search_result_text.html end -->
</h2>





      <p class="filter-list">


          <span class="facet">Organisations:</span>

            <span class="filtered pill">**Reserve Bank of Australia**
              <a href="/dataset?groups=business" class="remove" title="Remove"><i class="icon-remove"></i></a>
            </span>



          <span class="facet">Groups:</span>

            <span class="filtered pill">**Business Support and Regulation**
              <a href="/dataset?organization=reservebankofaustralia" class="remove" title="Remove"><i class="icon-remove"></i></a>
            </span>


      </p>



</form>




<!-- Snippet snippets/search_form.html end -->




<!-- Snippet snippets/search_package_list.html start -->



        <ul class="dataset-list unstyled">






<!-- Snippet snippets/package_item.html start -->






<li class="dataset-item">

    <div class="dataset-content">
      <h3 class="dataset-heading">



        <a href="/dataset/banks-assets">**Banks – Assets**</a>




      </h3>


        <div>These data are derived from returns submitted to the Australian Prudential Regulation Authority (APRA) by banks authorised under the Banking Act 1959. APRA assumed...</div>

    </div>

      <ul class="dataset-resources unstyled">

          <li>

            <a href="/dataset/banks-assets" class="label" data-format="xls">XLS</a>

          </li>

      </ul>


</li>
<!-- Snippet snippets/package_item.html end -->





<!-- Snippet snippets/package_item.html start -->






<li class="dataset-item">

    <div class="dataset-content">
      <h3 class="dataset-heading">



        <a href="/dataset/consolidated-exposures-immediate-and-ultimate-risk-basis">**Consolidated Exposures – Immediate and Ultimate Risk Basis**</a>




      </h3>


        <div>In March 2003, banks and selected Registered Financial Corporations (RFCs) began reporting their international assets, liabilities and country exposures to APRA in ARF/RRF 231...</div>

    </div>

      <ul class="dataset-resources unstyled">

          <li>

            <a href="/dataset/consolidated-exposures-immediate-and-ultimate-risk-basis" class="label" data-format="xls">XLS</a>

          </li>

      </ul>


</li>
<!-- Snippet snippets/package_item.html end -->

我想提取上面用粗体字母表示的数据,并希望以 csv 特定格式写入,例如:

Group                               Organisation              Title              
Business Support and Regulation    Reserve Bank of Australia   Banks-Assets
Business Support and Regulation    Reserve Bank of Australia   Consolidated Exposures – Immediate and Ultimate Risk Basis

等等.... 我有我的 python 代码,它提供了两个不同的文件。

webpage_urls = ["https://data.gov.au/dataset?q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&organization=reservebankofaustralia&_groups_limit=0",
                "https://data.gov.au/dataset?q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&organization=department-of-finance&_groups_limit=0",
                "https://data.gov.au/dataset?q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&organization=departmentofagriculturefisheriesandforestry&_groups_limit=0",
                "https://data.gov.au/dataset?organization=department-of-communications&q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&_groups_limit=0",
                "https://data.gov.au/dataset?organization=ip-australia&q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&_groups_limit=0",
                "https://data.gov.au/dataset?q=&organization=australiancommunicationsandmediaauthority&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&_groups_limit=0",
                "https://data.gov.au/dataset?q=&organization=www-mitchellshirecouncil-vic-gov-au&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&_groups_limit=0",
                "https://data.gov.au/dataset?q=&groups=business&sort=extras_harvest_portal+asc%2C+score+desc%2C+metadata_modified+desc&_organization_limit=0&organization=digital-transformation-agency&_groups_limit=0"]
# fetching data from all urls
data = []
dfs = []

for i in webpage_urls:
    wiki2 = i
    page= urllib.request.urlopen(wiki2)
    soup = BeautifulSoup(page)

    lobbying = {}
    data2 = soup.find_all('h3', class_="dataset-heading")
    for element in data2:
        lobbying[element.a.get_text()] = {}
    data2[0].a["href"]
    prefix = "https://data.gov.au"
    for element in data2:
        print()
        lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
        #print(lobbying)
        df = pd.DataFrame.from_dict(lobbying, orient='index').rename_axis('Titles').reset_index()
        dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
df1 = df.drop_duplicates(subset = 'Titles')
print (df1)
df1.to_csv('D:/output2.csv')

for i in webpage_urls:
    wiki2 = i
    page= urllib.request.urlopen(wiki2)
    soup = BeautifulSoup(page)

    # fetching organisations
    data3 = soup.find_all('li', class_="nav-item active")
    lobbying1 = []
    for element in data3:
        lobbying1.append(element.span.get_text())
        data.append(lobbying1)



df_ = pd.DataFrame(data, columns = ['Organisations', 'Groups'])
df2 = df_.drop_duplicates(subset = 'Organisations')
with pd.option_context('display.max_rows', 999):
    print (df2)
df2.to_csv('D:/output_new.csv')

上面也给出了链接。请帮助在具有三列的单个 csv 中获得所需的格式。

【问题讨论】:

    标签: python csv pandas beautifulsoup


    【解决方案1】:

    我尝试修改原始解决方案 - 最好只循环一次并创建一个包含所有数据的大 DataFrame。然后只为新的DataFrames 选择具有子集[['col1','col2'] 的列。

    也可以使用() 删除数字str.replace

    for i in webpage_urls:
        wiki2 = i
        page= urllib.request.urlopen(wiki2)
        soup = BeautifulSoup(page, "lxml")
    
        lobbying = {}
        #always only 2 active li, so select first by [0]  and second by [1]
        org = soup.find_all('li', class_="nav-item active")[0].span.get_text()
        groups = soup.find_all('li', class_="nav-item active")[1].span.get_text()
    
        data2 = soup.find_all('h3', class_="dataset-heading")
        for element in data2:
            lobbying[element.a.get_text()] = {}
        data2[0].a["href"]
        prefix = "https://data.gov.au"
        for element in data2:
            lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
            lobbying[element.a.get_text()]["Organisation"] = org
            lobbying[element.a.get_text()]["Group"] = groups
            #print(lobbying)
            df = pd.DataFrame.from_dict(lobbying, orient='index') \
                   .rename_axis('Titles').reset_index()
            dfs.append(df)
    df = pd.concat(dfs, ignore_index=True)
    df1 = df.drop_duplicates(subset = 'Titles').reset_index(drop=True)
    
    
    
    df1['Organisation'] = df1['Organisation'].str.replace('\(\d+\)', '')
    df1['Group'] = df1['Group'].str.replace('\(\d+\)', '')
    

    print (df1.head())
                                                  Titles             Organisation  \
    0                                     Banks – Assets  Reserve Bank of Aus...    
    1  Consolidated Exposures – Immediate and Ultimat...  Reserve Bank of Aus...    
    2  Foreign Exchange Transactions and Holdings of ...  Reserve Bank of Aus...    
    3  Finance Companies and General Financiers – Sel...  Reserve Bank of Aus...    
    4                   Liabilities and Assets – Monthly  Reserve Bank of Aus...    
    
                                                    link                    Group  
    0           https://data.gov.au/dataset/banks-assets  Business Support an...   
    1  https://data.gov.au/dataset/consolidated-expos...  Business Support an...   
    2  https://data.gov.au/dataset/foreign-exchange-t...  Business Support an...   
    3  https://data.gov.au/dataset/finance-companies-...  Business Support an...   
    4  https://data.gov.au/dataset/liabilities-and-as...  Business Support an...   
    

    df2 = df1[['Titles', 'link']]
    print (df2.head())
                                                  Titles  \
    0                                     Banks – Assets   
    1  Consolidated Exposures – Immediate and Ultimat...   
    2  Foreign Exchange Transactions and Holdings of ...   
    3  Finance Companies and General Financiers – Sel...   
    4                   Liabilities and Assets – Monthly   
    
                                                    link  
    0           https://data.gov.au/dataset/banks-assets  
    1  https://data.gov.au/dataset/consolidated-expos...  
    2  https://data.gov.au/dataset/foreign-exchange-t...  
    3  https://data.gov.au/dataset/finance-companies-...  
    4  https://data.gov.au/dataset/liabilities-and-as...  
    

    df3 = df1[['Group','Organisation','Titles']]
    print (df3.head())
                         Group             Organisation  \
    0  Business Support an...   Reserve Bank of Aus...    
    1  Business Support an...   Reserve Bank of Aus...    
    2  Business Support an...   Reserve Bank of Aus...    
    3  Business Support an...   Reserve Bank of Aus...    
    4  Business Support an...   Reserve Bank of Aus...    
    
                                                  Titles  
    0                                     Banks – Assets  
    1  Consolidated Exposures – Immediate and Ultimat...  
    2  Foreign Exchange Transactions and Holdings of ...  
    3  Finance Companies and General Financiers – Sel...  
    4                   Liabilities and Assets – Monthly  
    

    【讨论】:

    • 非常感谢您改进代码。它终于给了我想要的。
    • BeautifulSoup 对我来说很难,所以也许有更好的解决方案。但我认为主要思想是仅在必要时循环,因为它是非常缓慢的操作。
    • 好的,是的,循环是问题所在。我正在循环许多 URL,它花了大约 1 小时并且仍在运行。所以是的,这在 python 中很慢。
    • 我得到的输出是 Business Support an... and Reserve Bank of Aus....,对,我想打印整个文本而不是“......”对所有人。我将第 9 行和第 10 行替换为 "org = soup.find_all('a', {'class':'nav-item active'})[0].get('title')" & "groups = soup.find_all( 'a', {'class':'nav-item active'})[1].get('title')" 在上面的代码中。而且我正在单独运行它并收到错误:列表索引超出范围。我应该用什么来提取完整​​的句子?
    • @Arti123 - 你能创建新问题吗?我找到了答案,但这有点复杂。在我看来,您可以将您的评论用作正文,也可以添加指向此问题的链接。谢谢。
    猜你喜欢
    • 2014-08-23
    • 1970-01-01
    • 1970-01-01
    • 2012-12-13
    • 1970-01-01
    • 1970-01-01
    • 2023-04-04
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多