【问题标题】:Problem using CSV to query a website, input isn't right使用 CSV 查询网站时出现问题,输入不正确
【发布时间】:2021-08-31 22:34:28
【问题描述】:

我在 python 编程方面相当新,但有一列我想在网站上搜索的术语。代码如下:

import requests
import pandas as pd
from bs4 import BeautifulSoup as BS

col_list = ['Molecular Formula'] #this is a column title in my csv file
Chem = pd.read_csv('single.csv', usecols=col_list)
res = requests.get('https://hmdb.ca/unearth/q?utf8=✓&query='+ Chem +'&searcher=metabolites&button=')
html_page = res.content
soup = BS(html_page, 'html.parser')
body = soup.find_all('div', attrs={'class':'hit-name'})

for div in body:
    print(div.text)

我想使用列信息来填写搜索中的“Chem”项。如果我只使用 Chem =“一些特定的化学物质”,它会很好用。正如它所写的那样,我收到以下错误 - 没有找到'分子式\n0 https://hmdb.ca/unearth/q?utf8=✓&query=C10H7NO...\n1 https://hmdb.ca/unearth/q?utf8=✓&query=C11N12O...' 的连接适配器。也许这与熊猫添加到每一行的数字有关?任何帮助表示赞赏!

【问题讨论】:

    标签: python-3.x pandas csv web-scraping beautifulsoup


    【解决方案1】:

    您可以使用 for 循环来迭代“分子式”列中的值。例如:

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup as BS
    
    col_list = ["Molecular Formula"]  # this is a column title in my csv file
    Chem = pd.read_csv("data.csv", usecols=col_list)
    
    for c in Chem["Molecular Formula"]:
        res = requests.get(
            "https://hmdb.ca/unearth/q?utf8=✓&query="
            + c
            + "&searcher=metabolites&button="
        )
        html_page = res.content
        soup = BS(html_page, "html.parser")
        body = soup.find_all("div", attrs={"class": "hit-name"})
    
        for div in body:
            print(div.text)
        print("-" * 80)
    

    打印:

    Succinylcholine
    2-Ethyl-4,5-dimethylthiazole
    Water
    --------------------------------------------------------------------------------
    Licoricesaponin C2
    Illudin C2
    Eremopetasitenin C2
    Cinncassiol C2
    Gladiatoside C2
    Prostaglandin-c2
    Capsicoside C2
    Schidigerasaponin C2
    Ganoderic acid C2
    Ginsenoside C
    Diethyl sulfide
    Mangiferin
    4-Nitrophenol
    L-Acetylcarnitine
    Malonic acid
    11-trans-Leukotriene C4
    (-)-Epigallocatechin
    Tryptophan 2-C-mannoside
    --------------------------------------------------------------------------------
    

    data.csv的内容:

    Molecular Formula
    H2O
    C2
    

    编辑:将结果保存到 CSV:

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup as BS
    
    col_list = ["Molecular Formula"]  # this is a column title in my csv file
    Chem = pd.read_csv("data.csv", usecols=col_list)
    
    all_data = []
    for c in Chem["Molecular Formula"]:
        print(f"Getting {c=}")
        res = requests.get(
            "https://hmdb.ca/unearth/q?utf8=✓&query="
            + c
            + "&searcher=metabolites&button="
        )
        html_page = res.content
        soup = BS(html_page, "html.parser")
        body = soup.find_all("div", attrs={"class": "hit-name"})
    
        for div in body:
            all_data.append([c, div.text])
    
    df = pd.DataFrame(all_data, columns=["Molecular Formula", "Value"])
    print(df)
    df.to_csv("result.csv", index=False)
    

    打印:

    Getting c='H2O'
    Getting c='C2'
       Molecular Formula                         Value
    0                H2O               Succinylcholine
    1                H2O  2-Ethyl-4,5-dimethylthiazole
    2                H2O                         Water
    3                 C2            Licoricesaponin C2
    4                 C2                    Illudin C2
    5                 C2           Eremopetasitenin C2
    6                 C2                Cinncassiol C2
    7                 C2               Gladiatoside C2
    8                 C2              Prostaglandin-c2
    9                 C2                Capsicoside C2
    10                C2          Schidigerasaponin C2
    11                C2             Ganoderic acid C2
    12                C2                 Ginsenoside C
    13                C2               Diethyl sulfide
    14                C2                    Mangiferin
    15                C2                 4-Nitrophenol
    16                C2             L-Acetylcarnitine
    17                C2                  Malonic acid
    18                C2       11-trans-Leukotriene C4
    19                C2          (-)-Epigallocatechin
    20                C2      Tryptophan 2-C-mannoside
    

    并保存result.csv

    【讨论】:

    • 哇!非常感谢它像梦一样工作!它可能会达到但关于将打印输出放入新 csv 的任何建议?
    • 非常感谢!我已经为此工作了好几天。我可能需要根据收集到新 csv 中的结果在不同站点上搜索这些数据(我需要交叉引用我的结果),是否可以将 csv 写入行中,以便 H2O 在 csv 中给出三个值而不是重复H2O 三次,旁边有一列与之匹配?如果太难,别担心,你所做的已经节省了这么多时间,我可以用它。
    • @CharlieP 如果我理解正确,您可能想要df.groupby("Molecular Formula", as_index=False).agg(", ".join)(这将按“分子式”对数据框进行分组,并通过, 加入“值”
    • 再次感谢您花这么多时间在这方面。
    • @CharlieP 如果这个答案解决了您的问题,那么请考虑accepting it as the answer to your question,通过单击左侧的复选标记,这将让其他人知道这个答案已经解决。 (有关信息,请参阅What does it mean to accept an answer?
    猜你喜欢
    • 1970-01-01
    • 2016-08-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-08-07
    相关资源
    最近更新 更多