【问题标题】:BeautifulSoup for multiple URLs with different templatesBeautifulSoup 用于具有不同模板的多个 URL
【发布时间】:2022-06-10 20:25:10
【问题描述】:

我想用 2 个不同的 HTML 模板抓取多个 URL。我可以毫无问题地自行抓取每个 HTML,但是在尝试组合两个抓取器时遇到了问题。以下是我的代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd

page_url1 = 'https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory'
page_url2 = 'https://www.med.upenn.edu/apps/faculty/index.php/g20001100/p8866286'
page_url_lst = {'url': [page_url1, page_url2], 'template': [1,2]}
page_url_df = pd.DataFrame(page_url_lst)

data = []
if page_url_df['template'] == 1:
    for url in page_url_df['url']:
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'lxml')
        for e in soup.select('#tabs-publications em'):
            data.append({
                'author':e.previous.get_text(strip=True)[:-1],
                'title':e.get_text(strip=True),
                'journal':e.next_sibling.get_text(strip=True),
                'source': url
            })
else:
    for url_2 in page_url_df['url']:
        r_2 = requests.get(url_2)
        soup_2 = BeautifulSoup(r_2.text, 'lxml')
        for a in soup_2.find_all('span',{'class':'fac_citation'}):
            data.append({
                'author':a.find('b').get_text(),
                'title':a.find('i').get_text(strip=True),
                'journal':a.find('i').next_sibling.get_text(strip=True),
                'source': url_2
            })

这里的逻辑如果列'template'返回值1,则使用第一个模板提取数据,否则使用第二个模板提取数据。但是,此代码返回此错误:The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

提前谢谢你!

【问题讨论】:

    标签: python pandas beautifulsoup


    【解决方案1】:

    如果我没听错的话,您想根据page_url_df 创建新的数据框:

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    
    page_url1 = "https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory"
    page_url2 = (
        "https://www.med.upenn.edu/apps/faculty/index.php/g20001100/p8866286"
    )
    page_url_lst = {"url": [page_url1, page_url2], "template": [1, 2]}
    page_url_df = pd.DataFrame(page_url_lst)
    
    
    def get_template_1(url):
        data = []
        soup = BeautifulSoup(requests.get(url).content, "lxml")
        for e in soup.select("#tabs-publications em"):
            data.append(
                {
                    "author": e.previous.get_text(strip=True)[:-1],
                    "title": e.get_text(strip=True),
                    "journal": e.next_sibling.get_text(strip=True),
                    "source": url,
                }
            )
        return data
    
    
    def get_template_2(url):
        data = []
        soup = BeautifulSoup(requests.get(url).text, "lxml")
        for a in soup.find_all("span", {"class": "fac_citation"}):
            data.append(
                {
                    "author": a.find("b").get_text(),
                    "title": a.find("i").get_text(strip=True),
                    "journal": a.find("i").next_sibling.get_text(strip=True),
                    "source": url,
                }
            )
        return data
    
    
    all_data = []
    for _, row in page_url_df.iterrows():
        print("Getting", row["url"])
        if row["template"] == 1:
            all_data.extend(get_template_1(row["url"]))
        elif row["template"] == 2:
            all_data.extend(get_template_2(row["url"]))
    
    
    df_out = pd.DataFrame(all_data)
    
    # print sample data
    print(df_out.head().to_markdown())
    

    打印:

    author title journal source
    0 Hantsoo Liisa, Kornfield Sara, Anguera Montserrat C, Epperson C Neill Inflammation: A Proposed Intermediary Between Maternal Stress and Offspring Neuropsychiatric Risk. [PMID30314641] Biological psychiatry 85(2): 97-106, Jan 2019. https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory
    1 Sierra Isabel, Anguera Montserrat C Enjoy the silence: X-chromosome inactivation diversity in somatic cells.[PMID31108425] Current opinion in genetics & development 55: 26-31, May 2019. https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory
    2 Syrett Camille M, Anguera Montserrat C When the balance is broken: X-linked gene dosage from two X chromosomes and female-biased autoimmunity. [PMID31125996] Journal of leukocyte biology May 2019. https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory
    3 Kotzin Jonathan J, Iseka Fany, Wright Jasmine, Basavappa Megha G, Clark Megan L, Ali Mohammed-Alkhatim, Abdel-Hakeem Mohamed S, Robertson Tanner F, Mowel Walter K, Joannas Leonel, Neal Vanessa D, Spencer Sean P, Syrett Camille M, Anguera Montserrat C, Williams Adam, Wherry E John, Henao-Mejia Jorge The long noncoding RNA regulates CD8 T cells in response to viral infection.[PMID31138702] Proceedings of the National Academy of Sciences of the United States of America 116(24): 11916-11925, Jun 2019. https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory
    4 Syrett Camille M, Paneru Bam, Sandoval-Heglund Donavon, Wang Jianle, Banerjee Sarmistha, Sindhava Vishal, Behrens Edward M, Atchison Michael, Anguera Montserrat C Altered X-chromosome inactivation in T cells may promote sex-biased autoimmune diseases. [PMID30944248 JCI insight 4(7), Apr 2019. https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory

    【讨论】:

      【解决方案2】:

      您需要在外循环中使用可迭代对象。一种方法是从现有的数据框列生成一个元组列表并循环它。然后,您可以在循环中简化条件逻辑。

      import requests
      from bs4 import BeautifulSoup
      import pandas as pd
      
      page_url1 = "https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory"
      page_url2 = "https://www.med.upenn.edu/apps/faculty/index.php/g20001100/p8866286"
      page_url_lst = {"url": [page_url1, page_url2], "template": [1, 2]}
      page_url_df = pd.DataFrame(page_url_lst)
      
      data = []
      
      with requests.Session() as s:
          for template, url in zip(
              page_url_df["template"].to_list(), page_url_df["url"].to_list()
          ):
              r = s.get(url)
              soup = BeautifulSoup(r.text, "lxml")
      
              if template == 1:
                 
                  for e in soup.select("#tabs-publications em"):
                      data.append(
                          {
                              "author": e.previous.get_text(strip=True)[:-1],
                              "title": e.get_text(strip=True),
                              "journal": e.next_sibling.get_text(strip=True),
                              "source": url,
                          }
                      )
              else:
      
                  for a in soup.find_all("span", {"class": "fac_citation"}):
                      data.append(
                          {
                              "author": a.find("b").get_text(),
                              "title": a.find("i").get_text(strip=True),
                              "journal": a.find("i").next_sibling.get_text(strip=True),
                              "source": url,
                          }
                      )
      print(data)
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2021-01-29
        • 2021-06-23
        • 1970-01-01
        • 2023-04-04
        • 1970-01-01
        • 2018-11-12
        • 2021-10-12
        • 1970-01-01
        相关资源
        最近更新 更多