【问题标题】:Seperating the information from output in my scraping code (beautifulsoup + python)在我的抓取代码中将信息与输出分开(beautifulsoup + python)
【发布时间】:2020-03-08 19:39:12
【问题描述】:

我正在抓取的个人资料是 https://lawyers.justia.com/lawyer/robin-d-gross-39828 。我将教育和专业协会一起打印出来,如何将这两者分开?

for item in soup.find_all("dl", {"class": "description-list list-with-badges"}):
    y = item.find_all("span",attrs={"itemprop":"name"})
    if y:
        print("Education:", item.get_text(strip=True, separator= '|').split('|'))

输出是:

Education: ['Santa Clara University School of Law', 'J.D. ', '  Law', '1998', 'Honors:', 'Awarded "Certificate in High Technology Law"', 'Activities:', 'Editor, Santa Clara Computer & High Technology Law Journal;  Editor-in-Chief, The Advocate, Santa Clara University Law School Newspaper.']
Education: ['Michigan State University, James Madison College', 'B.A. ', '  Political Philosophy', '1995', 'Honors:', 'Overseas Study Program in Caribbean and South America, Summer Semester 1994Vice-President, MSU Adventure Club']
Education: ['Michigan State University, James Madison College', 'B.A. ', '  International Relations', '1995']
Education: ['California State Bar', '# 200701', 'Member', 'Current']
Education: ['California Bar Association', 'Member', 'Current']
Education: ['San Francisco Bar Association', 'Member', 'Current']
Education: ['American Bar Association', 'Member', 'Current']
Education: ['Internet Corporation for Assigned Names and Numbers (ICANN) - Noncommercial Stakeholders Group', 'Executive Committee', '2010', '- Current']
Education: ['Executive Committee of FreeMuse', 'Member', '2009', '-', '2016']
Education: ['Public Interest Registry - Advisory Council', 'Member', '2012', '-', '2014']

【问题讨论】:

    标签: python web beautifulsoup screen-scraping


    【解决方案1】:

    您正在使用"class": "description-list list-with-badges" 获取您的物品。如果您查看代码,您会发现EducationProfessional Associations 中的两个项目都有这些类。

    如果你想单独捕获它们,你可以使用itemtype 标签。 http://schema.org/CollegeOrUniversityEducation 标签的值,http://schema.org/OrganizationProfessional Associations

    【讨论】:

    • 没问题!如果您的问题解决了,别忘了接受答案:stackoverflow.com/help/someone-answers
    • 真棒不知道,但 ty。
    • 我正在使用您提出的相同想法来尝试获取有关奖项的信息,但看起来没有独特的标签,就像专业协会和教育一样,你知道我可能这种情况下怎么办?
    • 您可以先使用基于文本的搜索找到Awards div,然后使用.parent 获取此div 的所有信息。
    • 我试过这个:for item in soup.findAll("div",{"class":"heading-3 block-title iconed-heading font-w-bold"}): j=item .find_parent('div') print("AwardS:",item.get_text(strip=True, separator= '|').split('|'))
    猜你喜欢
    • 1970-01-01
    • 2020-11-15
    • 1970-01-01
    • 2016-09-07
    • 2022-08-14
    • 2022-01-10
    • 1970-01-01
    • 1970-01-01
    • 2017-12-23
    相关资源
    最近更新 更多