【问题标题】:Python/Beautiful soup. Extracting all the text from <li> tags between a h2 and h3 tag蟒蛇/美丽的汤。从 h2 和 h3 标签之间的 <li> 标签中提取所有文本
【发布时间】:2019-09-02 07:54:39
【问题描述】:

我正在尝试做的事情:这个网站上有 3 个食品添加剂列表,我正在尝试提取它们以获得 3 个不同的列表。它们位于&lt;ul&gt;&lt;li&gt; 标签中,介于&lt;h2&gt;&lt;h3&gt; 标签之间。 我想找到第一个 h2,将它下面的所有 lis 提取到一个列表中,当到达下一个 h 标记 (h3) 时,开始一个新列表并提取下面的所有 lis 并继续第三个列表。

我已经尝试过的方法:我已经阅读并发现了一个与我的问题非常相似的问题。 BeautifulSoup: How do I extract all the <li>s from a list of <ul>s that contains some nested <ul>s? 我试图应用该答案的逻辑,但它对我不起作用。

在开始制作列表之前,我会运行打印语句以查看输出是什么。

import urllib.request as request
import bs4 as bs

sauce = request.urlopen("https://www.foodadditivesworld.com/articles/banned-food-additives.html").read()
soup = bs.BeautifulSoup(sauce, 'lxml')


firstH2 = soup.find('h2') # Start here
# print(firstH2.text)
# print(firstH2.findNextSiblings())
uls = []
for sib in firstH2.findNextSiblings():
#     print(child.name)
    if sib.name=='h3':
        print(sib)
        break
    elif sib.name == 'div':
        print(sib.text)
        continue
        for c in sib.descendants:
            if c.name=='li':
                print (c)

发生了什么:代码基本上按照我的意愿进行,但它应该在第一次运行到 h3 标记时中断,但它没有,它继续到第二个 h3 标记在停止之前。为什么它错过了第一次出现?

【问题讨论】:

    标签: python web-scraping beautifulsoup


    【解决方案1】:

    你可以刮掉h2ul标签,然后使用itertools.groupby

    import requests, itertools, re
    from bs4 import BeautifulSoup as soup
    d = soup(requests.get('https://www.foodadditivesworld.com/articles/banned-food-additives.html').text, 'html.parser')
    _, *data = [[i.name, i] for i in d.find_all(re.compile('h2|h3|ul'))]
    new_data = [[a, list(b)] for a, b in itertools.groupby(data, key=lambda x:x[0] == 'h2' or x[0] == 'h3')]
    new_result = [[new_data[i][-1][0][-1].text, [c.text for b in new_data[i+1][-1] for c in b[-1].find_all('li')]] for i in range(0, len(new_data), 2)]
    

    输出:

    [['Banned Food Additives in US', ['Calamus extract', 'Calamus oil', 'Calcium cyclamate', 'Chlorofluorocarbons', 'cinnamyl anthranilate', 'Cobaltous chloride', 'Cobalt sulfate', 'Coumarin', 'Cyclamate', 'Diethyl pyrocarbonatec', 'Dulcin', 'Fd&c green no. 1', 'Fd&c green no. 2', 'Fd&c red no. 3, aluminum lake', 'CFd&c red no. 3, calcium lake', 'Fd&c red no. 1', 'Fd&c red no. 2', 'Fd&c red no. 4', 'Fd&c violet no. 1', 'Magnesium cyclamate', 'Nordihydroguaiaretic acid', 'Potassium cyclamate', 'P-4000', 'Safrole', 'Sodium cyclamate', 'Thiourea']], ['UK Food Additives Banned in Australia and New Zealand', ['E131 Patent Blue V', 'E154 Brown FK', 'E161g Canthaxanthin', 'E180 Litholrubine BK']], ['Preservatives', ['E214 Â\xa0 Ethyl p-hydroxybenzoate', 'E215 Â\xa0 Sodium ethyl p-hydroxybenzoate', 'E219 Â\xa0 Sodium methyl p-hydroxybenzoate', 'E226 Â\xa0 Calcium sulphite', 'E227 Â\xa0 Calcium hydrogen sulphite', 'E230 Â\xa0 Biphenyl; diphenyl', 'E231 Â\xa0 Orthophenyl phenol', 'E232 Â\xa0 Sodium orthophenyl phenol', 'E239 Â\xa0 Hexamethylene tetramine', 'E284 Â\xa0 Boric acid', 'E285 Â\xa0 Sodium tetraborate; borax', 'E356 Â\xa0 Sodium adipate antioxidant']], ['Stabilisers, Thickeners and Gelling Agents Emulsifiers', ['E417 Â\xa0 Tara gum', 'E425 Â\xa0 Konjac', 'E426 Â\xa0 Soybean hemicellulose', 'E226 Â\xa0 Calcium sulphite', 'E432 Polyoxyethylene sorbitan monolaurate; Polysorbate 20', 'E434 Polyoxyethylene sorbitan monopalmitate; Polysorbate 40', 'E459 Â\xa0 Beta-cyclodextrin', 'E462 Â\xa0 Ethyl cellulose', 'E468 Â\xa0 Crosslinked sodium carboxy methyl cellulose', 'E472d Â\xa0 Tartaric acid esters of mono- and diglycerides of fatty acids', 'E474 Â\xa0 Sucroglycerides', 'E483 Â\xa0 Stearyl tartrate', 'E493 Â\xa0 Sorbitan monolaurate', 'E494 Â\xa0 Sorbitan monooleate', 'E495 Â\xa0 Sorbitan monopalmitate', 'E513 Â\xa0 Sulphuric acid', 'E517 Â\xa0 Ammonium sulphate', 'E520 Â\xa0 Aluminium sulphate', 'E521 Â\xa0 Aluminium sodium sulphate', 'E522 Â\xa0 Aluminium potassium sulphate', 'E523 Â\xa0 Aluminium ammonium sulphate', 'E524 Â\xa0 Sodium hydroxide', 'E525 Â\xa0 Potassium hydroxide', 'E527 Â\xa0 Ammonium hydroxide', 'E528 Â\xa0 Magnesium hydroxide', 'E538 Â\xa0 Calcium ferrocyanide', 'E553a Â\xa0 (i) Magnesium silicate', 'E553b Â\xa0 Talc E574 Â\xa0 Gluconic acid', 'E576 Â\xa0 Sodium gluconate', 'E585 Â\xa0 Ferrous lactate', 'E626 Â\xa0 Guanylic acid', 'E628 Â\xa0 Dipotassium guanylate', 'E629 Â\xa0 Calcium guanylate', 'E630 Â\xa0 lnosinic acid', 'E632 Â\xa0 Dipotassium inosinate', 'E633 Â\xa0 Calcium inosinate', "E634 Â\xa0 Calcium 5'-ribonucleotides", 'E650 Â\xa0 Zinc acetate', 'E900 Â\xa0 Dimethylpolysiloxane', 'E902 Â\xa0 Candelilla wax', 'E905 Â\xa0 Microcrystalline wax', 'E912 Â\xa0 Montan acid esters', 'E927b Â\xa0 Carbamide', 'E938 Â\xa0 Argon', 'E939 Â\xa0 Helium', 'E948 Â\xa0 Oxygen', 'E949 Â\xa0 Hydrogen', 'E959 Â\xa0 Neohesperidine DC', 'E962 Â\xa0 Salt of aspartame-acesulfame', 'E999 Â\xa0 Quillaia extract', 'E1103 Â\xa0 Invertase', 'E1202 Â\xa0 Polyvinylpolypyrrolidone', 'E1204 Â\xa0 Pullulan', 'E1451 Â\xa0 Acetylated oxidised starch', 'E1452 Â\xa0 Starch aluminium Octenyl succinate', 'Annatto ExtractM', 'Anthocyanins', 'Lake Allura Red', 'Lake Amaranth', 'Solvent Black 5', 'Solvent Black 7', 'Pigment Fast Yellow G', 'Pigment Green B', 'FD&C; Blue No.2 ', 'FD&C; Blue No.1 ', 'Beverages ', 'Confectionery ', 'Anticaking Agents ', 'Color Retention Agents ']]]
    

    打印结果:

    print('\n\n'.join('  {}\n{}'.format(a, '\n'.join(f'\t-{i}' for i in b)) for a, b in new_result))
    

    输出:

    Banned Food Additives in US
    -Calamus extract
    -Calamus oil
    -Calcium cyclamate
    -Chlorofluorocarbons
    -cinnamyl anthranilate
    -Cobaltous chloride
    -Cobalt sulfate
    -Coumarin
    -Cyclamate
    -Diethyl pyrocarbonatec
    -Dulcin
    -Fd&c green no. 1
    -Fd&c green no. 2
    -Fd&c red no. 3, aluminum lake
    -CFd&c red no. 3, calcium lake
    -Fd&c red no. 1
    -Fd&c red no. 2
    -Fd&c red no. 4
    -Fd&c violet no. 1
    -Magnesium cyclamate
    -Nordihydroguaiaretic acid
    -Potassium cyclamate
    -P-4000
    -Safrole
    -Sodium cyclamate
    -Thiourea
    
    UK Food Additives Banned in Australia and New Zealand
    -E131 Patent Blue V
    -E154 Brown FK
    -E161g Canthaxanthin
    -E180 Litholrubine BK
    
    Preservatives
    -E214   Ethyl p-hydroxybenzoate
    -E215   Sodium ethyl p-hydroxybenzoate
    -E219   Sodium methyl p-hydroxybenzoate
    -E226   Calcium sulphite
    -E227   Calcium hydrogen sulphite
    -E230   Biphenyl; diphenyl
    -E231   Orthophenyl phenol
    -E232   Sodium orthophenyl phenol
    -E239   Hexamethylene tetramine
    -E284   Boric acid
    -E285   Sodium tetraborate; borax
    -E356   Sodium adipate antioxidant
    
    Stabilisers, Thickeners and Gelling Agents Emulsifiers
    -E417   Tara gum
    -E425   Konjac
    -E426   Soybean hemicellulose
    -E226   Calcium sulphite
    -E432 Polyoxyethylene sorbitan monolaurate; Polysorbate 20
    -E434 Polyoxyethylene sorbitan monopalmitate; Polysorbate 40
    -E459   Beta-cyclodextrin
    -E462   Ethyl cellulose
    -E468   Crosslinked sodium carboxy methyl cellulose
    -E472d   Tartaric acid esters of mono- and diglycerides of fatty acids
    -E474   Sucroglycerides
    -E483   Stearyl tartrate
    -E493   Sorbitan monolaurate
    -E494   Sorbitan monooleate
    -E495   Sorbitan monopalmitate
    -E513   Sulphuric acid
    -E517   Ammonium sulphate
    -E520   Aluminium sulphate
    -E521   Aluminium sodium sulphate
    -E522   Aluminium potassium sulphate
    -E523   Aluminium ammonium sulphate
    -E524   Sodium hydroxide
    -E525   Potassium hydroxide
    -E527   Ammonium hydroxide
    -E528   Magnesium hydroxide
    -E538   Calcium ferrocyanide
    -E553a   (i) Magnesium silicate
    -E553b   Talc E574   Gluconic acid
    -E576   Sodium gluconate
    -E585   Ferrous lactate
    -E626   Guanylic acid
    -E628   Dipotassium guanylate
    -E629   Calcium guanylate
    -E630   lnosinic acid
    -E632   Dipotassium inosinate
    -E633   Calcium inosinate
    -E634   Calcium 5'-ribonucleotides
    -E650   Zinc acetate
    -E900   Dimethylpolysiloxane
    -E902   Candelilla wax
    -E905   Microcrystalline wax
    -E912   Montan acid esters
    -E927b   Carbamide
    -E938   Argon
    -E939   Helium
    -E948   Oxygen
    -E949   Hydrogen
    -E959   Neohesperidine DC
    -E962   Salt of aspartame-acesulfame
    -E999   Quillaia extract
    -E1103   Invertase
    -E1202   Polyvinylpolypyrrolidone
    -E1204   Pullulan
    -E1451   Acetylated oxidised starch
    -E1452   Starch aluminium Octenyl succinate
    -Annatto ExtractM
    -Anthocyanins
    -Lake Allura Red
    -Lake Amaranth
    -Solvent Black 5
    -Solvent Black 7
    -Pigment Fast Yellow G
    -Pigment Green B
    -FD&C; Blue No.2 
    -FD&C; Blue No.1 
    -Beverages 
    -Confectionery 
    -Anticaking Agents 
    -Color Retention Agents 
    

    【讨论】:

    • 那是一些非常简洁的代码!理解它需要很长时间。谢谢你的回答。
    • @Jimmy9zz 很高兴为您提供帮助!
    • 能否请您澄清一下:为什么使用“_, *data”作为变量?
    • @Jimmy9zz _ 是一个丢弃变量,在这种情况下,它是一种忽略[[i.name, i] for i in d.find_all(re.compile('h2|h3|ul'))] 的第一个结果的干净方法。第一个结果本身实际上是下拉菜单的ul 内容,因此不需要。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-12-09
    • 2017-05-29
    • 2018-07-18
    • 1970-01-01
    • 2014-10-16
    相关资源
    最近更新 更多