【问题标题】:Unable to separate certain fields from each container of ingredients无法从每个成分容器中分离某些字段
【发布时间】:2021-05-15 14:51:57
【问题描述】:

我正在尝试从webpage 的某些成分容器中分离出三个 3 字段,如nameunitmeasure。我使用 BeautifulSoup 来解析成分容器,然后重新模块以分隔 unitmeasure。这是the portion,我有兴趣从中获取三个字段。

这是我迄今为止尝试过的方式:

import re
import requests
from bs4 import BeautifulSoup

link = 'https://www.delicious.com.au/recipes/gnocchi-walnut-rosemary-pecorino-pesto/1b0defa9-53c8-4e9c-8c93-fb96a5348b31?r=recipes/gallery/opvo6a3l'

def get_content(s,link):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select("ul.ingredient > li"):
        ingr_container = item.get_text(strip=True)
        ingr_unit_container = re.search(r"[\d.⁄a-z]+",ingr_container).group(0)
        ingr_name = re.sub(ingr_unit_container,"",ingr_container).strip()
        ingr_unit = re.sub(r"[a-z]+","",ingr_unit_container).strip()
        ingr_measure = re.sub(r"[\d.⁄]+","",ingr_unit_container).strip()
        yield ingr_name,ingr_unit,ingr_measure

if __name__ == '__main__':
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
        for item in get_content(s,link):
            print(item)

配料容器如下:

500g potato gnocchi
2 tbs extra virgin olive oil
Finely grated zest and juice of 1 lemon
1⁄2 bunch basil, leaves picked
1 tbs finely chopped rosemary, plus fried rosemary leaves to serve
2 garlic cloves, crushed
50g grated pecorino, (or parmesan) plus extra to serve
50g roasted and chopped walnuts, plus extra to serve
100ml extra virgin olive oil

脚本从上述容器生成的当前输出:

('potato gnocchi', '500', 'g')
('tbs extra virgin olive oil', '2', '')
('F grated zest and juice of 1 lemon', '', 'inely')
('bunch basil, leaves picked', '1⁄2', '')
('tbs finely chopped rosemary, plus fried rosemary leaves to serve', '1', '')
('garlic cloves, crushed', '2', '')
('grated pecorino, (or parmesan) plus extra to serve', '50', 'g')
('roasted and chopped walnuts, plus extra to serve', '50', 'g')
('extra virgin olive oil', '100', 'ml')

预期输出:

('potato gnocchi', '500', 'g')
('extra virgin olive oil', '2', 'tbs')
('Finely grated zest and juice of', '1', 'lemon')
('basil, leaves picked', '1⁄2', 'bunch')
('finely chopped rosemary, plus fried rosemary leaves to serve', '1', 'tbs')
('cloves, crushed', '2', 'garlic')
('grated pecorino, (or parmesan) plus extra to serve', '50', 'g')
('roasted and chopped walnuts, plus extra to serve', '50', 'g')
('extra virgin olive oil', '100', 'ml')

【问题讨论】:

    标签: python python-3.x regex web-scraping


    【解决方案1】:

    我离正则表达式还差得很远。但是,我发现以下实现有效:

    import re
    import requests
    from bs4 import BeautifulSoup
    
    link = 'https://www.delicious.com.au/recipes/gnocchi-walnut-rosemary-pecorino-pesto/1b0defa9-53c8-4e9c-8c93-fb96a5348b31?r=recipes/gallery/opvo6a3l'
    
    def get_content(s,link):
        r = s.get(link)
        soup = BeautifulSoup(r.text,"lxml")
        for item in soup.select("ul.ingredient > li"):
            ingr_container = item.get_text(strip=True)
            unit_container = re.search(r'[\d.⁄]+\s*?[a-zA-Z]+\s*?',ingr_container).group(0)
            ingr_name = ingr_container.replace(unit_container,"").strip()
            ingr_unit = re.search(r'[\d.⁄]+',unit_container).group(0)
            ingr_measure = unit_container.replace(ingr_unit,"").strip()
            yield ingr_name,ingr_unit,ingr_measure
    
    if __name__ == '__main__':
        with requests.Session() as s:
            s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
            for item in get_content(s,link):
                print(item)
    

    输出:

    ('potato gnocchi', '500', 'g')
    ('extra virgin olive oil', '2', 'tbs')
    ('Finely grated zest and juice of', '1', 'lemon')
    ('basil, leaves picked', '1⁄2', 'bunch')
    ('finely chopped rosemary, plus fried rosemary leaves to serve', '1', 'tbs')
    ('cloves, crushed', '2', 'garlic')
    ('grated pecorino, (or parmesan) plus extra to serve', '50', 'g')
    ('roasted and chopped walnuts, plus extra to serve', '50', 'g')
    ('extra virgin olive oil', '100', 'ml')
    

    【讨论】:

      【解决方案2】:

      因此,一种解决方案可能是在文本中搜索数字,这就是度量。它变得有点棘手,因为有时单位是度量的一部分,有时之间有一个空的空间。但是你可以通过条件来解决这个问题(也可能有一个正则表达式解决方案):

      import re
      import requests
      from bs4 import BeautifulSoup
      
      link = 'https://www.delicious.com.au/recipes/gnocchi-walnut-rosemary-pecorino-pesto/1b0defa9-53c8-4e9c-8c93-fb96a5348b31?r=recipes/gallery/opvo6a3l'
      
      def get_content(s,link):
          r = s.get(link)
          soup = BeautifulSoup(r.text,"lxml")
          for item in soup.select("ul.ingredient > li"):
              ingr_container = item.get_text(strip=True).split()
      
              for index, string in enumerate(ingr_container):
                  if re.search(r'\d', string): #check for digits, or parts, that contain digits
                      if not string.isdecimal(): #check if digits and characters are mixed
                          if not string.isalnum(): #check if it's a "backslash"-unit (e.g. 1/2)
                              ingr_measure = string
                              ingr_unit = ingr_container[index+1]     
                              to_remove = [index, index+1] #at this index (indices) the unit and measure is set   
                              break           
      
                          else: #split digit and characters
                              for i, char in enumerate(string):
                                  if char.isalpha():
                                      ingr_measure = string[:i]
                                      ingr_unit = string[i:]
                                      to_remove = [index, index]  
                                      break
                              break
                      else:
                          ingr_measure = string
                          ingr_unit = ingr_container[index+1]
                          to_remove = [index, index+1]
                          break
      
              ingr_name = ' '.join(ingr_container[:to_remove[0]] + ingr_container[to_remove[1]+1:]) #ingr_name is the whole ingr_container without measure and unit
      
              yield ingr_name, ingr_measure, ingr_unit
      
      
      if __name__ == '__main__':
          with requests.Session() as s:
              s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
              for item in get_content(s,link):
                  print(item)
      

      输出:

      ('potato gnocchi', '500', 'g')
      ('extra virgin olive oil', '2', 'tbs')
      ('Finely grated zest and juice of', '1', 'lemon')
      ('basil, leaves picked', '1⁄2', 'bunch')
      ('finely chopped rosemary, plus fried rosemary leaves to serve', '1', 'tbs')
      ('cloves, crushed', '2', 'garlic')
      ('grated pecorino, (or parmesan) plus extra to serve', '50', 'g')
      ('roasted and chopped walnuts, plus extra to serve', '50', 'g')
      ('extra virgin olive oil', '100', 'ml')
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2021-05-17
        • 2021-10-20
        • 1970-01-01
        • 2021-02-19
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多