【发布时间】:2021-05-15 14:51:57
【问题描述】:
我正在尝试从webpage 的某些成分容器中分离出三个 3 字段,如name、unit 和measure。我使用 BeautifulSoup 来解析成分容器,然后重新模块以分隔 unit 和 measure。这是the portion,我有兴趣从中获取三个字段。
这是我迄今为止尝试过的方式:
import re
import requests
from bs4 import BeautifulSoup
link = 'https://www.delicious.com.au/recipes/gnocchi-walnut-rosemary-pecorino-pesto/1b0defa9-53c8-4e9c-8c93-fb96a5348b31?r=recipes/gallery/opvo6a3l'
def get_content(s,link):
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select("ul.ingredient > li"):
ingr_container = item.get_text(strip=True)
ingr_unit_container = re.search(r"[\d.⁄a-z]+",ingr_container).group(0)
ingr_name = re.sub(ingr_unit_container,"",ingr_container).strip()
ingr_unit = re.sub(r"[a-z]+","",ingr_unit_container).strip()
ingr_measure = re.sub(r"[\d.⁄]+","",ingr_unit_container).strip()
yield ingr_name,ingr_unit,ingr_measure
if __name__ == '__main__':
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
for item in get_content(s,link):
print(item)
配料容器如下:
500g potato gnocchi
2 tbs extra virgin olive oil
Finely grated zest and juice of 1 lemon
1⁄2 bunch basil, leaves picked
1 tbs finely chopped rosemary, plus fried rosemary leaves to serve
2 garlic cloves, crushed
50g grated pecorino, (or parmesan) plus extra to serve
50g roasted and chopped walnuts, plus extra to serve
100ml extra virgin olive oil
脚本从上述容器生成的当前输出:
('potato gnocchi', '500', 'g')
('tbs extra virgin olive oil', '2', '')
('F grated zest and juice of 1 lemon', '', 'inely')
('bunch basil, leaves picked', '1⁄2', '')
('tbs finely chopped rosemary, plus fried rosemary leaves to serve', '1', '')
('garlic cloves, crushed', '2', '')
('grated pecorino, (or parmesan) plus extra to serve', '50', 'g')
('roasted and chopped walnuts, plus extra to serve', '50', 'g')
('extra virgin olive oil', '100', 'ml')
预期输出:
('potato gnocchi', '500', 'g')
('extra virgin olive oil', '2', 'tbs')
('Finely grated zest and juice of', '1', 'lemon')
('basil, leaves picked', '1⁄2', 'bunch')
('finely chopped rosemary, plus fried rosemary leaves to serve', '1', 'tbs')
('cloves, crushed', '2', 'garlic')
('grated pecorino, (or parmesan) plus extra to serve', '50', 'g')
('roasted and chopped walnuts, plus extra to serve', '50', 'g')
('extra virgin olive oil', '100', 'ml')
【问题讨论】:
标签: python python-3.x regex web-scraping