用漂亮的汤和过滤停用词进行 Python HTML 解析答案

【问题标题】：Python HTML parsing with beautiful soup and filtering stop words用漂亮的汤和过滤停用词进行 Python HTML 解析
【发布时间】：2011-08-03 12:57:30
【问题描述】：

我正在将网站中的特定信息解析为文件。现在我的程序正在查看一个网页，找到正确的 HTML 标签并解析出正确的内容。现在我想进一步过滤这些“结果”。

例如在网站上：http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx

我正在解析位于

标记中的成分。这个解析器很好地完成了这项工作，但我想进一步处理这些结果。

当我运行这个解析器时，它会删除数字、符号、逗号和斜杠（\ 或 /），但保留所有文本。当我在网站上运行它时，我会得到如下结果：

cup olive oil
cup chicken broth
cloves garlic minced
tablespoon paprika

现在我想通过删除“cup”、“cloves”、“minced”、“tablesoon”等停用词来进一步处理这个问题。我该怎么做？这段代码是用python编写的，我不是很擅长，我只是使用这个解析器来获取我可以手动输入的信息，但我宁愿不这样做。

任何有关如何详细执行此操作的帮助将不胜感激！我的代码如下：我该怎么做？

代码：

import urllib2
import BeautifulSoup

def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)

    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip('123456789.,/\ ') for s in ingreds.findAll('li')]

    fname = 'PorkRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('\n'.join(ingreds))

if __name__=="__main__":
    main()

【问题讨论】：

标签： python html parsing beautifulsoup

【解决方案1】：

import urllib2
import BeautifulSoup
import string

badwords = set([
    'cup','cups',
    'clove','cloves',
    'tsp','teaspoon','teaspoons',
    'tbsp','tablespoon','tablespoons',
    'minced'
])

def cleanIngred(s):
    # remove leading and trailing whitespace
    s = s.strip()
    # remove numbers and punctuation in the string
    s = s.strip(string.digits + string.punctuation)
    # remove unwanted words
    return ' '.join(word for word in s.split() if not word in badwords)

def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)

    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [cleanIngred(s.getText()) for s in ingreds.findAll('li')]

    fname = 'PorkRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('\n'.join(ingreds))

if __name__=="__main__":
    main()

结果

olive oil
chicken broth
garlic,
paprika
garlic powder
poultry seasoning
dried oregano
dried basil
thick cut boneless pork chops
salt and pepper to taste

？我不知道为什么要在其中留下逗号 - s.strip(string.punctuation) 应该已经处理好了。

【讨论】：

嘿，这行得通！我不知道为什么它也让逗号进入。但感谢您的帮助。我对它不是很熟悉，只使用了大约 2 周。因此，您将坏词设置为停用词，然后拆分行并仅在“坏词”中不存在这些词时才采用这些词？
strip 只删除字符串开头或结尾的字符，当“garlic, minced”通过strip时，它在它的中间