【问题标题】:How to turn bs4.element.Tag into JSON dictionary?如何将 bs4.element.Tag 转换为 JSON 字典?
【发布时间】:2019-02-05 23:55:29
【问题描述】:

我正在使用 Beautiful Soup 4 为食谱抓取 HTML 页面,application/ld+json 脚本包含以下内容:

['\r\n{\r\n  "@context": "https://schema.org/",\r\n  "@type": "Recipe",\r\n  "name": "The College Boy",\r\n  "url": "https://www.bodybuilding.com/recipes/the-college-boy",\r\n  "author": {\r\n    "@type": "Person",\r\n    "name": "Matt Biss"\r\n  },\r\n  "image": [\r\n    "https://www.bodybuilding.com/images/2018/august/crockpot-4b-header-960x540.jpg",\r\n            "https://www.bodybuilding.com/images/2018/august/crockpot-4b-square-600x600.jpg"\r\n      ],\r\n  "datePublished": "2018-08-27 00:00:00.0",\r\n  "publisher": {\r\n    "@type": "Organization",\r\n    "name": "Bodybuilding.com",\r\n    "logo": {\r\n      "@type": "ImageObject",\r\n      "url": "https://www.bodybuilding.com/images/icons/bb-logo-clean.png",\r\n      "width": 666,\r\n      "height": 422\r\n    }\r\n  },\r\n  "description": "I call this the "College Boy" because of its simple preparation. No chopping, dicing, slicing, or any real work is needed. You need only be able to use a can opener and get the top off the jar, and several hours later you will end up with some high-quality belly stuffing.",\r\n  "prepTime": "PT10M",\r\n  "cookTime": "PT420M",\r\n  "totalTime": "PT430M",\r\n  "recipeYield": "4 servings",\r\n  "recipeCuisine": "American",\r\n  "keywords": "Crockpot",\r\n  "nutrition": {\r\n    "@type": "NutritionInformation",\r\n            "calories": "607 calories",\r\n                "carbohydrateContent": "23 g",\r\n                "proteinContent": "70 g",\r\n                "fatContent": "26 g",\r\n        "servingSize": "4 servings"\r\n  },\r\n  "recipeIngredient": [\r\n                        "4 piece chicken breast",                    "1 16 oz can black beans, drained and rinsed",                    "1 15 oz can corn",                    "8 oz cream cheese"              ],\r\n  "recipeInstructions": [\r\n          {\r\n        "@type": "HowToStep",\r\n        "text": "Place chicken breasts in the Crock-Pot. They can still be frozen if that is your style."\r\n      },          {\r\n        "@type": "HowToStep",\r\n        "text": "Drain cans of black beans and corn and add them into the cauldron."\r\n      },          {\r\n        "@type": "HowToStep",\r\n        "text": "Top it with your salsa, stir it up, and let it go!"\r\n      },          {\r\n        "@type": "HowToStep",\r\n        "text": "Slow cook for 7-8 hours on low, or 4-5 hours on high."\r\n      },          {\r\n        "@type": "HowToStep",\r\n        "text": "Save cream cheese until the food is nearly done; let it melt on top prior to serving."\r\n      }      ]\r\n}\r\n']

有很多\r\n 和间距。如何将其清理到字典中,以便我可以访问 carbohydrateContentrecipeIngredient 之类的键?

【问题讨论】:

标签: python html json web-scraping beautifulsoup


【解决方案1】:

使用ast.literal_eval

例如:

import re
import ast

l = ['\r\n{\r\n  "@context": "https://schema.org/",\r\n  "@type": "Recipe",\r\n  "name": "The College Boy",\r\n  "url": "https://www.bodybuilding.com/recipes/the-college-boy",\r\n  "author": {\r\n    "@type": "Person",\r\n    "name": "Matt Biss"\r\n  },\r\n  "image": [\r\n    "https://www.bodybuilding.com/images/2018/august/crockpot-4b-header-960x540.jpg",\r\n            "https://www.bodybuilding.com/images/2018/august/crockpot-4b-square-600x600.jpg"\r\n      ],\r\n  "datePublished": "2018-08-27 00:00:00.0",\r\n  "publisher": {\r\n    "@type": "Organization",\r\n    "name": "Bodybuilding.com",\r\n    "logo": {\r\n      "@type": "ImageObject",\r\n      "url": "https://www.bodybuilding.com/images/icons/bb-logo-clean.png",\r\n      "width": 666,\r\n      "height": 422\r\n    }\r\n  },\r\n  "description": "I call this the "College Boy" because of its simple preparation. No chopping, dicing, slicing, or any real work is needed. You need only be able to use a can opener and get the top off the jar, and several hours later you will end up with some high-quality belly stuffing.",\r\n  "prepTime": "PT10M",\r\n  "cookTime": "PT420M",\r\n  "totalTime": "PT430M",\r\n  "recipeYield": "4 servings",\r\n  "recipeCuisine": "American",\r\n  "keywords": "Crockpot",\r\n  "nutrition": {\r\n    "@type": "NutritionInformation",\r\n            "calories": "607 calories",\r\n                "carbohydrateContent": "23 g",\r\n                "proteinContent": "70 g",\r\n                "fatContent": "26 g",\r\n        "servingSize": "4 servings"\r\n  },\r\n  "recipeIngredient": [\r\n                        "4 piece chicken breast",                    "1 16 oz can black beans, drained and rinsed",                    "1 15 oz can corn",                    "8 oz cream cheese"              ],\r\n  "recipeInstructions": [\r\n          {\r\n        "@type": "HowToStep",\r\n        "text": "Place chicken breasts in the Crock-Pot. They can still be frozen if that is your style."\r\n      },          {\r\n        "@type": "HowToStep",\r\n        "text": "Drain cans of black beans and corn and add them into the cauldron."\r\n      },          {\r\n        "@type": "HowToStep",\r\n        "text": "Top it with your salsa, stir it up, and let it go!"\r\n      },          {\r\n        "@type": "HowToStep",\r\n        "text": "Slow cook for 7-8 hours on low, or 4-5 hours on high."\r\n      },          {\r\n        "@type": "HowToStep",\r\n        "text": "Save cream cheese until the food is nearly done; let it melt on top prior to serving."\r\n      }      ]\r\n}\r\n']

for i in l:
    print( ast.literal_eval(re.sub(r'(:\s*\"(.*)\")', r":'\2'", i)) )
  • 注意我正在使用正则表达式将外部双引号替换为单引号,因为您有一些嵌套的双引号 例如:'description': "I call this the "College Boy" because of its simple preparation. No chopping, dicing, slicing, or any real work is needed. You need only be able to use a can opener and get the top off the jar, and several hours later you will end up with some high-quality belly stuffing."

【讨论】:

    【解决方案2】:

    欢迎来到社区。

    在从 html 中提取名称/url 凭据时使用 strip() 以避免不必要的东西。

    name = output.strip("\r")
    url = output.strip( "\n")
    

    然后在 dict/json 中使用它们

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-01-12
      • 1970-01-01
      • 2014-09-21
      • 2011-12-18
      • 2018-07-04
      • 2014-12-31
      • 1970-01-01
      相关资源
      最近更新 更多