【问题标题】:Python regex: re.findall method throwing list index out of range errorPython regex:re.findall 方法抛出列表索引超出范围错误
【发布时间】:2021-12-28 15:37:23
【问题描述】:

我正在学习使用 python 正则表达式进行网络抓取并练习以下脚本source,但是当我运行时,它正在抛出 IndexError: list index out of range

import re
import json
import requests

url = 'https://www.att.com/buy/phones/'
html_text = requests.get(url).text

data = json.loads(re.findall(r'__NEXT_DATA__ = (.*?});', html_text)[0])
print(json.dumps(data['props']['pageProps']['deviceList'], indent=4))

【问题讨论】:

  • 找不到您的表达式 - html_text 中的 __NEXT_DATA.... 带回一个空列表。
  • 试试这个soup.select_one("[id='__NEXT_DATA__']").get_text(strip=True)。该页面上有三个__NEXT_DATA__,您的模式无法找到正确的。
  • findall 不会抛出该错误。正如 Cameron 所说,找不到您的表达式,因此 findall 返回一个空列表。您希望如何从空列表中获得[0]?了解错误回溯的含义以及如何在包含多个表达式的行中隔离问题非常重要。 How to debug small programs.

标签: python json web-scraping


【解决方案1】:

您面临的问题是网络动态的直接结果。网站不是静态的,2019 年的解决方案可能不起作用。我建议不要使用自定义正则表达式来查找 JSON,而是使用 Beautiful Soup (bs4) 来获得更强大的脚本。

下面的代码会给你想要的;

import json
import requests
from bs4 import BeautifulSoup

url = 'https://www.att.com/buy/phones/'
html_text = requests.get(url).text
soup = BeautifulSoup(html_text)
data = json.loads(soup.find('script', id='__NEXT_DATA__').text)
print(json.dumps(data['props']['initialReduxState']['solr']['deviceList'], indent=4))  

代码说明

请求库从给定的 URL 获取原始 HTML 文本,我们使用 bs4 对其进行解析。默认为 lxml 解析器。然后,我们使用 find 函数搜索 ID 为 'NEXT_DATA' 的脚本,在其中我们得到脚本内的 JSON 文本。最后,我们加载了 json 库并找到了 'deviceList' 的新位置。更多bs4文档请见https://www.crummy.com/software/BeautifulSoup/bs4/doc

长 JSON 的第一个输出

{
        "firstNet": "notApplicable",
        "productFamily": "Phn13",
        "comingSoon": false,
        "skuId": "sku2360531",
        "brand": "Apple",
        "displayContentItems": [],
        "deviceGroup": "network",
        "starRatings": 4.5962,
        "numOfStarReviews": 2959,
        "mobileImageUrl": [
            "/idpassets/global/devices/phones/apple/apple-iphone-13/defaultimage/pink-hero-zoom.png?imwidth=219"
        ],
        "largeImageURL": "//www.att.com/catalog/en/skus/images/apple-iphone%2013-pink-450x350.png",
        "model": "iPhone 13",
        "productName": "Apple iPhone 13",
        "billCode": "6164D",
        "name": "jared",
        "PDPPageURL": [
            "/buy/phones/apple-iphone-13-128gb-pink.html"
        ],
        "prepaid": "",
        "productURL": "//www.att.com/cellphones/iphone/apple-iphone-13.html#sku=sku2360531",
        "condition": "New",
        "productId": "prod10340592",
        "htmlColor": "#FADDD7",
        "isPrepaid": false,
        "isRefurbished": false,
        "isPreOwned": false,
        "isPrePreOrderable": false,
        "type": "Device",
        "color": "Pink",
        "FinalPriceIRU": 22.23,
        "FinalPriceCRU": 22.23,
        "FinalPlanType": "monthly",
        "FinalPrice": 22.23,
        "FinalnextUpCharge": [
            0
        ],
        "FinalIRUnextUpCharge": [
            0
        ],
        "FinalCRUnextUpCharge": [
            0
        ],
        "FinalCommitmentTerm": "NE36MNUP",
        "FinalCommitmentTermCRU": "NE36MNUP",
        "FinalCommitmentTermIRU": "NE36MNUP",
        "FinalBasePriceCRU": 22.23,
        "FinalBasePriceIRU": 22.23,
        "FinalPlanTypeCRU": "monthly",
        "FinalPlanTypeIRU": "monthly",
        "FinalBasePrice": 22.23,
        "FinalTermLength": 36,
        "FinalTermLengthIRU": 36,
        "FinalTermLengthCRU": 36,
        "consumerOfferDescription": "$0 w/Trade",
        "cruOfferDescription": "$0 w/Trade",
        "iruOfferDescription": "$0 w/Trade",
        "consumerOfferDescriptionAL": "$0 w/Trade",
        "consumerOfferDescriptionUP": "$0 w/Trade",
        "iruOfferDescriptionAL": "$0 w/Trade",
        "iruOfferDescriptionUP": "$0 w/Trade",
        "cruOfferDescriptionAL": "$0 w/Trade",
        "cruOfferDescriptionUP": "$0 w/Trade",
        "allProductIds": [
            "prod10340592",
            "prod10340591",
            "prod10340593"
        ],
        "allSkuIds": [
            "sku2360531",
            "sku2360535",
            "sku2360534",
            "sku2360527",
            "sku2360528",
            "sku2360530",
            "sku2360529",
            "sku2360537",
            "sku2360526",
            "sku2360536",
            "sku2360533",
            "sku10940263",
            "sku10940264",
            "sku10940268",
            "sku10940269"
        ],
        "allBillCodes": [
            "6164D",
            "6166D",
            "6162D",
            "6165D",
            "6163D",
            "6169D",
            "6171D",
            "6167D",
            "6170D",
            "6168D",
            "6174D",
            "6176D",
            "6172D",
            "6175D",
            "6173D"
        ],
        "tradeInLegalModalPath": "/idpassets/fragment/legal/prod/legalcontent/wireless/offers/19900012/19900012_offertray_lm.cmsfeed.js",
        "tradeInLegalText": "Req\u2019s elig. unlimited (speed restr\u2019s apply) & trade-in. Price after 36 mo. credits. Other terms apply. ",
        "tradeInShortLegalLinkLabel": "See offer details",
        "tradeInPromoReference": "19900012",
        "tradeInMonthlyPromoPrice": "0",
        "tradeInLegalModalPathCRU": "/idpassets/fragment/legal/prod/legalcontent/wireless/offers/19900012/19900012_offertray_lm.cmsfeed.js",
        "tradeInLegalTextCRU": "Req\u2019s elig. unlimited (speed restr\u2019s apply) & trade-in. Price after 36 mo. credits. Other terms apply. ",
        "tradeInShortLegalLinkLabelCRU": "See offer details",
        "tradeInPromoReferenceCRU": "19900012",
        "tradeInMonthlyPromoPriceCRU": "0"
    }

【讨论】:

  • 谢谢,但我得到了`raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)` from你的解决方案。请看一下
  • JSON 是 Python 标准库 (docs.python.org/3/library/json.html) 的一部分,因此我建议根据您的 Python 版本更改代码的行为。我已经在 Python 3.7.12 上测试了代码 sn-p
猜你喜欢
  • 1970-01-01
  • 2021-08-31
  • 2016-06-29
  • 1970-01-01
  • 2021-05-12
  • 1970-01-01
  • 2016-01-06
  • 2014-11-03
  • 1970-01-01
相关资源
最近更新 更多