【问题标题】:python requests splitting certain data mismatchpython请求拆分某些数据不匹配
【发布时间】:2020-01-14 07:03:05
【问题描述】:

试图从网站获取数据,但获取了一些 url 的两个数据

本田思域

make = honda

model = civic

路虎

make = land

model = rover

应该在哪里

make = landrover model = rangerover

试过这个:

scala.txt:

https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208
https://www.redbook.com.au/cars/details/2019-holden-astra-rs-black-edition-bk-auto-my19/SPOT-ITM-524534
http://www.redbook.com.au/cars/research/used/details/2014-land-rover-range-rover-evoque-ed4-pure-tech-manual-my15/SPOT-ITM-410126
http://www.redbook.com.au/cars/research/used/details/2014-land-rover-range-rover-evoque-sd4-pure-tech-auto-4x4-my15/SPOT-ITM-410136

import lxml.html as lh
import pandas as pd
import html
from lxml import html
from bs4 import BeautifulSoup
import requests
import requests
from bs4 import BeautifulSoup as bs


cars = []
with open('scala.txt') as f:

    urls = f.read().splitlines()
for url in urls: 

    car_data={}
    headers = {'User-Agent':'Mozilla/5.0'}
    page = (requests.get(url, headers=headers))
    tree = html.fromstring(page.content)

    car_data['url']=url
    if tree.xpath('//h1[@class="details-title"]/text()')[0]:
        full_car_name = tree.xpath('//h1[@class="details-title"]/text()')[0]
        car_data['naming'] = full_car_name
        print(full_car_name)
    car_data['id'] = url.split("SPOT-ITM-")[1].replace("/", "")
    car_data['year'] = full_car_name.split(" ")[0]
    car_data['make'] = full_car_name.split(" ")[1]
    car_data['model']= full_car_name.split(" ")[2]
    cars.append(car_data)

前两个没问题,当第三个 url 出现时是多个值

输出:

{'id': '524208',
  'make': 'Honda',
  'model': 'Civic',
  'naming': '2019 Honda Civic 50 Years Edition Auto MY19',
  'url': 'https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208',
  'year': '2019'}


{'id': '410136',
  'make': 'Land',
  'model': 'Rover',
  'naming': '2014 Land Rover Range Rover Evoque SD4 Pure Tech Auto 4x4 MY15',
  'url': 'http://www.redbook.com.au/cars/research/used/details/2014-land-rover-range-rover-evoque-sd4-pure-tech-auto-4x4-my15/SPOT-ITM-410136',
  'year': '2014'}

对于路虎,make should be land rovermodel should be range rover

【问题讨论】:

  • 请编辑您的帖子以包含完整的traceback
  • @buran 添加了回溯
  • 第三个链接没有图片。你需要处理这种情况,例如使用 try/except
  • @buran 如果我运行 100 个 url,我的索引超出了范围,但是如果我运行一小串,为什么没有超出范围的索引?
  • 当您尝试解析不存在的元素时,您的索引超出范围。例如在这种情况下,您尝试获取图像 href,但在第三个 url 中没有图像。如果年份、品牌或型号信息不存在,您将面临同样的风险。您是否会在一堆 url 上收到错误取决于那里的信息。如果你幸运的话,你要找的所有信息都会在那里,但情况并非总是如此。

标签: python selenium beautifulsoup python-requests lxml


【解决方案1】:

尝试使用try/except。有些元素没有img。因此,当它尝试从索引[0] 获取 image_url 时,那里什么都没有。您基本上是在告诉从空列表中获取第一个元素:

try/except 的骨架

try:
    <code to do something>
    <code>
    <more code>
    ...
except:
    <code to do something if the try fails/throws errors>
    ...
    ...

图片也是这样:

...    

car_data={}
headers = {'User-Agent':'Mozilla/5.0'}
page = (requests.get(url, headers=headers))
tree = html.fromstring(page.content)

try:
    img_urls = tree.xpath('//div[@class="r-module"]/div[@class="csn-results"]/div[@class="content"]/a[@class="item"]//div[@class="photos"]//img/@src')
    img_url = tree.xpath('//ul/li/a/img/@src')[0]
    img_url = str(img_url)
    img_url = img_url
except:
    img_url = 'N/A'

    ...

这里还有一些帮助修复你的 json key:values。 你得到这些结果的原因是你在空白处分裂。在文本/内容中,它是land rover range rover,而不是landrover rangerover。因此,当您拆分时,它会返回 ['land', 'rover', 'range', 'rover']。您正在抓取索引 0 和 1 中的元素,即 'land''rover'

现在如果文本是'landrover rangerover',那么您将正确地得到您想要的。它会拆分['landrover', 'rangerover'],因此在索引位置 0 和 1 中抓取元素会按照您想要的方式工作。

import requests
from bs4 import BeautifulSoup as bs
import re
import json


cars = []
with open('scala.txt') as f:

    urls = f.read().splitlines()


for url in urls: 
    car_data={}
    headers = {'User-Agent':'Mozilla/5.0'}
    page = (requests.get(url, headers=headers))
    soup = bs(page.content, 'html.parser')



    script = soup.find('script', text=re.compile("CsnInsights.metaData"))
    jsonData = json.loads(script.text.split('CsnInsights.metaData = ')[-1].rsplit(';',1)[0])

    make = jsonData['make']
    model = jsonData['model']
    car_id = jsonData['networkid'].rsplit('-',1)[-1]

    naming = soup.find('div', class_='heading').text.split(' ',1)[-1]
    year = soup.find('div', class_='heading').text.split(' ',1)[0]

    car_data = {'id':car_id,
                'make':make,
                'model':model,
                'naming':naming,
                'url':url,
                'year':year}

    cars.append(car_data)

输出:

print(json.dumps(cars, indent=4))

[
    {
        "id": "524208",
        "make": "Honda",
        "model": "Civic",
        "naming": "Honda Civic VTi-S Auto MY19",
        "url": "https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208",
        "year": "2019"
    },
    {
        "id": "524534",
        "make": "Holden",
        "model": "Astra",
        "naming": "Holden Astra RS BK Auto MY19",
        "url": "https://www.redbook.com.au/cars/details/2019-holden-astra-rs-black-edition-bk-auto-my19/SPOT-ITM-524534",
        "year": "2019"
    },
    {
        "id": "410126",
        "make": "Land Rover",
        "model": "Range Rover Evoque",
        "naming": "Land Rover Range Rover Evoque SD4 Pure Manual 4x4 MY14",
        "url": "http://www.redbook.com.au/cars/research/used/details/2014-land-rover-range-rover-evoque-ed4-pure-tech-manual-my15/SPOT-ITM-410126",
        "year": "2014"
    },
    {
        "id": "410136",
        "make": "Land Rover",
        "model": "Range Rover Evoque",
        "naming": "Land Rover Range Rover Evoque SD4 Pure Tech Manual 4x4 MY15",
        "url": "http://www.redbook.com.au/cars/research/used/details/2014-land-rover-range-rover-evoque-sd4-pure-tech-auto-4x4-my15/SPOT-ITM-410136",
        "year": "2014"
    }
]

【讨论】:

  • 如何写多个异常?你能帮我吗。就像再写两个例外。还有两个 xpath。给我看看骨架
  • 骨架和我上面给出的例子一样。
猜你喜欢
  • 2018-10-17
  • 1970-01-01
  • 2022-01-19
  • 2015-11-21
  • 1970-01-01
  • 1970-01-01
  • 2012-05-15
  • 1970-01-01
  • 2020-08-28
相关资源
最近更新 更多