【问题标题】:How to QuerySelect using a recursive json object?如何使用递归 json 对象查询选择?
【发布时间】:2020-05-09 20:43:10
【问题描述】:

我正在构建一个网络爬虫。我有一个选择器的 json 对象,我需要对其进行迭代,以便抓取页面上的每个值并捕获数据。

我将如何创建一个 lodash 函数以递归方式遍历每个属性并执行标准函数以根据选择器获取元素内部文本值?

let startingJson = {

    address:"tbody > tr > td:nth-of-type(4) > p",
    comps: [{
        address: "tbody > tr:nth-of-type(1) > td:nth-of-type(1) > p",
        link: "tbody > tr:nth-of-type(1) > td:nth-of-type(1) > p > a",
        dateLastSold: "tbody > tr:nth-of-type(1) > td:nth-of-type(2) > p",
        value: "tbody > tr:nth-of-type(1) > td:nth-of-type(3) > p"
    },
    {
        address: "tbody > tr:nth-of-type(2) > td:nth-of-type(1) > p",
        link: "tbody > tr:nth-of-type(2) > td:nth-of-type(1) > p > a",
        dateLastSold: "tbody > tr:nth-of-type(2) > td:nth-of-type(2) > p",
        value: "tbody > tr:nth-of-type(2) > td:nth-of-type(3) > p"
    }]
}


let finalJsonExample = {

    address:"123 Main Street",
    comps: [{
        address: "234 Main Street",
        link: "abc.com",
        dateLastSold: "10/20/19",
        value: "100000"
    },
    {
        address: "345 Main Street",
        link: "def.com",
        dateLastSold: "10/21/19",
        value: "110000"
    }]
}

【问题讨论】:

  • 请向我们展示您的尝试。 SO 不是免费的代码编写服务。此处的目标是帮助您在尝试无法正常工作时修复自己的尝试……而不是为您完成所有工作

标签: javascript arrays lodash


【解决方案1】:

您只需要使用选择器递归地导航该对象,同时在您到达其中一个叶子时从页面中抓取元素,您可以分辨出来,因为它们的类型是string

为此,您可以使用https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/mapArray.prototype.reduce()Document.querySelector()

const startingJson = {
  address:"tbody > tr > td:nth-of-type(4) > p",
  bar: '#baz',
  comps: [{
    address: "tbody > tr:nth-of-type(1) > td:nth-of-type(1) > p",
    link: "tbody > tr:nth-of-type(1) > td:nth-of-type(1) > p > a",
    dateLastSold: "tbody > tr:nth-of-type(1) > td:nth-of-type(2) > p",
    value: "tbody > tr:nth-of-type(1) > td:nth-of-type(3) > p",
    qux: '#qux',
  }, {
    address: "tbody > tr:nth-of-type(2) > td:nth-of-type(1) > p",
    link: "tbody > tr:nth-of-type(2) > td:nth-of-type(1) > p > a",
    dateLastSold: "tbody > tr:nth-of-type(2) > td:nth-of-type(2) > p",
    value: "tbody > tr:nth-of-type(2) > td:nth-of-type(3) > p"
  }],
  list: [
    '.foo p',
    '.bar span',
  ],
};

function getElements(selectors) {
  if (typeof selectors === 'string') {
    // If `selectors` is a string, that's a selector we can use to grab an
    // element from the page. You might want to use document.querySelectorAll
    // instead, in case more than one element is returned, and handle that
    // appropriately:
    const element = document.querySelector(selectors);
    
    // Extract the `textContent` from that element, if any:
    return element ? (element.textContent || '') : null;
  }
  
  if (Array.isArray(selectors)) {
    // If we are passed an array, we need to recursively explore all the children
    // while filling in an array of the same size that we are going to return:
    return selectors.map(selector => getElements(selector));
  }
  
  if (typeof selectors === 'object') {
    // Lastly, if we are passed an object, we need to explore all its entries
    // and return an object with the same keys but with the values resolved using
    // this same function recursively:
    return Object.entries(selectors).reduce((elements, [key, value]) => {
      elements[key] = getElements(value);
      
      return elements;
    }, {}); 
  }
  
  return null;
}

console.log(getElements(startingJson));
.as-console-wrapper {
  max-height: 100% !important;
}
<div class="foo"><p>Foo<p></div>
<div class="bar"><span>Bar</span></div>
<div id="baz">Baz</div>
<div id="qux">Qux</div>

【讨论】:

    【解决方案2】:

    您可以使用 lodash 的 _.transform() 创建一个通用的遍历函数,它同时处理对象和数组。在对象上调用 traverse 时,传递一个函数 (fn) 来处理非对象值。

    在这种情况下,您可以在值上使用Document.querySelector(),然后从节点中提取文本内容。

    const recursiveTransform = (fn, val) => _.transform(val, (acc, v, key) => {  
      acc[key] = _.isObject(v) ? recursiveTransform(fn, v) : fn(v);
    });
    
    const startingJson = {"address":"tbody > tr > td:nth-of-type(4) > p","bar":"#baz","comps":[{"address":"tbody > tr:nth-of-type(1) > td:nth-of-type(1) > p","link":"tbody > tr:nth-of-type(1) > td:nth-of-type(1) > p > a","dateLastSold":"tbody > tr:nth-of-type(1) > td:nth-of-type(2) > p","value":"tbody > tr:nth-of-type(1) > td:nth-of-type(3) > p","qux":"#qux"},{"address":"tbody > tr:nth-of-type(2) > td:nth-of-type(1) > p","link":"tbody > tr:nth-of-type(2) > td:nth-of-type(1) > p > a","dateLastSold":"tbody > tr:nth-of-type(2) > td:nth-of-type(2) > p","value":"tbody > tr:nth-of-type(2) > td:nth-of-type(3) > p"}],"list":[".foo p",".bar span"]};
    
    const result = recursiveTransform(q => {
      const r = document.querySelector(q);
    
      return r && r.textContent;
    }, startingJson);
    
    console.log(result);
    .as-console-wrapper {
      max-height: 100% !important;
    }
    <script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.15/lodash.js"></script>
    <div class="foo"><p>Foo<p></div>
    <div class="bar"><span>Bar</span></div>
    <div id="baz">Baz</div>
    <div id="qux">Qux</div>

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-09-07
      • 2020-09-04
      • 1970-01-01
      • 1970-01-01
      • 2013-11-04
      • 1970-01-01
      相关资源
      最近更新 更多