【问题标题】:Web scraping: Expand/contract bounding box depending on results网页抓取:根据结果展开/收缩边界框
【发布时间】:2016-02-15 01:18:23
【问题描述】:

一个客户想知道他们竞争对手的店铺位置,所以我是准邪恶的,爬了竞争对手的网站。

服务器接受边界框(即左下角和右上角坐标)作为参数,并返回在边界框内找到的位置。这部分工作正常,我可以在给定边界框的情况下成功检索商店位置。

问题在于只返回边界框内的前 10 个位置 - 因此在人口稠密的地区,10 度边界框将返回太多位置:

我总是可以使用较小的边界框,但我会尽量避免对服务器造成不必要的影响,同时确保返回所有商店。

所以我需要一种方法来在找到 10 个商店时减小搜索矩形的大小(因为可能存在 10 个以上的商店),并以较小的搜索矩形大小递归搜索,然后恢复为较大的矩形下一个网格单元。

我已经编写了一个函数,它在给定边界框的情况下从服务器检索商店:

stores = checkForStores(<bounding box>)
if len(stores) >= 10:
  # There are too many stores. Search again with a smaller bounding box
else:
  # Everything is good - process these stores

但我正在为如何为checkForStores 函数设置适当的边界框而苦苦挣扎。

我尝试在经纬度上使用for 循环设置主网格单元:

cellsize = 10
for minLat in range(-40, -10, cellsize):
    for minLng in range(110, 150, cellsize):
        maxLat = minLat + cellsize
        maxLng = minLng + cellsize

...但是如果找到 10 家商店,我不知道如何使用较小的边界框继续搜索。我也尝试使用while 循环,但我无法让它们中的任何一个工作。

感谢任何关于从哪里开始的建议或指示。

【问题讨论】:

    标签: python for-loop recursion bounding-box


    【解决方案1】:

    以下是使用递归的方法。代码应该是不言自明的,但它是这样工作的: 给出一些边界框,它检查其中的商店数量,如果超过或等于 10 个,则将这个框分成更小的部分,并用每个新的边界框调用自己。它会这样做,直到找到少于 10 家商店。在这种情况下,找到的商店只是保存在列表中。

    注意:由于使用了递归,因此可能会出现超出最大递归深度的情况。这是理论上的。在您的情况下,即使您将通过 40 000 x 40 000 公里的边界框,也只需 15 步即可到达大约 1 x 1 公里的边界框 cell_axis_reduction_factor=2

    In [1]: import math
    
    In [2]: math.log(40000, 2)
    Out[2]: 15.287712379549449
    

    无论如何,在这种情况下,您可以尝试增加cell_axis_reduction_factor 的数量。

    另外注意:在Python中,根据PEP 8,函数应该是小写的,带下划线,所以我将checkForStores函数重命名为check_for_stores

    # Save visited boxes. Only for debugging purpose.
    visited_boxes = []
    
    
    def check_for_stores(bounding_box):
        """Function mocking real `ckeck_fo_stores` function by returning
        random list of "stores"
        """
        import random
        randint = random.randint(1, 12)
        print 'Found {} stores for bounding box {}.'.format(randint, bounding_box)
        visited_boxes.append(bounding_box)
        return ['store'] * randint
    
    
    def split_bounding_box(bounding_box, cell_axis_reduction_factor=2):
        """Returns generator of bounding box coordinates splitted
        from parent `bounding_box`
    
        :param bounding_box: tuple containing coordinates containing tuples of
              lower-left and upper-right corner coordinates,
              e.g. ((0, 5.2), (20.5, 14.0))
        :param cell_axis_reduction_factor: divide each axis in this param,
                                           in order to produce new box,
                                           meaning that in the end it will
                                           return `cell_axis_reduction_factor`**2 boxes
        :return: generator of bounding box coordinates
    
        """
        box_lc, box_rc = bounding_box
        box_lc_x, box_lc_y = box_lc
        box_rc_x, box_rc_y = box_rc
    
        cell_width = (box_rc_x - box_lc_x) / float(cell_axis_reduction_factor)
        cell_height = (box_rc_y - box_lc_y) / float(cell_axis_reduction_factor)
    
        for x_factor in xrange(cell_axis_reduction_factor):
            lc_x = box_lc_x + cell_width * x_factor
            rc_x = lc_x + cell_width
    
            for y_factor in xrange(cell_axis_reduction_factor):
                lc_y = box_lc_y + cell_height * y_factor
                rc_y = lc_y + cell_height
    
                yield ((lc_x, lc_y), (rc_x, rc_y))
    
    
    def get_stores_in_box(bounding_box, result=None):
        """Returns list of stores found provided `bounding_box`.
    
        If there are more than or equal to 10 stores found in `bounding_box`,
        recursively splits current `bounding_box` into smaller one and checks
        stores in them.
    
        :param bounding_box: tuple containing coordinates containing tuples of
              lower-left and upper-right corner coordinates,
              e.g. ((0, 5.2), (20.5, 14.0))
        :param result: list containing found stores, found stores appended here;
                       used for recursive calls
        :return: list with found stores
    
        """
        if result is None:
            result = []
    
        print 'Checking for stores...'
        stores = check_for_stores(bounding_box)
        if len(stores) >= 10:
            print 'Stores number is more than or equal 10. Splitting bounding box...'
            for splitted_box_coords in split_bounding_box(bounding_box):
                get_stores_in_box(splitted_box_coords, result)
        else:
            print 'Stores number is less than 10. Saving results.'
            result += stores
    
        return result
    
    
    stores = get_stores_in_box(((0, 1), (30, 20)))
    print 'Found {} stores in total'.format(len(stores))
    print 'Visited boxes: '
    print visited_boxes
    

    这是一个输出示例:

    Checking for stores...
    Found 10 stores for bounding box ((0, 1), (30, 20)).
    Stores number is more than or equal 10. Splitting bounding box...
    Checking for stores...
    Found 4 stores for bounding box ((0.0, 1.0), (15.0, 10.5)).
    Stores number is less than 10. Saving results.
    Checking for stores...
    Found 4 stores for bounding box ((0.0, 10.5), (15.0, 20.0)).
    Stores number is less than 10. Saving results.
    Checking for stores...
    Found 10 stores for bounding box ((15.0, 1.0), (30.0, 10.5)).
    Stores number is more than or equal 10. Splitting bounding box...
    Checking for stores...
    Found 1 stores for bounding box ((15.0, 1.0), (22.5, 5.75)).
    Stores number is less than 10. Saving results.
    Checking for stores...
    Found 9 stores for bounding box ((15.0, 5.75), (22.5, 10.5)).
    Stores number is less than 10. Saving results.
    Checking for stores...
    Found 4 stores for bounding box ((22.5, 1.0), (30.0, 5.75)).
    Stores number is less than 10. Saving results.
    Checking for stores...
    Found 1 stores for bounding box ((22.5, 5.75), (30.0, 10.5)).
    Stores number is less than 10. Saving results.
    Checking for stores...
    Found 6 stores for bounding box ((15.0, 10.5), (30.0, 20.0)).
    Stores number is less than 10. Saving results.
    Found 29 stores in total
    Visited boxes: 
    [
    ((0, 1), (30, 20)), 
    ((0.0, 1.0), (15.0, 10.5)), 
    ((0.0, 10.5), (15.0, 20.0)), 
    ((15.0, 1.0), (30.0, 10.5)), 
    ((15.0, 1.0), (22.5, 5.75)), 
    ((15.0, 5.75), (22.5, 10.5)), 
    ((22.5, 1.0), (30.0, 5.75)), 
    ((22.5, 5.75), (30.0, 10.5)), 
    ((15.0, 10.5), (30.0, 20.0))
    ]
    

    【讨论】:

    • 这太棒了 - 感谢您抽出宝贵时间用如此好的解释写出来。
    猜你喜欢
    • 2012-04-25
    • 2016-07-23
    • 2015-02-19
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多