如何获取特定谷歌搜索python上的所有网址答案

【问题标题】：How to get all urls on a specific google search python如何获取特定谷歌搜索python上的所有网址
【发布时间】：2018-03-08 14:48:27
【问题描述】：

因此，我正在尝试创建一个程序，该程序可以获取 google 网页搜索中的所有 url，并按照它们在该页面上的位置顺序返回它们的列表。因此，如果它是“随机”的谷歌搜索页面上的顶部 URL，this 链接，那么应该返回的列表中的第一个项目应该是“https://www.random.org/”。这是因为它是您在源代码中在 google 上随机搜索时的第一个链接。我正在使用 urllib3 和 re 模块，因为我真的不知道如何使用美丽的汤或 lxml，但如果你可以在美丽的汤和/或 lxml 中做到这一点，那也很好。到目前为止，这是我的代码：

import urllib.request
import re

def find(start,end):

    urls = []

    with open('data.txt', 'r') as myFile:
        pass # Needs to append the every instance of all urls between the start and end inputs in data.txt

    # Returns all instances of urls between the start and end paramaters in data.txt

    return urls


def parse(query):

    # Creates the url with the query

    url = 'https://www.google.com/search?q=' + query

    # Gets past googles attempt to block parsing

    headers = {}
    headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"

    # Fetches data

    req = urllib.request.Request(url, headers = headers)
    resp = urllib.request.urlopen(req)
    respData = resp.read()

    # Saves the source code in a txt file

    saveFile = open('data.txt','w')
    saveFile.write(str(respData))
    saveFile.close()

    # Finds the urls and returns them

    newUrl = find('<h3 class="r"><a href="','"')
    return newUrl

print(parse("random"))

问题： 我的问题是使 find() 函数工作，我不知道如何从 data.txt 中保存的源代码和变量 respData 中获取 url，我想做提高效率，所以我正在考虑使用正则表达式。但是我不确定如何根据 url 的开始位置（类位，它是 find 函数的参数）和它的开始位置（反逗号，它是另一个参数）从源代码中获取 url查找函数）。

简化问题：给定一些文本data，您将如何在data 中的两个字符串start 和@ 之间创建一个包含所有实例 某些文本的列表987654329@。对于存储在data 中的大量数据，您将如何使其高效，然后将其应用于我原始代码中的 find() 函数。

注意：因此，使用 python 3.6.3，我没有使用 urllib2，而是使用 urllib3。如果要花很长时间才能获取 google 搜索网页上的每个 url，那么前 10 个 url 就可以了。

【问题讨论】：

你面临什么样的问题？
我不确定如何从源代码中获取 url，我的问题是执行 find() 函数，如果您阅读 pass 旁边的评论，您可以看到我想要它做什么做。
我现在会在帖子中更清楚地说明这一点。
那么您想知道如何逐行读取文件并将其添加到列表中吗？
不，我想根据一些数据返回一个包含所有 url 的列表，以及所有 url 的开头和结尾。所以 print(return("random")) 的输出将是一个类似的列表：[random.org, blabla.com] 等等

标签： python python-3.x parsing

【解决方案1】：

加上漂亮的汤，在urlopen之后

from bs4 import BeautifulSoup
#code snip

resp = urllib.request.urlopen(req)
soup = BeautifulSoup(resp)

for x in soup.findAll('a', {"class": "r"}):
    print(x)

我没有测试过，但这就是你在美汤中的搜索方式

附带说明，单独使用 Regex 解析 html 可能会很棘手。最好使用 Beautiful Soap 4 或 Scrapy 为您处理解析。

【讨论】：

另外，你需要在 for 循环的末尾加一个冒号。
我希望它找到