从python中的网页获取文本答案

【问题标题】：getting text from web page in python从python中的网页获取文本
【发布时间】：2020-03-25 15:41:07
【问题描述】：

我正在尝试从网页获取网址。

我尝试使用 wget、urllib 和 lynx（返回最有条理的结果），但棘手的部分是 url 作为文本写在网页上，如果它们很长，那么 url 的其余部分将点（3 个点）（例如，exampppppppppppppple.com 将被写为 exampleppp...）为了查看它，您必须单击条目的 ID，这将打开一个新窗口，在该窗口中，URL 将也可以完整地写成文本。我设法获取了 url，但我不知道如何进入另一个页面并获取文本“url”（如果它是点缀的），我不确定 wget -r 是否适用于我的情况（因为 url 是文本）。

这是我写的

import os

def get_urls():
     os.system("lynx -dump https://www.example.com/ 
     | grep -v https://ww.example.com/* | grep https* | grep http* | cut -f5- -d' '> 
      urls.txt")

在这一行grep -v https://ww.example.com/*我排除了所有网站的链接，因为我只想要网站中的全部我也尝试过使用 -listonly 但那只会列出页面的网址。

输出

http://www.another-example... 
https://example1.com
https://www.example.com

【问题讨论】：

如果网页不包含用于动态呈现 HTML 页面的 javascript，请查看 requests ( 2.python-requests.org/en/master ) 和一些 html 解析器，例如。 beautifulsoup。（pypi.org/project/beautifulsoup4）或者看看requests_html（pypi.org/project/requests-html）玩一下，打破你的第一次尝试（在这里或另一个问题），我们可以提供帮助

标签： python url wget lynx

【解决方案1】：

2020 年年中更新

如果我正确理解了这个任务，那就是获取嵌入在网页中的 url 列表，这些 url 与网页本身的基本 url 不同。所以如果页面是https://example.com，那么就列出所有非'example.com/..'的url。

使用外部 Lynx 程序

调用 Lynx，Python 3.5 及更高版本

# Since Python 3.5 
import subprocess

site = "stackoverflow.com"
siteurl = "https://" + site

# set encoding so that result is in strings rather than bytes
# set timeout for the case of a non-existant url
try:
    result = subprocess.run(
        ["lynx", "-listonly", "-dump", siteurl],
        capture_output=True,
        encoding='utf-8',
        timeout=3,
    )
    result.check_returncode()
except subprocess.TimeoutExpired as err:
    print("[Error] ", err)
    exit(err.timeout)
except subprocess.CalledProcessError as err:
    print("[Error] ", err.stderr)
    exit(err.returncode)
except Exception as err:
    print(err)
    exit(err.errno)

resultlist = result.stdout.splitlines()

for item in resultlist:
    item = item.strip()

    urlindicator = "://"
    if item.find(urlindicator) > 0:
        # example split line: ["1.", "https://example.com"]
        item_url = item.split()[1] 
        if item_url.find(site) == -1:
            print(item_url)

调用 Lynx，Pre-Python 3.5

# Pre-Python 3.5
import subprocess

site = "stackoverflow.com"
siteurl = "https://" + site

# set encoding so that result is in strings rather than bytes
# set timeout for the case of a non-existant url
try:
    result = subprocess.check_output(
        ["lynx", "-listonly", "-dump", "https://stackoverflow.com"],
        stderr=subprocess.PIPE,
        encoding='utf-8',
        timeout=2
    )
except subprocess.TimeoutExpired as err:
    print("[Error] ", err)
    exit(err.timeout)
except subprocess.CalledProcessError as err:
    print("[Error] ", err.stderr)
    exit(err.returncode)
except Exception as err:
    print(err)
    exit(err.errno)

resultlist = result.splitlines()

for item in resultlist:
    item = item.strip()

    urlindicator = "://"
    if item.find(urlindicator) > 0:
        # example split line: ["1.", "https://example.com"]
        item_url = item.split()[1] 
        if item_url.find(site) == -1:
            print(item_url)

这些例子的输出应该是这样的：

head list
https://stackexchange.com/sites
https://stackoverflow.blog/
https://www.g2.com/products/stack-overflow-for-teams/
https://www.g2.com/products/stack-overflow-for-teams/
https://www.fastcompany.com/most-innovative-companies/2019/sectors/enterprise
https://stackoverflowbusiness.com/
...

【讨论】：