2020 年年中更新
如果我正确理解了这个任务,那就是获取嵌入在网页中的 url 列表,这些 url 与网页本身的基本 url 不同。所以如果页面是https://example.com,那么就列出所有非'example.com/..'的url。
使用外部 Lynx 程序
调用 Lynx,Python 3.5 及更高版本
# Since Python 3.5
import subprocess
site = "stackoverflow.com"
siteurl = "https://" + site
# set encoding so that result is in strings rather than bytes
# set timeout for the case of a non-existant url
try:
result = subprocess.run(
["lynx", "-listonly", "-dump", siteurl],
capture_output=True,
encoding='utf-8',
timeout=3,
)
result.check_returncode()
except subprocess.TimeoutExpired as err:
print("[Error] ", err)
exit(err.timeout)
except subprocess.CalledProcessError as err:
print("[Error] ", err.stderr)
exit(err.returncode)
except Exception as err:
print(err)
exit(err.errno)
resultlist = result.stdout.splitlines()
for item in resultlist:
item = item.strip()
urlindicator = "://"
if item.find(urlindicator) > 0:
# example split line: ["1.", "https://example.com"]
item_url = item.split()[1]
if item_url.find(site) == -1:
print(item_url)
调用 Lynx,Pre-Python 3.5
# Pre-Python 3.5
import subprocess
site = "stackoverflow.com"
siteurl = "https://" + site
# set encoding so that result is in strings rather than bytes
# set timeout for the case of a non-existant url
try:
result = subprocess.check_output(
["lynx", "-listonly", "-dump", "https://stackoverflow.com"],
stderr=subprocess.PIPE,
encoding='utf-8',
timeout=2
)
except subprocess.TimeoutExpired as err:
print("[Error] ", err)
exit(err.timeout)
except subprocess.CalledProcessError as err:
print("[Error] ", err.stderr)
exit(err.returncode)
except Exception as err:
print(err)
exit(err.errno)
resultlist = result.splitlines()
for item in resultlist:
item = item.strip()
urlindicator = "://"
if item.find(urlindicator) > 0:
# example split line: ["1.", "https://example.com"]
item_url = item.split()[1]
if item_url.find(site) == -1:
print(item_url)
这些例子的输出应该是这样的:
head list
https://stackexchange.com/sites
https://stackoverflow.blog/
https://www.g2.com/products/stack-overflow-for-teams/
https://www.g2.com/products/stack-overflow-for-teams/
https://www.fastcompany.com/most-innovative-companies/2019/sectors/enterprise
https://stackoverflowbusiness.com/
...