【问题标题】:getting text from web page in python从python中的网页获取文本
【发布时间】:2020-03-25 15:41:07
【问题描述】:

我正在尝试从网页获取网址。

我尝试使用 wget、urllib 和 lynx(返回最有条理的结果),但棘手的部分是 url 作为文本写在网页上,如果它们很长,那么 url 的其余部分将点(3 个点)(例如,exampppppppppppppple.com 将被写为 exampleppp...)为了查看它,您必须单击条目的 ID,这将打开一个新窗口,在该窗口中,URL 将也可以完整地写成文本。我设法获取了 url,但我不知道如何进入另一个页面并获取文本“url”(如果它是点缀的),我不确定 wget -r 是否适用于我的情况(因为 url 是文本)。

这是我写的

import os

def get_urls():
     os.system("lynx -dump https://www.example.com/ 
     | grep -v https://ww.example.com/* | grep https* | grep http* | cut -f5- -d' '> 
      urls.txt")
  • 在这一行grep -v https://ww.example.com/*我排除了所有网站的链接,因为我只想要网站中的全部 我也尝试过使用 -listonly 但那只会列出页面的网址。

输出

http://www.another-example... 
https://example1.com
https://www.example.com

【问题讨论】:

标签: python url wget lynx


【解决方案1】:

2020 年年中更新

如果我正确理解了这个任务,那就是获取嵌入在网页中的 url 列表,这些 url 与网页本身的基本 url 不同。所以如果页面是https://example.com,那么就列出所有非'example.com/..'的url。

使用外部 Lynx 程序

调用 Lynx,Python 3.5 及更高版本
# Since Python 3.5 
import subprocess

site = "stackoverflow.com"
siteurl = "https://" + site

# set encoding so that result is in strings rather than bytes
# set timeout for the case of a non-existant url
try:
    result = subprocess.run(
        ["lynx", "-listonly", "-dump", siteurl],
        capture_output=True,
        encoding='utf-8',
        timeout=3,
    )
    result.check_returncode()
except subprocess.TimeoutExpired as err:
    print("[Error] ", err)
    exit(err.timeout)
except subprocess.CalledProcessError as err:
    print("[Error] ", err.stderr)
    exit(err.returncode)
except Exception as err:
    print(err)
    exit(err.errno)

resultlist = result.stdout.splitlines()

for item in resultlist:
    item = item.strip()

    urlindicator = "://"
    if item.find(urlindicator) > 0:
        # example split line: ["1.", "https://example.com"]
        item_url = item.split()[1] 
        if item_url.find(site) == -1:
            print(item_url)
调用 Lynx,Pre-Python 3.5
# Pre-Python 3.5
import subprocess

site = "stackoverflow.com"
siteurl = "https://" + site

# set encoding so that result is in strings rather than bytes
# set timeout for the case of a non-existant url
try:
    result = subprocess.check_output(
        ["lynx", "-listonly", "-dump", "https://stackoverflow.com"],
        stderr=subprocess.PIPE,
        encoding='utf-8',
        timeout=2
    )
except subprocess.TimeoutExpired as err:
    print("[Error] ", err)
    exit(err.timeout)
except subprocess.CalledProcessError as err:
    print("[Error] ", err.stderr)
    exit(err.returncode)
except Exception as err:
    print(err)
    exit(err.errno)

resultlist = result.splitlines()

for item in resultlist:
    item = item.strip()

    urlindicator = "://"
    if item.find(urlindicator) > 0:
        # example split line: ["1.", "https://example.com"]
        item_url = item.split()[1] 
        if item_url.find(site) == -1:
            print(item_url)

这些例子的输出应该是这样的:

head list
https://stackexchange.com/sites
https://stackoverflow.blog/
https://www.g2.com/products/stack-overflow-for-teams/
https://www.g2.com/products/stack-overflow-for-teams/
https://www.fastcompany.com/most-innovative-companies/2019/sectors/enterprise
https://stackoverflowbusiness.com/
...

【讨论】:

    猜你喜欢
    • 2011-09-26
    • 2020-05-14
    • 2020-09-03
    • 1970-01-01
    • 1970-01-01
    • 2019-04-30
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多