在链接中查找子字符串答案

【问题标题】：Finding a Substring in A Link在链接中查找子字符串
【发布时间】：2020-04-24 01:12:08
【问题描述】：

所以在我的 Python 函数中，我传递了一个 url，在该 url 上搜索 pdf 文件，然后下载这些文件。在大多数情况下，它都能完美运行。

def get_pdfs(my_url):
    html = urllib2.urlopen(my_url).read()
    html_page = BeautifulSoup(html)
    current_link = ''
    links = []
    for link in html_page.find_all('a'):
        current_link = link.get('href')
        if current_link.endswith('pdf'):
            print(current_link)
            links.append(my_url + current_link)
    #print(links)

    for link in links:
        #urlretrieve(link)
        wget.download(link)


get_pdfs('https://grader.eecs.jacobs-university.de/courses/320241/2019_2/')

但是，当我尝试将我的功能用于特定课程网站时，我的 current_link 是

/courses/320241/2019_2/lectures/lecture_7_8.pdf

虽然它应该被自动检测到并且应该只是

lectures/lecture_7_8.pdf

而我传递给函数的原始 my_url 是

https://grader.eecs.jacobs-university.de/courses/320241/2019_2/

由于我同时添加了它们并且部分链接重复，因此下载的文件已损坏。如何检查current_link 是否与my_url 重复任何部分，如果是，如何在下载前将其删除？

【问题讨论】：

标签： python string parsing web-scraping beautifulsoup

【解决方案1】：

使用urllib.parse 中的urljoin 更新即可：

import urllib.parse import urljoin

def get_pdfs(my_url):
    html = urllib2.urlopen(my_url).read()
    html_page = BeautifulSoup(html)
    current_link = ''
    links = []
    for link in html_page.find_all('a'):
        current_link = link.get('href')
        if current_link.endswith('pdf'):
            print(current_link)
            links.append(urljoin(my_url, current_link))
    #print(links)

    for link in links:
        #urlretrieve(link)
        wget.download(link)

简化方法，.select('a[href$=pdf]')选择所有href以pdf结尾的链接：

import urllib.parse import urljoin

def get_pdfs(my_url):
    html = urllib2.urlopen(my_url).read()
    html_page = BeautifulSoup(html)
    [wget.download(urljoin(my_url, link.get('href'))) for link in html_page.select('a[href$=pdf]')]

【讨论】：

您还可以评估 current_link 值是什么。 - 如果您当前的链接以http:// 或https:// 开头，则表示该链接指向外部服务器。您可以按原样使用 URL。 - 如果您的链接以/ 开头，则表示它是您当前服务器的绝对路径（即base_url）。您可以使用f'{base_url}/{current_link}' - 如果链接以其他内容开头，则需要将相对路径构建为：f'{base_url}/{course_path}/{current_link}'
@Byob 我实际上是在尝试创建一个功能，用户只需输入链接，而不是课程路径。我需要一种可以在my_url`` and see if a part of the string is matching with the current_link```中搜索的方法。如果是，我想删除匹配的部分并连接其余部分以打开链接。
例如，如果我使用 wget 下载https://grader.eecs.jacobs-university.de/courses/320241/2019_2/lectures/lecture_7_8.pdf，它可以完美运行。所以我需要某种字符串解析技术
它的html是这样的：<tr> <td>2019-09-26, 27</td> <td><a href="/courses/320241/2019_2/lectures/lecture_7_8.pdf">Lecture 7, 8</a></td> </tr>
@x89 检查答案更新。使用 urllib.parse 你可以 urljoin 这将为你工作。