【问题标题】:Exclude external links and only scrape internal link in python using BeautifulSoup使用 BeautifulSoup 排除外部链接,仅在 python 中抓取内部链接
【发布时间】:2020-10-14 02:11:00
【问题描述】:

要从中获取链接的网站 - http://hindi-movies-songs.com/films/index-previous-listen.html

我想获取页面上所有链接的列表及其内部 .mp3 链接(aapx 20K 链接) 例如父链接:http://hindi-movies-songs.com/films/index-previous-listen.html 这里面有14个链接,还有更多的链接等等。看看清楚: 父级中的第一个链接:http://hindi-movies-songs.com/films/index-listen-20131118.html 1.1.1链接:http://hindi-films-songs.com/main/roberto-48.html 现在,我需要 1.1.1 等下的所有链接,所以要抓取 3 级页面。 问题是在每个页面的末尾都有一个链接到不需要被抓取的主页,我如何在每个级别排除它?

我的代码 -

import requests
from urllib.request import urlparse, urljoin
from bs4 import BeautifulSoup
import colorama

# init the colorama module
colorama.init()

GREEN = colorama.Fore.GREEN
GRAY = colorama.Fore.LIGHTBLACK_EX
RESET = colorama.Fore.RESET

# initialize the set of links (unique links)
internal_urls = set()


total_urls_visited = 0


def is_valid(url):
    """
    Checks whether `url` is a valid URL.
    """
    parsed = urlparse(url)
    return bool(parsed.netloc) and bool(parsed.scheme)


def get_all_website_links(url):
    """
    Returns all URLs that is found on `url` in which it belongs to the same website
    """
    # all URLs of `url`
    urls = set()
    # domain name of the URL without the protocol
    domain_name = urlparse(url).netloc
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    for a_tag in soup.findAll("a"):
        href = a_tag.attrs.get("href")
        if href == "" or href is None:
            # href empty tag
            continue
        # join the URL if it's relative (not absolute link)
        href = urljoin(url, href)
        parsed_href = urlparse(href)
        # remove URL GET parameters, URL fragments, etc.
        href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
        if not is_valid(href):
            # not a valid URL
            continue
        if href in internal_urls:
            # already in the set
            continue
        print(f"{GREEN}[*] Internal link: {href}{RESET}")
        urls.add(href)
        internal_urls.add(href)
    return urls


def crawl(url, max_urls=50):
    """
    Crawls a web page and extracts all links.
    You'll find all links in `external_urls` and `internal_urls` global set variables.
    params:
        max_urls (int): number of max urls to crawl, default is 50.
    """
    global total_urls_visited
    total_urls_visited += 1
    links = get_all_website_links(url)
    for link in links:
        if total_urls_visited > max_urls:
            break
        crawl(link, max_urls=max_urls)


if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser(description="Link Extractor Tool with Python")
    parser.add_argument("url", help="The URL to extract links from.")
    parser.add_argument("-m", "--max-urls", help="Number of max URLs to crawl, default is 50.", default=50, type=int)
    
    args = parser.parse_args()
    url = args.url
    max_urls = args.max_urls

    crawl(url, max_urls=max_urls)

    print("[+] Total Internal links:", len(internal_urls))
    print("[+] Total URLs:", len(internal_urls))

    domain_name = urlparse(url).netloc

    # save the internal links to a file
    with open(f"{domain_name}_internal_links.txt", "w") as f:
        for internal_link in internal_urls:
            print(internal_link.strip(), file=f)

href!="http://hindi-movies-songs.com/index.html" 放入空标签条件没有帮助

有什么解决办法吗?

【问题讨论】:

    标签: python web-scraping beautifulsoup scripting web-crawler


    【解决方案1】:

    href not in ["http://hindi-movies-songs.com/index.html"] 对我有用

    import requests
    from urllib.request import urlparse, urljoin
    from bs4 import BeautifulSoup
    import lxml
    url = "http://hindi-movies-songs.com/films/index-previous-listen.html"
    urls = set()
    soup = BeautifulSoup(requests.get(url).content, "lxml")
    for a_tag in soup.findAll("a"):
        if a_tag['href'] not in ["http://hindi-movies-songs.com/index.html"]:
            print(a_tag.get('href'))
    

    输出是:

    http://hindi-movies-songs.com/films/index-listen-20131118.html
    http://hindi-movies-songs.com/films/index-listen-20121231.html
    http://hindi-movies-songs.com/films/index-listen-20120327.html
    http://hindi-movies-songs.com/films/index-listen-20110831.html
    http://hindi-movies-songs.com/films/index-listen-20101215.html
    http://hindi-movies-songs.com/films/index-listen-20100404.html
    http://hindi-movies-songs.com/films/index-listen-20091201.html
    http://hindi-movies-songs.com/films/index-listen-20090611.html
    http://hindi-movies-songs.com/films/index-listen-20090105.html
    http://hindi-movies-songs.com/films/index-listen-20080523.html
    http://hindi-movies-songs.com/films/index-batch4.html
    http://hindi-movies-songs.com/films/index-batch3.html
    http://hindi-movies-songs.com/films/indexbatch2.html
    http://hindi-movies-songs.com/films/index11to25.html
    

    【讨论】:

    • 另外,使用 exclude = "hindi-movies-songs.com/index.html" 和条件 "href != exclude" 也可以正常工作。
    • 你发给我的结果是第一级。我需要在每个链接中再进入 2 个级别并进一步爬行。并且每个页面最后都有 1 个(相同的)链接,不能被抓取。
    猜你喜欢
    • 2019-06-06
    • 2018-07-29
    • 1970-01-01
    • 1970-01-01
    • 2015-06-10
    • 1970-01-01
    • 2014-06-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多