模拟和测试从 url 返回文本的函数答案

【问题标题】：Mocking and testing a function which returns text from an url模拟和测试从 url 返回文本的函数
【发布时间】：2020-12-16 10:02:22
【问题描述】：

我有一个函数，它接受一个 url 并从这个 url 返回文本。

def extract_raw_text_from_url(url, set_parser='lxml'):

    try:
        req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})  # Set user agent as Mozilla. Otherwise: Error 403
        source = urlopen(req).read()  # Return source code

        parser = set_parser
        soup = bs.BeautifulSoup(source, parser)  # create beautiful soup object

        text = soup.get_text()  # get text of websites

    except (ValueError): # ToDo: Why urllib.error.URLError is unknown? I want to include it in exception! Works in Colab!
        text = []

    return text

如何正确测试此功能？因为我认为每次运行测试时都发出请求是不好的做法，所以我认为模拟结果是个好主意。

知道怎么做吗？我正在使用 pytest，但我还是个初学者。

【问题讨论】：

你能直接从服务器访问到这个url吗？
嗯，是的，如果我插入一个 url，我可以访问它并提出真正的请求。但我认为最好的做法是编写测试，不要连接到互联网才能正常工作

标签： python testing beautifulsoup mocking pytest

【解决方案1】：

我认为这取决于你想测试什么，如果你想测试请求，你应该每次都执行一个请求（实际上网页可能会从一天变为另一天，它会考虑到这一点）。

如果您想测试给定 html 输入的解析过程，我认为您可以下载 html 页面并将其放在测试中的 assets （或其他）文件夹中，而不是您可以尝试使用

url = "assets/mywebpage1.html"
with open(url, 'r') as f:
   source = f.read()
   #...

编辑：我认为可以采用两种方法：

将 2 个操作划分为两个不同的函数并仅测试 parse_content_from_html(source) 其中 source 是在测试例程中如上例所示获得的

def extract_raw_text_from_url(url, set_parser='lxml'):
    try:
        req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
        source = urlopen(req).read()  # Return source code
        text = parse_content_from_html(source)
    except (ValueError): 
        text = []

    return text

def parse_content_from_html(source):
    parser = set_parser
    soup = bs.BeautifulSoup(source, parser)  # create beautiful soup object
    text = soup.get_text()  # get text of websites
    return text

使用标志来区分本地 html 加载和远程加载。您可以使用extract_raw_text_from_url("assets/mywebpage1.html", ..., local=True)

def extract_raw_text_from_url(url, set_parser='lxml', local=False):

    try:
        if local:
            with open(url, 'r') as f:
                source = f.read()
        else:
            req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})  # Set user agent as Mozilla. Otherwise: Error 403
            source = urlopen(req).read()  # Return source code

        parser = set_parser
        soup = bs.BeautifulSoup(source, parser)  # create beautiful soup object

        text = soup.get_text()  # get text of websites

    except (ValueError): 
        text = []

    return text

【讨论】：

非常感谢。我的目标是测试解析过程。将网站保存在文件夹中并加载 html 页面是个好主意。但是，我真的不知道，如何编写函数extract_raw_text_from_url 的完整测试。我的问题是，在我的测试函数中的某个时刻，我必须调用函数extract_raw_text_from_url 来将其结果与预期结果进行比较。也许你可以告诉我，在我的测试中，我必须在什么位置使用你的代码示例？！
为您的具体场景添加示例，希望其中一个对您有用！