如何强制 python 导入模块依赖项的独立副本？答案

【问题标题】：How can I force python to import independent copies of modules' dependencies?如何强制 python 导入模块依赖项的独立副本？
【发布时间】：2020-01-09 00:03:57
【问题描述】：

我遇到了pandas.DataFrame.to_html() 依赖于全局状态的问题，如other questions 中所述。这些问题的解决方案有点老套，要么修改全局属性，然后将其还原，要么将 DataFrame 的内容存储在其他地方，然后在转换后将它们重新插入到 html 中。此外，这些问题还暗示了一个更普遍的问题：是否可以独立加载模块及其依赖关系？

下面有一个 MWE。当main.py 运行时，它会导入mod1，它会设置pandas 属性。接下来导入mod2，它识别出pandas 已经被加载。然后它使用 pandas 的实例化并重置属性。结果，当main.py 稍后调用mod1 的函数时，mod1 将属性视为mod2 已离开它。这意味着依赖于pandas.to_html() 的mod1.bar() 的行为与mod2 禁止一样。我们可以在main.py 中检查mod1.base.pd is mod2.base.pd（返回True）。

我可以在mod1 中指定我想导入base 的干净副本（以及它的依赖项pandas），以便mod1.base.pd is not mod2.base.pd？我在importlib 中看到的任何内容都不允许这样做。 sys.modules 暗示了某种神秘的诡计，但不确定这是否涵盖这种情况：

这个 [sys.modules] 可以被操纵以强制重新加载模块和其他技巧。

如果不可能这样做，那么处理这种情况的风格正确的方法是什么？将全局赋值/set_option 移动到每个函数（bar() 和 baz()）中是可行的，但如果有多种函数依赖于 to_html() 并因此依赖于全局状态，这似乎很乏味且容易出错。我可以将DataFrame.to_html(*args) 包装到一个新函数df_to_html(df, temporary_state, *args) 中，该函数处理设置和重置模块选项，并简单地调用此函数而不是to_html()，但如果有更多函数依赖于该选项，这又会很乏味。

base.py

import pandas as pd

def foo():
    return pd.DataFrame([["Looooooooooooooooooooooooooooooooong","a"],
                         ["b","Texxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxt"]], 
    columns=['A','B'])

mod1.py

import base

base.pd.set_option('display.max_colwidth',8)

def bar():
    return base.foo().loc[:,['A']].to_html()

mod2.py

import base

base.pd.set_option('display.max_colwidth',5)

def baz():
    return base.foo().loc[:,['B']].to_html()

main.py

import mod1
import mod2

def qux():
    while input():
        print(mod1.bar())
        print(mod2.baz())

if __name__=='__main__':
    qux()

输出

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>A</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>L...</td>
    </tr>
    <tr>
      <th>1</th>
      <td>b</td>
    </tr>
  </tbody>
</table>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>B</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>a</td>
    </tr>
    <tr>
      <th>1</th>
      <td>T...</td>
    </tr>
  </tbody>
</table>

【问题讨论】：

为什么不在bar() 和baz() 函数调用to_html() 之前设置选项？这肯定比加载pandas 的单独副本便宜（除了在内存和cpu 方面昂贵之外，这似乎有问题要完全正确）。
@cco 我可以。但我的是一个人为的、最小的例子——可能有许多函数依赖于导入模块的全局状态，并且对每个函数进行狙击似乎很容易出错，而且可能不是最好的风格。
强制分离模块实例的问题是您需要从sys.modules 删除所有依赖关系，直到模块没有全局状态的级别（如果存在这样的级别）。对我来说，这似乎比根据需要即时设置全局状态更容易出错（并且再次昂贵）。全局状态是邪恶的，正是因为它导致了这种丑陋的代码；无论哪种方式，你都会留下丑陋、脆弱的代码。

标签： python-3.x pandas dataframe global-variables

【解决方案1】：

我已经学会了我在问题中提到的诡计，是的，这是可能的。您需要使mod1.py 和mod2.py 清除 sys.modules 中可能缓存全局变量的任何内容。在mod1.py 和mod2.py 中，在import base 之前添加以下文本：

[sys.modules.pop(key) for key in sys.modules.keys() if 'pandas' in key]
try:
    sys.modules.pop('base')
except KeyError:
    pass

那么main.py的mod1和mod2有不同版本的base.pandas，可以通过添加main.py来验证：

print(mod1.base.pandas is mod2.base.pandas)

如果 mod1 和 mod2 发送给您时无意中发现它们的全局状态更改可能会发生冲突，您可以在 main.py 中对 sys.modules 执行类似的操作。但是，该全局状态缓存在每个导入 pandas 的模块或导入 pandas 的模块中，现在是 mod1、mod2、base 和 pandas 及其子包/模块。

【讨论】：