python：字典困境：如何根据属性正确索引对象答案

【问题标题】：python: dictionary dilemma: how to properly index objects based on an attributepython：字典困境：如何根据属性正确索引对象
【发布时间】：2010-02-21 12:07:43
【问题描述】：

首先，一个例子：

给定一堆 Person 对象各种属性（姓名、ssn、电话、电子邮件地址、信用卡号等）

现在想象以下简单的网站：

使用某人的电子邮件地址作为唯一的登录名

允许用户编辑他们的属性（包括他们的电子邮件地址）

如果这个网站有大量用户，那么存储 Person 是有意义的由索引的字典中的对象电子邮件地址，用于快速人员登录时检索。

但是当一个人的电子邮件地址被编辑，然后字典键该 Person 需要更改为好吧。这有点恶心

我正在寻找有关如何解决一般问题的建议：

给定一堆具有共同方面的实体。该方面既用于快速访问实体，也用于每个实体的功能。方面应该放在哪里：

在每个实体内（不利于快速访问）
仅索引（不利于每个实体的功能）
在每个实体内和作为索引（重复数据/参考）
其他地方/以某种方式不同

问题可能会扩展，例如，如果我们想使用多个索引来索引数据（ssn、信用卡号等）。最终我们可能会得到一堆 SQL 表。

我正在寻找具有以下属性的东西（如果你能想到的话，还有更多）：

# create an index on the attribute of a class
magical_index = magical_index_factory(class, class.attribute)
# create an object
obj = class() 
# set the object's attribute
obj.attribute= value
# retrieve object from using attribute as index
magical_index[value] 
# change object attribute to new value
obj.attribute= new_value 
# automagically object can be retrieved using new value of attribute
magical_index[new_value]
# become less materialistic: get rid of the objects in your life
del obj
# object is really gone
magical_index[new_value]
KeyError: new_value

我希望对象、索引都能够很好地无缝地相互播放。

请提出合适的设计模式

注意：上面的例子就是这样，一个例子。用于描述一般问题的示例。所以请提供通用解决方案（当然，您在解释通用解决方案时可以选择继续使用示例）

【问题讨论】：

首先，您为什么不简单地使用关系数据库呢？ Python 字典意味着您所有的“大量用户”始终都在内存中，从而减慢速度。
@S. Lott：使用现代计算机，您可以将几百兆字节的用户放入内存中，这很多。所以它实际上可以比使用关系数据库更快。
@Otto Allmendinger：绝对正确。然而，这个问题的措辞使它听起来像家庭作业。我正在探究为什么没有使用数据库的原因，因为数据库是标准方法。虽然不使用数据库也行，但很少有人这样做，我不明白为什么有人会尝试它——当然是在做作业之外。
@Otto - 是的，除非您使用的是内存数据库。这将允许您将用户保存在内存中，并且仍然使用适当的 SQL 来访问它们。你正在重新发明一个行之有效的轮子。另一个考虑因素是线程安全和隔离。当它们是只读的时，我通常以这种方式将大型数据集带入内存。如果您的人员发生变化，我会返回关系数据库或添加缓存解决方案。
@S.洛特：请注意我的笔记。我知道关系（和持久）数据库是上述示例的一个很好的解决方案。我不关心网站、登录或人员。我对我发布的一般问题的解决方案更感兴趣，索引对象的属性，对象和索引相互配合

标签： python design-patterns data-structures dictionary indexing

【解决方案1】：

考虑一下。

class Person( object ):
    def __init__( self, name, addr, email, etc. ):
        self.observer= []
        ... etc. ...
    @property
    def name( self ): return self._name
    @name.setter
    def name( self, value ): 
        self._name= value
        for observer in self.observedBy: observer.update( self )
    ... etc. ...

这个observer 属性实现了一个Observable，它通知它的Observers 更新。这是必须通知更改的观察者列表。

每个属性都包含有属性。使用 Descriptors 我们可能会更好，因为它可以避免重复观察者通知。

class PersonCollection( set ):
    def __init__( self, *args, **kw ):
        self.byName= collections.defaultdict(list)
        self.byEmail= collections.defaultdict(list)
        super( PersonCollection, self ).__init__( *args, **kw )
    def add( self, person ):
        super( PersonCollection, self ).append( person )
        person.observer.append( self )
        self.byName[person.name].append( person )
        self.byEmail[person.email].append( person )
    def update( self, person ):
        """This person changed.  Find them in old indexes and fix them."""
        changed = [(k,v) for k,v in self.byName.items() if id(person) == id(v) ]
        for k, v in changed:
            self.byName.pop( k )
        self.byName[person.name].append( person )
        changed = [(k,v) for k,v in self.byEmail.items() if id(person) == id(v) ]
        for k, v in changed:
            self.byEmail.pop( k )
        self.byEmail[person.email].append( person)

    ... etc. ... for all methods of a collections.Set.

使用 collections.ABC 获取有关必须实施的更多信息。

http://docs.python.org/library/collections.html#abcs-abstract-base-classes

如果您想要“通用”索引，那么您的集合可以使用属性名称进行参数化，您可以使用getattr 从底层对象中获取这些命名属性。

class GenericIndexedCollection( set ):
    attributes_to_index = [ ] # List of attribute names
    def __init__( self, *args, **kw ):
        self.indexes = dict( (n, {}) for n in self.attributes_to_index ]
        super( PersonCollection, self ).__init__( *args, **kw )
    def add( self, person ):
        super( PersonCollection, self ).append( person )
        for i in self.indexes:
            self.indexes[i].append( getattr( person, i )

注意。要正确模拟数据库，请使用集合而不是列表。数据库表（理论上）是集合。实际上，它们是无序的，索引将允许数据库拒绝重复。一些 RDBMS 不会拒绝重复的行，因为——没有索引——检查起来太昂贵了。

【讨论】：

@bandana：考虑一下。代码并不完整。但是，您可以做很多事情来添加该自动功能。 (1) 每个 Person 对象都可以绑定到所属的集合。 (2) 所有的变化都可以通过集合。 #2 是标准的 RDBMS 方法。 #1 是标准的 ORM 方法。
当一个人更改他们的电子邮件地址时更新查找将不得不在 Person 内部处理，这会导致非常尴尬的耦合，或者由客户端代码处理，这并不比使用简单的字典。我会说拥有一个单独的 dict 键和一个重复的对象属性的更简单的解决方案仍然是要走的路。
@Max S.：我不确定 Observer/Observable 模式是“非常尴尬的耦合”。
我的评论是在您描述通过观察者模式进行之前，但我会说对于 Python 设计，这很尴尬。 Person 无需触摸即可放入容器中。鉴于 Python 的动态特性，我认为让 Collection 动态注入观察代码会节省大量工作。
@Max S. 同意。我认为描述符是一种将观察者/可观察者注入其中的方法。

【解决方案2】：

好吧，另一种方法可能是实现以下内容：

Attr 是“值”的抽象。我们需要这个，因为 Python 中没有“赋值重载”（简单的 get/set 范例被用作最干净的替代方案）。 Attr 也充当“Observable”。
AttrSet 是Attrs 的“观察者”，它跟踪它们的值变化，同时有效地充当Attr-to-whatever（在我们的例子中为person）字典。
create_with_attrs 是一个工厂，通过提供的Attrs 生产看起来像命名元组的转发属性访问，因此person.name = "Ivan" 有效地产生person.name_attr.set("Ivan") 并使AttrSets 观察到这一点person 的 name 适当地重新排列它们的内部结构。

代码（已测试）：

from collections import defaultdict

class Attribute(object):
    def __init__(self, value):
        super(Attribute, self).__init__()
        self._value = value
        self._notified_set = set()
    def set(self, value):
        old = self._value
        self._value = value
        for n_ch in self._notified_set:
            n_ch(old_value=old, new_value=value)
    def get(self):
        return self._value
    def add_notify_changed(self, notify_changed):
        self._notified_set.add(notify_changed)
    def remove_notify_changed(self, notify_changed):
        self._notified_set.remove(notify_changed)

class AttrSet(object):
    def __init__(self):
        super(AttrSet, self).__init__()
        self._attr_value_to_obj_set = defaultdict(set)
        self._obj_to_attr = {}
        self._attr_to_notify_changed = {}
    def add(self, attr, obj):
        self._obj_to_attr[obj] = attr
        self._add(attr.get(), obj)
        notify_changed = (lambda old_value, new_value:
                          self._notify_changed(obj, old_value, new_value))
        attr.add_notify_changed(notify_changed)
        self._attr_to_notify_changed[attr] = notify_changed
    def get(self, *attr_value_lst):
        attr_value_lst = attr_value_lst or self._attr_value_to_obj_set.keys()
        result = set()
        for attr_value in attr_value_lst:
            result.update(self._attr_value_to_obj_set[attr_value])
        return result
    def remove(self, obj):
        attr = self._obj_to_attr.pop(obj)
        self._remove(attr.get(), obj)
        notify_changed = self._attr_to_notify_changed.pop(attr)
        attr.remove_notify_changed(notify_changed)
    def __iter__(self):
        return iter(self.get())
    def _add(self, attr_value, obj):
        self._attr_value_to_obj_set[attr_value].add(obj)
    def _remove(self, attr_value, obj):
        obj_set = self._attr_value_to_obj_set[attr_value]
        obj_set.remove(obj)
        if not obj_set:
            self._attr_value_to_obj_set.pop(attr_value)
    def _notify_changed(self, obj, old_value, new_value):
        self._remove(old_value, obj)
        self._add(new_value, obj)

def create_with_attrs(**attr_name_to_attr):
    class Result(object):
        def __getattr__(self, attr_name):
            if attr_name in attr_name_to_attr.keys():
                return attr_name_to_attr[attr_name].get()
            else:
                raise AttributeError(attr_name)
        def __setattr__(self, attr_name, attr_value):
            if attr_name in attr_name_to_attr.keys():
                attr_name_to_attr[attr_name].set(attr_value)
            else:
                raise AttributeError(attr_name)
        def __str__(self):
            result = ""
            for attr_name in attr_name_to_attr:
                result += (attr_name + ": "
                           + str(attr_name_to_attr[attr_name].get())
                           + ", ")
            return result
    return Result()

用准备好的数据

name_and_email_lst = [("John","email1@dot.com"),
                      ("John","email2@dot.com"),
                      ("Jack","email3@dot.com"),
                      ("Hack","email4@dot.com"),
                      ]

email = AttrSet()
name = AttrSet()

for name_str, email_str in name_and_email_lst:
    email_attr = Attribute(email_str)
    name_attr = Attribute(name_str)
    person = create_with_attrs(email=email_attr, name=name_attr)
    email.add(email_attr, person)
    name.add(name_attr, person)

def print_set(person_set):
    for person in person_set: print person
    print

下面的伪 SQL sn-p 序列给出：

从电子邮件中选择 ID

>>> print_set(email.get())
email: email3@dot.com, name: Jack,
email: email4@dot.com, name: Hack,
email: email2@dot.com, name: John,
email: email1@dot.com, name: John,

从电子邮件中选择 ID WHERE email="email1@dot.com"

>>> print_set(email.get("email1@dot.com"))
email: email1@dot.com, name: John,

从电子邮件中选择 ID WHERE email="email1@dot.com" OR email="email2@dot.com"

>>> print_set(email.get("email1@dot.com", "email2@dot.com"))
email: email1@dot.com, name: John,
email: email2@dot.com, name: John,

SELECT id FROM name WHERE name="John"

>>> print_set(name.get("John"))
email: email1@dot.com, name: John,
email: email2@dot.com, name: John,

SELECT id FROM name, email WHERE name="John" AND email="email1@dot.com"

>>> print_set(name.get("John").intersection(email.get("email1@dot.com")))
email: email1@dot.com, name: John,

更新电子邮件，名称 SET email="jon@dot.com", name="Jon"

ID 在哪里

从电子邮件中选择 ID WHERE email="email1@dot.com"

>>> person = email.get("email1@dot.com").pop()
>>> person.name = "Jon"; person.email = "jon@dot.com"
>>> print_set(email.get())
email: email3@dot.com, name: Jack,
email: email4@dot.com, name: Hack,
email: email2@dot.com, name: John,
email: jon@dot.com, name: Jon,

从电子邮件中删除，名称 WHERE id=%s

从电子邮件中选择 ID

>>> name.remove(person)
>>> email.remove(person)
>>> print_set(email.get())
email: email3@dot.com, name: Jack,
email: email4@dot.com, name: Hack,
email: email2@dot.com, name: John,

【讨论】：