ActiveState Powered by ActiveState

Recipe 259174: bag collection class


Implement Smalltalk's bag class (similar to MultiSets in C++ or bag's in Objective C or Haskell's Edison module).

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
from operator import itemgetter
from heapq import nlargest

class bag(object):

    def __init__(self, iterable=()):
        self._data = {}
        self._len = 0
        self.update(iterable)

    def update(self, iterable):
        if isinstance(iterable, dict):
            for elem, n in iterable.iteritems():
                self[elem] += n
        else:
            for elem in iterable:
                self[elem] += 1 

    def __contains__(self, elem):
        return elem in self._data

    def __getitem__(self, elem):
        return self._data.get(elem, 0)

    def __setitem__(self, elem, n):
        self._len += n - self[elem]
        self._data[elem] = n
        if n == 0:
            del self._data[elem]

    def __delitem__(self, elem):
        self._len -= self[elem]
        del self._data[elem]

    def __len__(self):
        assert self._len == sum(self._data.itervalues())
        return self._len

    def __eq__(self, other):
        if not isinstance(other, bag):
            return False
        return self._data == other._data

    def __ne__(self, other):
        if not isinstance(other, bag):
            return True
        return self._data != other._data

    def __hash__(self):
        raise TypeError

    def __repr__(self):
        return 'bag(%r)' % self._data

    def copy(self):
        return self.__class__(self)

    __copy__ = copy # For the copy module

    def __deepcopy__(self, memo):
        from copy import deepcopy
        result = self.__class__()
        memo[id(self)] = result
        data = result._data
        result._data = deepcopy(self._data, memo)
        result._len = self._len
        return result

    def __getstate__(self):
        return self._data.copy(), self._len

    def __setstate__(self, data):
        self._data = data[0].copy()        
        self._len = data[1]

    def clear(self):
        self._data.clear()
        self._len = 0

    def __iter__(self):
        for elem, cnt in self._data.iteritems():
            for i in xrange(cnt):
                yield elem
                
    def iterunique(self):
        return self._data.iterkeys()

    def itercounts(self):
        return self._data.iteritems()     

    def mostcommon(self, n=None):
        if n is None:
            return sorted(self.itercounts(), key=itemgetter(1), reverse=True)
        it = enumerate(self.itercounts())
        nl = nlargest(n, ((cnt, i, elem) for (i, (elem, cnt)) in it))
        return [(elem, cnt) for cnt, i, elem in nl]

Discussion

I'm proposing this API for inclusion in a Py2.5 collections module. Please comment if you see any way the API can be improved.

Example: <pre>

>>> from collections import bag
>>> b = bag('banana')
>>> b.mostcommon()
[('a', 3), ('n', 2), ('b', 1)]
>>> list(b)
['a', 'a', 'a', 'b', 'n', 'n']
>>> set(b.iterunique())  # faster than set(b)
set(['a', 'b', 'n'])
>>> b['a']
3
>>> b['a'] += 4
>>> b['a']
7
>>> del b['a']
>>> b['a'] += 1
>>> b['a']
1
</pre>

Reposted after incorporating most of the suggestions below.

Comments

  1. 1. At 7:07 a.m. on 16 jan 2004, Matt R said:

    subscripting? Finding this recipe has just saved me an hour or so -- thanks!

    API: How about using subscripting to access the counts?

    >>> b = bag('banana')
    >>> b['a']
    3
    
  2. 2. At 4:16 p.m. on 17 jan 2004, Anonymous said:

    API. I agree with previous comment from Matt R and prefer subscripting interface so i can:

    >>> mybag = bag("abacab")
    
    
    
    >>> mybag['d'] += 1 #automatically adds 'a' to bag if doesn't exist
    
    
    
    >>> mybag['c'] -= 2 #clear (or throw exception) if less than 2 'c's in bag
    
    
    
    >>> mybag['a'] = 0  #removes all 'a' from bag
    
    
    
    >>> mybag['e']      #queries bag for the count of 'e'
    

    0

    This gets rid of count, add, remove methods.

    Since I inheret my bag from dict, I redefine pop() to both return and remove all of a given item of bag. Also since I have to redefine dict.__setitem__() anyway, I use that opportunity to validate the dict(non-zero integer) values before insertion. This also makes it easy for subclasses to redefine the policy of how the bag should behave (like in example #2 above). (Alternatively, one could pass in a policy function or "filter" to the bag constructor.)

    I also allow add/subtract arithmetic on bags:

    >>> bag("word counts from file one".split()) + bag("word counts from another file".split())
    

    bag({"word": 2, "counts": 2, "from": 2, "another": 1, "file": 2, "one": 1})

    Mike

  3. 3. At 7:02 p.m. on 18 jan 2004, Anonymous said:

    addendum. I should add that the suggested API allows me to easily create an weighted graph class as a dict of bags with a nice API for the graph:

    >>> g = graph()
    >>> g['here']['there']  #returns edge weight
    
    
    
    >>> g['here']['newthere'] = 1 #add new vertex 'newthere' with weight=1
    
  4. 4. At 12:11 a.m. on 29 nov 2004, Remy Blank said:

    Inequality operator. Shouldn't the __ne__() operator be:

    def __ne__(self, other):
            if not isinstance(other, bag):
                return True
            return self._data != other._data
    

    I.e. if the other object is not a bag, it is not equal to the bag?

  5. 5. At 11 a.m. on 29 nov 2004, Steven Bethard said:

    __iter__. Why does __iter__ iterate over (item, count) pairs as opposed to just items like dict? That is, why not try to parallel the current Python objects more?

  6. 6. At 11:03 a.m. on 29 nov 2004, Steven Bethard said:

    remove. I think maybe remove needs renamed. If I called the remove method on a bag, I would expect the count to be zeroed, not decremented. Maybe 'decrement' is a better name? I would also like a remove method that really does zero the count -- this is useful for pruning a bag by count... Of course, a 'prune' method that removes all items with less than or equal to a given count would be even better for my purposes...

  7. 7. At 11:06 a.m. on 29 nov 2004, Steven Bethard said:

    sortedbycount. I don't think sortedbycount should be a method. As your implementation shows, this functionality is already available with sorted(). I'm not sure that sortedbycount actually gains us much here...

  8. 8. At 7:03 a.m. on 30 nov 2004, Andreas Kloss said:

    Thinking in abstract vs. implementation. If you think of the bag as a key:count dict, then yes, "remove" should remove the whole key, and "decrement" could decrement until 0 (then remove).

    On the other hand, if you think of the bag as an unordered container for multiple possibly-equal things, then it is sensible for "remove" to remove just one element instead of all elements being equal to the one you happened to select.

    Case in point: Suppose you have a bag of many black and white marbles. You remove a white marble. Would you expect the bag to not contain any white marble? The method to remove all elements of an equality class should be aptly named "removeAll" or something to this effect.

    -- Andre

    Providing high quality math zealotry since 1977 ;)

  9. 9. At 11:03 p.m. on 30 nov 2004, Steven Bethard said:

    My typical use for a bag class would be to hold word counts, not marbles. ;) So my abstraction is keys with counts. Of course, it's a moot point since the remove method's been removed in favor of __getitem__ and __setitem__.

  10. 10. At 11:07 p.m. on 30 nov 2004, Steven Bethard said:

    mostcommon. Thanks, I like mostcommon a lot better!

  11. 11. At 2:55 p.m. on 2 dec 2004, Jim Jewett said:

    specialcase update with another bag. I would expect to able to write

    b1 = Bag()
    b2 = Bag()
    ...
    b2.update(b1)
    

    As written, this turns out to be a fairly inefficient way of doing it. If b1[k] == 20, it will take 20 separate calls to the iterator, the length adjuster, and the dictionary's __setitem__

  12. 12. At 4:06 p.m. on 16 feb 2005, Steven Bethard said:

    itercounts is confusing. I was looking at this again today because I need to use something like this right now, and it strikes me that itercounts is kind of confusing. For my current task, I want just the counts (i.e. the equivalent of _data.itervalues()). To me, itercounts sounds like it does this, not iterate through the (item, count) pairs.

    So I guess I have two suggestions: (1) Make itercounts iterate just through the counts, and (2) Rename itercounts to something like iteritems or iteritemcounts.

    Thanks again!

  13. 13. At 11:34 a.m. on 27 mar 2005, Martin v. Löwis said:

    Still need add and remove. Users coming from Smalltalk or Objective C, or users familiar with C++ multiset will expect that a collections.bag is a container that is capable of holding the same thing (say, a string) multiple times. Currently, there appears to be no .add method, which I think there should be. Also, .remove should remove a single item, not all copies of the same item. For convenience, .addAll might be useful, to implement a frequency count.

  14. 14. At 9:15 a.m. on 19 jun 2005, Jon Blunt said:

    Bags should do not have -ve quantities. If you decrement a value below 0 there is no logical meaning and you can break the validity of len(myBag).

    >>> x =bag.bag()
    
    
    
    >>> x['a']
    

    0

    >>> x['a']-=1
    
    
    
    >>> x['a']
    

    -1

    >>> len (x)
    

    Traceback (most recent call last):

    File "", line 1, in -toplevel-

    len (x)
    

    ValueError: __len__() should return >= 0

  15. 15. At 11:41 a.m. on 27 nov 2007, Dan Stromberg said:

    Union, intersection, and difference of two bags? This class would be more general if it supported union, intersection and difference of two bags:

    http://www.brpreiss.com/books/opus5/html/page398.html

    ...making it more truly like "sets with duplicates".

    I'll probably need a method that does a modified union, where b.minunion(c) would give a bag with elements e s.t. their repetition counts are max(|e| in b, |e| in c). But I'm not sure I should foist this weird operation on the rest of the python world. :)

Sign in to comment