1
2
3
4
5
6
7
8
9 """
10 This module contains a number of basic clustering algorithms. Clustering
11 describes the task of discovering groups of similar items with a large
12 collection. It is also describe as unsupervised machine learning, as the data
13 from which it learns is unannotated with class information, as is the case for
14 supervised learning. Annotated data is difficult and expensive to obtain in
15 the quantities required for the majority of supervised learning algorithms.
16 This problem, the knowledge acquisition bottleneck, is common to most natural
17 language processing tasks, thus fueling the need for quality unsupervised
18 approaches.
19
20 This module contains a k-means clusterer, E-M clusterer and a group average
21 agglomerative clusterer (GAAC). All these clusterers involve finding good
22 cluster groupings for a set of vectors in multi-dimensional space.
23
24 The K-means clusterer starts with k arbitrary chosen means then allocates each
25 vector to the cluster with the closest mean. It then recalculates the means of
26 each cluster as the centroid of the vectors in the cluster. This process
27 repeats until the cluster memberships stabilise. This is a hill-climbing
28 algorithm which may converge to a local maximum. Hence the clustering is
29 often repeated with random initial means and the most commonly occurring
30 output means are chosen.
31
32 The GAAC clusterer starts with each of the M{N} vectors as singleton clusters.
33 It then iteratively merges pairs of clusters which have the closest centroids.
34 This continues until there is only one cluster. The order of merges gives rise
35 to a dendogram - a tree with the earlier merges lower than later merges. The
36 membership of a given number of clusters M{c}, M{1 <= c <= N}, can be found by
37 cutting the dendogram at depth M{c}.
38
39 The Gaussian EM clusterer models the vectors as being produced by a mixture
40 of k Gaussian sources. The parameters of these sources (prior probability,
41 mean and covariance matrix) are then found to maximise the likelihood of the
42 given data. This is done with the expectation maximisation algorithm. It
43 starts with k arbitrarily chosen means, priors and covariance matrices. It
44 then calculates the membership probabilities for each vector in each of the
45 clusters - this is the 'E' step. The cluster parameters are then updated in
46 the 'M' step using the maximum likelihood estimate from the cluster membership
47 probabilities. This process continues until the likelihood of the data does
48 not significantly increase.
49
50 They all extend the ClusterI interface which defines common operations
51 available with each clusterer. These operations include.
52 - cluster: clusters a sequence of vectors
53 - classify: assign a vector to a cluster
54 - classification_probdist: give the probability distribution over cluster memberships
55
56 The current existing classifiers also extend cluster.VectorSpace, an
57 abstract class which allows for singular value decomposition (SVD) and vector
58 normalisation. SVD is used to reduce the dimensionality of the vector space in
59 such a manner as to preserve as much of the variation as possible, by
60 reparameterising the axes in order of variability and discarding all bar the
61 first d dimensions. Normalisation ensures that vectors fall in the unit
62 hypersphere.
63
64 Usage example (see also demo())::
65 vectors = [array(f) for f in [[3, 3], [1, 2], [4, 2], [4, 0]]]
66
67 # initialise the clusterer (will also assign the vectors to clusters)
68 clusterer = cluster.KMeans(2, euclidean_distance)
69 clusterer.cluster(vectors, True)
70
71 # classify a new vector
72 print clusterer.classify(array([3, 3]))
73
74 Note that the vectors must use numpy array-like
75 objects. nltk_contrib.unimelb.tacohn.SparseArrays may be used for
76 efficiency when required.
77 """
78
79 from nltk_lite.probability import DictionaryProbDist
80 import copy, numpy, math, random, sys, types
81 from numpy import array, linalg
82
83
84
85
86
88 """
89 Interface covering basic clustering functionality.
90 """
91
92 - def cluster(self, vectors, assign_clusters=False):
93 """
94 Assigns the vectors to clusters, learning the clustering parameters
95 from the data. Returns a cluster identifier for each vector.
96 """
97 raise AssertionError()
98
100 """
101 Classifies the token into a cluster, setting the token's CLUSTER
102 parameter to that cluster identifier.
103 """
104 raise AssertionError()
105
107 """
108 Returns the likelihood (a float) of the token having the
109 corresponding cluster.
110 """
111 if self.classify(vector) == label:
112 return 1.0
113 else:
114 return 0.0
115
129
131 """
132 Returns the number of clusters.
133 """
134 raise AssertError()
135
137 """
138 Returns the names of the clusters.
139 """
140 return range(self.num_clusters())
141
143 """
144 Returns the names of the cluster at index.
145 """
146 return index
147
149 """
150 Abstract clusterer which takes tokens and maps them into a vector space.
151 Optionally performs singular value decomposition to reduce the
152 dimensionality.
153 """
154 - def __init__(self, normalise=False, svd_dimensions=None):
155 """
156 @param normalise: should vectors be normalised to length 1
157 @type normalise: boolean
158 @param svd_dimensions: number of dimensions to use in reducing vector
159 dimensionsionality with SVD
160 @type svd_dimensions: int
161 """
162 self._Tt = None
163 self._should_normalise = normalise
164 self._svd_dimensions = svd_dimensions
165
166 - def cluster(self, vectors, assign_clusters=False, trace=False):
167 assert len(vectors) > 0
168
169
170 if self._should_normalise:
171 vectors = map(self._normalise, vectors)
172
173
174 if self._svd_dimensions and self._svd_dimensions < len(vectors[0]):
175 [u, d, vt] = linalg.svd(numpy.transpose(array(vectors)))
176 S = d[:self._svd_dimensions] * \
177 numpy.identity(self._svd_dimensions, numpy.Float64)
178 T = u[:,:self._svd_dimensions]
179 Dt = vt[:self._svd_dimensions,:]
180 vectors = numpy.transpose(numpy.matrixmultiply(S, Dt))
181 self._Tt = numpy.transpose(T)
182
183
184 self.cluster_vectorspace(vectors, trace)
185
186
187 if assign_clusters:
188 print self._Tt, vectors
189 return [self.classify(vector) for vector in vectors]
190
192 """
193 Finds the clusters using the given set of vectors.
194 """
195 raise AssertionError()
196
204
206 """
207 Returns the index of the appropriate cluster for the vector.
208 """
209 raise AssertionError()
210
217
219 """
220 Returns the likelihood of the vector belonging to the cluster.
221 """
222 predicted = self.classify_vectorspace(vector)
223 if cluster == predicted: return 1.0
224 else: return 0.0
225
227 """
228 Returns the vector after normalisation and dimensionality reduction
229 """
230 if self._should_normalise:
231 vector = self._normalise(vector)
232 if self._Tt != None:
233 vector = numpy.matrixmultiply(self._Tt, vector)
234 return vector
235
241
243 """ Tree node of a dendogram. """
244
248
249 - def leaves(self, values=True):
259
280
282 """
283 Represents a dendogram, a tree with a specified branching order. This
284 must be initialised with the leaf items, then iteratively call merge for
285 each branch. This class constructs a tree representing the order of calls
286 to the merge function.
287 """
288
290 """
291 @param items: the items at the leaves of the dendogram
292 @type items: sequence of (any)
293 """
294 self._items = [_DendogramNode(item) for item in items]
295 self._original_items = copy.copy(self._items)
296 self._merge = 1
297
298 - def merge(self, *indices):
299 """
300 Merges nodes at given indices in the dendogram. The nodes will be
301 combined which then replaces the first node specified. All other nodes
302 involved in the merge will be removed.
303
304 @param indices: indices of the items to merge (at least two)
305 @type indices: seq of int
306 """
307 assert len(indices) >= 2
308 node = _DendogramNode(self._merge, *[self._items[i] for i in indices])
309 self._merge += 1
310 self._items[indices[0]] = node
311 for i in indices[1:]:
312 del self._items[i]
313
315 """
316 Finds the n-groups of items (leaves) reachable from a cut at depth n.
317 @param n: number of groups
318 @type n: int
319 """
320 if len(self._items) > 1:
321 root = _DendogramNode(self._merge, *self._items)
322 else:
323 root = self._items[0]
324 return root.groups(n)
325
327 """
328 Print the dendogram in ASCII art to standard out.
329 """
330
331
332 JOIN, HLINK, VLINK = '+', '-', '|'
333
334
335 if len(self._items) > 1:
336 root = _DendogramNode(self._merge, *self._items)
337 else:
338 root = self._items[0]
339 leaves = self._original_items
340
341
342 last_row = [str(leaf._value) for leaf in leaves]
343 width = max(map(len, last_row)) + 1
344 lhalf = width / 2
345 rhalf = width - lhalf - 1
346
347
348 def format(centre, left=' ', right=' '):
349 return '%s%s%s' % (lhalf*left, centre, right*rhalf)
350 def display(str):
351 sys.stdout.write(str)
352
353
354 queue = [(root._value, root)]
355 verticals = [ format(' ') for leaf in leaves ]
356 while queue:
357 priority, node = queue.pop()
358 child_left_leaf = map(lambda c: c.leaves(False)[0], node._children)
359 indices = map(leaves.index, child_left_leaf)
360 if child_left_leaf:
361 min_idx = min(indices)
362 max_idx = max(indices)
363 for i in range(len(leaves)):
364 if leaves[i] in child_left_leaf:
365 if i == min_idx: display(format(JOIN, ' ', HLINK))
366 elif i == max_idx: display(format(JOIN, HLINK, ' '))
367 else: display(format(JOIN, HLINK, HLINK))
368 verticals[i] = format(VLINK)
369 elif min_idx <= i <= max_idx:
370 display(format(HLINK, HLINK, HLINK))
371 else:
372 display(verticals[i])
373 display('\n')
374 for child in node._children:
375 if child._children:
376 queue.append((child._value, child))
377 queue.sort()
378
379 for vertical in verticals:
380 display(vertical)
381 display('\n')
382
383
384 display(''.join(item.center(width) for item in last_row))
385 display('\n')
386
394
395
396
397 from kmeans import *
398 from gaac import *
399 from em import *
400