The following paper by Motwani et al gives couple of algorithms for computing frequency counts exceeding a user specified threshold over data streams.

1. Sticky sampling.

2. Lossy counting.

http://www.vldb.org/conf/2002/S10P03.pdf

**Lossy Counting:**

bucket width w = ceil (1 / epsilon)

fe = frequency of element e in the steam, whose current count is N.

bcurrent = ceil (N / w)

We store the following (e, f, delta) in Data structure D. where e is element, f is freq and delta is maximum error in f.

1. Initially D is empty.

2. When a new item arrives, if its already in D, we increment its f by 1. Else, we create a new entry (e, f, bcurrent -1). We also prune D by deleting some entries at bucket boundaries (N congruent 0 mod w). Delete if f + delta <= bcurrent.

3. When user requests a list of items, with thresold s, we output entries in S, whose freq f is >= (s – epsilon)N.

****

The above paper led to other algorithms including the one by Eric Demaine et al.

http://erikdemaine.org/papers/NetworkStats_TR2002/paper.pdf

**Algorithm Frequent**

1. Initialize the counters to zero.

2. For each element in the stream:

a. If the current element is not monitored by any counter and some counter is zero, define the current element to be the monitored element of that counter.

b. If the current element is the monitored element of a counter, increment the counter. Otherwise decrement every counter.

****

The related “Britney Spears problem” which the above two papers solve is discussed in the below article.

http://www.americanscientist.org/issues/pub/the-britney-spears-problem/99999