best method to find the freequent numbers

I

Imran

I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]

and I want to find out the number which occurs most frequently.what is the
quick method. My array size is huge.

what I am doing is

1. find out the maximum value N
2. loop through 1...N
3. count # times each occurred
4. output the most frequent one

Are there are more efficient implementations avaiable?
Thanks
 
P

Peter Karlsson

Den 2006-03-14 skrev Imran said:
I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]
and I want to find out the number which occurs most frequently.what is the
quick method. My array size is huge. [snip]
Are there are more efficient implementations avaiable?
Thanks

Check out:
http://www.jjj.de/fxt/demo/sort/
For some good implementations.

//Peter
 
R

Richard G. Riley

I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]

and I want to find out the number which occurs most frequently.what is the
quick method. My array size is huge.

what I am doing is

1. find out the maximum value N
2. loop through 1...N
3. count # times each occurred
4. output the most frequent one

Are there are more efficient implementations avaiable?
Thanks

If the range of the integers is limited to something "reasonable" like
65535 then you could consider creating an array indexed by the
integer in question and incrementing the count. Without writing pure C
code completely, it might approximate to something like:

while(not finished) begin
nextInt = inputIntegers[readIndex++];
countArray[nextInt]++;
end

Very fast. You could keep track of the most common number in the loop
or do a quick scan at the end.

The creation of the count array and the loop details are in your hands...
 
V

Vladimir S. Oka

I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]

and I want to find out the number which occurs most frequently.what is
the quick method. My array size is huge.

what I am doing is

1. find out the maximum value N

I guess you mean "find the size of the array"? I'd expect that to be
known upfront.
2. loop through 1...N
3. count # times each occurred
4. output the most frequent one

Are there are more efficient implementations avaiable?
Thanks

Let's make this slightly C-specific:

You can use `qsort()` to sort the array. Once sorted, you need just one
pass to determine the most frequent number (and it's frequency),
without the need to keep track of more than one count at the time
(well, two if you include current maximum). What you've proposed above,
would have required a separate count for every unique number you
encounter.
 
B

Ben C

I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]

and I want to find out the number which occurs most frequently.what is the
quick method. My array size is huge.

what I am doing is

1. find out the maximum value N
2. loop through 1...N
3. count # times each occurred
4. output the most frequent one

Are there are more efficient implementations avaiable?
Thanks

Reading between the lines a bit, you're making another vector of counts
1 .. N long. It might not need to be that long if not every number is
represented; so you could use a sort of hash table where the hashing
function is just n % n_buckets. This could save memory, but wouldn't
be any faster.

But if every time you update one of the counts you keep track of the
maximum and minimum count so far, i.e.:

...

counts[number]++;
if (counts[number] > *max) max = counts + number;
if (counts[number] < *min) min = counts + number;

...

Then as soon as the number of elements left in the original vector is
less than *max - *min, you can stop counting and break out of the loop,
because none of them can get bigger than the max you've found.

Would this be worth it? Well, statistically, I would have thought quite
possibly, but you'd have to just try it and do some tests.
 
R

Richard G. Riley

I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]

and I want to find out the number which occurs most frequently.what is the
quick method. My array size is huge.

what I am doing is

1. find out the maximum value N
2. loop through 1...N
3. count # times each occurred
4. output the most frequent one

Are there are more efficient implementations avaiable?
Thanks

Reading between the lines a bit, you're making another vector of counts
1 .. N long. It might not need to be that long if not every number is
represented; so you could use a sort of hash table where the hashing
function is just n % n_buckets. This could save memory, but wouldn't
be any faster.

But if every time you update one of the counts you keep track of the
maximum and minimum count so far, i.e.:

...

counts[number]++;
if (counts[number] > *max) max = counts + number;
if (counts[number] < *min) min = counts + number;

...

Then as soon as the number of elements left in the original vector is
less than *max - *min, you can stop counting and break out of the loop,
because none of them can get bigger than the max you've found.

The break condition is a nice optimization alright.

A little simpler might be (no need to keep pointers to maximum and
minum integer locations and have overhead of pointer addition).

count = ++counts[number];
if (count>=max){
secondHighest=max;
max=count;
}
//numLeft is number left to process in vector.
if(numLeft < max-secondHighest)
break;
 
B

Ben C

A little simpler might be (no need to keep pointers to maximum and
minum integer locations and have overhead of pointer addition).

count = ++counts[number];
if (count>=max){
secondHighest=max;
max=count;
}
//numLeft is number left to process in vector.
if(numLeft < max-secondHighest)
break;

Also what I posted was wrong: max - secondHighest is correct. max - min
is not.

The point at which you break out is not necessarily the point at which
you discover the "winning" max count though; you could have discovered
max and secondHighest on previous iterations, and break out because
numLeft has got small enough.

It means then you may need to get back from max to the corresponding
number, since you need to return the number not the count. That was why
I had max as a pointer. There are other ways of course, like keep the
corresponding number in a variable as you go along etc...
 
R

Richard G. Riley

A little simpler might be (no need to keep pointers to maximum and
minum integer locations and have overhead of pointer addition).

count = ++counts[number];
if (count>=max){
secondHighest=max;
max=count;
}
//numLeft is number left to process in vector.
if(numLeft < max-secondHighest) /****
break;

Also what I posted was wrong: max - secondHighest is correct. max - min
is not.

The point at which you break out is not necessarily the point at which
you discover the "winning" max count though; you could have discovered
max and secondHighest on previous iterations, and break out because
numLeft has got small enough.

I thought I did that? /**** above
 
B

Ben C

A little simpler might be (no need to keep pointers to maximum and
minum integer locations and have overhead of pointer addition).

count = ++counts[number];
if (count>=max){
secondHighest=max;
max=count;
}
//numLeft is number left to process in vector.
if(numLeft < max-secondHighest) /****
break;

Also what I posted was wrong: max - secondHighest is correct. max - min
is not.

The point at which you break out is not necessarily the point at which
you discover the "winning" max count though; you could have discovered
max and secondHighest on previous iterations, and break out because
numLeft has got small enough.

I thought I did that? /**** above

You do break out in the right place, the question is, having broken out
what do you have? The frequency of the highest-frequency number (but not
the highest-frequency number itself-- because that's not necessarily
"number", which was all I was saying). How to get from the max count
back to the corresponding number is the question. That was why I was
using pointers. *max is the highest count, max - counts would then be
the highest-frequency number, assuming the counts array is just
one entry for each possible value of number.

But never mind, this is just details, and not really as complicated as
I'm making it sound, and there are plenty of other ways of doing it that
are just as good or better. It would be pretty obvious what to do when
one actually implemented it I think.
 
R

Richard G. Riley

A little simpler might be (no need to keep pointers to maximum and
minum integer locations and have overhead of pointer addition).

count = ++counts[number];
if (count>=max){
secondHighest=max;
max=count;
}
//numLeft is number left to process in vector.
if(numLeft < max-secondHighest) /****
break;

Also what I posted was wrong: max - secondHighest is correct. max - min
is not.

The point at which you break out is not necessarily the point at which
you discover the "winning" max count though; you could have discovered
max and secondHighest on previous iterations, and break out because
numLeft has got small enough.

I thought I did that? /**** above

You do break out in the right place, the question is, having broken out
what do you have? The frequency of the highest-frequency number (but not
the highest-frequency number itself-- because that's not necessarily
"number", which was all I was saying). How to get from the max count
back to the corresponding number is the question. That was why I was
using pointers. *max is the highest count, max - counts would then be
the highest-frequency number, assuming the counts array is just
one entry for each possible value of number.

But never mind, this is just details, and not really as complicated as
I'm making it sound, and there are plenty of other ways of doing it that
are just as good or better. It would be pretty obvious what to do when
one actually implemented it I think.

You're absolutely right!

a one liner after the if statement makes it all good

mostCommonNumber=number;
 
F

Fred Kleinschmidt

Richard G. Riley said:
A little simpler might be (no need to keep pointers to maximum and
minum integer locations and have overhead of pointer addition).

count = ++counts[number];
if (count>=max){
secondHighest=max;
max=count;
}
//numLeft is number left to process in vector.
if(numLeft < max-secondHighest) /****
break;

Also what I posted was wrong: max - secondHighest is correct. max - min
is not.

The point at which you break out is not necessarily the point at which
you discover the "winning" max count though; you could have discovered
max and secondHighest on previous iterations, and break out because
numLeft has got small enough.

I thought I did that? /**** above

You do break out in the right place, the question is, having broken out
what do you have? The frequency of the highest-frequency number (but not
the highest-frequency number itself-- because that's not necessarily
"number", which was all I was saying). How to get from the max count
back to the corresponding number is the question. That was why I was
using pointers. *max is the highest count, max - counts would then be
the highest-frequency number, assuming the counts array is just
one entry for each possible value of number.

But never mind, this is just details, and not really as complicated as
I'm making it sound, and there are plenty of other ways of doing it that
are just as good or better. It would be pretty obvious what to do when
one actually implemented it I think.

You're absolutely right!

a one liner after the if statement makes it all good

mostCommonNumber=number;

The OP also has to consider that the distribution of values might be
multi-modal:
1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6
The above set has 6 equally populated modes.
 
R

Rod Pemberton

Imran said:
I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]

and I want to find out the number which occurs most frequently.what is the
quick method. My array size is huge.

Forget about sorting the data, the time it takes to sort will be much more
than the time it takes to count it.

Try to only loop through the data once.

Factors which heavily affect how this could be implemented are:
1) the range of the integers
2) the data source of the vector
3) the size of the vector

If the range of the integers is small, say zero to 127 (ASCII) for a text
analysis of William Shakespaere, the implementation is simple:
1) create a counter array of 127 unsigned long or unsigned long long's
2) increment from the first to last element of the vector
3) increment the appropriate counter

If the range is very large, you may want to create multiple arrays to help
you keep track of the data to reduce memory usage:
1) one array of bits where each bit is a seen or not-seen indicator of
values in the vector
2) another array or binary tree which keeps track of the count for each of
the seen integers
3) you may need to normalize the data first, by say mapping one set of
numbers to another.
This could allow you to reduce the range of integers used:
1,6,30,32,60,200 to 1,2,3,4,5,6

If the data source of the vector is say from some function, forget about
generating the vector. Just feed the data into your tabulation method.

If the "huge" vector is truly huge, it becomes an issue of finding a method
to balance the memory available for tabulation and the speed of execution.


Rod Pemberton
 
J

Jan-Hinnerk Dumjahn

Imran said:
I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]

and I want to find out the number which occurs most frequently.what is the
quick method. My array size is huge.

There two common "types" of complexity (of an algorithm):
- expected complexity
- worst-case complexity

For calculating expected complexity you need to know something about the
probabilities of different vectors.
Moreover, good expected complexity is not good enough for some applications
(e.g. real-time)

This leaves you with worst-case complexity.
what I am doing is

1. find out the maximum value N
2. loop through 1...N
3. count # times each occurred
4. output the most frequent one

Viewing the worst-case complexity none of the approaches are better than
yours.

I guess the best approach is some kind of sorting and then scanning through
once, like Vladimir suggested. However qsort() is not the best choice since
Quicksort has complexity O(n^2).

Faster sorting algorithms usually work better with linked lists then with
arrays, so you would need some extra memory ;-)

My suggestion would be to use a (height-balanced) binary-search-tree.
At the nodes you keep pairs of (value,frequency). For each array element
search the value in the tree. If it exists increment frequency; else add a
new node (value,1). After you are done with the array look for the node
with highest frequency.

This algorithm is especially fast, if you don't have many different values.

There should be BST implementations available in C.

--------

If your integer type is small compared to the size of the array, sorting the
array with bucket sort could be even faster than using a BST.

Have fun
/Jan-Hinnerk
 
K

Keith Thompson

Jan-Hinnerk Dumjahn said:
I guess the best approach is some kind of sorting and then scanning through
once, like Vladimir suggested. However qsort() is not the best choice since
Quicksort has complexity O(n^2).

Quicksort has complexity O(n log n). I think pure Quicksort has
worst-case complexity O(n^2), but there's no requirement for qsort()
to be implemented as pure Quicksort.
Faster sorting algorithms usually work better with linked lists then with
arrays, so you would need some extra memory ;-)

I don't think that's true. The fastest way to sort a linked list is
usually to copy it to an array and sort the array.
 
J

Jan-Hinnerk Dumjahn

Keith said:
Quicksort has complexity O(n log n). I think pure Quicksort has
worst-case complexity O(n^2), but there's no requirement for qsort()
to be implemented as pure Quicksort.

I have taken a quick look on google but wasn't been able to find something
on Quicksort with O(N log n). However, I came across heapsort which seems
to fit quite nicely. (Wish we had done more fun stuff like that at
university...)

qsort() need not have a poor implementation, but it could ;-)
I don't think that's true. The fastest way to sort a linked list is
usually to copy it to an array and sort the array.

Complexity should be same for both (if you don't need random access). From a
theoretical approach I still believe that e.g. bucketsort would be slower
using arrays. However, if I consider real world effects like processor
cache using arrays could be a lot faster ;-(
 
K

Keith Thompson

Jan-Hinnerk Dumjahn said:
Complexity should be same for both (if you don't need random access). From a
theoretical approach I still believe that e.g. bucketsort would be slower
using arrays. However, if I consider real world effects like processor
cache using arrays could be a lot faster ;-(

Most decent (O(n log n)) sorting algorithms do require random access.
 
J

Jordan Abel

Quicksort has complexity O(n log n). I think pure Quicksort has
worst-case complexity O(n^2), but there's no requirement for qsort()
to be implemented as pure Quicksort.


I don't think that's true. The fastest way to sort a linked list is
usually to copy it to an array and sort the array.

I'd argue that the fastest way to sort a linked list is to keep it
sorted in the first place.
 
P

pete

Jordan said:
I'd argue that the fastest way to sort a linked list is to keep it
sorted in the first place.

I find linked lists to be especially well suited to mergesort.


struct list_node {
struct list_node *next;
void *data;
};

typedef struct list_node list_type;

list_type *list_sort(list_type *head,
int (*compar)(const list_type *, const list_type *));
list_type *list_merge(list_type *head, list_type *tail,
int (*compar)(const list_type *, const list_type *));

static long unsigned node_count(list_type *head);
static list_type *list_split(list_type *head, long unsigned count);
static list_type *node_sort (list_type *head, long unsigned count,
int (*compar)(const list_type *, const list_type *));



list_type *list_sort(list_type *head,
int (*compar)(const list_type *, const list_type *))
{
return node_sort(head, node_count(head), compar);
}

static long unsigned node_count(list_type *head)
{
long unsigned count;

for (count = 0; head != NULL; head = head -> next) {
++count;
}
return count;
}

static list_type *node_sort(list_type *head, long unsigned count,
int (*compar)(const list_type *, const list_type *))
{
long unsigned half;
list_type *tail;

if (count > 1) {
half = count / 2;
tail = list_split(head, half);
tail = node_sort(tail, count - half, compar);
head = node_sort(head, half, compar);
head = list_merge(head, tail, compar);
}
return head;
}

static list_type *list_split(list_type *head, long unsigned count)
{
list_type *tail;

while (--count != 0) {
head = head -> next;
}
tail = head -> next;
head -> next = NULL;
return tail;
}

list_type *list_merge(list_type *head, list_type *tail,
int (*compar)(const list_type *, const list_type *))
{
list_type *list, *sorted, **node;

node = compar(head, tail) > 0 ? &tail : &head;
sorted = list = *node;
*node = sorted -> next;
while (*node != NULL) {
node = compar(head, tail) > 0 ? &tail : &head;
sorted -> next = *node;
sorted = *node;
*node = sorted -> next;
}
sorted -> next = head != NULL ? head : tail;
return list;
}
 
G

Grumble

Jan-Hinnerk Dumjahn said:
I guess the best approach is some kind of sorting and then scanning
through once, like Vladimir suggested. However qsort() is not the
best choice since Quicksort has complexity O(n^2).

Who said qsort() _must_ implement Quicksort?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,176
Messages
2,570,950
Members
47,500
Latest member
ArianneJsb

Latest Threads

Top