median of large data set (from large file)

F

friend.05

I have very big file(GBs).

I want to calculate median of this large data set. Any efficient way
to calculate it.

I cannot read filei n array as it will go out-of-memory due to large
amount of data.


THanks
 
C

ccc31807

I have very big file(GBs).

I want to calculate median of this large data set. Any efficient way
to calculate it.

I cannot read filei n array as it will go out-of-memory due to large
amount of data.

THanks

As I recall, you can use a bucket sort to sort a list without reading
them all into memory. Sort the list using bucket sort, counting the
elements, and then take the middle element.

CC
 
S

smallpond

I have very big file(GBs).

I want to calculate median of this large data set. Any efficient way
to calculate it.

I cannot read filei n array as it will go out-of-memory due to large
amount of data.

THanks

There is a method for calculating the median without
sorting that takes O(log N) passes through your 9-track
tapes, getting progressively closer upper and lower bounds.
It's probably in Knuth, but I don't have a copy handy.
 
I

Ilya Zakharevich

Median is MAJORLY tricky. (IIRC,) Contrary to a widespead believes,
there IS a linear time limited-memory algo (but it is MUCH more
delicate than the usual FFT/quicksort-like tricks).

Oups, of course it needs "limited" RANDOM-ACCESS memory; it still needs
a lot of sequential-access memory (read: disk).

Ilya
 
I

Ilya Zakharevich

Median is MAJORLY tricky. (IIRC,) Contrary to a widespead believes, ^^^^
there IS a linear time limited-memory algo (but it is MUCH more
delicate than the usual FFT/quicksort-like tricks).

Yes, I remembered correct. ;-) *This* time I even managed to
reconstruct the algo...
I would be pretty sure that the algo is already implemented in any
reasonable toolset. Did you check the CPAN search?

In fact, this algo makes sense ONLY if know nothing about distribution
of the data. E.g., if you know that your data is `double-precision',
then there are only 2**64 buckets; on 10-year-old hardware it is not a
problem to decimate into about 2**24 buckets.

So `double-precision' data is guarantied to be decimated in 3 rounds,
and most probably would in 2 (second round very quick).

Hope this helps,
Ilya
 
X

Xho Jingleheimerschmidt

I have very big file(GBs).

I want to calculate median of this large data set. Any efficient way
to calculate it.

I cannot read filei n array as it will go out-of-memory due to large
amount of data.

I'd start by using the system's file sort utility to sort the file, and
see if that is good enough.

Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,699
Latest member
AnneRosen

Latest Threads

Top