most elegant way to split text file randomly into n parts?

M

Markus Dehmann

What is the most elegant way to split a text file randomly into n
parts? If the original file contains m lines, the n new file should
each contain about m/n lines, chosen uniformly at random.

Leaving the actual handling of files aside here is a naive solution:

my $n = 10;
foreach my $line (0..9999){
my $r = rand;
for(my $i=$n; $i>=0; --$i){
if($r > $i/$n){
print "Line $line: Print to file $i\n";
last;
}
}
}

Are there other ways to do it? This seems like a very typical perl
problem, so it would be interesting to see what solutions others come
up with!

-Markus
 
J

Jürgen Exner

Markus said:
What is the most elegant way to split a text file randomly into n
parts? If the original file contains m lines, the n new file should
each contain about m/n lines, chosen uniformly at random.

perldoc -q shuffle

and then simply write each chunk of m/n lines to an individual file.

jue
 
J

John W. Krahn

Markus said:
What is the most elegant way to split a text file randomly into n
parts? If the original file contains m lines, the n new file should
each contain about m/n lines, chosen uniformly at random.

Leaving the actual handling of files aside here is a naive solution:

my $n = 10;
foreach my $line (0..9999){
my $r = rand;
for(my $i=$n; $i>=0; --$i){
if($r > $i/$n){
print "Line $line: Print to file $i\n";
last;
}
}
}

Are there other ways to do it? This seems like a very typical perl
problem, so it would be interesting to see what solutions others come
up with!

This may do what you want, or maybe not. :)

my $n = 10;
while ( <FILE> ) {
my $i = 1 + $. % ( 1 + rand $n );
print "Line $.: Print to file $i\n";
}



John
 
M

Michele Dondi

perldoc -q shuffle

and then simply write each chunk of m/n lines to an individual file.

The (possibly) interesting part is that about "*about* m/n lines". I
would generate n random numbers and normalize them suitably. Of course
he could like to narrow the standard deviation of the starting numbers
in the first place e.g.:

my @nums = map 1+$alpha*rand, 1..$n;


Michele
 
T

Ted Zlatanov

MD> What is the most elegant way to split a text file randomly into n
MD> parts? If the original file contains m lines, the n new file should
MD> each contain about m/n lines, chosen uniformly at random.

MD> Leaving the actual handling of files aside here is a naive solution:

MD> my $n = 10;
MD> foreach my $line (0..9999){
MD> my $r = rand;
MD> for(my $i=$n; $i>=0; --$i){
MD> if($r > $i/$n){
MD> print "Line $line: Print to file $i\n";
MD> last;
MD> }
MD> }
MD> }

MD> Are there other ways to do it? This seems like a very typical perl
MD> problem, so it would be interesting to see what solutions others come
MD> up with!

Usually programmers don't like the word "about" in a specification, so
your requirements should be refined a bit.

Essentially you're trying to partition a set S of size M into N subsets
(S[0]...S[N-1]). You should define what to do with the left over (M%N)
elements of S: assign them randomly or according to some specific rule?

Also, do you want to guarantee a minimum amount of elements in each
subset?

If you assign leftovers randomly and don't require minimums, you're just
picking a random number between 1 and N for any input you're given:

while (<>)
{
printf "Line %d goes to set %d\n", $. , rand(100)+1;
}

If you're not happy with the built-in random number generator you could
use something else instead of rand(). Note how I use rand(100) to get
random numbers between 0 and 99.

Ted
 
T

Ted Zlatanov

MD> On Fri, 14 Dec 2007 05:27:48 GMT, "Jürgen Exner"

MD> The (possibly) interesting part is that about "*about* m/n lines". I
MD> would generate n random numbers and normalize them suitably. Of course
MD> he could like to narrow the standard deviation of the starting numbers
MD> in the first place e.g.:

MD> my @nums = map 1+$alpha*rand, 1..$n;

I think Perl's rand() is uniformly distributed so this should be OK
without the extra normalization, but as I mentioned in my other reply
the word "about" is too imprecise to bother trying to guess what the OP
really meant.

Ted
 
M

Michele Dondi

MD> my @nums = map 1+$alpha*rand, 1..$n;

I think Perl's rand() is uniformly distributed so this should be OK
without the extra normalization, but as I mentioned in my other reply
the word "about" is too imprecise to bother trying to guess what the OP
really meant.

I meant to "normalize" in the sense of making the sum of @nums equal
to the total number of lines.


Michele
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top