most elegant way to split text file randomly into n parts?

Markus Dehmann · Dec 14, 2007

What is the most elegant way to split a text file randomly into n
parts? If the original file contains m lines, the n new file should
each contain about m/n lines, chosen uniformly at random.

Leaving the actual handling of files aside here is a naive solution:

my $n = 10;
foreach my $line (0..9999){
my $r = rand;
for(my $i=$n; $i>=0; --$i){
if($r > $i/$n){
print "Line $line: Print to file $i\n";
last;
}
}
}

Are there other ways to do it? This seems like a very typical perl
problem, so it would be interesting to see what solutions others come
up with!

-Markus

Jürgen Exner · Dec 14, 2007

Markus said:
What is the most elegant way to split a text file randomly into n
parts? If the original file contains m lines, the n new file should
each contain about m/n lines, chosen uniformly at random.

perldoc -q shuffle

and then simply write each chunk of m/n lines to an individual file.

jue

John W. Krahn · Dec 14, 2007

Markus said:
What is the most elegant way to split a text file randomly into n
parts? If the original file contains m lines, the n new file should
each contain about m/n lines, chosen uniformly at random.

Leaving the actual handling of files aside here is a naive solution:

my $n = 10;
foreach my $line (0..9999){
my $r = rand;
for(my $i=$n; $i>=0; --$i){
if($r > $i/$n){
print "Line $line: Print to file $i\n";
last;
}
}
}

Are there other ways to do it? This seems like a very typical perl
problem, so it would be interesting to see what solutions others come
up with!

This may do what you want, or maybe not.

my $n = 10;
while ( <FILE> ) {
my $i = 1 + $. % ( 1 + rand $n );
print "Line $.: Print to file $i\n";
}

John

Michele Dondi · Dec 14, 2007

perldoc -q shuffle

and then simply write each chunk of m/n lines to an individual file.

The (possibly) interesting part is that about "*about* m/n lines". I
would generate n random numbers and normalize them suitably. Of course
he could like to narrow the standard deviation of the starting numbers
in the first place e.g.:

my @nums = map 1+$alpha*rand, 1..$n;

Michele

Ted Zlatanov · Dec 14, 2007

MD> What is the most elegant way to split a text file randomly into n
MD> parts? If the original file contains m lines, the n new file should
MD> each contain about m/n lines, chosen uniformly at random.

MD> Leaving the actual handling of files aside here is a naive solution:

MD> my $n = 10;
MD> foreach my $line (0..9999){
MD> my $r = rand;
MD> for(my $i=$n; $i>=0; --$i){
MD> if($r > $i/$n){
MD> print "Line $line: Print to file $i\n";
MD> last;
MD> }
MD> }
MD> }

MD> Are there other ways to do it? This seems like a very typical perl
MD> problem, so it would be interesting to see what solutions others come
MD> up with!

Usually programmers don't like the word "about" in a specification, so
your requirements should be refined a bit.

Essentially you're trying to partition a set S of size M into N subsets
(S[0]...S[N-1]). You should define what to do with the left over (M%N)
elements of S: assign them randomly or according to some specific rule?

Also, do you want to guarantee a minimum amount of elements in each
subset?

If you assign leftovers randomly and don't require minimums, you're just
picking a random number between 1 and N for any input you're given:

while (<>)
{
printf "Line %d goes to set %d\n", $. , rand(100)+1;
}

If you're not happy with the built-in random number generator you could
use something else instead of rand(). Note how I use rand(100) to get
random numbers between 0 and 99.

Ted

Ted Zlatanov · Dec 14, 2007

MD> On Fri, 14 Dec 2007 05:27:48 GMT, "Jürgen Exner"

MD> The (possibly) interesting part is that about "*about* m/n lines". I
MD> would generate n random numbers and normalize them suitably. Of course
MD> he could like to narrow the standard deviation of the starting numbers
MD> in the first place e.g.:

MD> my @nums = map 1+$alpha*rand, 1..$n;

I think Perl's rand() is uniformly distributed so this should be OK
without the extra normalization, but as I mentioned in my other reply
the word "about" is too imprecise to bother trying to guess what the OP
really meant.

Ted

Michele Dondi · Dec 14, 2007

MD> my @nums = map 1+$alpha*rand, 1..$n;

I think Perl's rand() is uniformly distributed so this should be OK
without the extra normalization, but as I mentioned in my other reply
the word "about" is too imprecise to bother trying to guess what the OP
really meant.

I meant to "normalize" in the sense of making the sum of @nums equal
to the total number of lines.

Michele

use python to split a video file into a set of parts	2	May 7, 2013
best way to make a few changes in a large data file	18	Jan 8, 2013
Newbie question: most efficient way to search fields of this file	9	Apr 14, 2006
Inserting lines into text files, or howto "fix" vCards having no n: entry	7	Jun 7, 2006
How to Create a random password generator in a separate window	4	May 26, 2022
Appropriate technique for altering a text file?	19	Aug 13, 2010
Regular Expression {m,n}	2	Jan 19, 2009
Split line into an array vs multiple strings	2	Jul 27, 2005

most elegant way to split text file randomly into n parts?

Markus Dehmann

Jürgen Exner

John W. Krahn

Michele Dondi

Ted Zlatanov

Ted Zlatanov

Michele Dondi

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads