Using hashes to sort number sequences

M

Martin Foster

Hi.

A few months ago, I posted to comp.lang.perl for help with a script
that
looks at number sequences. I'm revisiting the problem but with some
extra
data.

I have two files: a.txt & b.txt

a.txt=
191_6_270328 T1 4 10 19 34 55 72 88 116 157 200 280 332 388 451 756 4
0 5 0 4 0 6 2 6 2 8 0
191_6_270328 T2 4 9 17 22 34 56 83 112 146 181 266 320 376 431 665 3 0
5 0 4 0 6 2 6 0 22 2
191_6_270328 T3 4 10 17 23 35 56 83 115 149 188 274 324 381 437 681 4
0 5 0 4 0 6 0 6 0 6 2
191_6_270328 T4 4 12 24 35 49 68 92 123 157 196 288 347 409 464 761 5
0 8 0 5 0 8 0 6 0 18 8
191_6_270328 T5 4 10 19 32 44 57 83 118 158 197 281 331 380 445 723 4
0 5 0 4 0 5 0 6 0 6 2
191_6_270328 T6 4 9 14 18 26 48 83 114 142 178 260 312 375 434 637 3 0
4 0 6 0 6 2 6 0 6 2191_6_270330 T1 4 10 20 38 61 82 110 149 187 228
357 408 465 552 890 4 0 5 0 4 0 6 2 6 0 8 0
191_6_270330 T2 4 9 19 31 47 71 97 121 166 222 331 410 491 559 788 3 0
5 0 4 0 6 0 8 0 8 2
191_6_270330 T3 4 10 18 28 45 67 93 125 161 210 337 404 470 541 762 4
0 5 0 4 0 6 0 6 0 6 0
191_6_270330 T4 4 12 24 35 53 82 114 149 189 227 335 419 490 546 890 5
0 8 2 5 0 8 2 6 0 24 0
191_6_270330 T5 4 10 19 34 48 65 95 136 180 218 332 397 455 536 810 4
0 5 0 4 0 5 0 6 0 6 2
191_6_270330 T6 4 10 21 36 51 67 95 139 186 233 360 441 523 596 843 3
0 6 0 6 0 8 0 6 0 8 0
191_6_270334 T1 4 10 19 33 54 76 101 137 178 219 336 406 462 529 832 4
0 5 0 4 0 6 0 6 2 8 0




b.txt=
191_6_9908682 T1 4 8 14 25 41 60 83 115 153 190 276 321 374 437 694 4
0 4 0 4 0 6 0 4 0 8 0
191_6_9908682 T2 4 10 19 30 44 64 92 122 155 198 285 338 394 446 739 4
0 5 0 4 0 6 0 8 0 8 2
191_6_9908682 T3 4 10 20 33 51 69 88 123 164 199 295 341 398 465 762 4
0 5 0 4 0 6 0 7 0 18 0
191_6_9908682 T4 4 10 20 36 56 79 104 130 158 190 285 339 401 473 788
4 0 6 0 4 0 6 0 7 0 12 0
191_6_9908682 T5 4 9 18 33 51 68 89 118 153 195 280 334 387 448 739 4
0 5 0 4 0 7 0 4 0 7 0
191_6_9908682 T6 4 9 19 33 54 76 98 126 159 198 279 330 393 463 777 4
0 6 0 4 0 7 0 4 0 7 0
191_6_9908690 T1 4 8 14 25 41 61 87 119 153 189 275 331 393 452 702 4
0 4 0 4 0 6 0 4 0 8 0
191_6_9908690 T2 4 10 19 31 49 73 101 131 162 197 293 349 409 472 778
4 0 5 0 4 0 6 0 8 0 8 2
191_6_9908690 T3 4 10 21 36 55 77 98 126 163 201 291 344 413 482 792 4
0 5 0 4 0 6 0 7 0 18 0
191_6_9908690 T4 4 10 19 32 50 74 98 122 151 193 303 358 421 492 754 4
0 6 2 4 0 6 2 6 0 7 0
191_6_9908690 T5 4 10 20 36 55 72 94 122 152 194 290 347 404 471 760 4
0 7 0 4 0 7 0 5 0 6 2
191_6_9908690 T6 4 10 22 36 52 74 100 126 158 201 297 363 429 488 784
4 0 6 0 4 0 6 0 6 0 10 2


Each file contains in the first column an identifier, I call it $name.
The 2nd column contains an entry T1 or T2 or T3 ... until T6.
After these two columns each row contains a number sequence.

What I would like to do is to read file a.txt, six lines at a time
(from T1 to T6)
and search for similar number sequences in file b.txt.
The number sequences in file b.txt must also be within each block of
six lines,
but they can be in any order.

my script looks like this so far:

#!/usr/bin/perl
# Perl script to compare to files with T6 CS & VS
use strict;
use warnings;

my $infile1 = "a.txt";
open INFILE1, $infile1 or die "Shit! Couldn't open file
$infile1: $!\n";
my $infile2 = "b.txt";
open INFILE2, $infile2 or die "Shit! Couldn't open file
$infile2: $!\n";

do {

my %a_list;
# six lines at a time
for(0 .. 5){
$_ = <INFILE1>;
my ( $name, $nums ) = /^(\S+\s\S+)\s(.*)/ or
die;
push @{$a_list{$nums}}, $name;
}

# DEBUG : print out contents of hash
while ( my ($key, $value) = each(%a_list) ) {
print "$value->[0] $key\n";
}

# now check this block of sequences in file b.txt
do {
# first make a copy of a_list, to which you
feed the six lines of b.txt
my %b_list = %a_list;
for(0 .. 5){
$_ = <INFILE2>;
my ( $name, $nums ) =
/^(\S+\s+\S+)\s+(.*)/;
push @{$b_list{$nums}}, $name;
}
# OK, quick DEBUG print new hash b_list
print "\n\nb_list hash:\n";
while ( my ($key, $value) = each(%b_list) ) {
print "$value->[0] $key\n";
}

# Now check for the similar keys
print "\n\nmatches in b_list\n";
for ( values %b_list){
print "\t$_->[0] ",scalar(@$_),"\n";
}

} while (<INFILE2>);
} while (<INFILE1>);


Unfortunately, I'm stuck now. I can't get the script to keep running
the inner loop (b_list) for each "block" of a_list ( 6 lines of
a.txt). I come to the
end of file b.txt and get errors such as:

Use of uninitialized value in hash element at compare_files.plx line
33, <INFILE2> line 37.

Could anyone please help me?

Also, the files a & b are in fact huge, with 100,000s of 6 line
blocks. If
anyone has any suggestions for a faster method that would be awesome.
I've
tried coding it in C, but after finding out about the hash/keys
feature in Perl,
which is fantastic for this stuff, I think Perl is the way to go.

Many thanks in advance,
Martin.
 
B

Bob Walton

Martin Foster wrote:

....
I have two files: a.txt & b.txt

a.txt=
191_6_270328 T1 4 10 19 34 55 72 88 116 157 200 280 332 388 451 756 4
0 5 0 4 0 6 2 6 2 8 0
191_6_270328 T2 4 9 17 22 34 56 83 112 146 181 266 320 376 431 665 3 0 ....
b.txt=
191_6_9908682 T1 4 8 14 25 41 60 83 115 153 190 276 321 374 437 694 4
0 4 0 4 0 6 0 4 0 8 0
191_6_9908682 T2 4 10 19 30 44 64 92 122 155 198 285 338 394 446 739 4
0 5 0 4 0 6 0 8 0 8 2 ....


Each file contains in the first column an identifier, I call it $name.
The 2nd column contains an entry T1 or T2 or T3 ... until T6.
After these two columns each row contains a number sequence.

What I would like to do is to read file a.txt, six lines at a time
(from T1 to T6)
and search for similar number sequences in file b.txt.
The number sequences in file b.txt must also be within each block of
six lines,
but they can be in any order.


Why don't you just sort (using the Unix or maybe even the Win32 sort
command) the two files, and then, using Perl, read and compare from the
two sorted files? Or maybe the -u switch on Unix's sort could give you
what you want in one go. Or maybe (if the data for matching lines is
all the same), after the sorts, use diff to do the compare, and just
process the output of diff with Perl? Or if there is something in the
data which indicates if it from INFILE1 versus INFILE2, the files could
be concatenated, sorted, and processed as one file (I don't think that
last method would have any advantages).

That sort (punny, huh?) of method will avoid reading your $infile2 many
hundreds of thousands of times, which will take almost forever.

BTW, you would need to either close and reopen the file in your inner
loop, or seek() it back to the beginning every time you go though the
outer loop. Also recognize that your while(<INFILE1>) and
while(<INFILE2>) constructions will read a record from the corresponding
file and place it into $_. You are discarding that data, so you are
really reading data 7 records at a time, discarding the first of each
chunk of 7.

HTH.


....
 
G

gnari

[ comparing multi-length records in 2 huge files]

you definitely want to sort the files.
then you only have to make one pass through each,
only keeping 2 records in memory at the same time.

if sorting is not an option, maybe you could make
one prelimininary pass through each file, storing
the file offset for the start of each record in
a hash keyed to the record identifier

gnari
 
M

Martin Foster

Bob Walton said:
Martin Foster wrote:

...


Why don't you just sort (using the Unix or maybe even the Win32 sort
command) the two files, and then, using Perl, read and compare from the
two sorted files? Or maybe the -u switch on Unix's sort could give you
what you want in one go. Or maybe (if the data for matching lines is
all the same), after the sorts, use diff to do the compare, and just
process the output of diff with Perl? Or if there is something in the
data which indicates if it from INFILE1 versus INFILE2, the files could
be concatenated, sorted, and processed as one file (I don't think that
last method would have any advantages).

I may need to tell you a little more about the data, I'm not sure a sort
would help me but maybe you have an idea.

Each $name tag is the name of a crystal structure. Each T1, T2, etc describes
an atom. For each structure there are six atoms. To identify if two crystal
structures are the same, one can compare the coordination sequences ( the number
sequences that follow the T1, T2, etc). For each structure all six sequences,
must completely match another six sequences of another structure, but they can
be in any order, ie T1, T2s may be called T3, T6 or whatever. The important
part is that each structure has six lines, which is why I want to read
them in separately. If I do a sort I will get matching lines of sequences
grouped together. For some structures, only one or two lines will match the
original structure and I will have to do careful counting throughout the
output to get what I want.
That sort (punny, huh?) of method will avoid reading your $infile2 many
hundreds of thousands of times, which will take almost forever.
Oh know, I hope not! My first attempt was to do this directly from the
MySQL database. I retrieved the data with queries for each structure.
That did take forever! So now I've put all the data into files.

I may need to rewrite the script to not reopen the b file again and again.
Maybe by passing in all the data to arrays and then shifting six lines of the
array into the hashes. I've got 512mb memory, how big can the arrays be?
I've got 29 columns and the a & b files have ~127,000 rows. I'm not sure.
BTW, you would need to either close and reopen the file in your inner
loop, or seek() it back to the beginning every time you go though the
outer loop.
I will try this. Thanks.
Also recognize that your while(<INFILE1>) and
while(<INFILE2>) constructions will read a record from the corresponding
file and place it into $_. You are discarding that data, so you are
really reading data 7 records at a time, discarding the first of each
chunk of 7.

I'm not quite sure what you mean by this. Do you suggest to use another
variable for $_ in the inner loop?
 
B

Ben Morrow

Quoth (e-mail address removed) (Martin Foster):
Each $name tag is the name of a crystal structure. Each T1, T2, etc describes
an atom. For each structure there are six atoms. To identify if two crystal
structures are the same, one can compare the coordination sequences ( the number
sequences that follow the T1, T2, etc). For each structure all six sequences,
must completely match another six sequences of another structure, but they can
be in any order, ie T1, T2s may be called T3, T6 or whatever. The important
part is that each structure has six lines, which is why I want to read
them in separately. If I do a sort I will get matching lines of sequences
grouped together. For some structures, only one or two lines will match the
original structure and I will have to do careful counting throughout the
output to get what I want.

Oh know, I hope not! My first attempt was to do this directly from the
MySQL database. I retrieved the data with queries for each structure.
That did take forever! So now I've put all the data into files.

I may need to rewrite the script to not reopen the b file again and again.
Maybe by passing in all the data to arrays and then shifting six lines of the
array into the hashes. I've got 512mb memory, how big can the arrays be?
I've got 29 columns and the a & b files have ~127,000 rows. I'm not sure.

Try something like this (untested):

#!/usr/bin/perl -l

use strict;
use warnings;
use Symbol;

# the (*) means that the sub takes one parameter, which will be interpreted
# as a filehandle
sub read_crystal (*) {
# this will find the correct FH in case this sub is called
# from a different package (string and bareword FH names are
# relative to the current package, with exceptions for STDIN etc.)
my $FH = Symbol::qualify_to_ref $_[0], caller;

my (@atoms, $crystal);
for (1..6) {
my $line = <$FH>;
defined $line or die "read error: $!";
$line or return;

my ($tmp, undef, $atom) = split ' ', $line, 3;
$crystal ||= $tmp;
$crystal eq $tmp or die "bad file format: $tmp ne $crystal";
push @atoms, $atom;
}

# this creates a canonical representation of a given crystal
# (i.e. it will be the same regardless of the order the atoms
# were given in the file)
my $canon = join '|', sort @atoms;

return ($crystal, $canon);
}

my %crystals;
{
# lexical FHs like this are closed automatically when they
# go out of scope
open my $B, '<', 'b.txt' or die "can't open b.txt: $!";

while (my ($crystal, $canon) = read_crystal $B) {
$crystals{$canon} = $crystal;
}
}
{
open my $A, '<', 'a.txt' or die "can't open a.txt: $!";

while (my ($crystal, $canon) = read_crystal $A) {
if ($crystals{$canon}) {
print "$crystal is the same as $crystals{$canon}";
}
}
}

__END__

If b.txt is too large, and you do run out of memory (or it is
unacceptably slow), you can speed things up by tying %crystals to a db
file:

use DB_File;

tie my %crystals, DB_File => 'b.db' or die "can't create b.db: $!";

This will also allow you to create the db from b.txt once and then check
several different files against it.

Ben
 
A

Anno Siegel

Martin Foster said:
Bob Walton said:
Martin Foster wrote:

...

[...]
Why don't you just sort (using the Unix or maybe even the Win32 sort
[...]

I may need to tell you a little more about the data, I'm not sure a sort
would help me but maybe you have an idea.

Each $name tag is the name of a crystal structure. Each T1, T2, etc describes
an atom. For each structure there are six atoms. To identify if two crystal
structures are the same, one can compare the coordination sequences ( the number
sequences that follow the T1, T2, etc). For each structure all six sequences,
must completely match another six sequences of another structure, but they can
be in any order, ie T1, T2s may be called T3, T6 or whatever. The important
part is that each structure has six lines, which is why I want to read
them in separately. If I do a sort I will get matching lines of sequences
grouped together. For some structures, only one or two lines will match the
original structure and I will have to do careful counting throughout the
output to get what I want.

If I get that right, there is a set of atoms (represented by sequences
of numbers), and a crystal (structure) is a sequence of six atoms. The
problem is to find the sequences that are permutations of each other.
If I got that entirely wrong, you can stop reading now.

Otherwise, the straightforward solution involves indeed sorting, but
not of the file as a whole, but of each set of six atoms. After sorting,
two permutations of the same atoms are equal (no matter how you sort).
This reduces the problem to finding the elements in a list that are
the same. Perl's standard solutions (involving a hash) apply.

In the actual case it may pay to re-encode the atoms with shorter
strings, which would save storage and might reduce sort time. I'm
not sure about the effect of key length on Perl's string sort. Uri?
How many different atoms are there? If they represent actual chemical
elements there can't be too many.

Before I go on further I'd like some feedback if this sounds plausible
at all.

Anno
 
A

Anno Siegel

Ben Morrow said:
Quoth (e-mail address removed) (Martin Foster):

[snip problem to get to the code]
Try something like this (untested):

#!/usr/bin/perl -l

use strict;
use warnings;
use Symbol;

# the (*) means that the sub takes one parameter, which will be interpreted
# as a filehandle
sub read_crystal (*) {
# this will find the correct FH in case this sub is called
# from a different package (string and bareword FH names are
# relative to the current package, with exceptions for STDIN etc.)
my $FH = Symbol::qualify_to_ref $_[0], caller;

my (@atoms, $crystal);
for (1..6) {
my $line = <$FH>;
defined $line or die "read error: $!";
$line or return;

This test won't work unless $line is chomp'ed. I'd append "...or return"
to the "split" line that comes next.
my ($tmp, undef, $atom) = split ' ', $line, 3;

[snip rest, nothing to improve there]

That's the algorithm I also recommended in another post to this thread,
presented in beautiful Perl. That program was a good read.

Anno
 
B

Ben Morrow

Quoth (e-mail address removed)-berlin.de (Anno Siegel):
This test won't work unless $line is chomp'ed. I'd append "...or return"
to the "split" line that comes next.

Arrgh, no, there's a more important bug... I was assuming <> returned
undef only on error, and false-but-defined on EOF. So, what I meant was:

undef $!;
my $line = <$FH>;
$! and die "read error: $!";
$line or return;

Ben
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top