M
Martin Foster
Hi.
A few months ago, I posted to comp.lang.perl for help with a script
that
looks at number sequences. I'm revisiting the problem but with some
extra
data.
I have two files: a.txt & b.txt
a.txt=
191_6_270328 T1 4 10 19 34 55 72 88 116 157 200 280 332 388 451 756 4
0 5 0 4 0 6 2 6 2 8 0
191_6_270328 T2 4 9 17 22 34 56 83 112 146 181 266 320 376 431 665 3 0
5 0 4 0 6 2 6 0 22 2
191_6_270328 T3 4 10 17 23 35 56 83 115 149 188 274 324 381 437 681 4
0 5 0 4 0 6 0 6 0 6 2
191_6_270328 T4 4 12 24 35 49 68 92 123 157 196 288 347 409 464 761 5
0 8 0 5 0 8 0 6 0 18 8
191_6_270328 T5 4 10 19 32 44 57 83 118 158 197 281 331 380 445 723 4
0 5 0 4 0 5 0 6 0 6 2
191_6_270328 T6 4 9 14 18 26 48 83 114 142 178 260 312 375 434 637 3 0
4 0 6 0 6 2 6 0 6 2191_6_270330 T1 4 10 20 38 61 82 110 149 187 228
357 408 465 552 890 4 0 5 0 4 0 6 2 6 0 8 0
191_6_270330 T2 4 9 19 31 47 71 97 121 166 222 331 410 491 559 788 3 0
5 0 4 0 6 0 8 0 8 2
191_6_270330 T3 4 10 18 28 45 67 93 125 161 210 337 404 470 541 762 4
0 5 0 4 0 6 0 6 0 6 0
191_6_270330 T4 4 12 24 35 53 82 114 149 189 227 335 419 490 546 890 5
0 8 2 5 0 8 2 6 0 24 0
191_6_270330 T5 4 10 19 34 48 65 95 136 180 218 332 397 455 536 810 4
0 5 0 4 0 5 0 6 0 6 2
191_6_270330 T6 4 10 21 36 51 67 95 139 186 233 360 441 523 596 843 3
0 6 0 6 0 8 0 6 0 8 0
191_6_270334 T1 4 10 19 33 54 76 101 137 178 219 336 406 462 529 832 4
0 5 0 4 0 6 0 6 2 8 0
b.txt=
191_6_9908682 T1 4 8 14 25 41 60 83 115 153 190 276 321 374 437 694 4
0 4 0 4 0 6 0 4 0 8 0
191_6_9908682 T2 4 10 19 30 44 64 92 122 155 198 285 338 394 446 739 4
0 5 0 4 0 6 0 8 0 8 2
191_6_9908682 T3 4 10 20 33 51 69 88 123 164 199 295 341 398 465 762 4
0 5 0 4 0 6 0 7 0 18 0
191_6_9908682 T4 4 10 20 36 56 79 104 130 158 190 285 339 401 473 788
4 0 6 0 4 0 6 0 7 0 12 0
191_6_9908682 T5 4 9 18 33 51 68 89 118 153 195 280 334 387 448 739 4
0 5 0 4 0 7 0 4 0 7 0
191_6_9908682 T6 4 9 19 33 54 76 98 126 159 198 279 330 393 463 777 4
0 6 0 4 0 7 0 4 0 7 0
191_6_9908690 T1 4 8 14 25 41 61 87 119 153 189 275 331 393 452 702 4
0 4 0 4 0 6 0 4 0 8 0
191_6_9908690 T2 4 10 19 31 49 73 101 131 162 197 293 349 409 472 778
4 0 5 0 4 0 6 0 8 0 8 2
191_6_9908690 T3 4 10 21 36 55 77 98 126 163 201 291 344 413 482 792 4
0 5 0 4 0 6 0 7 0 18 0
191_6_9908690 T4 4 10 19 32 50 74 98 122 151 193 303 358 421 492 754 4
0 6 2 4 0 6 2 6 0 7 0
191_6_9908690 T5 4 10 20 36 55 72 94 122 152 194 290 347 404 471 760 4
0 7 0 4 0 7 0 5 0 6 2
191_6_9908690 T6 4 10 22 36 52 74 100 126 158 201 297 363 429 488 784
4 0 6 0 4 0 6 0 6 0 10 2
Each file contains in the first column an identifier, I call it $name.
The 2nd column contains an entry T1 or T2 or T3 ... until T6.
After these two columns each row contains a number sequence.
What I would like to do is to read file a.txt, six lines at a time
(from T1 to T6)
and search for similar number sequences in file b.txt.
The number sequences in file b.txt must also be within each block of
six lines,
but they can be in any order.
my script looks like this so far:
#!/usr/bin/perl
# Perl script to compare to files with T6 CS & VS
use strict;
use warnings;
my $infile1 = "a.txt";
open INFILE1, $infile1 or die "Shit! Couldn't open file
$infile1: $!\n";
my $infile2 = "b.txt";
open INFILE2, $infile2 or die "Shit! Couldn't open file
$infile2: $!\n";
do {
my %a_list;
# six lines at a time
for(0 .. 5){
$_ = <INFILE1>;
my ( $name, $nums ) = /^(\S+\s\S+)\s(.*)/ or
die;
push @{$a_list{$nums}}, $name;
}
# DEBUG : print out contents of hash
while ( my ($key, $value) = each(%a_list) ) {
print "$value->[0] $key\n";
}
# now check this block of sequences in file b.txt
do {
# first make a copy of a_list, to which you
feed the six lines of b.txt
my %b_list = %a_list;
for(0 .. 5){
$_ = <INFILE2>;
my ( $name, $nums ) =
/^(\S+\s+\S+)\s+(.*)/;
push @{$b_list{$nums}}, $name;
}
# OK, quick DEBUG print new hash b_list
print "\n\nb_list hash:\n";
while ( my ($key, $value) = each(%b_list) ) {
print "$value->[0] $key\n";
}
# Now check for the similar keys
print "\n\nmatches in b_list\n";
for ( values %b_list){
print "\t$_->[0] ",scalar(@$_),"\n";
}
} while (<INFILE2>);
} while (<INFILE1>);
Unfortunately, I'm stuck now. I can't get the script to keep running
the inner loop (b_list) for each "block" of a_list ( 6 lines of
a.txt). I come to the
end of file b.txt and get errors such as:
Use of uninitialized value in hash element at compare_files.plx line
33, <INFILE2> line 37.
Could anyone please help me?
Also, the files a & b are in fact huge, with 100,000s of 6 line
blocks. If
anyone has any suggestions for a faster method that would be awesome.
I've
tried coding it in C, but after finding out about the hash/keys
feature in Perl,
which is fantastic for this stuff, I think Perl is the way to go.
Many thanks in advance,
Martin.
A few months ago, I posted to comp.lang.perl for help with a script
that
looks at number sequences. I'm revisiting the problem but with some
extra
data.
I have two files: a.txt & b.txt
a.txt=
191_6_270328 T1 4 10 19 34 55 72 88 116 157 200 280 332 388 451 756 4
0 5 0 4 0 6 2 6 2 8 0
191_6_270328 T2 4 9 17 22 34 56 83 112 146 181 266 320 376 431 665 3 0
5 0 4 0 6 2 6 0 22 2
191_6_270328 T3 4 10 17 23 35 56 83 115 149 188 274 324 381 437 681 4
0 5 0 4 0 6 0 6 0 6 2
191_6_270328 T4 4 12 24 35 49 68 92 123 157 196 288 347 409 464 761 5
0 8 0 5 0 8 0 6 0 18 8
191_6_270328 T5 4 10 19 32 44 57 83 118 158 197 281 331 380 445 723 4
0 5 0 4 0 5 0 6 0 6 2
191_6_270328 T6 4 9 14 18 26 48 83 114 142 178 260 312 375 434 637 3 0
4 0 6 0 6 2 6 0 6 2191_6_270330 T1 4 10 20 38 61 82 110 149 187 228
357 408 465 552 890 4 0 5 0 4 0 6 2 6 0 8 0
191_6_270330 T2 4 9 19 31 47 71 97 121 166 222 331 410 491 559 788 3 0
5 0 4 0 6 0 8 0 8 2
191_6_270330 T3 4 10 18 28 45 67 93 125 161 210 337 404 470 541 762 4
0 5 0 4 0 6 0 6 0 6 0
191_6_270330 T4 4 12 24 35 53 82 114 149 189 227 335 419 490 546 890 5
0 8 2 5 0 8 2 6 0 24 0
191_6_270330 T5 4 10 19 34 48 65 95 136 180 218 332 397 455 536 810 4
0 5 0 4 0 5 0 6 0 6 2
191_6_270330 T6 4 10 21 36 51 67 95 139 186 233 360 441 523 596 843 3
0 6 0 6 0 8 0 6 0 8 0
191_6_270334 T1 4 10 19 33 54 76 101 137 178 219 336 406 462 529 832 4
0 5 0 4 0 6 0 6 2 8 0
b.txt=
191_6_9908682 T1 4 8 14 25 41 60 83 115 153 190 276 321 374 437 694 4
0 4 0 4 0 6 0 4 0 8 0
191_6_9908682 T2 4 10 19 30 44 64 92 122 155 198 285 338 394 446 739 4
0 5 0 4 0 6 0 8 0 8 2
191_6_9908682 T3 4 10 20 33 51 69 88 123 164 199 295 341 398 465 762 4
0 5 0 4 0 6 0 7 0 18 0
191_6_9908682 T4 4 10 20 36 56 79 104 130 158 190 285 339 401 473 788
4 0 6 0 4 0 6 0 7 0 12 0
191_6_9908682 T5 4 9 18 33 51 68 89 118 153 195 280 334 387 448 739 4
0 5 0 4 0 7 0 4 0 7 0
191_6_9908682 T6 4 9 19 33 54 76 98 126 159 198 279 330 393 463 777 4
0 6 0 4 0 7 0 4 0 7 0
191_6_9908690 T1 4 8 14 25 41 61 87 119 153 189 275 331 393 452 702 4
0 4 0 4 0 6 0 4 0 8 0
191_6_9908690 T2 4 10 19 31 49 73 101 131 162 197 293 349 409 472 778
4 0 5 0 4 0 6 0 8 0 8 2
191_6_9908690 T3 4 10 21 36 55 77 98 126 163 201 291 344 413 482 792 4
0 5 0 4 0 6 0 7 0 18 0
191_6_9908690 T4 4 10 19 32 50 74 98 122 151 193 303 358 421 492 754 4
0 6 2 4 0 6 2 6 0 7 0
191_6_9908690 T5 4 10 20 36 55 72 94 122 152 194 290 347 404 471 760 4
0 7 0 4 0 7 0 5 0 6 2
191_6_9908690 T6 4 10 22 36 52 74 100 126 158 201 297 363 429 488 784
4 0 6 0 4 0 6 0 6 0 10 2
Each file contains in the first column an identifier, I call it $name.
The 2nd column contains an entry T1 or T2 or T3 ... until T6.
After these two columns each row contains a number sequence.
What I would like to do is to read file a.txt, six lines at a time
(from T1 to T6)
and search for similar number sequences in file b.txt.
The number sequences in file b.txt must also be within each block of
six lines,
but they can be in any order.
my script looks like this so far:
#!/usr/bin/perl
# Perl script to compare to files with T6 CS & VS
use strict;
use warnings;
my $infile1 = "a.txt";
open INFILE1, $infile1 or die "Shit! Couldn't open file
$infile1: $!\n";
my $infile2 = "b.txt";
open INFILE2, $infile2 or die "Shit! Couldn't open file
$infile2: $!\n";
do {
my %a_list;
# six lines at a time
for(0 .. 5){
$_ = <INFILE1>;
my ( $name, $nums ) = /^(\S+\s\S+)\s(.*)/ or
die;
push @{$a_list{$nums}}, $name;
}
# DEBUG : print out contents of hash
while ( my ($key, $value) = each(%a_list) ) {
print "$value->[0] $key\n";
}
# now check this block of sequences in file b.txt
do {
# first make a copy of a_list, to which you
feed the six lines of b.txt
my %b_list = %a_list;
for(0 .. 5){
$_ = <INFILE2>;
my ( $name, $nums ) =
/^(\S+\s+\S+)\s+(.*)/;
push @{$b_list{$nums}}, $name;
}
# OK, quick DEBUG print new hash b_list
print "\n\nb_list hash:\n";
while ( my ($key, $value) = each(%b_list) ) {
print "$value->[0] $key\n";
}
# Now check for the similar keys
print "\n\nmatches in b_list\n";
for ( values %b_list){
print "\t$_->[0] ",scalar(@$_),"\n";
}
} while (<INFILE2>);
} while (<INFILE1>);
Unfortunately, I'm stuck now. I can't get the script to keep running
the inner loop (b_list) for each "block" of a_list ( 6 lines of
a.txt). I come to the
end of file b.txt and get errors such as:
Use of uninitialized value in hash element at compare_files.plx line
33, <INFILE2> line 37.
Could anyone please help me?
Also, the files a & b are in fact huge, with 100,000s of 6 line
blocks. If
anyone has any suggestions for a faster method that would be awesome.
I've
tried coding it in C, but after finding out about the hash/keys
feature in Perl,
which is fantastic for this stuff, I think Perl is the way to go.
Many thanks in advance,
Martin.