First Commercial Perl Program

T

Tim McDaniel

I'm working on another project now. It's a project where I read a
line from one file (the file has multiple lines) and then check to
see if a certain field from a second file matches. ... However, the
datasets are HUGE (first file is 500m, the compare file is 1tb). Is
there a more efficient way to do this?

I would tend to think that the efficient way is to see whether you can
use a decent database for this, like MySQL. A decent database system
is designed to allow fast lookups, so a few lines of SQL may allow you
to avoid having to re-implement indexing on your own.
 
X

Xho Jingleheimerschmidt

I need to reread through this entire thread again, because there is some
learning here for me.

I'm working on another project now. It's a project where I read a line from
one file (the file has multiple lines) and then check to see if a certain
field from a second file matches. I already know how to do this with a nested
foreach reading in a line from the first file, and then a foreach (for) to
compare to all lines in the second file. That verbal explanation is how I
am doing it, and after much googling, it looks like everyone else does it that
way too. However, the datasets are HUGE (first file is 500m, the compare file
is 1tb). Is there a more efficient way to do this?

If the second file is sorted on the column needed to do the look up, you
can do a binary search into it.

If the record length is not constant, you can still do a binary search
where you divide by offset rather than record, you just have be sure to
re-align the the line boundaries after seeking into the middle of a
line. If you are using ASCII or some other simple character set, this
is almost trivial, if you are using some other character set, it might
not be.

Xho
 
R

Reini Urban

I entered the professional perl programming world by being paid
(that's what I call professional, though the code may be far from)
for
a very small perl script. The user basically wanted a config file
which contained as the first line a username, the second line a
password, and the remaining lines to be hotnames.
ex.
user
pass
127.0.0.1
127.0.0.2
Then I wrote the following script. It gathers the user, pass, and
hostlist, and then establishes an ssh connection to query a 'device'
and return the output in a file named after the host. Following is
that program:
#!/usr/bin/perl
# Code by
# For
# dmon-1.6
use warnings;
use strict;
use Net::SSH::perl;
my $cfgfile="./config";
open CONFIG, "<", $cfgfile || die $!;
chomp(my @cfgdat=(<CONFIG>));
my $user=shift(@cfgdat);
my $pass=shift(@cfgdat);
my $extcmd="ls -l";
my $stime=3;
while (defined $stime) {
foreach (@cfgdat) {
my $ssh=Net::SSH::perl->new($_);
$ssh->login($user,$pass);
my ($stdout,$stderr,$exit)=$ssh->cmd($extcmd);
open OUTFILE, ">>", $_ || die $!;
if ($stdout) {
print OUTFILE $stdout;
}
if ($stderr) {
print OUTFILE $stderr;
}
close OUTFILE;
}
sleep $stime;
}

I am just looking for critique. I have been a Unix Admin for over 15
years, and have used perl for one off scripts, but I spent time and
master Oreillys Learning Perl and Intermediate Perl (Mastering and
Advanced Perl are next) and am now looking to solely become a
commercial perl programmer. However, as I lack commercial experience,
I probably lack a 'standard' way of approaching things, or at least
don't know what experienced perl programmers know, which I'll learn
as
a function of time. Either way, if you have time, let me know how I
could have done all this better, and maybe even a source of
commercial
perl programs I can look at and see how pro's do it.
Ron

Overall this approach is non-sense, as you might probably know as sysadmin.

1. password in cleartext for ssh?
never do that. even if the customer is to stupid to understand that, you
should just refuse to do that and ssh-copy-id instead.
if the target machine has ssh, copy your key over to it.
only if its some antique router with telnet only, I saw plaintext
passwords attempts, but then at least store them encrypted.

2. stupid config format
there's a established format for this type of problem, which is
basically the ssh connection format.
user@hostname1
user@hostname2

3. why perl when a simple shell script is much simplier and short?
for h in `cat .config`; do
ssh $h ls -l
done

oh my
 
M

Martijn Lievaart

1. password in cleartext for ssh?
never do that. even if the customer is to stupid to understand that, you
should just refuse to do that and ssh-copy-id instead.

And what does a private key stored in plaintext buy you over a password
stored in plaintext?

(Yes, there is an advantage, but it is not obvious. Do you know it?)

M4
 
M

Martijn Lievaart

It can be revoked, obviously. This means, as a corollary, that you
should never use the same private key on more than one machine.

A password can be revoked as well, but it is easier to use multiple keys
on one account and then revoke only one of them. True, hadn't thought of
that myself.

What I was thinking of, was that sshd will ensure tight file permissions
on the private key, while a config file containing a password has a good
chance of ending up world readable.
As a matter of general security, a randomly-generated key is also not
subject to dictionary attacks, which a user-chosen password generally
will be. This is not relevant here, of course.

It might be, people do stupid things all the time. So yes,this is another
good point in favor of using keys over passwords.
(In general I don't consider ssh keys particularly secure, and would
rather use Kerberos, but that isn't usually possible across
authentication domains.)

I think all schemes have advantages and disadvantages and Kerberos is a
scheme with the disadvantage that it is much more difficult to set up
correctly. So unless you do the hard work once and then can plug into
your existing Kerberos infra from then on, I don't find Kerberos
particularly manageable. And unmanageable equals unsecure.

M4
 
T

tbb!/fbr!

Overall this approach is non-sense, as you might probably know as sysadmin.

1. password in cleartext for ssh?
never do that. even if the customer is to stupid to understand that, you
should just refuse to do that and ssh-copy-id instead.
if the target machine has ssh, copy your key over to it.
only if its some antique router with telnet only, I saw plaintext
passwords attempts, but then at least store them encrypted.

2. stupid config format
there's a established format for this type of problem, which is
basically the ssh connection format.
user@hostname1
user@hostname2

3. why perl when a simple shell script is much simplier and short?
for h in `cat .config`; do
ssh $h ls -l
done

oh my

1. This was a perl program for the client, and he defined the requirements.It had to be in perl. Yes, I'm a hardcore shell scripter/programmer as well, but I've given it up in favor of perl. He needed something to run identically on all machines, regardless of funky OS difference. He wanted that datafile which contains the user and pass, and all hosts to be hit. The script was his engine, so all he ever had to do to poll his various devices was to add a hostname or ip address. He didn't define security reqreuirements. My guess is that he was running this from a local machine (his laptop maybe) and simply wanted something in perl (again, the clients requirement) so he could poll several hundred devices, which would probably be better in perl than in shell script anyways. And it's completely modular, and I look forward to him to ask me for additional fucntionality.

2. stupid config format? that doesn't even qualify for a reply.

3. perl because that's what the client wanted.

Thanks,
Ron
 
T

tbb!/fbr!

Here's a piece of code which compares lines in one file against lines in the other. I mentioned earlier in the thread it was something I was working on, but can't quit get it right via hashes, splits, working on the data:

foreach $line1 ( @lines1 )
{
( $symb, $company, $excess ) = split( /\t/, $line1, 3 );


foreach $line2 ( @lines2 ) #
{
( $date_y, $symbol_y, $company_y, $cap_y, $open_y, $low_y, $high_y, $close_y, $pe_ratio_y, $date_div_y, $dividend_y, $div_yield_y, $date_ex_div_y, $nav_y, $yield_y, $vol_y, $avg_vol_y ) = split(/\t/, $line2 );
if ( $symb eq $symbol_y )
{
print "$symb\t$company\t$cap_y\t$open_y\t$low_y\t$high_y\t$close_y\t$vol_y\t$avg_vol_y\t$excess\n";
print FILEOUT1 "$symb\t$company\t$cap_y\t$open_y\t$low_y\t$high_y\t$close_y\t$vol_y\t$avg_vol_y\t$excess\n";
}
}

}

It was mentioned to me that there is a couple of ways to do this. Can I get some input/help on making this piece of code a little more streamlined.

Ron
 
T

Tim Watts

tbb!/fbr! said:
Here's a piece of code which compares lines in one file against lines in
the other. I mentioned earlier in the thread it was something I was
working on, but can't quit get it right via hashes, splits, working on the
data:

foreach $line1 ( @lines1 )
{
( $symb, $company, $excess ) = split( /\t/, $line1, 3 );


foreach $line2 ( @lines2 ) #
{
( $date_y, $symbol_y, $company_y, $cap_y, $open_y, $low_y,
$high_y, $close_y, $pe_ratio_y, $date_div_y, $dividend_y,
$div_yield_y, $date_ex_div_y, $nav_y, $yield_y, $vol_y, $avg_vol_y
) = split(/\t/, $line2 );
if ( $symb eq $symbol_y )
{
print
"$symb\t$company\t$cap_y\t$open_y\t$low_y\t$high_y\t$close_y\t$vol_y\t$avg_vol_y\t$excess\n";
print FILEOUT1
"$symb\t$company\t$cap_y\t$open_y\t$low_y\t$high_y\t$close_y\t$vol_y\t$avg_vol_y\t$excess\n";
}
}

}

It was mentioned to me that there is a couple of ways to do this. Can I
get some input/help on making this piece of code a little more
streamlined.

Ron

Aside from Ben's excellent comments, what exactly isn't working?

I would put some debugging print's in (or run it through a perl debugger) -

1) print $symb, $company, $excess after the first foreach (add a next; to
skip the inner loop) and make sure you are happy that the data is being
"split" right.

2) Same for inner loop


If either of those datasets only have one record for each unique SYMBOL, it
woudl be a good candidate for preloading into a hash and only looping on the
other dataset. If both files have multiple unique SYMBOLS you could still do
this, but it would be a hash of arrays, so a little bit fiddlier.

HTH

Tim
 
T

tbb!/fbr!

Aside from Ben's excellent comments, what exactly isn't working?

it works fine and does exactly what it was suppose to do. I was just looking for ways to streamline it. Additionally, as I am learning perl, a few folks mentioned loading one of the files in a hash (@lines1 is 1 file and @lines2 is another file) and then looping through that. Taking a line from the first file, and then comparing it to every line in the next file means thatthe second file is being read x(times) for the number of lines in the first file. Was just looking to learn a more efficient way of doing this. The first file is only 400k or so, but the other file is 4m. I'm playing with what Ben showed me, but I know the comparing against a hash is probably the best way to do this and I'm not proficient enough at perl yet to do things in a sophisticated (heh) fashion, so I wrote it the was I did understand itwith the two foreach loops.

Ron
 
T

tbb!/fbr!

Quoth "tbb!/fbr! said:
Here's a piece of code which compares lines in one file against lines in
the other. I mentioned earlier in the thread it was something I was
working on, but can't quit get it right via hashes, splits, working on
the data:

You haven't explained what isn't working. AFAICS the code below should
do *something*; I can't tell whether or not it what you want it to do.
foreach $line1 ( @lines1 )

foreach my $line1 ( @lines1 )

Are you using 'strict'?

You should give your variables more meaningful names than '@lines1'.
What does this file actually contain?

[I've rewrapped the code below since the lines were too long; since the
wraps were inside strings it no longer does what it did. Please wrap any
posted code to 76 columns: if necessary you can divide a long string up
with .]
{
( $symb, $company, $excess ) = split( /\t/, $line1, 3 );


foreach $line2 ( @lines2 ) #
{
( $date_y, $symbol_y, $company_y, $cap_y, $open_y, $low_y,
$high_y, $close_y, $pe_ratio_y, $date_div_y, $dividend_y, $div_yield_y,
$date_ex_div_y, $nav_y, $yield_y, $vol_y, $avg_vol_y ) = split(/\t/,
$line2 );
if ( $symb eq $symbol_y )
{
print "$symb\t$company\t$cap_y\t$open_y\t$low_y\t
$high_y\t$close_y\t$vol_y\t$avg_vol_y\t$excess\n";
print FILEOUT1 "$symb\t$company\t$cap_y\t$open_y\t

Don't use global bareword filehandles. Keep your filehandles in
variables: that is, instead of

open FILEOUT, ">", ... or die ...;

write

open my $FILEOUT, ">", ... or die ...;

and then use

print $FILEOUT ...

later on. Also, use a meaningful name for your filehandle variables.
$low_y\t$high_y\t$close_y\t$vol_y\t$avg_vol_y\t$excess\n";

You can make this a lot less messy by noticing that $cap_y--$close_y and
$vol_y--$avg_vol_y stay together, so you can treat them as one field.
Also, you are printing the same string twice, so put it in a variable
rather than writing it out all over again.

# create a pattern to match a single field, and a pattern to match a
# field plus a delimiter
my $f = qr/[^\t]*/;
my $fd = qr/$f\t/;

# extract the fields we want
my ($symbol, $data, $vols) = $line2 =~ m{
^ $fd ($f) \t $fd ($fd{4} $f) \t $fd{7} ($fd $f) $
}x;

# strip the trailing tab from the field values
s/\t$// for $symbol, $data;

if ($symbol eq $symb) {
my $new = join "\t",
$symb, $company, $data, $vols, $excess;
print $new;
print FILEOUT $new;
}

An alternative would be something like this:

# outside the loop
my @in = qw/
date symb company
cap open low high close pe_ratio
date_div div div_yield date_ex_div
nav yield vol avg_vol
/;
my @out = qw/
cap open low high close vol avg_vol
/;

# inside
my %f;
@f{@in} = split /\t/, $line1;

if ($f{symb} eq $symb) {
my $new = join "\t",
$symb, $company, @f{@out}, $excess;
# print as above
}

which is likely to be more use if you want to do more processing of this
data later.

Ben

I apologize, but it is working, There is nothing wrong with the way it's written, other than it's written by a novice perl programmer. You provided excellent help on how to streamline it. It works 100% the way it is; I was just looking for a better or more efficient way to write it.

Thanks,
Ron
 
T

tbb!/fbr!

Aside from Ben's excellent comments, what exactly isn't working?

I would put some debugging print's in (or run it through a perl debugger) -

1) print $symb, $company, $excess after the first foreach (add a next; to
skip the inner loop) and make sure you are happy that the data is being
"split" right.

2) Same for inner loop


If either of those datasets only have one record for each unique SYMBOL, it
woudl be a good candidate for preloading into a hash and only looping on the
other dataset. If both files have multiple unique SYMBOLS you could still do
this, but it would be a hash of arrays, so a little bit fiddlier.

HTH

Tim

As I mentioned to Ben, it works 100% as is. I was just looking for a better way to write it. It seems to me there must be some better way than reading all of @lines2 for each line in @lines1. And yes, SYMBOL is unique.

Thanks,
Ron
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,159
Messages
2,570,883
Members
47,415
Latest member
SharonCran

Latest Threads

Top