Reading Mac / Unix / DOS text files

J

January Weiner

Hi, I'm sure this is a common problem:

I'd like my script to treat text files coming from various systems alike.
More specifically, I'd like to recognize ends of line as one of: \r, \l,
\r\l. Is there a more elegant way than doing the obvious?:

while(<IF>) {
s/\r?\l?$// ; # is this correct anyway? will an end of line be
# recognized with a Mac file?
#...
}

I would expect that there is some weird variable out there (like the $/)
that changes the behaviour of chomp to be more promiscous.

The problem, of course, is, that this cannot be set platform- or
scriptwide. One file might contain DOS eols, another one would come from
Mac.

j.

--
 
A

A. Sinan Unur

I'd like my script to treat text files coming from various systems alike.
More specifically, I'd like to recognize ends of line as one of: \r, \l,
\r\l. Is there a more elegant way than doing the obvious?:

You should use the codes for those characters rather than the escapes.
while(<IF>) {

I stared at this for a long time trying to figure out what

while(<IF>) {

meant. I guess IF is short for Input File?

Here, an appropriate amount of whitespace, not using bareword
filehandles, and using an appropriate variable name would have helped
immensely with readability.

while ( said:
s/\r?\l?$// ; # is this correct anyway? will an end of line be
# recognized with a Mac file?


This information is readily available by doing a cursory Google search.
Are you that lazy?

s{ \012 | (?: \015\012? ) }{\n}x

should convert any line ending convention to the one supported by your
platform.
I would expect that there is some weird variable out there
(like the $/)


$/ is not a weird variable. It is documented in perldoc perlvar.

Sinan
 
T

thrill5

I don't know why you had to stare at "while (<IF>)" for anything longer than
about a tenth of second. Pretty obvious what the code does to me.
Whitespace and using barewords for file handles is a matter of programming
style. Just because that's not the way you do it does not mean that it is
incorrect or wrong.

Scott
 
R

Rick Scott

(thrill5 said:
I don't know why you had to stare at "while (<IF>)" for anything
longer than about a tenth of second. Pretty obvious what the code
does to me.

Your filehandle `IF' collides with Perl's conditional `if' in the
mental hash-bucket of the programmers who have to read your code.
Given that it reduces the comprehensibility of your program and that
you could have used any number of more legible identifiers, I'd call
use of `IF' poor style, if not an outright error.

Whitespace and using barewords for file handles is a matter of
programming style. Just because that's not the way you do it does
not mean that it is incorrect or wrong.

On the contrary -- I would posit that any coding practice that makes
it easier to inadvertently introduce bugs into your program is a
poor one. By using a bareword filehandle, you're essentially using
a package variable (*IF). If some other piece of code in the same
package namespace touches that bareword while you're using it to read
from a file, your filehandle will get stomped and your code will break
without you even having changed it. That's why it's a bad idea.

Pick up a copy of Damian Conway's "Perl Best Practices" -- he explains
why this and about a thousand other `harmless style preferences' aren't
harmless, aren't stylish, and definitely aren't preferable.




Rick
 
J

January Weiner

A. Sinan Unur said:
You should use the codes for those characters rather than the escapes.

Hmmm, OK, and why?
Here, an appropriate amount of whitespace, not using bareword
filehandles, and using an appropriate variable name would have helped
immensely with readability.
while ( <$input> ) {

Sorry. I learned the rudimentaries of Perl some ten years ago, when, as
far as I can remember, bareword filehandles were something to be found
frequently in the code I learned Perl from. Thanks for the suggestions.
I should have written <STDIN> and everyone would be happy (unless, of
course, the modern style recommends something instead of the bareword
STDIN).
This information is readily available by doing a cursory Google search.
Are you that lazy?

I think that I am rather that stupid, because I did go through both, FAQ
and a dozen hits Google returned (not to mention Perl documentation on my
system), but I found mostly references to modifications of the
$INPUT_RECORD_SEPARATOR, which does not really do job for me. Do you think
I am really so eager to expose myself to ruddy remarks of Perl gurus by
asking a novice question? :)
s{ \012 | (?: \015\012? ) }{\n}x
should convert any line ending convention to the one supported by your
platform.

Thank you for giving me the answer nonetheless :)
$/ is not a weird variable. It is documented in perldoc perlvar.

Sorry. I know it is and I know the docs. Would you have been happier if I
had written "shorthand" instead of "weird"?

Thanks for your answer!

j.

--
 
J

January Weiner

Rick Scott said:
Your filehandle `IF' collides with Perl's conditional `if' in the
mental hash-bucket of the programmers who have to read your code.

Given that it reduces the comprehensibility of your program and that
you could have used any number of more legible identifiers, I'd call
use of `IF' poor style, if not an outright error.

Why an error? (yeah, it can lead to errors, agree, but then, of course, a
collegue of mine says the same about using Perl)
On the contrary -- I would posit that any coding practice that makes
it easier to inadvertently introduce bugs into your program is a
poor one. By using a bareword filehandle, you're essentially using
a package variable (*IF). If some other piece of code in the same
package namespace touches that bareword while you're using it to read
from a file, your filehandle will get stomped and your code will break
without you even having changed it. That's why it's a bad idea.

I think this depends a little on what you are using Perl for, and just
some basic common sense is sufficient to say when using shorthand is OK and
when it is a bad idea. I do use the <IF> construct in few-liners that do
not read more than one file. I do think it is important to define
variables if you are writing a larger piece of code. I do not understand
the whole stir about this issue.
Pick up a copy of Damian Conway's "Perl Best Practices" -- he explains
why this and about a thousand other `harmless style preferences' aren't
harmless, aren't stylish, and definitely aren't preferable.

So what is the problem with whitespace? Why is while(<STDIN>) more harmful
than while ( <STDIN> ) ?

j.

--
 
T

Tad McClellan

I'd like my script to treat text files coming from various systems alike.
More specifically, I'd like to recognize ends of line as one of: \r, \l,
\r\l.


In Perl, \l lowercases the following character.

I think you must have meant \n instead?

Furthermore, sometimes \n means CR rather than LF.

We better get our terminology precise if we are to avoid
confusing ourselves.

I will use "carriage return" (CR) and "linefeed" (LF) to
avoid further confusion.

Is there a more elegant way than doing the obvious?:


The obvious will not work, so I wouldn't characterize it as "obvious".

You need a "correct way" before exploring for a "more elegant way".

while(<IF>) {


Too late.

At this point, you have *already done* an operation that depends
on the definition of line-ending.

If the file is Mac-style and the program is running on *nix,
then the loop executes 1 time, and the entire file will be
in $_ already...

s/\r?\l?$// ; # is this correct anyway? will an end of line be
# recognized with a Mac file?


.... so this will delete the final CR but leave all the rest untouched.

I would expect that there is some weird variable out there (like the $/)
that changes the behaviour of chomp to be more promiscous.


You would be disappointed then. :)

The problem, of course, is, that this cannot be set platform- or
scriptwide. One file might contain DOS eols, another one would come from
Mac.


Then you should "normalize" the data before doing any line-oriented
processing.

In other words, you must treat these "text files" as if they
were "binary" files. That is, use read() or sysread()
rather than readline().
 
A

A. Sinan Unur

Hmmm, OK, and why?

Because ...

it is easy to get confused (from perldoc perlre):

\l lowercase next char (think vi)

That is, \l is not linefeed.

In any case, these escapes could potentially mean different things on
different systems. Why not be very specific in what you really are
looking for?

I think that I am rather that stupid, because I did go through both,
FAQ and a dozen hits Google returned

http://www.google.com/search?q=perl+eol

http://www.google.com/search?q=newline

In any case, I should probably have put a smiley there, because I had not
intended it to come across that harshly.
Thanks for your answer!

You are welcome.

Sinan
 
J

January Weiner

it is easy to get confused (from perldoc perlre):
\l lowercase next char (think vi)
That is, \l is not linefeed.

:))) nice demonstration of the problem. But this

s/(?:\r\n?|\n)/

should work correctly? (except for the fact that one should use the codes)
In any case, these escapes could potentially mean different things on
different systems. Why not be very specific in what you really are
looking for?

Hmmmm, I assumed that I should rather use what Perl thinks is a linefeed
than the ASCII code I think it is. But this is really a minor issue.

OK. However, I was not looking for a solution with string substitution, as
you have seen (demonstrated on my faulty code snippet) I came up with that
one myself. I was rather thinking along the following lines: isn't there a
general way to tell Perl "Hey, treat all the text files alike, wherever
they come from: DOS, Mac or Unix".

The point is: (i) I have written a handfull of various scripts, some of them
quite large. All of them work on text files. Recently I have discovered
problems due to the fact that some of the files that I work on recently
come from the DOS world. Now, I'd rather insert _one_ command or variable
assignment somewhere at the beginning of the script that would change the
behaviour of chomp than to go through all that code and substitute each
chomp by a substitution. (ii) A substitution takes more time by orders of
magnitude:

:~ $ head -100000 /db/prodom/prodom.mul | (time perl -p -e 'chomp ;' > /dev/null ; )

real 0m0.157s
user 0m0.123s
sys 0m0.034s
:~ $ head -100000 /db/prodom/prodom.mul | (time perl -p -e 's{ \012 | (?: \015\012? ) }{\n}x ;' > /dev/null ; )

real 0m2.012s
user 0m1.990s
sys 0m0.024s

And, surprise, the files can be quite large:
:~ $ wc -l /db/prodom/prodom.mul
7900570 /db/prodom/prodom.mul

I simply thought there might be a better solution than to use
substitutions, like assigning $/ in a special way or using a module that
adds a layer to the file open() or redefines chomp. What do I know. I
thought that the problem was common enough to be addressed in a better way.

I think that I will find some way to determine the file type (possibly by
looking at the ending of the first line), redefine $/ and continue reading.
Some untested code follows:


#!/usr/bin/perl -w
use strict;
use warnings;

my $DFNTNTF = myopen("<test.mul") ;

die "Cannot open file: $!\n" unless($DFNTNTF) ;

while( <$DFNTNTF> ) {
chomp ;
print "line $.:$_\n" ;
}

close $DFNTNTF ;

exit 0 ;

# open a file and set the input record separator
sub myopen {

my $file_mode = shift ;
my $definitelynotif ;

open ( $definitelynotif, $file_mode ) or return ;
my $line = <$definitelynotif> ;

if($line =~ m/(\015\012|\012|\015)/) {
$/ = $1 ;
}

seek $definitelynotif, 0, 0 ;
return $definitelynotif ;
}
In any case, I should probably have put a smiley there, because I had not
intended it to come across that harshly.

No offence taken.

Cheers,
January

--
 
J

January Weiner

it is easy to get confused (from perldoc perlre):
\l lowercase next char (think vi)
That is, \l is not linefeed.

:))) nice demonstration of the problem. But this

s/(?:\r\n?|\n)/

should work correctly? (except for the fact that one should use the codes)
In any case, these escapes could potentially mean different things on
different systems. Why not be very specific in what you really are
looking for?

Hmmmm, I assumed that I should rather use what Perl thinks is a linefeed
than the ASCII code I think it is. But this is really a minor issue.

OK. However, I was not looking for a solution with string substitution, as
you have seen (demonstrated on my faulty code snippet) I came up with that
one myself. I was rather thinking along the following lines: isn't there a
general way to tell Perl "Hey, treat all the text files alike, wherever
they come from: DOS, Mac or Unix".

The point is: (i) I have written a handfull of various scripts, some of them
quite large. All of them work on text files. Recently I have discovered
problems due to the fact that some of the files that I work on recently
come from the DOS world. Now, I'd rather insert _one_ command or variable
assignment somewhere at the beginning of the script that would change the
behaviour of chomp than to go through all that code and substitute each
chomp by a substitution. (ii) A substitution takes more time by orders of
magnitude:

:~ $ head -100000 /db/prodom/prodom.mul | (time perl -p -e 'chomp ;' > /dev/null ; )

real 0m0.157s
user 0m0.123s
sys 0m0.034s
:~ $ head -100000 /db/prodom/prodom.mul | (time perl -p -e 's{ \012 | (?: \015\012? ) }{\n}x ;' > /dev/null ; )

real 0m2.012s
user 0m1.990s
sys 0m0.024s

And, surprise, the files can be quite large:
:~ $ wc -l /db/prodom/prodom.mul
7900570 /db/prodom/prodom.mul

I simply thought there might be a better solution than to use
substitutions, like assigning $/ in a special way or using a module that
adds a layer to the file open() or redefines chomp. What do I know. I
thought that the problem was common enough to be addressed in a better way.

I think that I will find some way to determine the file type (possibly by
looking at the ending of the first line), redefine $/ and continue reading.
Some untested code follows:


#!/usr/bin/perl -w
use strict;
use warnings;

my $DFNTNTF = myopen("<test.mul") ;

die "Cannot open file: $!\n" unless($DFNTNTF) ;

while( <$DFNTNTF> ) {
chomp ;
print "line $.:$_\n" ;
}

close $DFNTNTF ;

exit 0 ;

# open a file and set the input record separator
sub myopen {

my $file_mode = shift ;
my $definitelynotif ;

open ( $definitelynotif, $file_mode ) or return ;
my $line = <$definitelynotif> ;

if($line =~ m/(\015\012|\012|\015)/) {
$/ = $1 ;
}

close $definitelynotif ;
open ( $definitelynotif, $file_mode ) or return ;

return $definitelynotif ;
}
In any case, I should probably have put a smiley there, because I had not
intended it to come across that harshly.

No offence taken.

Cheers,
January

--
 
A

A. Sinan Unur

:))) nice demonstration of the problem. But this

s/(?:\r\n?|\n)/

should work correctly?

Have you tried that on a DOS file on Unix? Take a look at it in a hex
editor.
OK. However, I was not looking for a solution with string
substitution, as you have seen (demonstrated on my faulty code
snippet) I came up with that one myself. I was rather thinking along
the following lines: isn't there a general way to tell Perl "Hey,
treat all the text files alike, wherever they come from: DOS, Mac or
Unix".

Open in binmode, don't use \n to match eol.

Sinan
 
R

Rick Scott

(January Weiner said:
Why an error? (yeah, it can lead to errors, agree, but then, of
course, a collegue of mine says the same about using Perl)

Why cause a greater propensity to introduce errors than one has to?

I think this depends a little on what you are using Perl for, and
just some basic common sense is sufficient to say when using
shorthand is OK and when it is a bad idea. I do use the <IF>
construct in few-liners that do not read more than one file. I do
think it is important to define variables if you are writing a
larger piece of code. I do not understand the whole stir about
this issue.

Since there's a way to do things that lets you do everything you can
do with a bareword filehandle without incurring the disadvantages,
why not make a habit of using it? As of Perl 5.6, you can do this:

open my $FILE, '<', $filename or die "Can't open file: $!";

Then the filehandle is stored in the lexical variable $FILE (where it
can't be stomped on by someone else's code) instead of in the package
variable *FILE (where it can).

So what is the problem with whitespace? Why is while(<STDIN>) more
harmful than while ( <STDIN> ) ?

Actually, I'd go with the first of these two. About whitespace in
general -- there's not necessarily too many hard and fast rules;
you just want to make best use of it to increase the readability of
your code. To take an extreme example,

LINE:
foreach my $line (@lines) {
my ($registry, $cc, $type, $start, $length, $date, $status) =
split qr{\|}, $line;

next LINE unless $status;
next LINE unless ($type eq 'ipv4');

print $start;
}

is obviously much better than

LINE:foreach my $line(@lines){my($registry,$cc,$type,$start,$length,$date,$status)=split qr{\|},$line;next LINE unless $status;next LINE unless($type eq 'ipv4');print$start;}

or

LINE:
foreach
my
$line
(
@lines
)
{
my
($registry,
$cc,
$type,
$start,
$length,
$date,
$status)
=
split
qr{\|}
,
$line
;
....

or some other godawful thing.




Rick
 
S

Samwyse

Rick said:
Why cause a greater propensity to introduce errors than one has to?

One of the very first programs (non-Perl) that I had to maintain (as
opposed to create) was written by a joker who decided to use O as a
variable name. Distinguishing between
X = 0;
and
X = O;
was a source of much merriment for the rest of the staff. Fortunately
for his nose, he was quick at ducking whenever clenched fists were in
his vicinity.
 
T

Tad McClellan

January Weiner said:
Hmmm, OK, and why?


So you will *know* what character you will get.

print "\n";

outputs different characters when perl is run on different systems.

If you use the codes, everybody on every system sees what you
want them to see.
 
J

January Weiner

Why cause a greater propensity to introduce errors than one has to?

Yeah, I guess I _could_ switch to Python, which is verrry orrrdentlich,
syntax imposes coding style, etc., etc... sorry. I use Python when
teaching. I use Perl when doing my bloody job. Seriously: Perl is
notorious for shorthands, quick dirty code snippets etc. I would not have
been programming in Perl if I was ana^W careful.
Since there's a way to do things that lets you do everything you can
do with a bareword filehandle without incurring the disadvantages,
why not make a habit of using it? As of Perl 5.6, you can do this:
open my $FILE, '<', $filename or die "Can't open file: $!";
Then the filehandle is stored in the lexical variable $FILE (where it
can't be stomped on by someone else's code) instead of in the package
variable *FILE (where it can).

Thanks for these explanation, but read what I have written above. Yeah, I
know that, I use it in larger projects. I don't care about that in the
five liners. Frankly, do you use "use strict ; use warnings ;" with
one-liners run with "perl -e"? Will you try to convince me that omitting
these is a serious danger and causes a greater propensity to introduce
errors? ...with "perl -e"? Well -- I agree. Of course it does - so what?
Actually, I'd go with the first of these two. About whitespace in
general -- there's not necessarily too many hard and fast rules;
you just want to make best use of it to increase the readability of
your code. To take an extreme example,

(snip example)

Yes, I do agree with general formatting, but I was critisized for this
particular thing -- not putting space in while( said:
LINE:
foreach my $line (@lines) {
my ($registry, $cc, $type, $start, $length, $date, $status) =
split qr{\|}, $line;
next LINE unless $status;
next LINE unless ($type eq 'ipv4');
print $start;
}

....I would most probably write as

my @info ;
for(@lines) {
($stat, $type, $start) = split /\|/ ; # info has now: registry, cc,
# type, start, length, date, status
next unless ($info[-1] && $info[2] eq 'ipv4') ;
print $info[3] ;
}

which probably is for you
some other godawful thing.

j.

--
 
J

January Weiner

Samwyse said:
One of the very first programs (non-Perl) that I had to maintain (as
opposed to create) was written by a joker who decided to use O as a
variable name. Distinguishing between
X = 0;
and
X = O;
was a source of much merriment for the rest of the staff. Fortunately
for his nose, he was quick at ducking whenever clenched fists were in
his vicinity.

Yeah. Lack of reasonable editors or skills at using them (how do you do
"s/\(\W\)O\(\W\)/\1BloodyStupidVariableName\2/g in Notepad?) is always
a problem.

j.

--
 
L

Lukas Mai

January Weiner said:
Rick Scott said:
[lexical filehandles vs. package/bareword fhs]

Thanks for these explanation, but read what I have written above. Yeah, I
know that, I use it in larger projects. I don't care about that in the
five liners. Frankly, do you use "use strict ; use warnings ;" with
one-liners run with "perl -e"? Will you try to convince me that omitting
these is a serious danger and causes a greater propensity to introduce
errors? ...with "perl -e"? Well -- I agree. Of course it does - so what?

I 'use warnings; use script;' in every Perl script that's stored in a
file. I use perl -w(l)e for one liners (except when
golfing/obfuscating). Sometimes I add -Mstrict to quickly check how
exactly a perl feature/bug works.

Just my 2¢, Lukas
 
D

Donald King

January said:
Hi, I'm sure this is a common problem:

I'd like my script to treat text files coming from various systems alike.
More specifically, I'd like to recognize ends of line as one of: \r, \l,
\r\l. Is there a more elegant way than doing the obvious?:

while(<IF>) {
s/\r?\l?$// ; # is this correct anyway? will an end of line be
# recognized with a Mac file?
#...
}

I would expect that there is some weird variable out there (like the $/)
that changes the behaviour of chomp to be more promiscous.

The problem, of course, is, that this cannot be set platform- or
scriptwide. One file might contain DOS eols, another one would come from
Mac.

j.

Short, short version:

binmode(IF);
my $whole_file = do { local $/; <IF> };
my @lines = split /(?:\r\n|\r|\n)/, $whole_file;
foreach(@lines) {
...
}

Since regexps check alternatives from left to right, that splits as
correctly as possible. If something horrid has happened, like inserting
the contents of a Mac file into a Unix file, you'll get some funny
behavior near the seams, of course, but for a file that's all one line
ending, it works great, and it even handles the common case of files
that mix Unix LFs with Windows CRLFs.

However, if you want to cut back on memory consumption (important for
files bigger than a few hundred KB or so) and your files have consistent
line endings, you might probe the end-of-line by sysreading the first
2KB or so, sysseek back to the start of the file, then locally set $/ to
the exact line ending that you probed.

Something like this might work:

use Fcntl ':seek';
....
binmode(IF);
local $/ = "\n";
while(1) {
last if sysread(IF, my $peek, 2048) == 0;
$/ = $1, last if $peek =~ /(\r\n|\r|\n)/;
}
sysseek(IF, 0, SEEK_SET);
while(<IF>) {
...
}
 
D

Donald King

Donald said:
Short, short version:

binmode(IF);
my $whole_file = do { local $/; <IF> };
my @lines = split /(?:\r\n|\r|\n)/, $whole_file;
foreach(@lines) {
...
}

Since regexps check alternatives from left to right, that splits as
correctly as possible. If something horrid has happened, like inserting
the contents of a Mac file into a Unix file, you'll get some funny
behavior near the seams, of course, but for a file that's all one line
ending, it works great, and it even handles the common case of files
that mix Unix LFs with Windows CRLFs.

However, if you want to cut back on memory consumption (important for
files bigger than a few hundred KB or so) and your files have consistent
line endings, you might probe the end-of-line by sysreading the first
2KB or so, sysseek back to the start of the file, then locally set $/ to
the exact line ending that you probed.

Something like this might work:

use Fcntl ':seek';
...
binmode(IF);
local $/ = "\n";
while(1) {
last if sysread(IF, my $peek, 2048) == 0;
$/ = $1, last if $peek =~ /(\r\n|\r|\n)/;
}
sysseek(IF, 0, SEEK_SET);
while(<IF>) {
...
}

Oh, and if your perl code is running or might run on any of a small
handful of screwy systems (IIRC, EBCDIC systems and pre-OSX Macs are the
main offenders), you might need to change \r => \x0D and \n => \x0A just
to be specific. (If you're on an EBCDIC system and handling an ASCII
file, though, you've got bigger problems.)
 
J

January Weiner

Hi,


(snip)
However, if you want to cut back on memory consumption (important for
files bigger than a few hundred KB or so) and your files have consistent
line endings, you might probe the end-of-line by sysreading the first

Exactly. Sometimes I need to run my programs on files of gigabyte size.
2KB or so, sysseek back to the start of the file, then locally set $/ to
the exact line ending that you probed.
Something like this might work:
use Fcntl ':seek';
...
binmode(IF);
local $/ = "\n";
while(1) {
last if sysread(IF, my $peek, 2048) == 0;
$/ = $1, last if $peek =~ /(\r\n|\r|\n)/;
}
sysseek(IF, 0, SEEK_SET);
while(<IF>) {
...
}

Thanks! This brings me much further. Actually, it would be even nice to
have a Perl Module implementing the file(1) functionality... The above
subroutine plus a hacked magic file plus some clever string searching.

j.

--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,997
Messages
2,570,241
Members
46,831
Latest member
RusselWill

Latest Threads

Top