perl to python

  • Thread starter Olivier Scalbert
  • Start date
J

Josef Meile

There's definitely a sed available, possibly even in MingW (I have it
on my system, but am not sure if it arrived with MingW or something
else I installed). It's definitely available with cygwin. One reason
to install it is that it's smaller than perl or python; another is
that it probably performs the task faster, since it isn't a general
purpose state machine;
Ok, If those two are true, then using it should be
considered for big files.
another is that it's 25% shorter to type than
perl and 50% shorter to type than python.
I don't think that shorter codes are always the
most efficient. They are nicer, but you can't
assure that they are faster. For example a simple
sort algoritm implemented with two anidated loops:
It is well known that you can use trees or other
strategies to achieve better results; however,
some of them are larger as the loop implementation
 
P

Peter Hickman

Ville said:
It's funny, but somehow I can't really think of cases that a
specialized language would do better (ignoring the performace, which
is rarely a concern in sysadmin tasks) than Python with some
modules.

There is more to computer usage than sysadmin tasks, sed is an ideal
tool for processing large sets of large files (I have to handle small
files that are only 130 Mb in size, and I have around 140,000 of them).

Performance is not an issue you can ignore when you are handling large
amounts of data. Long may sed and awk live, just have to make sure that
the O'Reilly's are to hand because the syntax is a bugger.
 
P

Pete Forman

Jason Mobarak said:
> > "Olivier Scalbert said:
> > > What is the python way of doing this :
> > > perl -pi -e 's/string1/string2/' file
> >
> > I'm not sure what the -pi and -e switches do, but the rest is
> > fairly simple, although not as simple as the perl one-liner.
> > Just load the file into a string variable, and either use the
> > string .replace() method, or use a regx, depending on which is
> > appropriate. Then write it back out.
> > [...]
>
> More obfuscated:
>
> python -c '(lambda fp: fp.write(fp.seek(0) or
> "".join([L.replace("th","ht") for L in fp])))(file("foo","rw+"))'

For a less obfuscated approach, look at PyOne to run short python
scripts from a one-line command.

http://www.unixuser.org/~euske/pyone/
 
K

Kirk Job-Sluder

The things you usually do with the non-python tools are trivial, and
trivial things have the habit of being, well, trivial in Python too.

I've not found this to be the case due to Python's emphasis on being
explicit rather than implicit. My emulation of
"perl -pi -e" was about 24 lines in length. Even with the improvement
there is still 10 times as many statements where things can go wrong.

It is really hard to be more trivial than a complete program in one
command line.
You can always implement modules to do the tasks you normally use sed
or awk for. I never saw much virtue in using the most specialized (or
crippled, if you wish) tool possible. Not even if it's "optimized" for
the thing. Actually, I tend to think that Python has to some extent
deprecated that part of the Unix tradition.

However, that raises its own host of problems such as how do you import
the needed modules on the command line? What do you do when that module is not
available? What do you do when you need additional functionality that
takes one line in awk but a major rewrite in python?

It's a matter of task efficiency. Why should I spend a half hour doing
in python something that takes 1 minute if you know the right sed, awk
or perl one-liner? There is a level of complexity where you are better
off using python. But why not use a one-liner when it is available?
And yes, I'm aware that I'm exposing myself to some serious flammage
from "if it was good enough for my grandad, it's good enough for me"
*nix crowd. Emotional attachment to various cute little tools is
understandable, but sometimes it's good to take a fresh perspective
and just let go.

Write me a two-line script in python that reads a character delimited
file, and printf pretty-prints all of the records in a different order.

Sometimes, a utility that uses an implicit loop over every line of a
file is useful. That's not emotional attachment, it's plain common
sense.
 
C

Carl Banks

Kirk said:
Write me a two-line script in python that reads a character delimited
file, and printf pretty-prints all of the records in a different order.

How about one line (broken into three for clarity):

for line in __import__('sys').stdin:
print ''.join([ x.rjust(10) for x in map(
line.strip().split(',').__getitem__,[4,3,2,1,0]) ])

Believe it or not, I actually do stuff like this on the command line
once in awhile; to me, it's less effort to type this in than to
remember (read: look up) the details of awk syntax. I don't think I'm
typical in this regard, though.
 
K

Kirk Job-Sluder

Kirk said:
Write me a two-line script in python that reads a character delimited
file, and printf pretty-prints all of the records in a different order.

How about one line (broken into three for clarity):

for line in __import__('sys').stdin:
print ''.join([ x.rjust(10) for x in map(
line.strip().split(',').__getitem__,[4,3,2,1,0]) ])

Believe it or not, I actually do stuff like this on the command line
once in awhile; to me, it's less effort to type this in than to
remember (read: look up) the details of awk syntax. I don't think I'm
typical in this regard, though.

This looks like using the proverbial hammer to drive the screw.

I still find:
awk 'BEGIN {FS="\t"} {printf("pattern", $1,$4,$3,$2)}' file

to be more elegant and easier to debug. It does the required task in
two easy-to remember statements.
 
V

Ville Vainio

Kirk> I've not found this to be the case due to Python's emphasis
Kirk> on being explicit rather than implicit. My emulation of
Kirk> "perl -pi -e" was about 24 lines in length. Even with the
Kirk> improvement there is still 10 times as many statements where
Kirk> things can go wrong.

That's when you create a module which does the implicit looping. Or a
python script that evals the passed expression string in the loop.

Kirk> It is really hard to be more trivial than a complete program in one
Kirk> command line.

As has been stated elsewhere, you can do the trick on the command
line. The effort to create the required tools only needs to be paid
once.

However, many times it won't matter whether the whole program fits on
the command line. I always do a script into a file and then execute
it. I just prefer a real editor to command history editing if
something goes wrong.

Kirk> It's a matter of task efficiency. Why should I spend a half
Kirk> hour doing in python something that takes 1 minute if you
Kirk> know the right sed, awk or perl one-liner? There is a level
Kirk> of complexity where you are better off using python. But
Kirk> why not use a one-liner when it is available?

I think one should just analyze the need, implement the requisite
module(s) and the script to invoke the stuff in modules. The needs
have the habit of repeating themselves, and having a bit more
structure in the solution will pay off.

Kirk> Write me a two-line script in python that reads a character
Kirk> delimited file, and printf pretty-prints all of the records
Kirk> in a different order.

(Already done)

Kirk> Sometimes, a utility that uses an implicit loop over every line of a
Kirk> file is useful. That's not emotional attachment, it's plain common
Kirk> sense.

The virtual of the implicitness is still arguable.
 
V

Ville Vainio

Pete> For a less obfuscated approach, look at PyOne to run short python
Pete> scripts from a one-line command.

Pete> http://www.unixuser.org/~euske/pyone/

Looks exactly like something I always wanted to implement, but found
that doing the script in a multi-line file is easier. It's great that
someone has got around to imlpement something like this.

There should be a wiki entry for "quick and dirty python" (sounds
somehow... suspicious ;-), having awk/sed/oneliner workalikes.
 
K

Kirk Job-Sluder

Kirk> I've not found this to be the case due to Python's emphasis
Kirk> on being explicit rather than implicit. My emulation of
Kirk> "perl -pi -e" was about 24 lines in length. Even with the
Kirk> improvement there is still 10 times as many statements where
Kirk> things can go wrong.

That's when you create a module which does the implicit looping. Or a
python script that evals the passed expression string in the loop.

Except now you've just eliminated portability, one of the main arguments
for using python in the first place.

And here is the fundamental question. Why should I spend my time
writing a module in python to emulate another tool, when I can simply
use that other tool? Why should I, as a resarcher who must process
large quantities of data, spend my time and my employer's money
reinventing the wheel?

Kirk> It is really hard to be more trivial than a complete program in one
Kirk> command line.

As has been stated elsewhere, you can do the trick on the command
line. The effort to create the required tools only needs to be paid
once.

One can do the trick one one command line in python. However that
command line is an ugly inelegant hack that eliminates the most
important advantage of python: clear, easy to understand code. In
addition, that example still required 8 python statements compared to
two in awk.
However, many times it won't matter whether the whole program fits on
the command line. I always do a script into a file and then execute
it. I just prefer a real editor to command history editing if
something goes wrong.

Which is what I do as well. The question is, why should I write 8
python statements to perform a task that I can do in two using awk, or
sed? Why should I spend 30 minutes writing, testing and debugging a
python script that takes 5 minutes to write in awk, or sed taking
advantage of the implicit loops and record splitting.
I think one should just analyze the need, implement the requisite
module(s) and the script to invoke the stuff in modules. The needs
have the habit of repeating themselves, and having a bit more
structure in the solution will pay off.

I think you are missing a key step. You are starting off with a
solution (python scripts and modules) and letting it drive your
needs analysis. I don't get paid enough money to write pythonic
solutions to problems that have already been fixed using other tools.
The virtual of the implicitness is still arguable.

I'll be more specific about the challenge. Using only stock python with
no added modules, give me a script that pretty-prints a
character-delimted file using one variable assignment, and one function.

Here is the solution in awk:
BEGIN { FS="\t" }
{printf("%s %s %s %s", $4, $3, $2, $1)}
 
R

Roy Smith

Kirk Job-Sluder said:
And here is the fundamental question. Why should I spend my time
writing a module in python to emulate another tool, when I can simply
use that other tool? Why should I, as a resarcher who must process
large quantities of data, spend my time and my employer's money
reinventing the wheel?

At the risk of veering this thread in yet another different direction,
anybody who does analysis of large amounts of data should take a look at
Gary Perlman's excellent, free, and generally under-appreciated |STAT
package.

http://www.acm.org/~perlman/stat/

It's been around in one version or another for something like 20 years.
It fills an interesting little niche that's part data manipulation and
part statistics.
Here is the solution in awk:
BEGIN { FS="\t" }
{printf("%s %s %s %s", $4, $3, $2, $1)}

In |STAT, that would be simply "colex 4 3 2 1".

There's nothing you can do in |STAT that you couldn't do with more
general purpose tools like awk, perl, python, etc, but |STAT often has a
quicker, simpler, easier way to do many common statistical tasks. A
good tool to have in your toolbox.

For example, on of the cool tools is the "validata". You feed it a file
and it applies some heuristics trying to guess which data in it might be
invalid. For example, if a file looks like it's columns of numbers, and
the third column is all integers except for one entry which is a
floating point number, it'll guess that might be an error and flag it.
It's great when you're analyzing 5000 log files of 100,000 lines each
and one of them makes your script crash for no apparent reason.
 
R

Ralf Muschall

Kirk Job-Sluder said:
I still find:
awk 'BEGIN {FS="\t"} {printf("pattern", $1,$4,$3,$2)}' file
to be more elegant and easier to debug. It does the required task in
two easy-to remember statements.

It misses the "-i" thing. You have to wrap it:

NEWNAME=mktemp foo.XXXXXX
mv file $NEWNAME
your_awk $NEWNAME > file
rm $NEWNAME (unless there is a backup argument after the "-i")
mv $NEWNAME file.bak (if there is one)

Ralf
 
D

Duncan Booth

I'll be more specific about the challenge. Using only stock python with
no added modules, give me a script that pretty-prints a
character-delimted file using one variable assignment, and one function.

Here is the solution in awk:
BEGIN { FS="\t" }
{printf("%s %s %s %s", $4, $3, $2, $1)}

One assignment statement and one function call is easy. Of course, you
could argue that more than one name gets rebound, but then that is also
true of the awk program:

import sys
for line in sys.stdin:
line = line[:-1].split('\t')
print "%s %s %s %s" % (line[3], line[2], line[1], line[0])

While I agree with you that using the appropriate tool is preferred over
using Python for everything, I don't really see much to choose between the
Python and awk versions here.
 
V

Ville Vainio

Kirk> And here is the fundamental question. Why should I spend my
Kirk> time writing a module in python to emulate another tool,
Kirk> when I can simply use that other tool? Why should I, as a

Perhaps you won't; but someone who isn't already proficient with the
tool may rest assured that learning the tool really isn't worth his
time. awk and sed fall into this category.

Kirk> resarcher who must process large quantities of data, spend
Kirk> my time and my employer's money reinventing the wheel?

You are not reinventing the wheel, you are refactoring it :). I don't
think your employer minds you spending 15 extra minutes creating some
tool infrastructure, if it allows you to drop awk/sed dependency that
your co-workers then won't need to learn.

Kirk> I think you are missing a key step. You are starting off
Kirk> with a solution (python scripts and modules) and letting it
Kirk> drive your needs analysis. I don't get paid enough money to
Kirk> write pythonic solutions to problems that have already been
Kirk> fixed using other tools.

I find writing pythonic tools a relaxing deversion from my everyday
work (cranking out C++), so I don't really mind. As long as the time
spent is within 5 minutes - 1 hour range.

Kirk> I'll be more specific about the challenge. Using only stock
Kirk> python with no added modules, give me a script that
Kirk> pretty-prints a character-delimted file using one variable
Kirk> assignment, and one function.

Kirk> Here is the solution in awk:
Kirk> BEGIN { FS="\t" }
Kirk> {printf("%s %s %s %s", $4, $3, $2, $1)}


for line in open("file.txt"):
fields = line.strip().split("\t")
print "%s %s %s" % (fields[2], fields[1], fields[0])

(untested code warning)

Time taken: 56 seconds, give or take. Roughly the same I would expect
writing your awk example took, and within the range I expect your
employer would afford ;-).

Technically it does two variable assignments, but I don't see the
problem (ditto with function calls - who cares?) Assignment is
conceptually cheap. It doesn't seem any less readable or elegant than
your awk example. I could have maybe lost a few seconds by using
shorter variable names.
 
A

Aahz

And here is the fundamental question. Why should I spend my time
writing a module in python to emulate another tool, when I can simply
use that other tool? Why should I, as a resarcher who must process
large quantities of data, spend my time and my employer's money
reinventing the wheel?

Why should your employer pay for the time for all of its employees to
learn all of those other tools, when Python will do the job? I've used
sed and awk often enough to read other people's code some of the time,
but I certainly can't write them without a great deal of effort, and
modifying an existing example to do what I want might or might not be
easy -- no way of knowing in advance.
 
D

Daniel 'Dang' Griffith

Daniel 'Dang' Griffith said:
[on sed] One reason
to install it is that it's smaller than perl or python; another is
that it probably performs the task faster, since it isn't a general
purpose state machine;

FWIW, sed _is_ a state machine, although not really "general
purpose". It is a programming language with variables, loops
and conditionals, and I believe it is turing-complete. Most
of the time it is abused to perform simple search-and-replace
tasks, though. ;-)

I never used sed for anything but "stream editing", aka search and
replace. Well, if it's turing complete, my apologies to the sed
author(s). :)
--dang
 
D

David M. Cooke

At some point said:
Daniel 'Dang' Griffith said:
[on sed] One reason
to install it is that it's smaller than perl or python; another is
that it probably performs the task faster, since it isn't a general
purpose state machine;

FWIW, sed _is_ a state machine, although not really "general
purpose". It is a programming language with variables, loops
and conditionals, and I believe it is turing-complete. Most
of the time it is abused to perform simple search-and-replace
tasks, though. ;-)

I never used sed for anything but "stream editing", aka search and
replace. Well, if it's turing complete, my apologies to the sed
author(s). :)
--dang

There's a whole bunch of 'extreme' sed scripts at
http://sed.sourceforge.net/grabbag/scripts/

I like the dc.sed script there; it's an implementation of the UNIX
program 'dc', which is an arbitrary precision RPN calculator:
http://sed.sourceforge.net/grabbag/scripts/dc_overview.htm
Only for the truly brave.

A Turing machine, too:
http://sed.sourceforge.net/grabbag/scripts/turing.sed

And I notice they have a Python sed debugger:
http://sed.sourceforge.net/grabbag/scripts/sd.py.txt
 
K

Kirk Job-Sluder

At the risk of veering this thread in yet another different direction,
anybody who does analysis of large amounts of data should take a look at
Gary Perlman's excellent, free, and generally under-appreciated |STAT
package.

http://www.acm.org/~perlman/stat/

It's been around in one version or another for something like 20 years.
It fills an interesting little niche that's part data manipulation and
part statistics.


Thanks. I'll check it out.
 
K

Kirk Job-Sluder

Kirk> And here is the fundamental question. Why should I spend my
Kirk> time writing a module in python to emulate another tool,
Kirk> when I can simply use that other tool? Why should I, as a

Perhaps you won't; but someone who isn't already proficient with the
tool may rest assured that learning the tool really isn't worth his
time. awk and sed fall into this category.

Actually, I'm not convinced of the learning time argument. It takes
about 30 minutes training time to learn enough awk or sed to handle
90% of the cases where it is the better tool for the job. A good
understanding of regular expressions will do most of your work for you no
matter which langugage you use.
Kirk> resarcher who must process large quantities of data, spend
Kirk> my time and my employer's money reinventing the wheel?

You are not reinventing the wheel, you are refactoring it :). I don't
think your employer minds you spending 15 extra minutes creating some
tool infrastructure, if it allows you to drop awk/sed dependency that
your co-workers then won't need to learn.

In which my case, the perl version is more likely to win out on the
basis of standardization. IME, the time involved to create good
infrastructure that does not come back to bite you in the ass is
considerably more than 15 minutes. Think also that for every minute you
spend designing something to share, you need to spend between 5-20
documenting and training in the organization (and this is not including
maintenance and distribution.)

The great thing is that the tool infrastructure already exists. Not
only does the tool infrastructure exist, but the training materials
already exist. Really, how hard is "perl -pi -e 's/foo/bar/'" to
I find writing pythonic tools a relaxing deversion from my everyday
work (cranking out C++), so I don't really mind. As long as the time
spent is within 5 minutes - 1 hour range.

Well, there is another big difference. I'm a big fan of instant
gratification so the off-the-shelf tool that does the job in 10 seconds
is better than 5 minutes to 1 hour writing a pythonic tool. I have
re-written shell scripts in python just for kicks, but I don't have any
illusions that refactoring everything into python should be a
perogative.
 
A

Aahz

Well, there is another big difference. I'm a big fan of instant
gratification so the off-the-shelf tool that does the job in 10 seconds
is better than 5 minutes to 1 hour writing a pythonic tool. I have
re-written shell scripts in python just for kicks, but I don't have any
illusions that refactoring everything into python should be a
perogative.

If it takes you an hour to rewrite a ten-second job into a Pythonic
script, you don't know Python very well. That kinda counters your claim
of a shallow learning curve for the other programs.
 
S

Scott Schwartz

Duncan Booth said:
import sys
for line in sys.stdin:
line = line[:-1].split('\t')
print "%s %s %s %s" % (line[3], line[2], line[1], line[0])
While I agree with you that using the appropriate tool is preferred over
using Python for everything, I don't really see much to choose between the
Python and awk versions here.

1) Python throws an error if you have less than three fields,
requiring more typing to get the same effect.

2) Python generators on stdin behave strangely. For one thing,
they're not properly line buffered, so you don't get any lines until
eof. But then, eof is handled wrongly, and the loop doesn't exit.

3) There is no efficient RS equivalent, in case you need to read
paragraphs.

The simpler example

for line in sys.stdin:
print line

demonstrates the problem nicely.

$ python z
a
b
c
^D
a

b

c

foo
bar
baz
^D
foo

bar

baz
^D
^D
$

Explanations in the docs about buffering and readahead don't excuse
this poor result.

$ awk '{print}'
a
a
b
b
c
c
^D
$
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,199
Messages
2,571,045
Members
47,643
Latest member
ashutoshjha_1101

Latest Threads

Top