Marshal Pipe

  • Thread starter Carlos J. Hernandez
  • Start date
C

Carlos J. Hernandez

I've just re-discovered pipes.
Using Linux bash... stuff like `grep zip.89433 addresses.csv | sort |
head`
Bash pipes work very well for many problems, such as mass downloads and
data filtering.
But they're simplest to implement on line by line text data.
This is not a true limitation of pipe architectures.

You can implement data pipes with Marshal.
Within your class, you can define a puts method for the source's
$stdout:

def self.puts(data)
data = Marshal.dump( data )
# tell the sink how many bytes to read
$stdout.print [data.length].pack('l')
# then print out data
$stdout.print data
end

and then the sink reads from $stdin:

while data = $stdin.read(4) do
data = data.unpack('l').shift # bytes to read
data = $stdin.read( data ) # marshal'ed dump from stdin
data = Marshal.load( data ) # restored data structure
# what you do here.........
end

I don't think this is implemented in a standard way anywhere in Ruby (or
any other language), but
looks to me like a really, really good idea.

-Carlos
 
E

Eric Hodel

I've just re-discovered pipes.
Using Linux bash... stuff like `grep zip.89433 addresses.csv | sort |
head`
Bash pipes work very well for many problems, such as mass downloads
and
data filtering.
But they're simplest to implement on line by line text data.
This is not a true limitation of pipe architectures.

You can implement data pipes with Marshal.
Within your class, you can define a puts method for the source's
$stdout:
[...]

and then the sink reads from $stdin:

[...]

I don't think this is implemented in a standard way anywhere in Ruby
(or
any other language), but
looks to me like a really, really good idea.

You've written the core of DRb, which is these data pipes expanded to
a multi-process, multi-machine distributed programming tool.
 
C

Carlos J. Hernandez

Eric, thanks for your comment.
I'll look again, but I don't think I saw in DRb the simplicity achieved
by bash as in:

cat source.txt | filter | sort > result.txt

I'm saying cat, filter, and sort could be ruby programs piping Marshal
data structures.
-Carlos
 
R

Robert Klemme

2008/1/8 said:
Eric, thanks for your comment.
I'll look again, but I don't think I saw in DRb the simplicity achieved
by bash as in:

cat source.txt | filter | sort > result.txt

That line makes you eligible for a "useless cat award".
I'm saying cat, filter, and sort could be ruby programs piping Marshal
data structures.

Your solution is still too complicated: you do not need the byte
transfer - in fact, it may be disadvantageous because you need the
full marshaled representation in memory before you can send it. This
is not very nice for streaming processing. Instead, simply directly
marshal data into the pipe:

$ ruby -e '10.times {|i| Marshal.dump(i, $stdout) }' | ruby -e 'until
$stdin.eof?; p Marshal.load($stdin) end'
0
1
2
3
4
5
6
7
8
9

The question is: how often do you actually need the processing power
of two processes? On a single core machine the code is probably as
efficient with a single Ruby process (probably using multiple threads)
- and you do not need the piping complexity and marshaling overhead.
For tasks that involve IO Ruby threads work pretty well. So, I'd be
interested to hear what is the use case for your solution?

Kind regards

robert
 
C

Carlos J. Hernandez

Robert:
Thanks for your performance improvement suggestion.
I did not think of giving Marshal $stdout.
But the problem remains that I don't know ahead of time how many bytes
the Marshal data will have and
I can no longer use "\n", the input line separator, as a record
separator.

As for general usefulness.
If you already have a general purpose cat, filter, transform, and sort
programs...
And just want to see the results of manipulating the contents of some
source file....
Then just say
cat source.txt | transform | filter | sort > result.txt
I do these kind of stuff all the time, I just have not program that way
before.
I just started because the model is useful in my data downloads where
I download history CSVs from Finance.Yahoo.com and along the way to
append to my data files,
I transform the data.
There is an impedance problem though,
in having to flatten and convert a data structure that contain floats,
integers, and dates,
back to a CSV line every time you go through the pipe, and then restore
it back in the receiver.
Marshal solves this, except that "\n" can no longer be used as record
separators.
Marshal is more efficient, that's why someone wrote it.

Lastly, computer will be multi-processing from here on...
Faster chips are finding their physical limits.

BTW, I have an implementation of Marshal Pipes, just as I described in
my opening email.
It works great.

-Carlos
 
R

Robert Klemme

2008/1/8 said:
Robert:
Thanks for your performance improvement suggestion.
I did not think of giving Marshal $stdout.
But the problem remains that I don't know ahead of time how many bytes

No, this is not a problem because Marshal.load will take care of this
(as you can see from the command line example I posted).
the Marshal data will have and
I can no longer use "\n", the input line separator, as a record
separator.

Not needed as said before.
As for general usefulness.
If you already have a general purpose cat, filter, transform, and sort
programs...
And just want to see the results of manipulating the contents of some
source file....
Then just say
cat source.txt | transform | filter | sort > result.txt

... and get another "useless cat award". :)
I do these kind of stuff all the time, I just have not program that way
before.
I just started because the model is useful in my data downloads where
I download history CSVs from Finance.Yahoo.com and along the way to
append to my data files,
I transform the data.
There is an impedance problem though,
in having to flatten and convert a data structure that contain floats,
integers, and dates,
back to a CSV line every time you go through the pipe, and then restore
it back in the receiver.
Marshal solves this, except that "\n" can no longer be used as record
separators.

Marshal basically just hides the conversion and makes it faster. The
conversion is still there: you have a data structure (say an array),
transform it into a sequence of bytes (either CSV or Marshal format),
send it through a pipe, transform byte sequence back (either from CSV
or Marshal format) and get out the array again. That's why I say it's
more efficient to not use two processes but do it in one Ruby process
most of the time (i.e. on single core machine or with IO bound stuff).
Marshal is more efficient, that's why someone wrote it.

Not only that. Marshal servers a slightly different purpose, namely
converting object graphs which can contain loops into a byte stream
and resurrecting this graph from the byte stream.
Lastly, computer will be multi-processing from here on...
Faster chips are finding their physical limits.

But OTOH Ruby will rather sooner than later use native threads and a
multithreaded application is easier and in this particular case also
more efficient (unless you use tons of memory per processing step)
because you do not need the conversion for IPC. Do you actually
/need/ that processing power?
BTW, I have an implementation of Marshal Pipes, just as I described in
my opening email.
It works great.

That's nice for you. But you proposed a general solution in your
original posting. At least that's what I picked up from your last
statements. With this (public!) discussion we are trying to find out
whether it *is* actually a good idea for the general audience. So far
I haven't been convinced that it is indeed.

Kind regards

robert
 
C

Carlos Hernandez

Robert:

ruby -e '10.times {|i| Marshal.dump(i, $stdout) }' | ruby -e 'until
$stdin.eof?; p Marshal.load($stdin) end'

THANKS!!!
Did not recognized it at first read, because it's a bit cryptic.
-Carlos
 
A

ara howard

I'll look again, but I don't think I saw in DRb the simplicity
achieved
by bash as in:

cat source.txt | filter | sort > result.txt

I'm saying cat, filter, and sort could be ruby programs piping Marshal
data structures.

check out ruby queue (rq) - it uses that paradigm but, instead of
marshal'd data, it uses yaml which accomplishes the same goal without
giving up human readability. for instance one might do (simplified)

rq q query tag==foobar
---
jid: 1
tag: foobar
command: processing_stage_a input

so query is dumping a job object, as yaml. then you do

!! | rq q update priority=42 -

which is to say use the output of the last command, a ruby object, and
input that into the next command, which takes a job, or jobs, on stdin
when '-' is given, and update that job in the queue

you can also do things like

rq q query priority=42 tag=foobar | rq q resubmit -

etc.

the pattern is a good one - but i wouldn't touch marshal data over
yaml for the commandline with a ten foot pole: one slip and you'll
blast out chars that will hose the display or disconnect your ssh
session. also, yaml provides natural document separators so you can
embed more than one set in a stream separated by --- which allows for
chunking of huge output streams

food for thought.

kind regards.



a @ http://codeforpeople.com/
 
C

Carlos J. Hernandez

Ara:

Yaml is find over internet connection where transmission time is high
compared to cpu time, and
where human readability is a plus.
For my case, separate programs/processes on the same machine working
very closely
as if a single program in a pipe architecture... Marshal is better.
In fact, if Marshal is a bit of a Hybrid (don't know the details), then
what I really want is pure binary, I think.

Anyways, for a bit more details of my implementation,
taking out the specifics of my application and including Roberts'
comments,
I now have:

class MarshalPipe
def self.puts(data)
Marshal.dump( data, $stdout )
end

def _pipe
data = nil
while data = Marshal.load($stdin) do
pipe(data)
break if $stdin.eof?
end
end
end

I don't know why this did not work:

until $stdin.eof do
data = Marshal.load($stdin)
pipe( data )
end
 
T

thefed

You've written the core of DRb, which is these data pipes expanded
to a multi-process, multi-machine distributed programming tool.

I'm really looking to get into DRb, but it's dsl and stuff is a
little.... daunting... Is there a slightly toned-down wrapper for it
or an alternative?
 
R

Robert Klemme

Ara:

Yaml is find over internet connection where transmission time is high
compared to cpu time, and
where human readability is a plus.
For my case, separate programs/processes on the same machine working
very closely
as if a single program in a pipe architecture... Marshal is better.
In fact, if Marshal is a bit of a Hybrid (don't know the details), then
what I really want is pure binary, I think.

Anyways, for a bit more details of my implementation,
taking out the specifics of my application and including Roberts'
comments,
I now have:

class MarshalPipe
def self.puts(data)
Marshal.dump( data, $stdout )
end

def _pipe
data = nil
while data = Marshal.load($stdin) do
pipe(data)

What does #pipe do? Why don't you use a block for the processing of the
data? For a general (aka library) solution it would also be much better
to pass the IO as an argument, in case there are more pipes to work with.
break if $stdin.eof?
end
end
end

I don't know why this did not work:

until $stdin.eof do
data = Marshal.load($stdin)
pipe( data )
end

Probably because this is not the same as my code (hint: punctuation
matters).

Bte, I am still interested to learn the use case where your solution is
significantly better than an in process solution with Threads and a Queue...

Regards

robert
 
C

Carlos J. Hernandez

On Wed, 9 Jan 2008 07:00:04 +0900, "Robert Klemme"
What does #pipe do? Why don't you use a block for the processing of the
data? For a general (aka library) solution it would also be much better
to pass the IO as an argument, in case there are more pipes to work
with.

Yep! Like a yield statement you mean. I agree.

As for multiple pipe sources and your question of general usefulness...
(I read somewhere lack of multiple IO is a known issue in UNIX pipes)
I'm just thinking bash, shell scripting.
I don't mean to ignite a language war.
I just think Bash, Ruby, and C make a terrific team.
Also, setting up the pipes seems to be best done from outside,
which makes it best fitted for shell scripting.

Anyways, the missing "?" was a typo.
The following which as I read it should work;

until $stdin.eof? do
data = Marshal.load($stdin) # <= Error here
pipe( data )
end

still gives the following error:

buffer already filled with text-mode content

$stdin.eof? is necessary though, as
a different error is triggered if Marshal tries to load on a EOF.

-Carlos
 
C

Carlos Hernandez

class MarshalPipe
def self.puts(data)
Marshal.dump(data,$stdout)
end
def self.each
data = nil
begin
while data = Marshal.load($stdin) do
yield data
end
rescue EOFError
# rudely ignore
end
end
end


I guess a class/module above is as clean and simple I can get it.
A mp2mp type pipe would be..


require 'marshal_pipe'
MarshalPipe.each { |data|
transformed_data = transform( data ) # <= do something
MarshalPipe.puts transformed_data
}


A quick csv2mp could be...


require 'MarshalPipe.rb'
require 'csv'
$stdin.each { |line|
data = []
CSV.parse_line( line.strip ).each {|item|
case item
when /^-?\d+(\.\d+)?$/
data.push( ($1)? item.to_f: item.to_i )
# maybe add date handling or any other data type...
else
# simple string
data.push( item )
end
}
MarshalPipe.puts data
}


and a mp2txt


require 'MarshalPipe.rb'
MarshalPipe.each { |data|
puts data.join("\t")
}


A hastily written csv2mp needs cat (not knowing how to read files)...

cat source.csv | csv2mp | mp2mp | mp2txt > result.txt

But one could argue to make MarshalPipe a template to make pipes in
general.
That'd be more like I'm actually using, except
without the much nicer MashalPipe.(puts and each),
 
J

Joel VanderWerf

Carlos Hernandez wrote:
...
A quick csv2mp could be...


require 'MarshalPipe.rb'
require 'csv'
$stdin.each { |line| ...
A hastily written csv2mp needs cat (not knowing how to read files)...

cat source.csv | csv2mp | mp2mp | mp2txt > result.txt

Use ARGF instead of $stdin, and you read files for free.
 
C

Carlos J. Hernandez

On Thu, 10 Jan 2008 06:16:35 +0900, "Joel VanderWerf"
Use ARGF instead of $stdin, and you read files for free. ...
vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407


Cool!!! Thanks! So

csv2mp < source.txt | ....

the filehandle is in ARGF.
I guess I don't need to explain the use of pipes to a Berkley man, home
of Berkley-Unix.
 
J

Joel VanderWerf

Carlos said:
csv2mp < source.txt | ....

the filehandle is in ARGF.

Or just this:

csv2mp source.txt | ....

For example:

$ cat test.txt
This is
a test
$ ruby -e 'puts ARGF.read' test.txt
This is
a test
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,981
Messages
2,570,188
Members
46,733
Latest member
LonaMonzon

Latest Threads

Top