file.seek and unused bytes

R

Robert Klemme

2009/7/6 Greg Willits said:
Robert Klemme wrote:
We're working with large data sets for aggregation which takes a long
time to run, and second only to the ease and clarity of the top level
DSL, is the speed of the aggregation process itself so we can afford to
do more analysis.

Did you actually measure significant differences in time or are you
just assuming there is a significant impact because you write less and
have to do less processing?
 
B

Brian Candler

Robert said:
Errr, I am not sure I fully understand your approach. What you write
sounds like if you end up with a file containing multiple sections
which each have lines with identical length.

My reading of the description was that each "chunk" was a separate file,
and so all the records within that file had the same line length, but I
could be wrong.
 
R

Robert Klemme

2009/7/6 Brian Candler said:
My reading of the description was that each "chunk" was a separate file,
and so all the records within that file had the same line length, but I
could be wrong.

But he writes "the file" which makes me think that there is only one
output file. Also, the 4GB limit seems to indicate that we are
talking about a single large file.

Greg, help!

;-)

Cheers

robert


--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
B

Brian Candler

Brian said:
If so, it sounds to me like what you really want is cdb:
http://cr.yp.to/cdb.html

You emit key/value records of the form

+1,50:1->(50 byte record)
+1,70:2->(70 byte record)
+1,60:3->(60 byte record)
...
+2,500:10->(500 byte record)
... etc

then pipe it into cdbmake. The resulting file is built, in a single
pass, with a hash index, allowing you to jump to record with key 'x'
instantly.

I dug out some old CDB-building code; it's actually very simple with
ruby-cdb and no piping is required. Here's some code which takes
table-separated key-value pairs file and builds a cdb:

def cdbmake(src, dest=src+".cdb", mode=0444, uid=0, gid=0)
require 'cdb'
File.open(src) do |f1|
File.delete(file+".tmp") rescue nil
CDBMake.open(src+".tmp", mode) do |f2|
f1.each_line do |line|
line.chomp!
next if line =~ /^#/ or line =~ /^\s*$/
if line =~ /^([^\t]*)(\t(.*))?$/
f2.add($1, $3 || "")
else
puts "#{src}: Bad line #{line.inspect} ignored"
end
end
end
end
File.rename(src+".tmp", dest)
File.chown(uid, gid, dest)
rescue
File.delete(src+".tmp") rescue nil
raise
end

This used http://raa.ruby-lang.org/project/cdb/ which unfortunately is
not available as a gem, but it does bundle the cdb source for you.

I see there's now a cdb gem. To install this you need the cdb
development libraries already installed on your system. Under Ubuntu all
I needed was:

sudo apt-get install libcdb-dev
sudo gem install cdb

Strangely I found it installed all files mode 600. The fix:

find /usr/lib/ruby/gems/1.8/gems/cdb-0.0.2 -type f | xargs sudo chmod
644

Anyway, the README shows quite clearly how to use it.
 
G

Greg Willits

Eeek. Opened a can of worms!

It's 5:30 am where I am (I'm "still up" and not "up early"), so I'll hit
the highlights, and detail more later if for some reason there's an
interest.

-- application: data aggregation

-- I get data from county services and school districts. They have no
way to correlate and aggregate their data. That's what I am doing. I
match kids that are flagged in county services for whatever reason and
have to match them up with school records, health records, judicial
records where applicable, etc.

-- I get dozens of CSV files of various subjects. Every school
district's CSV file for any subject (attendance, grades, etc) has
similar content but not identical. Ditto for demographics from various
agencies.

-- before I can even attempt aggregation, I need to normalize all this
data so it can be indexed and analyzed, and I ensure the data is
cleaned and standardized for display.

-- there's a top layer DSL where I describe data sources and how to
transform any one particular raw data file into a normalized data file.
There another set of descriptions to allow any one field to come from a
number of possible sources. So, something like birthcity might come from
data source X, but if not available, check data source Y, etc.

-- this process has been abstracted into a data aggregation core to do
the normalization, the indexing, and other tasks, with an
application-specific layer to handle the definition of transformations,
indexes that are needed, order of precedence, and the stitching of data
from various sources into records.

-- So, this particular step I've been talking about is where a raw CSV
file undergoes normalization by reorgnizing the fields of each record
into a common structure for that given data topic, each field undergoes
some scrubbing (character case, packing phones, normalizing date
formats, translation of codes into strings, etc).

-- raw data files range from a handful of columns to a couple dozen
columns. From a few hundred rows to a couple million rows.

-- data is an array of hashes

-- by the time we get done normalizing a particular raw source, it can
hit the 4GB memory limit any one ruby fork has available to play with
(many forks run in parallel on multiple cores)

-- while most CSV files work out just fine reading into RAM,
transforming them 100% in RAM, and then writing in a simple single step,
many files do not.

-- so we have updated the process to load X records from the raw file,
tranform X records in memory, then write X records to disk, and loop
untill all records are done.

-- and we have to deal with indexes, duplicates, and other issues in
there as well

-- imagine 2,000,000 raw records from one file which get processed in
200,000 record chunks, but output back to another single file.

-- as I step through each chunk of 200,000 records, I can get the
longest length of that 200,000, and I can store that, but I can't know
what the longest length is for the next 200,000 that I haven't loaded
yet.

-- having processed and written the first 200,000 results to disk, and
then determining the length of the second 200,000, I'm not going to go
back and change the first 200,000 just to make them all the same.
there's no value in that at all.

-- So, when I get done with each 200,000 chunk, I now have a series of
integers which tells me the length of the records in each chunk of
200,000 rows.

-- the file has already been written, so again, I'm not going to go back
and move everything to insert this data at the top (which is where I
would put if indeed every record was the same length)

-- so, I put this data at the end.

-- BTW, the rows lengths are similiar enough that the disk efficiency is
not an issue.

-- I read this last line using tail, and I strip off the leading empty
bytes (if any) as I described earlier

-- it's a couple of very simple calculation to convert any "index"
position into the exact byte seek position to find a specific record.

-- from this point on, the records are read as random access from disk
because otherwise I would need oodles of GB of data in RAM all at once
during the aggregation process.

-- is doing 200,000 or even 500,000 at a time in chunks really any
faster than doing them one at a time -- that I actually don't know yet,
I am just now finishing all the code that this "chunking" touches and
ensure I get the same results I used to. the size of the chunks isn't as
important for speed as it is for memory management -- making sure I stay
within 4GB.

-- as for the speed issues, we've done a lot of profiling, and even
wrote a mini compiler to read our file tranformation DSL and output a
stream of inline variable declarations and commands whichs gets included
as a module on the fly for each data source. That trick saved us from
parsing the DSL for each data row and literally shaved hours off the
total processing time. We attacked many other levels of optimization
while working to keep the code as readable as possible, because it's a
complicated layering of abstractions and processes.

-- I will look into cdb

-- gw
 
B

Brian Candler

Greg said:
Eeek. Opened a can of worms! ...
-- by the time we get done normalizing a particular raw source, it can
hit the 4GB memory limit any one ruby fork has available to play with
(many forks run in parallel on multiple cores)

Ah, so this is a 4GB process limit not file limit.
-- imagine 2,000,000 raw records from one file which get processed in
200,000 record chunks, but output back to another single file.

-- as I step through each chunk of 200,000 records, I can get the
longest length of that 200,000, and I can store that, but I can't know
what the longest length is for the next 200,000 that I haven't loaded
yet.

Standard "external processing", this is fine.
-- it's a couple of very simple calculation to convert any "index"
position into the exact byte seek position to find a specific record.

OK, so you want your output file to have a property which CSV files
don't, namely that you can jump to line N with a direct seek.
-- I will look into cdb

Here's another worm for your can: CouchDB :)

Output each normalised CSV row as a JSON document, then it will build
arbitrary indexes out of that, defined using Javascript (or Ruby or
...).

Unfortunately, building the indexes isn't as fast as it should be yet,
and I note you have a lot of hand-optimised code. But if you ever want
to be able to search your big files by content of a field, rather than
just jump to the N'th record, this might be your saviour.

Maybe mongodb would be even faster, but I've not played with it yet.

Regards,

Brian.
 
R

Robert Klemme

-- is doing 200,000 or even 500,000 at a time in chunks really any
faster than doing them one at a time -- that I actually don't know yet,
I am just now finishing all the code that this "chunking" touches and
ensure I get the same results I used to. the size of the chunks isn't as
important for speed as it is for memory management -- making sure I stay
within 4GB.

From my experience doing one at a time is efficient enough. Of course
you then cannot do the line length math. But, for that I'd probably
consider doing something different: What about storing all file offsets
in an Array and write it to a file "<orig>.idx" via Marshal. That way
you do not need fill bytes, your files are smaller and you can process
one line at a time. Reading a single entry is then easy: just slurp in
the index in one go and seek to index[record_index].

robert@fussel ~
$ time ruby19 -e 'h=(1..1_000_000).map {|i| i << 6};File.open("x","wb")
{|io| Marshal.dump(h,io)}'

real 0m1.848s
user 0m1.483s
sys 0m0.249s

robert@fussel ~
$ ls -l x
-rw-r--r-- 1 robert Kein 5736837 Jul 6 20:01 x

Apart from that this will reduce memory usage of individual processes
and you might be able to better utilize your cores. Dumping via Marshal
is pretty fast and the memory overhead of that single index array is not
too big.

Alternatively you can write the index file while you are writing the
main data file. You just need to fix the number of bits you reserve for
each file offset. Then the read operation can be done via two seek
operations (first on the index, then on the data file) if you do not
cache the index.
-- as for the speed issues, we've done a lot of profiling, and even
wrote a mini compiler to read our file tranformation DSL and output a
stream of inline variable declarations and commands whichs gets included
as a module on the fly for each data source. That trick saved us from
parsing the DSL for each data row and literally shaved hours off the
total processing time. We attacked many other levels of optimization
while working to keep the code as readable as possible, because it's a
complicated layering of abstractions and processes.

Did you consider to make your dsl generate the code or is that what you
are doing?

Kind regards

robert
 
T

Thomas Chust

2009/7/6 Greg Willits said:
[...]
-- So, this particular step I've been talking about is where a raw CSV
file undergoes normalization by reorgnizing the fields of each record
into a common structure for that given data topic, each field undergoes
some scrubbing (character case, packing phones, normalizing date
formats, translation of codes into strings, etc).
[...]

Hello,

reading this I wonder why you don't dump the records into a database
at this point. For example SQLite3 would be lightweight, fast and
flexible. You could probably save a lot of effort not having to build
index structures manually and you could query your dataset using SQL
or the ORM of your choice afterwards...

cu,
Thomas
 
G

Greg Willits

reading this I wonder why you don't dump the records into a database
at this point. For example SQLite3 would be lightweight, fast and
flexible. You could probably save a lot of effort not having to build
index structures manually and you could query your dataset using SQL
or the ORM of your choice afterwards...

That would actually require a much larger rewrite than what I've done.
It wasn't too big a leap to change the APIs from "here's all the
records" to "here's some of the records." A few order of operations had
to be juggled, but overall it's not a major logical leap, nor a major
implementation leap. That probably goes for all the other DB options as
well.

At this point, the change from all-in-RAM to chunks-in-RAM has been a
relatively straight forward retrofit, and based on previsou versions
which did use a DB, I think this is still faster or as fast. Another
option we will explore is 64-bit ruby (which I started another thread
for a few days ago).
What about storing all file offsets
in an Array and write it to a file

That's a possibility that would be an easy retrofit and probably w/o
breaking the current interfaces. It would use a little more RAM, but
would eliminate calculating that offset each time during reads. That's
not a real pinch point, but worth an experiment just because it's a
simpler concept to recognize.
Did you consider to make your dsl generate the code
or is that what you are doing?

If I interpret the question correctly, that's what we're doing. I may be
using "DSL" a little loosely here though. It's more like an "active
config" language. I write setup files that are a mixture of k-v pairs
for certain things, and expressions for other things. All of that gets
processed at the start of every aggregation run which results in a
number of Ruby Module files being created which get included
dynamically.

Many thanks for all the thoughts & feedback.

-- gw
 
R

Robert Klemme

That's a possibility that would be an easy retrofit and probably w/o
breaking the current interfaces. It would use a little more RAM, but
would eliminate calculating that offset each time during reads. That's
not a real pinch point, but worth an experiment just because it's a
simpler concept to recognize.

I'd be interested to learn the outcome of that exercise.
If I interpret the question correctly, that's what we're doing. I may be
using "DSL" a little loosely here though. It's more like an "active
config" language. I write setup files that are a mixture of k-v pairs
for certain things, and expressions for other things. All of that gets
processed at the start of every aggregation run which results in a
number of Ruby Module files being created which get included
dynamically.

Only that you do not need to write this into files but you can do that
in memory with eval. Anyway, that's just a detail. :)
Many thanks for all the thoughts & feedback.

You're welcome!

Kind regards

robert
 
G

Greg Willits

Robert said:
I'd be interested to learn the outcome of that exercise.


In my original system, I used the row lengths and the number of rows per
"set" to pre-calculate a lookup of offsets into the start of each set.
Upon recieving a request for record 623,456 I first had to determine
which set that would be in (4th set based on 200,000 per set). So I'd
use a floor calc to determine that. Then subtract that factor of rows
from the 623,456, then use the basic row_length multiplier. So, there
were a few calcs involved.

It looked like this:

set_indx = (row_number.to_f / @set_size.to_f).floor
set_sub_indx = row_number - (@set_size * set_indx)
record_start = @set_offsets[set_indx] + (set_sub_indx *
@record_lengths[set_indx])

By saving the row-specific offsets as an array, that is simplified to :

record_start = @record_offsets[row_number]

This turned out to make the total process of fetching rows 15% faster.
(roughly 32 seconds vs 38 seconds for reading 1.4 million rows on my dev
system).

I already index the data itself using hashes (which work very, very
fast) for aggregation lookups, so this concept is quite parallel to
other ways the code works, and is a worthwhile change to make -- so
thanks for that idea.

-- gw
 
J

James Gray

In my original system, I used the row lengths and the number of rows
per
"set" to pre-calculate a lookup of offsets into the start of each set.
Upon recieving a request for record 623,456 I first had to determine
which set that would be in (4th set based on 200,000 per set).
By saving the row-specific offsets as an array, that is simplified
to :

record_start = @record_offsets[row_number]

This turned out to make the total process of fetching rows 15% faster.
(roughly 32 seconds vs 38 seconds for reading 1.4 million rows on my
dev
system).
I already index the data itself using hashes (which work very, very
fast) for aggregation lookups, so this concept is quite parallel to
other ways the code works, and is a worthwhile change to make -- so
thanks for that idea.

The more I read in this thread, the more I feel it's crying out for a
database. I recognize that you disagree with this, I just disagree
with your disagreement. In a totally friendly way, of course. :)

Originally I too was thinking of the already suggested SQLite, but now
I'm thinking that not right. My current feeling is that Redis would
be a terrific fit:

http://code.google.com/p/redis/

That's just my two bits. I promise to let the database argument lie
now.

James Edward Gray II
 
J

Joel VanderWerf

James said:

Does this install as a gem for you? I get this...

$ gem install redis
Successfully installed redis-0.0.1
1 gem installed
Installing ri documentation for redis-0.0.1...

...

and then rdoc fails with:
invalid argument: --format=html

and the gem dir redis-0.0.1 is empty, and

Even disabling rdoc:

$ gem install redis --no-rdoc --no-ri
Successfully installed redis-0.0.1
1 gem installed

results in an empty gem dir.
 
E

Ezra Zygmuntowicz

Does this install as a gem for you? I get this...

$ gem install redis
Successfully installed redis-0.0.1
1 gem installed
Installing ri documentation for redis-0.0.1...

...

and then rdoc fails with:
invalid argument: --format=html

and the gem dir redis-0.0.1 is empty, and

Even disabling rdoc:

$ gem install redis --no-rdoc --no-ri
Successfully installed redis-0.0.1
1 gem installed

results in an empty gem dir.



Get redis from github as far as I know there is no redis gem on
rubyforge yet(i wrote the ruby client)

git clone git://github.com/ezmobius/redis-rb.git

cd redis-rb

rake gem
sudo gem install pkg/redis***.gem

rake redis:install
rake redis:start


Cheers-

Ezra Zygmuntowicz
(e-mail address removed)
 
J

Joel VanderWerf

Ezra said:
Get redis from github as far as I know there is no redis gem on
rubyforge yet(i wrote the ruby client)

git clone git://github.com/ezmobius/redis-rb.git

cd redis-rb

rake gem
sudo gem install pkg/redis***.gem

rake redis:install
rake redis:start

Ok until this point:

$ rake redis:start
(in /home/vjoel/ruby/src/db/redis-rb)
Detach with Ctrl+\ Re-attach with rake redis:attach
rake aborted!
No such file or directory - dtach -A /tmp/redis.dtach redis-server
/etc/redis.conf

Is dtach something I need to install separately? (kubuntu 8.04 here)
 
E

Ezra Zygmuntowicz

Ok until this point:

$ rake redis:start
(in /home/vjoel/ruby/src/db/redis-rb)
Detach with Ctrl+\ Re-attach with rake redis:attach
rake aborted!
No such file or directory - dtach -A /tmp/redis.dtach redis-server /
etc/redis.conf

Is dtach something I need to install separately? (kubuntu 8.04 here)

Yeah sorry dtach is another piece of software to install, you can get
it form darwin ports or your package manager. But that is just a
convenience wrapper for starting redis, you can start redis yourself
by just checking it out and running the ./redis-server command

Cheers-

Ezra Zygmuntowicz
(e-mail address removed)
 
G

Greg Willits

James said:
The more I read in this thread, the more I feel it's crying out for a
database. I recognize that you disagree with this, I just disagree
with your disagreement. In a totally friendly way, of course. :)

I think if you saw the whole thing you'd see why the db is more of a
hinderance than an assett. I've described a small piece of it, and not
the piece that has the most compelling reasons to avoid a db -- at least
a traditional db. Some of these other ones that have brought up may be a
different story.

-- gw
 
R

Robert Klemme

2009/7/7 Greg Willits said:
I think if you saw the whole thing you'd see why the db is more of a
hinderance than an assett. I've described a small piece of it, and not
the piece that has the most compelling reasons to avoid a db -- at least
a traditional db. Some of these other ones that have brought up may be a
different story.

I have to admit that I thought of a database as well. Keywords that
especially triggered this were: index, aggregation, huge, speed. You
make it really sound as if we are talking about a huge beast of which
we are getting just a tiny glimpse. :)

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
G

Greg Willits

Robert said:
I have to admit that I thought of a database as well. Keywords that
especially triggered this were: index, aggregation, huge, speed. You
make it really sound as if we are talking about a huge beast of which
we are getting just a tiny glimpse. :)

Heh. No, it's not huge, certainly not by today's standards of huge.

The first version of this did everything by database, but we realized we
were using little of the DB's strengths, and that in-memory indexes
would be more efficient to both create and use. Most of our indexes
aren't normal indexing of single fields, but rather combinations of
pieces of various fields. Those ones are used to determine matches and
probabilities of matches between two data sets.

During aggregation, we do a lot of work with just the indexes
themselves, something that wasn't as effective when using a DB.

When we replaced the DB with random access text files for the data
itself, and replaced all the other tasks with our own techniques, the
system as whole was faster, and more flexible to code new ideas for
matching and aggregation.

Now, whether something like redis or couchdb has value added compared to
a traditional db because they behave more like what we've done, I don't
know. Haven't looked into that yet, but it does look interesting.

Still, what we have works quite well, is entirely in Ruby, and requires
no other resources except to connect to the application's db for the
final data update. So there's some advantages to that simplicity as
well.

-- gw
 
J

Joel VanderWerf

Ezra said:
Yeah sorry dtach is another piece of software to install, you can get it
form darwin ports or your package manager. But that is just a
convenience wrapper for starting redis, you can start redis yourself by
just checking it out and running the ./redis-server command

Thanks, that works. Nice to see that little notes.rb sinatra app in
action, though it's a got a few bugs.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,240
Members
46,830
Latest member
HeleneMull

Latest Threads

Top