file.seek and unused bytes

G

Greg Willits

Ruby 1.8.6

(sorry this one takes some setup to explain the context of the question)

I'm using file seek to create equal fixed length rows in a disk file.
The row data itself is not fixed length, but I am forcing the seek
position to be at fixed intervals for the start of each row. So, on disk
it might look something like this:

aaaaaaaaaaaaaaaaaaaaaaaaaX0000000000000
bbbbbbbbbbbbbbX000000000000000000000
ccccccccccccccccccccccccccccccccccccccccX00

I'm really only writing the "aaaa..." and "bbbb...." portions of the
rows with an EOL (an X here so it's easy to see).

I have one operation step which uses tail to grab the last line of the
file. When I do that, I get something like this:

000000000000000000000cccccccccccccccccccccccccccccccccccccccc

which is the empty bytes past the EOL of the line "bbb..." plus the
valid data of the following line.

After some fiddling, it became apparent that I can test the value of the
byte against zero to know if there's data in it or not -- if byte_data
== 0 so that I can trim off that leading set of zeros, but I'm not
certain those empty bytes will always be zero.

And finally my question.....

So, my question is about those zeros. Does advancing the seek position
(do we still call that the "cursor"?) intentionally and proactively fill
the unsed bytes with what apparently equates to zero? OR, am I just
getting luck that my test files have used virgin disk space which yields
zeros, and the seek position just skips bytes which potentially would
contain garbage from previously used disk sectors?

Can I count on those unused bytes always being zero?

-- gw
 
G

Greg Willits

Greg said:
aaaaaaaaaaaaaaaaaaaaaaaaaX0000000000000
bbbbbbbbbbbbbbX000000000000000000000
ccccccccccccccccccccccccccccccccccccccccX00

Argh. those should look equal length in non-proportional font.

aaaaaaaaaaaaaaaaaaaaaaaaaX0000000000
bbbbbbbbbbbbbbX000000000000000000000
cccccccccccccccccccccccccccccccccX00

-- gw
 
E

Eleanor McHugh

So, my question is about those zeros. Does advancing the seek position
(do we still call that the "cursor"?) intentionally and proactively
fill
the unsed bytes with what apparently equates to zero? OR, am I just
getting luck that my test files have used virgin disk space which
yields
zeros, and the seek position just skips bytes which potentially would
contain garbage from previously used disk sectors?

Can I count on those unused bytes always being zero?

Unfortunately you're getting lucky. A seek adjusts the file pointer
but doesn't write anything to disk so whilst your 'unused' bytes won't
be changing value as a result of writing data to the file unless you
write the full record, you can't rely on them not having a value other
than zero if you don't.

Also you have to consider that zero may itself be a valid data value
within a record :)


Ellie

Eleanor McHugh
Games With Brains
http://slides.games-with-brains.net
 
R

Robert Klemme

Unfortunately you're getting lucky. A seek adjusts the file pointer
but doesn't write anything to disk so whilst your 'unused' bytes won't
be changing value as a result of writing data to the file unless you
write the full record, you can't rely on them not having a value other
than zero if you don't.

Actually it should not matter what those bytes are. Your record format
should make sure that you exactly know how long an individual record is
- as Ellie pointed out:
Also you have to consider that zero may itself be a valid data value
within a record :)

Oh, the details. :)

Here's another one: if your filesystem supports sparse files and the
holes are big enough (at least spanning more than one complete cluster
or whatever the smallest allocation unit of the filesystem is called)
those bytes might not really come from disk, in which case - when read -
they are usually zeroed. But I guess this is also implementation dependent.

One closing remark: using "tail" to look at a binary file is probably a
bad idea in itself. "tail" makes certain assumptions about what a line
is (it needs to in order to give you N last lines). Those assumptions
are usually incorrect when using binary files.

Kind regards

robert
 
G

Greg Willits

Does advancing the seek position intentionally and proactively fill
you're getting lucky

Thanks, I figured as much (just thought I'd see if Ruby was any
different). I've gone ahead and filled empty positions using
string.ljust.

tail works fine. It's all text data with 10.chr EOLs, and yeah I would
know whether a 0 is a valid piece of data or not based on the file
format.

Thanks.
 
B

Brian Candler

Eleanor said:
Unfortunately you're getting lucky. A seek adjusts the file pointer
but doesn't write anything to disk so whilst your 'unused' bytes won't
be changing value as a result of writing data to the file unless you
write the full record, you can't rely on them not having a value other
than zero if you don't.

I don't believe that's the case today. If it were, then you would have a
very easy way to examine the contents of unused sectors on the disk -
which would allow you to see other people's deleted files, passwords
etc.

It was possible on old mainframe systems in the 80's though :)

But today, if you extend a file using seek, you should always read
zeros.
 
G

Greg Willits

Brian said:
I don't believe that's the case today. If it were, then you would have a
very easy way to examine the contents of unused sectors on the disk -
which would allow you to see other people's deleted files, passwords
etc.

It was possible on old mainframe systems in the 80's though :)

80's micros too with BASIC :p
But today, if you extend a file using seek, you should always read
zeros.

That makes a great deal of sense, and would be consistent with what I
was seeing. I was wondering why the values being returned were zeros
instead of nil or something else.

Either way, I know its a better practice to pack the rows, but I had a
moment of laziness because I'm dealing will a couple million rows and
figured if there was some processing time to be saved, I'd take
advantage of it.

I would have experimented, but I don't know how to ensure that the
various file contents are in fact being written to the exact same disk
space.

-- gw
 
G

Gary Wright

So, my question is about those zeros. Does advancing the seek position
(do we still call that the "cursor"?) intentionally and proactively
fill
the unsed bytes with what apparently equates to zero?

I don't think anyone has answered this question directly but on POSIX-
like file systems a seek past the end of the file and a subsequent
write will cause the intervening bytes (which have never been written)
to read as zeros. Whether those 'holes' occupy disk space or not is
implementation dependent.

Gary Wright
 
G

Greg Willits

Gary said:
I don't think anyone has answered this question directly but on POSIX-
like file systems a seek past the end of the file and a subsequent
write will cause the intervening bytes (which have never been written)
to read as zeros. Whether those 'holes' occupy disk space or not is
implementation dependent.


If in deed this is a fact (and it's consistent with my observation),
then I'd say it's worth taking advantage of. I can't find a definitive
reference to cite though (Pickaxe, The Ruby Way).

-- gw
 
B

Brian Candler

Greg said:
If in deed this is a fact (and it's consistent with my observation),
then I'd say it's worth taking advantage of. I can't find a definitive
reference to cite though (Pickaxe, The Ruby Way).

Well, those aren't POSIX references. But from "Advanced Programming in
the UNIX Environment" by the late great Richard Stevens, pub.
Addison-Wesley, p53:

"`lseek` only records the current file offset within the kernel - it
does not cause any I/O to take place. This offset is then used by the
next read or write operation.

The file's offset can get greater than the file's current size, in which
case the next `write` to the file will extend the file. This is referred
to as creating a hole in a file and is allowed. Any bytes in a file that
have not been written are read back as 0."
 
G

Greg Willits

Brian said:
Well, those aren't POSIX references. But from "Advanced Programming in
the UNIX Environment" by the late great Richard Stevens, pub.
Addison-Wesley, p53:

"`lseek` only records the current file offset within the kernel - it
does not cause any I/O to take place. This offset is then used by the
next read or write operation.

The file's offset can get greater than the file's current size, in which
case the next `write` to the file will extend the file. This is referred
to as creating a hole in a file and is allowed. Any bytes in a file that
have not been written are read back as 0."


I see, you guys are saying it's an OS-level detail, not a Ruby-specfic
detail.

It seems though that any hole in the file must be written to. Otherwise
the file format itself must keep track of every byte that it has written
to or not in order to have a write-nothing / read-as-zero capability.
This would seem to be very inefficient overhead.

Hmm... duh, I can bust out the hex editor and have a look.

<pause>

OK, well, empty bytes created by extending the filesize of a new file
are 0.chr not an ASCII zero character (well, at least according to the
hex editor app). That could simply be the absence of data from virgin
disk space. I suppose, that absence of data could be interpreted however
the app wants, so the hex editor says it is 0.chr and the POSIX code
says it is 48.chr.

Still though, since the file isn't being filled with the data that is
provided by the read-back, that still confuses me. How does the read
know to convert those particular NULL values into ASCII zeros vs a NULL
byte I write on purpose? And it still doesn't really confirm what would
happen when non-virgin disk space is being written to.

Hrrmmm. :-\

Thanks for the discussion so far.

-- gw
 
B

Brian Candler

Greg said:
It seems though that any hole in the file must be written to. Otherwise
the file format itself must keep track of every byte that it has written
to or not in order to have a write-nothing / read-as-zero capability.

Unless you seek over entire blocks, in which case the filesystem can
create a "sparse" file with entirely missing blocks (i.e. the disk usage
reported by du can be much less than the file size)

When you read any of these blocks, you will see all zero bytes.
Hmm... duh, I can bust out the hex editor and have a look.

<pause>

OK, well, empty bytes created by extending the filesize of a new file
are 0.chr not an ASCII zero character (well, at least according to the
hex editor app). That could simply be the absence of data from virgin
disk space. I suppose, that absence of data could be interpreted however
the app wants, so the hex editor says it is 0.chr and the POSIX code
says it is 48.chr.

No, POSIX says it is a zero byte (character \0, \x00, byte value 0,
binary 00000000, ASCII NUL, however you want to think of it)
 
G

Greg Willits

Unless you seek over entire blocks, in which case the filesystem can
create a "sparse" file with entirely missing blocks (i.e. the disk usage
reported by du can be much less than the file size)
When you read any of these blocks, you will see all zero bytes.

OK. But the file system doesn't keep track of aything smaller than the
block, right? So, it's not keeping track of the misc individual holes
created by each extension of the seek (?).

No, POSIX says it is a zero byte (character \0, \x00, byte value 0,
binary 00000000, ASCII NUL, however you want to think of it)

Doh! My zeros are coming from a step in my process which includes
converting this particular data chunk to integers which I was
forgetting. And nil.to_i will generate a zero. So, my bad; that detail
is cleared up.

The only thing I'm still not real clear on is....

- file X gets written to disk block 999 -- the data is a stream of 200
contiguous "A" characters

- file X gets deleted (which AFAIK only deletes the directory entry,
and does not null-out the file data unless the OS has been told to do
just that with a "secure delete" operation)

- file Y gets written to disk block 999 -- the data has holes in it
from extending the seek position

Generally, I wouldn't read in the holes, but I have this one little step
that does end up with some holes, and I know it. What I don't know is
what to expect in those holes. Null values or, garbage "A' characters
left over from file X.

Logically I would expect garbage data, but the literal impact of
paragraphs quoted earlier from the Unix book above indicates I should
expect null values. I can't think of any tools I have that would enable
me to test this.

Because I don't know, I've gone ahead and packed the holes with a known
character. However, if I can avoid that I want to because it sucks up
some time I'd like to avoid in large files, but it's not super critical.

At this point I'm more curious than anything. I appreciate the dialog.

-- gw
 
G

Gary Wright

Generally, I wouldn't read in the holes, but I have this one little
step
that does end up with some holes, and I know it. What I don't know is
what to expect in those holes. Null values or, garbage "A' characters
left over from file X.

You should expect null bytes (at least on Posix-like file systems).
I'm not sure why you are doubting this.

From the Open Group Base Specification description of lseek:
The lseek() function shall allow the file offset to be set beyond
the end of the existing data in the file. If data is later written
at this point, subsequent reads of data in the gap shall return
bytes with the value 0 until data is actually written into the gap.

Gary Wright
 
G

Greg Willits

Gary said:
You should expect null bytes (at least on Posix-like file systems).
I'm not sure why you are doubting this.

I wasn't separating the read from the write. The spec talks about
reading zeros but doesn't talk about writing them. I wasn't trusting
that the nulls were getting written. I think I get it now that the read
is what matters. Whether a null/zero got written, or whether the gaps
are accounted for in some other way, is independent of the data the read
returns.

I still don't see where the nulls come from (if they're not being
written), but if the rules allow me to expect nulls/zeros, and those
gaps are being accounted for somewhere/somehow then that's what matters.

-- gw
 
B

Brian Candler

Greg said:
I still don't see where the nulls come from (if they're not being
written)

All disk I/O is done in terms of whole blocks (typically 1K)

Whenever the filesystem adds a new block to a file, insteading of
reading the existing contents into the VFS cache it just zero-fills a
block in the VFS cache. A write to an offset then updates that block and
marks it 'dirty'. The entire block will then at some point get written
back to disk, including of course any of the zeros which were not
overwritten with user data.
 
R

Robert Klemme

Generally, I wouldn't read in the holes, but I have this one little step
that does end up with some holes, and I know it. What I don't know is
what to expect in those holes. Null values or, garbage "A' characters
left over from file X.

Logically I would expect garbage data, but the literal impact of
paragraphs quoted earlier from the Unix book above indicates I should
expect null values. I can't think of any tools I have that would enable
me to test this.

I would not expect anything in those bytes for the simple reason that
this reduces portability of your program. If anything the whole
discussion has shown that apparently there are (or were) different
approaches to handling this (including return of old data which should
not happen any more nowadays).
Because I don't know, I've gone ahead and packed the holes with a known
character. However, if I can avoid that I want to because it sucks up
some time I'd like to avoid in large files, but it's not super critical.

At this point I'm more curious than anything. I appreciate the dialog.

I stick to the point I made earlier: if you need particular data to be
present in the slack of your records you need to make sure it's there.
Since your IO is done block wise and you probably aligned your offsets
with block boundaries anyway there should not be a noticeable difference
in IO. You probably need a bit more CPU time to generate that data but
that's probably negligible in light of the disk IO overhead.

If you want to save yourself that effort you should probably make sure
that your record format allows for easy separation of the data and slack
area. There are various well established practices, for example
preceding the data area with a length indicator or terminating data with
a special marker byte.

My 0.02 EUR.

Kind regards

robert
 
G

Greg Willits

Brian said:
All disk I/O is done in terms of whole blocks (typically 1K)

Whenever the filesystem adds a new block to a file, insteading of
reading the existing contents into the VFS cache it just zero-fills a
block in the VFS cache. A write to an offset then updates that block and
marks it 'dirty'. The entire block will then at some point get written
back to disk, including of course any of the zeros which were not
overwritten with user data.

Ah. That's what I was looking for. thanks.
 
G

Greg Willits

Robert said:
I would not expect anything in those bytes for the simple reason that
this reduces portability of your program.

Understood. In this case, I'm making a concious decision to go with
whatever is faster. I've already written the code so that it is easy to
add back in the packing if it's ever needed.

We're working with large data sets for aggregation which takes a long
time to run, and second only to the ease and clarity of the top level
DSL, is the speed of the aggregation process itself so we can afford to
do more analysis.

should probably make sure
that your record format allows for easy separation of the data and slack
area. There are various well established practices, for example
preceding the data area with a length indicator or terminating data with
a special marker byte.

Yep, already done that. Where this 'holes' business comes in, is that to
stay below the 4GB limit, the data has to be processed and the file
written out in chunks. Each chunk may have a unique line length. So, we
find the longest line of the chunk, and write records at that interval
using seek. Each record terminates with a line feed.

Since we don't know the standard length of each chunk until processing
is done (and the file has already een started), a set of the lengths is
added to the end of the file instead of the beginning.

When reading data, the fastest way to get the last line which has my
line lengths, is to use tail. This returns a string starting from the
last record's EOL marker to the EOF. This "line" has the potential
(likelihood) to include the empty bytes of the last record in front of
the actual I want because of how tail interprets "lines" between EOL
markers. I need to strip those empty bytes from the start of the line
before I get to the line lengths data.

Every other aspect of the file uses the common approach of lines with
#00 between fields and #10 at the end of the data, followed by zero or
more fill characters to make each row an equal length of bytes.

-- gw
 
B

Brian Candler

Greg said:
Yep, already done that. Where this 'holes' business comes in, is that to
stay below the 4GB limit, the data has to be processed and the file
written out in chunks. Each chunk may have a unique line length. So, we
find the longest line of the chunk, and write records at that interval
using seek. Each record terminates with a line feed.

To me, this approach smells. For example, it could have *really* bad
disk usage if one record in your file is much larger than all the
others.

Is the reason for this fixed-space padding just so that you can jump
directly to record number N in the file, by calculating its offset?

If so, it sounds to me like what you really want is cdb:
http://cr.yp.to/cdb.html

You emit key/value records of the form

+1,50:1->(50 byte record)
+1,70:2->(70 byte record)
+1,60:3->(60 byte record)
...
+2,500:10->(500 byte record)
... etc

then pipe it into cdbmake. The resulting file is built, in a single
pass, with a hash index, allowing you to jump to record with key 'x'
instantly.

There's a nice and simple ruby-cdb library available, which wraps djb's
cdb library.

Of course, with cdb you're not limited to integers as the key to locate
the records, nor do they have to be in sequence. Any unique key string
will do - consider it like an on-disk frozen Hash. (The key doesn't have
to be unique actually, but then when you search for key K you would ask
for all records matching this key)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,997
Messages
2,570,240
Members
46,829
Latest member
KimberAlli

Latest Threads

Top