(e-mail address removed)>, (e-mail address removed)
says...
[ ... Warning: this post is long, and topicality is marginal at best: ]
I'm not sure what you're considering a B-Tree here. I don't
think that the OS does anything special with directories,
however: the general philosophy is that if your directory
structure is well organized, no individual directory will
contain enough entries to make anything other than linear search
worth the effort. Also, the main use of B-Trees is to keep
sorted data; the OS's I know don't maintain the directory
entries sorted, so a B-Tree wouldn't improve access in any way.
Windows NTFS does use B-trees for directories. Small directories (up to
something like 4K of total data) are stored entirely in the directory's
MFT entry as essentially a vector. Large directories (anything where the
directory entries won't fit entirely in the MFT) are stored as B-trees.
Most Unix file systems use hashing instead. E.g. ext2 and ext3 are quite
common on Linux. There's a pretty decent description of their hashing
at:
http://www.nongnu.org/ext2-doc/ext2.html#INDEXED-DIRECTORY
Interestingly, though this uses hashing, it's not "flat" hashing -- the
hash is used as the key in (you've probably guessed by now) a balanced
tree -- pretty much a B-tree.
Though the decision as to when to use this indexing is made a bit
differently than it is in NTFS, the same basic idea applies: only large
directories are indexed in this fashion.
There are certainly a few primitive file systems (e.g. FAT) that don't
support indexing directories at all, but at this point, I'd consider
them more the exception than the rule. Oddly, one of the primary uses of
FAT at this point is in things like cards in digital cameras -- which
typically organize your pictures as a single directory containing _lots_
of files (hundreds or thousands of files is quite common). IOW, one of
the most common uses of FAT also closely resembles its worst case.
[ ... ]
In a relational database, it probably depends on the types of
indexes, and the use which is made of them. I think modern
databases dynamically tune the organization according to the
actual accesses they see, which means that if sorted access
isn't useful, they won't use a B-Tree. (Sorted access is useful
not only when you request sorted output, but also when you have
index criteria like "where x > a and x < b". On the other hand,
a hash table is generally better for "where x = a", and off
hand, I can't think of any "optimizing" structure for things
like "where x like '...%'"---I've also noticed that when I use
such requests, my access times go up significantly
.)
If you didn't mind a huge index, you could create an index that pointed
to a particular record with the string starting from every possible
offset in a field. For example, given two records with keys of "abcd"
and "xyz", the index would look something like:
key record number
abcd 0
bcd 0
cd 0
d 0
xyz 1
yz 1
z 1
Given its (almost inevitably large) size, you'd probably want to
organize that index as a b-tree.
Then, matching a search with a leading wildcard would be essentially
similar to matching something without the leading wildcard. Given the
larger index size, you might easily have an extra level or two in the
tree, so access might be somewhat slower, but not the orders of
magnitude slower that we expect with current SQL servers.
Of course, on its own, this design only optimizes access in the case of
a leading wildcard -- but IME, that's a very common scenario. OTOH, with
only a small amount of extra data, this index could also be used for
embedded wildcards: you'd also store the offset into the string for each
entry. In this case, searching for something like 'ab%f' would consist
of searching for 'ab' with an offset of 0 and 'f' with an offset >= 2.
[ ... ]
That used to be true, but I'm not sure now, with paging, etc.
If you need random access to sorted data, using a B-Tree with
each node filling an entire page (and page aligned) might be
worth it (but you'd need one awfully big table to see the
difference).
I wouldn't be surprised if it did pretty well -- like paging, the
performance of caching also tends to depend heavily upon locality of
reference -- and that's a factor B-trees were designed to maximize to a
large degree.