Hi,
What is being measured? Access time for files that already exist?
Creation of new files? Scanning the directory structure for a list
of existing files?
At a prior gig, we used to split a couple hundred thousand
encyclopedia articles up as 12/34/56.xxx sort of format. It worked
adequately for our needs--our batch-oriented processing was expected
to run overnight anyway--but my impression was that as long as the
filename was known, accessing file 12/34/56.xxx seemed quick,
whereas directory scans to enumerate the existing filenames were
pretty slow.
I wanted to put a rough lower bound on the performance of the
approach, just to decide if it's worth pursuing at all. (Kirk
obviously thinks it is, but I don't know if the class of applications
that interests him is anything like the class that interests me.) So I
measured creation of 1k files and re-writing of 1k files. (The prior
case involves creating an inode and data pages, the second involves
touching an inode and creating data pages.) I didn't test reads of the
files (which would involve touching an inode) because I didn't feel
like optimizing the test (by remounting my filesystem with the
inode-touch for last-access turned off). The test box was a
medium-powered Linux workstation with a 2.6 kernel, a single SATA
drive, and ext3 filesystems.
I'd expect under normal conditions to get maybe 15 megabytes per
second of disk-write bandwidth from this system, although with tuning
I could probably get a lot more. But again, I was going for a smell
test here. This whole approach is attractive because it's easy to
code, so I'd want to use it for light-duty applications on non-tuned
hardware. For an application with more stringent requirements, I'd
make a different tradeoff in development time and probably use a
different approach.
Anyway, I first tried it on /dev/shm. That worked really nice for
10,000 files, took about 0.9 seconds consistently to create new files
and 0.6 seconds to re-write them. The same test with 100,000 files
totally de-stabilized the machine. I didn't want to reboot it so I
waited. Fifteen minutes later it was back. But what a strange journey
that must have been. Obviously this approach doesn't make a lot of
sense on a shm device anyway, but I had to know.
With a disk-based filesystem, things were a lot better. For 10,000
files, about 1.6 seconds to create them and 1.2 to rewrite them. Those
numbers were consistent across many trials. Similar results at 30,000
and 60,000 files, just scale upward. At 100,000 files things got
screwy. The create-time got variable, ranging from 5 seconds to almost
15 seconds from run to run. During all the runs, the machine didn't
appear to destabilize and remained responsive. Obviously processor
loads were very low. I didn't make a thorough study of page faults and
swapping activity though. But notice that the implied throughput is an
interestingly high fraction of my notional channel bandwidth of 15
megabytes/sec. And the journalling FS means that I don't even think
about the movement of the R/W head inside the disk drive anymore. (Of
course that may matter a great deal on Windows, but if I'm using
Windows then *everything* about the project costs vastly more anyway,
so who cares?)
So I'm claiming without further study (and without trying to explain
the results) that the lower bound on performance is in the region of
5000 writes a second. That's just at the fuzzy edge of being worth
doing. I usually think in terms of a "budget" for any operation that
must be repeated for a continuously-running server application. In
general, I want to be able to do a bare minimum of 1000 "useful
things" per second on a sustained basis, on untuned hardware. ("Useful
things" might be dynamic web-pages generated, or guaranteed-delivery
messages processed, etc.) So this approach uses nearly 20% of my
budget. It's a big number. (Just to show how I apply this kind of
analysis: I never worry about adding a SHA-1 hash calculation to any
critical code path, because I know I can do 250,000 of those per
second without breaking a sweat.)