the fastest way to create a directory

George Mpouras · Jul 15, 2013

Create a directory with all upper directories if missing.
it uses the minimum possible disk access and checks.

Mkdir_recursive('/some/dir/d1/d2') or die;

sub Mkdir_recursive
{
return 1 if $_[0] eq '' || -d $_[0];
Mkdir_recursive( $_[0] =~/^(.*?)[\\\/][^\\\/]+$/ ) || return undef;
mkdir $_[0] || return undef
}

George Mpouras · Jul 15, 2013

ÎŸ "Henry Law" ÎÎ³ÏÎ±ÏˆÎµ ÏƒÏ„Î¿ Î¼Î®Î½Ï…Î¼Î±

Create a directory with all upper directories if missing.

Use File:

ath.

no it is slow

Charlton Wilbur · Jul 15, 2013

"GM" == George Mpouras
GM> ÎŸ "Henry Law" ÎÎ³ÏÎ±ÏˆÎµ ÏƒÏ„Î¿ Î¼Î®Î½Ï…Î¼Î±
GM>

GM> Use File:

ath.

GM> no it is slow

If you create so many directories that the execution time of File:

ath
is greater than the debugging time of roll-your-own code, Perl is
probably the wrong langauge for the task.

Charlton

Tim McDaniel · Jul 15, 2013

Create a directory with all upper directories if missing.

On Linux-like systems,
system("mkdir", "-p", $dir);
Unless you're creating millions of directories per day, doing it with
a known way is far faster and far more reliable than trying to code
your own.

George Mpouras · Jul 15, 2013

yes there are about thousands dirs. try to undestand the code. looks simple
but it is not

George Mpouras · Jul 15, 2013

I have found that usingn Perl with clever code you can be faster than the
usual c
of cource if you write the same code with C it will be faster but then, the
deadlines will pass !

Peter J. Holzer · Jul 15, 2013

ÎŸ "Henry Law" ÎÎ³ÏÎ±ÏˆÎµ ÏƒÏ„Î¿ Î¼Î®Î½Ï…Î¼Î±

Use File:ath.

no it is slow

On my system it is exactly as fast as your version, creating 2*70644
directories.

Here is the test program:

#!/usr/bin/perl
use warnings;
use strict;
use File:

ath qw(make_path);
use Time::HiRes qw(time);

$| = 1;

my $n = $ARGV[0];
{
my $t0 = time;
for my $i (0 .. $n) {
for my $j (0 .. $n) {
for my $k (0 .. $n) {
make_path("t1/$i/$j/$k");
}
}
printf("%3d %9.6f\n", $i, time - $t0);
}
}
{
my $t0 = time;
for my $i (0 .. $n) {
for my $j (0 .. $n) {
for my $k (0 .. $n) {
make_path("t2/$k/$j/$i");
}
}
printf("%3d %9.6f\n", $i, time - $t0);
}
}
__END__

(second test program uses your Mkdir_recursive instead of make_path)

With $n = 40, The first loop takes about 20 seconds, the second about 24
seconds for both programs. As expected, they are completely disk-bound.
strace shows that they are also performing essentially the same system
calls.

hp

Peter J. Holzer · Jul 15, 2013

yes there are about thousands dirs. try to undestand the code. looks
simple but it is not

What? It's trivial. It also has a bug: It doesn't perform silently under
use warnings. Ok, so maybe it isn't trivial after all ...

hp

Tim McDaniel · Jul 15, 2013

yes there are about thousands dirs. try to undestand the code. looks simple
but it is not

It *IS* simple. Microsecond efficiency is highly unlikely to be a
problem for you, so you should, in almost all cases, so with the
reliable solution.

Peter J. Holzer · Jul 16, 2013

Quoth George Mpouras said:
Quoth George Mpouras said:

Create a directory with all upper directories if missing.
it uses the minimum possible disk access and checks.

Mkdir_recursive('/some/dir/d1/d2') or die;

sub Mkdir_recursive
{
return 1 if $_[0] eq '' || -d $_[0];
Mkdir_recursive( $_[0] =~/^(.*?)[\\\/][^\\\/]+$/ ) || return undef;
mkdir $_[0] || return undef
}

Click to expand...

You'd be better off calling mkdir blind and keying off $! if it fails.
That way you save a stat in the case where the creation succeeds.

That shouldn't make a noticeable difference. If the stat does cause any
disk accesses, those would also have been caused by the mkdir, and if it
doesn't (i.e. everything is already in the cache) the time for the stat
calls is completely swamped by the mkdir's.

To my surprise the second loop of my test program seems actually to be
a bit faster with a blind mkdir, but the difference is less than the
variability, so I'd need more runs to see if the difference is
significant.

hp

George Mpouras · Jul 17, 2013

Please write the code you test

George Mpouras · Jul 17, 2013

Please write the code you test

Peter J. Holzer · Jul 17, 2013

Please write the code you test

Please quote relevant parts of the postings you are replying to.

hp

Rainer Weikusat · Jul 17, 2013

Peter J. Holzer said:
Quoth George Mpouras said:

Create a directory with all upper directories if missing.
it uses the minimum possible disk access and checks.

Mkdir_recursive('/some/dir/d1/d2') or die;

sub Mkdir_recursive
{
return 1 if $_[0] eq '' || -d $_[0];
Mkdir_recursive( $_[0] =~/^(.*?)[\\\/][^\\\/]+$/ ) || return undef;
mkdir $_[0] || return undef
}

Click to expand...

You'd be better off calling mkdir blind and keying off $! if it fails.
That way you save a stat in the case where the creation succeeds.

Click to expand...

That shouldn't make a noticeable difference. If the stat does cause any
disk accesses, those would also have been caused by the mkdir, and if it
doesn't (i.e. everything is already in the cache) the time for the stat
calls is completely swamped by the mkdir's.

Both stat and mkdir are system calls and 'one system call' is going to
be faster than 'two system calls'. 'In certain situations' (eg, for
the FFS and UFS filesystems or one of the Linux ext? mounted with the
dirsync option), mkdir will be executed synchronously and then, it is
going to take a 'long' time (possibly, an infinitely/ arbitrary long
time for storage devices expected to fail routinely during normal
operation where the error handling method is 'retry until successful
or powered down' aka SSDs). But all these ought to be regarded as
corner case and "everything's in the cache" the general one and in
this case, Ben's suggestion is sensible if it is expected that
directories a more often created than not created.

I don't think I'd use recursion for that because an iterative
equivalent is still fairly simple, example

------------------
use Errno qw(EEXIST ENOTDIR);

sub mkdir_p
{
my ($cur, $next, $remain);

$remain = $_[0];
while ($remain) {
($next, $remain) = $remain =~ /^(\/*[^\/]*)(.*)/;
$cur .= $next;

mkdir($cur) or do {
return if $! != EEXIST;
next if $remain;

-d($cur) or $! = ENOTDIR, return;
};
}

return 1;
}

mkdir_p($ARGV[0]) or die("$!");
--------------------

NB: I didn't perform any benchmarks on this. But it works with
pathnames a la /////b/c///d/e//g while the original doesn't (OTOH, it
doesn't support MS-DOSE [German for 'tin'] but that's not my problem

.

Peter J. Holzer · Jul 17, 2013

Peter J. Holzer said:
Peter J. Holzer said:

Quoth George Mpouras <[email protected]>:
sub Mkdir_recursive
{
return 1 if $_[0] eq '' || -d $_[0];
Mkdir_recursive( $_[0] =~/^(.*?)[\\\/][^\\\/]+$/ ) || return undef;
mkdir $_[0] || return undef
}

You'd be better off calling mkdir blind and keying off $! if it fails.
That way you save a stat in the case where the creation succeeds.

Click to expand...

That shouldn't make a noticeable difference. If the stat does cause any
disk accesses, those would also have been caused by the mkdir, and if it
doesn't (i.e. everything is already in the cache) the time for the stat
calls is completely swamped by the mkdir's.

Click to expand...

Both stat and mkdir are system calls and 'one system call' is going to
be faster than 'two system calls'.

As stated, that's trivially untrue ('one call to exec(2)' will *not* be
faster than 'two calls to time(2)' except under pathological
circumstances), but even if I translate that into 'an additional system
call will take additional time', it's not necessarily true.

In this case I think that the stat system call would normally add a
little time but it will be completely swamped by the time spent in
mkdir:

Each successful mkdir call will cause at least 5 disk accesses on a
typical Linux file system: 1 for the journal, 2 for the inode and
content of the parent directory and 2 for the inode and content of the
new directory (Oh, I forgot the bitmaps, add another 2 or 4 ...). These
will happen *after* mkdir returns because of the writeback cache, and
the kernel will almost certainly succeed in coalescing at least some and
maybe many of those writes, but if you create a lot of directories
(George wrote about "thousands", in my tests I created about 150000)
these writes will eventually dominate.

Now, in addition to writing new blocks, where does the pair
stat($d); mkdir($d)
spend time?

If the ancestor directories of $d are in cache (that would be the normal
case), both stat and mkdir will walk exactly the same in-memory
structure until they fail to find $d. So, yes, that part will be
uselessly duplicated, but it's very fast compared to actually writing a
new directory to the disk, so the extra time is negligible.

If the ancestor directories of $d are not in cache, stat will load them
into the cache, which may take a noticable time. But that time will then
be saved by mkdir which can now use the cache instead of loading the
directories itself: So again the difference is one walk through
in-memory structures, which is insignificant compared to loading the
structures from disk and then writing a new directory (which will happen
anyway).

The ratios will be different depending on the relative speed of RAM and
storage: Maybe SSDs are fast enough that the additional walk through the
cache is noticable, but I doubt it. Of course anybody is free to post
benchmark results to prove me wrong.

hp

George Mpouras · Jul 17, 2013

your code is not good. It ALWAYS access the hard disk as many times as the
number of upper dirs .
This completly uneccassery .
To check ti with your own eyes run:

#!/usr/bin/perl
use Errno qw(EEXIST ENOTDIR);
my $access=0;
mkdir_p("/opt/d1/d2/d3/d4/d5") or die("$!");
print "disk touches : $access\n";

sub mkdir_p
{
my ($cur, $next, $remain);
$remain = $_[0];

while ($remain) {
($next, $remain) = $remain =~ /^(\/*[^\/]*)(.*)/;
$cur .= $next;

$access++;

mkdir($cur) or do {
return if $! != EEXIST;
next if $remain;

-d($cur) or $! = ENOTDIR, return;
};
}

return 1;
}

George Mpouras · Jul 17, 2013

ÎŸ "Peter J. Holzer" ÎÎ³ÏÎ±ÏˆÎµ ÏƒÏ„Î¿ Î¼Î®Î½Ï…Î¼Î±

yes there are about thousands dirs. try to undestand the code. looks
simple but it is not

What? It's trivial.

probably you understand nothing at all

Tim McDaniel · Jul 17, 2013

your code is not good. It ALWAYS access the hard disk as many times as the
number of upper dirs .

I think you don't understand disk caching. Any reasonable modern
system will cache data.

Rainer Weikusat · Jul 17, 2013

Peter J. Holzer said:
Peter J. Holzer said:

Quoth George Mpouras <[email protected]>:
sub Mkdir_recursive
{
return 1 if $_[0] eq '' || -d $_[0];
Mkdir_recursive( $_[0] =~/^(.*?)[\\\/][^\\\/]+$/ ) || return undef;
mkdir $_[0] || return undef
}

You'd be better off calling mkdir blind and keying off $! if it fails.
That way you save a stat in the case where the creation succeeds.

That shouldn't make a noticeable difference. If the stat does cause any
disk accesses, those would also have been caused by the mkdir, and if it
doesn't (i.e. everything is already in the cache) the time for the stat
calls is completely swamped by the mkdir's.

Click to expand...

Both stat and mkdir are system calls and 'one system call' is going to
be faster than 'two system calls'.

Click to expand...

As stated, that's trivially untrue ('one call to exec(2)' will *not* be
faster than 'two calls to time(2)' except under pathological
circumstances),

Yes. And a single call to pause(2) will take forever except 'under
pathological circumstances'. Another interesting (Linux-)system call would be
reboot because not only doesn't it ever return, it might even cause
the system to be powered down!

Morale: Thanks to inherent vagueness in natural language and the
sloppy ways people usually use it, anything can be interpreted in a
way which doesn't make sense.

but even if I translate that into 'an additional system
call will take additional time', it's not necessarily true.

What about trying the intended meaning: Calling stat and mkdir is
going to take more time than calling stat and not mkdir or mkdir and
not stat (the two system calls in question).

BTW: Should I immediately supply a fresh round of nonsensical
interpretations?

[...]

Each successful mkdir call will cause at least 5 disk accesses on a
typical Linux file system: 1 for the journal, 2 for the inode and
content of the parent directory and 2 for the inode and content of the
new directory (Oh, I forgot the bitmaps, add another 2 or 4 ...). These
will happen *after* mkdir returns because of the writeback cache, and
the kernel will almost certainly succeed in coalescing at least some and
maybe many of those writes, but if you create a lot of directories
(George wrote about "thousands", in my tests I created about 150000)
these writes will eventually dominate.

[more of this]

As I already tried to communicate in the earlier posting: 'Disk I/O'
is in itself a pathological situation or at least one the kernel tries
very hard to avoid. Since it is going to be slower than pretty much
everything else, talking about the execution speed of different
algorithms becomes completely meaningless then. Consequently, it
should be disregarded when doing this.

Rainer Weikusat · Jul 17, 2013

"George Mpouras"

your code is not good. It ALWAYS access the hard disk as many times as
the number of upper dirs .

Believe it or not but I understand how this algorithm works because I
wrote it. Anything like this depends heavily on what the expected
situation happens to be. Eg, if I run your code with an argument of
/tmp/a/b/c/d/e, assuming nothing except /tmp exists initially, it will
do the following system calls:

stat("/tmp/a/b/c/d/e", 0x602130) = -1 ENOENT (No such file or directory)
stat("/tmp/a/b/c/d", 0x602130) = -1 ENOENT (No such file or directory)
stat("/tmp/a/b/c", 0x602130) = -1 ENOENT (No such file or directory)
stat("/tmp/a/b", 0x602130) = -1 ENOENT (No such file or directory)
stat("/tmp/a", 0x602130) = -1 ENOENT (No such file or directory)
stat("/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=94208, ...}) = 0
mkdir("/tmp/a", 0777) = 0
mkdir("/tmp/a/b", 0777) = 0
mkdir("/tmp/a/b/c", 0777) = 0
mkdir("/tmp/a/b/c/d", 0777) = 0
mkdir("/tmp/a/b/c/d/e", 0777) = 0

OTOH, the mkdir_p I posted will just do

mkdir("/tmp", 0777) = -1 EEXIST (File exists)
mkdir("/tmp/a", 0777) = 0
mkdir("/tmp/a/b", 0777) = 0
mkdir("/tmp/a/b/c", 0777) = 0
mkdir("/tmp/a/b/c/d", 0777) = 0
mkdir("/tmp/a/b/c/d/e", 0777) = 0

lest talk a litle more about directories	25	Jul 26, 2013
Finding all the links in a Unix file/directory path	3	May 12, 2009
Trying to build a SARIMAX model to forecast the S&P500 trend	0	Nov 5, 2023
Can someone pls help me with a little algorithm script	1	Nov 28, 2024
Fastest way to find a match?	10	Mar 12, 2008
Fastest way to calculate leading whitespace	19	May 8, 2010
Fastest way to detect a non-ASCII character in a list of strings.	2	Oct 17, 2010
I made a blockchain and want to make a cryptocurrency, but my code doesn't verify hash of each block	2	Jun 2, 2024

the fastest way to create a directory

George Mpouras

George Mpouras

Charlton Wilbur

Tim McDaniel

George Mpouras

George Mpouras

Peter J. Holzer

Peter J. Holzer

Tim McDaniel

Peter J. Holzer

George Mpouras

George Mpouras

Peter J. Holzer

Rainer Weikusat

Peter J. Holzer

George Mpouras

George Mpouras

Tim McDaniel

Rainer Weikusat

Rainer Weikusat

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads