the fastest way to create a directory

G

George Mpouras

Create a directory with all upper directories if missing.
it uses the minimum possible disk access and checks.




Mkdir_recursive('/some/dir/d1/d2') or die;

sub Mkdir_recursive
{
return 1 if $_[0] eq '' || -d $_[0];
Mkdir_recursive( $_[0] =~/^(.*?)[\\\/][^\\\/]+$/ ) || return undef;
mkdir $_[0] || return undef
}
 
G

George Mpouras

Ο "Henry Law" έγÏαψε στο μήνυμα

Create a directory with all upper directories if missing.

Use File::path.


no it is slow
 
C

Charlton Wilbur

"GM" == George Mpouras
GM> Ο "Henry Law" έγÏαψε στο μήνυμα
GM>

GM> Use File::path.

GM> no it is slow

If you create so many directories that the execution time of File::path
is greater than the debugging time of roll-your-own code, Perl is
probably the wrong langauge for the task.

Charlton
 
T

Tim McDaniel

Create a directory with all upper directories if missing.

On Linux-like systems,
system("mkdir", "-p", $dir);
Unless you're creating millions of directories per day, doing it with
a known way is far faster and far more reliable than trying to code
your own.
 
G

George Mpouras

yes there are about thousands dirs. try to undestand the code. looks simple
but it is not
 
G

George Mpouras

I have found that usingn Perl with clever code you can be faster than the
usual c
of cource if you write the same code with C it will be faster but then, the
deadlines will pass !
 
P

Peter J. Holzer

Ο "Henry Law" έγÏαψε στο μήνυμα



Use File::path.


no it is slow

On my system it is exactly as fast as your version, creating 2*70644
directories.

Here is the test program:


#!/usr/bin/perl
use warnings;
use strict;
use File::path qw(make_path);
use Time::HiRes qw(time);

$| = 1;

my $n = $ARGV[0];
{
my $t0 = time;
for my $i (0 .. $n) {
for my $j (0 .. $n) {
for my $k (0 .. $n) {
make_path("t1/$i/$j/$k");
}
}
printf("%3d %9.6f\n", $i, time - $t0);
}
}
{
my $t0 = time;
for my $i (0 .. $n) {
for my $j (0 .. $n) {
for my $k (0 .. $n) {
make_path("t2/$k/$j/$i");
}
}
printf("%3d %9.6f\n", $i, time - $t0);
}
}
__END__

(second test program uses your Mkdir_recursive instead of make_path)

With $n = 40, The first loop takes about 20 seconds, the second about 24
seconds for both programs. As expected, they are completely disk-bound.
strace shows that they are also performing essentially the same system
calls.

hp
 
P

Peter J. Holzer

yes there are about thousands dirs. try to undestand the code. looks
simple but it is not

What? It's trivial. It also has a bug: It doesn't perform silently under
use warnings. Ok, so maybe it isn't trivial after all ...

hp
 
T

Tim McDaniel

yes there are about thousands dirs. try to undestand the code. looks simple
but it is not

It *IS* simple. Microsecond efficiency is highly unlikely to be a
problem for you, so you should, in almost all cases, so with the
reliable solution.
 
P

Peter J. Holzer

Quoth George Mpouras said:
Create a directory with all upper directories if missing.
it uses the minimum possible disk access and checks.

Mkdir_recursive('/some/dir/d1/d2') or die;

sub Mkdir_recursive
{
return 1 if $_[0] eq '' || -d $_[0];
Mkdir_recursive( $_[0] =~/^(.*?)[\\\/][^\\\/]+$/ ) || return undef;
mkdir $_[0] || return undef
}

You'd be better off calling mkdir blind and keying off $! if it fails.
That way you save a stat in the case where the creation succeeds.

That shouldn't make a noticeable difference. If the stat does cause any
disk accesses, those would also have been caused by the mkdir, and if it
doesn't (i.e. everything is already in the cache) the time for the stat
calls is completely swamped by the mkdir's.

To my surprise the second loop of my test program seems actually to be
a bit faster with a blind mkdir, but the difference is less than the
variability, so I'd need more runs to see if the difference is
significant.

hp
 
R

Rainer Weikusat

Peter J. Holzer said:
Quoth George Mpouras said:
Create a directory with all upper directories if missing.
it uses the minimum possible disk access and checks.

Mkdir_recursive('/some/dir/d1/d2') or die;

sub Mkdir_recursive
{
return 1 if $_[0] eq '' || -d $_[0];
Mkdir_recursive( $_[0] =~/^(.*?)[\\\/][^\\\/]+$/ ) || return undef;
mkdir $_[0] || return undef
}

You'd be better off calling mkdir blind and keying off $! if it fails.
That way you save a stat in the case where the creation succeeds.

That shouldn't make a noticeable difference. If the stat does cause any
disk accesses, those would also have been caused by the mkdir, and if it
doesn't (i.e. everything is already in the cache) the time for the stat
calls is completely swamped by the mkdir's.

Both stat and mkdir are system calls and 'one system call' is going to
be faster than 'two system calls'. 'In certain situations' (eg, for
the FFS and UFS filesystems or one of the Linux ext? mounted with the
dirsync option), mkdir will be executed synchronously and then, it is
going to take a 'long' time (possibly, an infinitely/ arbitrary long
time for storage devices expected to fail routinely during normal
operation where the error handling method is 'retry until successful
or powered down' aka SSDs). But all these ought to be regarded as
corner case and "everything's in the cache" the general one and in
this case, Ben's suggestion is sensible if it is expected that
directories a more often created than not created.

I don't think I'd use recursion for that because an iterative
equivalent is still fairly simple, example

------------------
use Errno qw(EEXIST ENOTDIR);

sub mkdir_p
{
my ($cur, $next, $remain);

$remain = $_[0];
while ($remain) {
($next, $remain) = $remain =~ /^(\/*[^\/]*)(.*)/;
$cur .= $next;

mkdir($cur) or do {
return if $! != EEXIST;
next if $remain;

-d($cur) or $! = ENOTDIR, return;
};
}

return 1;
}

mkdir_p($ARGV[0]) or die("$!");
--------------------

NB: I didn't perform any benchmarks on this. But it works with
pathnames a la /////b/c///d/e//g while the original doesn't (OTOH, it
doesn't support MS-DOSE [German for 'tin'] but that's not my problem
:).
 
P

Peter J. Holzer

Peter J. Holzer said:
Quoth George Mpouras <[email protected]>:
sub Mkdir_recursive
{
return 1 if $_[0] eq '' || -d $_[0];
Mkdir_recursive( $_[0] =~/^(.*?)[\\\/][^\\\/]+$/ ) || return undef;
mkdir $_[0] || return undef
}

You'd be better off calling mkdir blind and keying off $! if it fails.
That way you save a stat in the case where the creation succeeds.

That shouldn't make a noticeable difference. If the stat does cause any
disk accesses, those would also have been caused by the mkdir, and if it
doesn't (i.e. everything is already in the cache) the time for the stat
calls is completely swamped by the mkdir's.

Both stat and mkdir are system calls and 'one system call' is going to
be faster than 'two system calls'.

As stated, that's trivially untrue ('one call to exec(2)' will *not* be
faster than 'two calls to time(2)' except under pathological
circumstances), but even if I translate that into 'an additional system
call will take additional time', it's not necessarily true.

In this case I think that the stat system call would normally add a
little time but it will be completely swamped by the time spent in
mkdir:

Each successful mkdir call will cause at least 5 disk accesses on a
typical Linux file system: 1 for the journal, 2 for the inode and
content of the parent directory and 2 for the inode and content of the
new directory (Oh, I forgot the bitmaps, add another 2 or 4 ...). These
will happen *after* mkdir returns because of the writeback cache, and
the kernel will almost certainly succeed in coalescing at least some and
maybe many of those writes, but if you create a lot of directories
(George wrote about "thousands", in my tests I created about 150000)
these writes will eventually dominate.

Now, in addition to writing new blocks, where does the pair
stat($d); mkdir($d)
spend time?

If the ancestor directories of $d are in cache (that would be the normal
case), both stat and mkdir will walk exactly the same in-memory
structure until they fail to find $d. So, yes, that part will be
uselessly duplicated, but it's very fast compared to actually writing a
new directory to the disk, so the extra time is negligible.

If the ancestor directories of $d are not in cache, stat will load them
into the cache, which may take a noticable time. But that time will then
be saved by mkdir which can now use the cache instead of loading the
directories itself: So again the difference is one walk through
in-memory structures, which is insignificant compared to loading the
structures from disk and then writing a new directory (which will happen
anyway).

The ratios will be different depending on the relative speed of RAM and
storage: Maybe SSDs are fast enough that the additional walk through the
cache is noticable, but I doubt it. Of course anybody is free to post
benchmark results to prove me wrong.

hp
 
G

George Mpouras

your code is not good. It ALWAYS access the hard disk as many times as the
number of upper dirs .
This completly uneccassery .
To check ti with your own eyes run:



#!/usr/bin/perl
use Errno qw(EEXIST ENOTDIR);
my $access=0;
mkdir_p("/opt/d1/d2/d3/d4/d5") or die("$!");
print "disk touches : $access\n";


sub mkdir_p
{
my ($cur, $next, $remain);
$remain = $_[0];

while ($remain) {
($next, $remain) = $remain =~ /^(\/*[^\/]*)(.*)/;
$cur .= $next;

$access++;

mkdir($cur) or do {
return if $! != EEXIST;
next if $remain;

-d($cur) or $! = ENOTDIR, return;
};
}

return 1;
}
 
G

George Mpouras

Ο "Peter J. Holzer" έγÏαψε στο μήνυμα

yes there are about thousands dirs. try to undestand the code. looks
simple but it is not

What? It's trivial.


probably you understand nothing at all
 
T

Tim McDaniel

your code is not good. It ALWAYS access the hard disk as many times as the
number of upper dirs .

I think you don't understand disk caching. Any reasonable modern
system will cache data.
 
R

Rainer Weikusat

Peter J. Holzer said:
Peter J. Holzer said:
Quoth George Mpouras <[email protected]>:
sub Mkdir_recursive
{
return 1 if $_[0] eq '' || -d $_[0];
Mkdir_recursive( $_[0] =~/^(.*?)[\\\/][^\\\/]+$/ ) || return undef;
mkdir $_[0] || return undef
}

You'd be better off calling mkdir blind and keying off $! if it fails.
That way you save a stat in the case where the creation succeeds.

That shouldn't make a noticeable difference. If the stat does cause any
disk accesses, those would also have been caused by the mkdir, and if it
doesn't (i.e. everything is already in the cache) the time for the stat
calls is completely swamped by the mkdir's.

Both stat and mkdir are system calls and 'one system call' is going to
be faster than 'two system calls'.

As stated, that's trivially untrue ('one call to exec(2)' will *not* be
faster than 'two calls to time(2)' except under pathological
circumstances),

Yes. And a single call to pause(2) will take forever except 'under
pathological circumstances'. Another interesting (Linux-)system call would be
reboot because not only doesn't it ever return, it might even cause
the system to be powered down!

Morale: Thanks to inherent vagueness in natural language and the
sloppy ways people usually use it, anything can be interpreted in a
way which doesn't make sense.
but even if I translate that into 'an additional system
call will take additional time', it's not necessarily true.

What about trying the intended meaning: Calling stat and mkdir is
going to take more time than calling stat and not mkdir or mkdir and
not stat (the two system calls in question).

BTW: Should I immediately supply a fresh round of nonsensical
interpretations?

[...]
Each successful mkdir call will cause at least 5 disk accesses on a
typical Linux file system: 1 for the journal, 2 for the inode and
content of the parent directory and 2 for the inode and content of the
new directory (Oh, I forgot the bitmaps, add another 2 or 4 ...). These
will happen *after* mkdir returns because of the writeback cache, and
the kernel will almost certainly succeed in coalescing at least some and
maybe many of those writes, but if you create a lot of directories
(George wrote about "thousands", in my tests I created about 150000)
these writes will eventually dominate.

[more of this]

As I already tried to communicate in the earlier posting: 'Disk I/O'
is in itself a pathological situation or at least one the kernel tries
very hard to avoid. Since it is going to be slower than pretty much
everything else, talking about the execution speed of different
algorithms becomes completely meaningless then. Consequently, it
should be disregarded when doing this.
 
R

Rainer Weikusat

"George Mpouras"
your code is not good. It ALWAYS access the hard disk as many times as
the number of upper dirs .

Believe it or not but I understand how this algorithm works because I
wrote it. Anything like this depends heavily on what the expected
situation happens to be. Eg, if I run your code with an argument of
/tmp/a/b/c/d/e, assuming nothing except /tmp exists initially, it will
do the following system calls:

stat("/tmp/a/b/c/d/e", 0x602130) = -1 ENOENT (No such file or directory)
stat("/tmp/a/b/c/d", 0x602130) = -1 ENOENT (No such file or directory)
stat("/tmp/a/b/c", 0x602130) = -1 ENOENT (No such file or directory)
stat("/tmp/a/b", 0x602130) = -1 ENOENT (No such file or directory)
stat("/tmp/a", 0x602130) = -1 ENOENT (No such file or directory)
stat("/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=94208, ...}) = 0
mkdir("/tmp/a", 0777) = 0
mkdir("/tmp/a/b", 0777) = 0
mkdir("/tmp/a/b/c", 0777) = 0
mkdir("/tmp/a/b/c/d", 0777) = 0
mkdir("/tmp/a/b/c/d/e", 0777) = 0

OTOH, the mkdir_p I posted will just do

mkdir("/tmp", 0777) = -1 EEXIST (File exists)
mkdir("/tmp/a", 0777) = 0
mkdir("/tmp/a/b", 0777) = 0
mkdir("/tmp/a/b/c", 0777) = 0
mkdir("/tmp/a/b/c/d", 0777) = 0
mkdir("/tmp/a/b/c/d/e", 0777) = 0
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,079
Messages
2,570,574
Members
47,207
Latest member
HelenaCani

Latest Threads

Top