How to check for filetype existence quickly

F

fidokomik

I have directory, say "c:\documents" on Windows or "/home/petr/
documents" on Linux. In this directory many files are stored with many
filetypes (extensions), say *.doc, *.txt, *.zip. I need to find fastes
way how to check if some filetype exist. Now I use routine where I use
readdir() and return true if passed filetype is first time found but
when I have huge number of files but passed filetype is not found then
routine is very slow before return false.
Any idea?
 
L

Lars Eighner

In our last episode,
the lovely and talented fidokomik
broadcast on comp.lang.perl.misc:
I have directory, say "c:\documents" on Windows or "/home/petr/
documents" on Linux. In this directory many files are stored with many
filetypes (extensions), say *.doc, *.txt, *.zip. I need to find fastes
way how to check if some filetype exist. Now I use routine where I use
readdir() and return true if passed filetype is first time found but
when I have huge number of files but passed filetype is not found then
routine is very slow before return false.
Any idea?

Obviously no routine can say no such file exists until, in one way or
another, it has examined all of the files.

Did you try globbing to see if it is any faster:

$found = 0;
if (</usr/home/lars/saves/*.cgi>){
$found = 1;
}
 
R

RedGrittyBrick

fidokomik said:
I have directory, say "c:\documents" on Windows or "/home/petr/
documents" on Linux. In this directory many files are stored with many
filetypes (extensions), say *.doc, *.txt, *.zip.

On Linux, file name extensions are not required and if present, may not
be a reliable guide to file type. `man file`.

For Windows, consider ADT, AFM, ALL etc in
http://en.wikipedia.org/wiki/List_of_file_formats_(alphabetical)

I guess you have control of these files and so the above isn't an issue
in this case.
I need to find fastes
way how to check if some filetype exist. Now I use routine where I use
readdir() and return true if passed filetype is first time found but
when I have huge number of files but passed filetype is not found then
routine is very slow before return false.
Any idea?

Create, maintain and use an index or cache? I'd use a hash, maybe backed
by DBM or suchlike. I'd schedule index updates or use the OS'
directory-change notification mechanism to ensure file additions,
deletions and renames get indexed.
 
J

Justin C

I have directory, say "c:\documents" on Windows or "/home/petr/
documents" on Linux. In this directory many files are stored with many
filetypes (extensions), say *.doc, *.txt, *.zip. I need to find fastes
way how to check if some filetype exist. Now I use routine where I use
readdir() and return true if passed filetype is first time found but
when I have huge number of files but passed filetype is not found then
routine is very slow before return false.
Any idea?

ls | egrep doc\|txt\|zip

(you need to escape the 'or' operator, and use egrep instead of grep)

I know you want to use perl, but perl won't be as fast as this... unless
there are a very, very large[1] number of files.

Justin.

1. Depending on your concept of large.
 
L

Leon Timmermans

I have directory, say "c:\documents" on Windows or "/home/petr/
documents" on Linux. In this directory many files are stored with many
filetypes (extensions), say *.doc, *.txt, *.zip. I need to find fastes
way how to check if some filetype exist. Now I use routine where I use
readdir() and return true if passed filetype is first time found but
when I have huge number of files but passed filetype is not found then
routine is very slow before return false. Any idea?

To find a negative match, you will have to loop through the whole list,
that is unavoidable. However, if you find yourself testing the same
directory a number of times for different extentions, you could loop
through it once and save a list of extensions you've found.

opendir my $dh, $basedir;
my %is_found;
while (my $dirname = readdir $dh) {
$dirname =~ / \. (\w+) \z /x or next;
$is_found{$1}++;
}

for my $extention (qw/exe doc txt zip mp3/) {
my $found = $is_found{$extention} ? "Found" : "Didn't found";
print "$found $extention\n";
}

Regards,

Leon Timmermans
 
X

xhoster

fidokomik said:
I have directory, say "c:\documents" on Windows or "/home/petr/
documents" on Linux. In this directory many files are stored with many
filetypes (extensions), say *.doc, *.txt, *.zip. I need to find fastes
way how to check if some filetype exist.

You will have to write a file system that is optimized for this (for
example, it stores directory information in some kind of tree based on the
reversed file name, so that extensions group together.) Then you have to
hack the operating system so that it can take advantage of the FS features.
Then you would have to hack perl so that it can take advantage of the OS
features.

Personally, I think I'd settle for something other than the fastest, and
just aim for good enough.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
C

cc96ai

if you have sub-directory,
you could use find ()

find(\&listfile, $dir);

sub listfile(){

if ( -f ) {
my($filename, $directories, $suffix) = fileparse($File::Find::name);
#check extension
.....
}
}
 
P

Peter J. Holzer

ls | egrep doc\|txt\|zip

Please not that the OP is passed *one* filetype. So that would be

ls | egrep '\.doc$'

or

ls | egrep '\.txt$'

or

ls | egrep '\.zip$'

instead. Which can be simplified to "ls *.doc", etc. Which can be
simplified to a call to glob().

(you need to escape the 'or' operator, and use egrep instead of grep)

I know you want to use perl, but perl won't be as fast as this... unless
there are a very, very large[1] number of files.

Perl is also likely to be faster for a small number of files - spawning
a shell which then spawns two other programs is not exactly a cheap
operation.

hp
 
F

fidokomik

Did you try globbing to see if it is any faster:

$found = 0;
if (</usr/home/lars/saves/*.cgi>){
$found = 1;

}
Hmm, easy and quick. Thank you Lars. But how to pass variable to this
but avoid eval()? Is it possible?

I thinked up this only:

my $searchfor = 'c:/images/*.jpg';
if(checkit($searchfor)) {do_something}
else {do_other}

sub checkit {
return 1 if( eval('<' . shift . '>') );
return 0;
}
 
B

Ben Morrow

Quoth fidokomik said:
Hmm, easy and quick. Thank you Lars. But how to pass variable to this
but avoid eval()? Is it possible?

perldoc -f glob

glob is the function that underlies this meaning of <>, and it's
probably cleanest to simply call it directly. If you carefully read the
section on the <> operator in perldoc perlop, you will see that it is
possible to use variables in the glob form of <>, but rather tricky.

You may also want to look at the File::Glob or (as you appear to be on
Win32) the File::DosGlob extension, which implement other forms of
globbing.

Ben
 
J

Jim Gibson

fidokomik said:
Hmm, easy and quick. Thank you Lars. But how to pass variable to this
but avoid eval()? Is it possible?

I thinked up this only:

my $searchfor = 'c:/images/*.jpg';
if(checkit($searchfor)) {do_something}
else {do_other}

sub checkit {
return 1 if( eval('<' . shift . '>') );
return 0;
}

Use the 'glob' function (untested):

return 1 if glob shift;
return 0;

Or possibly

return scalar glob shift;

See 'perldoc -f glob'
 
T

Tad J McClellan

fidokomik said:
Hmm, easy and quick. Thank you Lars. But how to pass variable to this
but avoid eval()? Is it possible?


$extension = 'cgi';
if (</usr/home/lars/saves/*.$extension>){

Though that would be the bad kind of Lazy, IMO.

It makes it easier for the 1 programmer at the expense of making it
harder for the many readers/maintainers.

So I would instead write it for others rather than for myself:

if ( glob "/usr/home/lars/saves/*.$extension" ) {
 
X

xhoster

Globbing will go through all the files in the directory with no possibility
of stopping early. It won't be more than slightly faster on failure, and
will be substantially slower on success.


$extension = 'cgi';
if (</usr/home/lars/saves/*.$extension>){

Though that would be the bad kind of Lazy, IMO.

It makes it easier for the 1 programmer at the expense of making it
harder for the many readers/maintainers.

So I would instead write it for others rather than for myself:

if ( glob "/usr/home/lars/saves/*.$extension" ) {

Even this is bad. The glob is being executed in a scalar context,
so doesn't reset itself, it iterates and next time it gets invoked
(assuming the if is in a loop, or a subroutine which gets called from a
loop), the new value of $extension is not even inspected, unless the old
iterator has exhausted itself.


if ( () = glob "/usr/home/lars/saves/*.$extension" ) {


Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
J

Justin C

Please not that the OP is passed *one* filetype. So that would be
Well spotted. Thanks for pointing it out.

[snip]
I know you want to use perl, but perl won't be as fast as this... unless
there are a very, very large[1] number of files.

Perl is also likely to be faster for a small number of files - spawning
a shell which then spawns two other programs is not exactly a cheap
operation.

Gasp! You mean, you don't *always* have a TERM to hand?! :) I see
what you mean, I was just thinking quick and dirty, and not "write once,
use many"... which is a habit I'm trying to cultivate.

Justin.
 
B

Bart Lateur

Globbing will go through all the files in the directory with no possibility
of stopping early.

No it won't. glob in scalar context is an iterator, it'll return the
first matching file, or undef on failure.
It won't be more than slightly faster on failure, and
will be substantially slower on success.

Define "substantially". The only reason to return early is when a lot of
files match, and there a few files of adifferent kind.
 
X

xhoster

Bart Lateur said:
No it won't.

Yes it will. This has been done to death here lately.
glob in scalar context is an iterator, it'll return the
first matching file, or undef on failure.

It might *return* only the first matching file, but it does so only after
it goes through all of them.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
P

Peter J. Holzer

Please not that the OP is passed *one* filetype. So that would be
Well spotted. Thanks for pointing it out.

[snip]
I know you want to use perl, but perl won't be as fast as this... unless
there are a very, very large[1] number of files.

Perl is also likely to be faster for a small number of files - spawning
a shell which then spawns two other programs is not exactly a cheap
operation.

Gasp! You mean, you don't *always* have a TERM to hand?! :)

*I* may have, but the script I write don't. They often run as cron jobs,
or web applications or whatever. But in this case the presence or
absence of a terminal is irrelevant: qx(ls | egrep something) has
exactly the same work to do whether there is a terminal or not.
I see what you mean, I was just thinking quick and dirty, and not
"write once, use many"... which is a habit I'm trying to cultivate.

You were arguing with performance. Performance and "quick and dirty"
usually don't mix well.

hp
 
J

Jürgen Exner

fidokomik said:
I have directory, say "c:\documents" on Windows or "/home/petr/
documents" on Linux. In this directory many files are stored with many
filetypes (extensions), say *.doc, *.txt, *.zip. I need to find fastes
way how to check if some filetype exist.

I would simply use a glob and check if it returns any results:

if (<*.doc>) {
print ".doc file(s) found\n";
}

Only downside: it will read the directory in full even if the first file
is a match already.
Only way I know to avoid that is to do what you are doing already: loop
through the directory using readdir() and abort as soon as the first
match is found.
Which one of these is faster for your environment you will have to
benchmark.

jue
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top