How to check for filetype existence quickly

fidokomik · Aug 6, 2008

I have directory, say "c:\documents" on Windows or "/home/petr/
documents" on Linux. In this directory many files are stored with many
filetypes (extensions), say *.doc, *.txt, *.zip. I need to find fastes
way how to check if some filetype exist. Now I use routine where I use
readdir() and return true if passed filetype is first time found but
when I have huge number of files but passed filetype is not found then
routine is very slow before return false.
Any idea?

Lars Eighner · Aug 6, 2008

In our last episode,
the lovely and talented fidokomik
broadcast on comp.lang.perl.misc:

I have directory, say "c:\documents" on Windows or "/home/petr/
documents" on Linux. In this directory many files are stored with many
filetypes (extensions), say *.doc, *.txt, *.zip. I need to find fastes
way how to check if some filetype exist. Now I use routine where I use
readdir() and return true if passed filetype is first time found but
when I have huge number of files but passed filetype is not found then
routine is very slow before return false.
Any idea?

Obviously no routine can say no such file exists until, in one way or
another, it has examined all of the files.

Did you try globbing to see if it is any faster:

$found = 0;
if (</usr/home/lars/saves/*.cgi>){
$found = 1;
}

RedGrittyBrick · Aug 6, 2008

fidokomik said:
I have directory, say "c:\documents" on Windows or "/home/petr/
documents" on Linux. In this directory many files are stored with many
filetypes (extensions), say *.doc, *.txt, *.zip.

On Linux, file name extensions are not required and if present, may not
be a reliable guide to file type. `man file`.

For Windows, consider ADT, AFM, ALL etc in
http://en.wikipedia.org/wiki/List_of_file_formats_(alphabetical)

I guess you have control of these files and so the above isn't an issue
in this case.

I need to find fastes
way how to check if some filetype exist. Now I use routine where I use
readdir() and return true if passed filetype is first time found but
when I have huge number of files but passed filetype is not found then
routine is very slow before return false.
Any idea?

Create, maintain and use an index or cache? I'd use a hash, maybe backed
by DBM or suchlike. I'd schedule index updates or use the OS'
directory-change notification mechanism to ensure file additions,
deletions and renames get indexed.

Justin C · Aug 6, 2008

I have directory, say "c:\documents" on Windows or "/home/petr/
documents" on Linux. In this directory many files are stored with many
filetypes (extensions), say *.doc, *.txt, *.zip. I need to find fastes
way how to check if some filetype exist. Now I use routine where I use
readdir() and return true if passed filetype is first time found but
when I have huge number of files but passed filetype is not found then
routine is very slow before return false.
Any idea?

ls | egrep doc\|txt\|zip

(you need to escape the 'or' operator, and use egrep instead of grep)

I know you want to use perl, but perl won't be as fast as this... unless
there are a very, very large[1] number of files.

Justin.

1. Depending on your concept of large.

Leon Timmermans · Aug 6, 2008

I have directory, say "c:\documents" on Windows or "/home/petr/
documents" on Linux. In this directory many files are stored with many
filetypes (extensions), say *.doc, *.txt, *.zip. I need to find fastes
way how to check if some filetype exist. Now I use routine where I use
readdir() and return true if passed filetype is first time found but
when I have huge number of files but passed filetype is not found then
routine is very slow before return false. Any idea?

To find a negative match, you will have to loop through the whole list,
that is unavoidable. However, if you find yourself testing the same
directory a number of times for different extentions, you could loop
through it once and save a list of extensions you've found.

opendir my $dh, $basedir;
my %is_found;
while (my $dirname = readdir $dh) {
$dirname =~ / \. (\w+) \z /x or next;
$is_found{$1}++;
}

for my $extention (qw/exe doc txt zip mp3/) {
my $found = $is_found{$extention} ? "Found" : "Didn't found";
print "$found $extention\n";
}

Regards,

Leon Timmermans

xhoster · Aug 6, 2008

fidokomik said:
I have directory, say "c:\documents" on Windows or "/home/petr/
documents" on Linux. In this directory many files are stored with many
filetypes (extensions), say *.doc, *.txt, *.zip. I need to find fastes
way how to check if some filetype exist.

You will have to write a file system that is optimized for this (for
example, it stores directory information in some kind of tree based on the
reversed file name, so that extensions group together.) Then you have to
hack the operating system so that it can take advantage of the FS features.
Then you would have to hack perl so that it can take advantage of the OS
features.

Personally, I think I'd settle for something other than the fastest, and
just aim for good enough.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

cc96ai · Aug 6, 2008

if you have sub-directory,
you could use find ()

find(\&listfile, $dir);

sub listfile(){

if ( -f ) {
my($filename, $directories, $suffix) = fileparse($File::Find::name);
#check extension
.....
}
}

Peter J. Holzer · Aug 7, 2008

ls | egrep doc\|txt\|zip

Please not that the OP is passed *one* filetype. So that would be

ls | egrep '\.doc$'

or

ls | egrep '\.txt$'

or

ls | egrep '\.zip$'

instead. Which can be simplified to "ls *.doc", etc. Which can be
simplified to a call to glob().

(you need to escape the 'or' operator, and use egrep instead of grep)

I know you want to use perl, but perl won't be as fast as this... unless
there are a very, very large[1] number of files.

Perl is also likely to be faster for a small number of files - spawning
a shell which then spawns two other programs is not exactly a cheap
operation.

hp

fidokomik · Aug 8, 2008

Did you try globbing to see if it is any faster:

$found = 0;
if (</usr/home/lars/saves/*.cgi>){
$found = 1;

}

Hmm, easy and quick. Thank you Lars. But how to pass variable to this
but avoid eval()? Is it possible?

I thinked up this only:

my $searchfor = 'c:/images/*.jpg';
if(checkit($searchfor)) {do_something}
else {do_other}

sub checkit {
return 1 if( eval('<' . shift . '>') );
return 0;
}

Ben Morrow · Aug 8, 2008

Quoth fidokomik said:
Hmm, easy and quick. Thank you Lars. But how to pass variable to this
but avoid eval()? Is it possible?

perldoc -f glob

glob is the function that underlies this meaning of <>, and it's
probably cleanest to simply call it directly. If you carefully read the
section on the <> operator in perldoc perlop, you will see that it is
possible to use variables in the glob form of <>, but rather tricky.

You may also want to look at the File::Glob or (as you appear to be on
Win32) the File:

osGlob extension, which implement other forms of
globbing.

Ben

Jim Gibson · Aug 8, 2008

fidokomik said:
Hmm, easy and quick. Thank you Lars. But how to pass variable to this
but avoid eval()? Is it possible?

I thinked up this only:

my $searchfor = 'c:/images/*.jpg';
if(checkit($searchfor)) {do_something}
else {do_other}

sub checkit {
return 1 if( eval('<' . shift . '>') );
return 0;
}

Use the 'glob' function (untested):

return 1 if glob shift;
return 0;

Or possibly

return scalar glob shift;

See 'perldoc -f glob'

Tad J McClellan · Aug 8, 2008

fidokomik said:
Hmm, easy and quick. Thank you Lars. But how to pass variable to this
but avoid eval()? Is it possible?

$extension = 'cgi';
if (</usr/home/lars/saves/*.$extension>){

Though that would be the bad kind of Lazy, IMO.

It makes it easier for the 1 programmer at the expense of making it
harder for the many readers/maintainers.

So I would instead write it for others rather than for myself:

if ( glob "/usr/home/lars/saves/*.$extension" ) {

xhoster · Aug 8, 2008

Globbing will go through all the files in the directory with no possibility
of stopping early. It won't be more than slightly faster on failure, and
will be substantially slower on success.

$extension = 'cgi';
if (</usr/home/lars/saves/*.$extension>){

Though that would be the bad kind of Lazy, IMO.

It makes it easier for the 1 programmer at the expense of making it
harder for the many readers/maintainers.

So I would instead write it for others rather than for myself:

if ( glob "/usr/home/lars/saves/*.$extension" ) {

Even this is bad. The glob is being executed in a scalar context,
so doesn't reset itself, it iterates and next time it gets invoked
(assuming the if is in a loop, or a subroutine which gets called from a
loop), the new value of $extension is not even inspected, unless the old
iterator has exhausted itself.

if ( () = glob "/usr/home/lars/saves/*.$extension" ) {

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Justin C · Aug 11, 2008

Please not that the OP is passed *one* filetype. So that would be

Well spotted. Thanks for pointing it out.

[snip]

I know you want to use perl, but perl won't be as fast as this... unless
there are a very, very large[1] number of files.

Click to expand...

Perl is also likely to be faster for a small number of files - spawning
a shell which then spawns two other programs is not exactly a cheap
operation.

Gasp! You mean, you don't *always* have a TERM to hand?!

I see
what you mean, I was just thinking quick and dirty, and not "write once,
use many"... which is a habit I'm trying to cultivate.

Justin.

Bart Lateur · Aug 12, 2008

Globbing will go through all the files in the directory with no possibility
of stopping early.

No it won't. glob in scalar context is an iterator, it'll return the
first matching file, or undef on failure.

It won't be more than slightly faster on failure, and
will be substantially slower on success.

Define "substantially". The only reason to return early is when a lot of
files match, and there a few files of adifferent kind.

xhoster · Aug 12, 2008

Bart Lateur said:
No it won't.

Yes it will. This has been done to death here lately.

glob in scalar context is an iterator, it'll return the
first matching file, or undef on failure.

It might *return* only the first matching file, but it does so only after
it goes through all of them.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Peter J. Holzer · Aug 12, 2008

Please not that the OP is passed *one* filetype. So that would be

Click to expand...

Well spotted. Thanks for pointing it out.

[snip]

I know you want to use perl, but perl won't be as fast as this... unless
there are a very, very large[1] number of files.

Click to expand...

Perl is also likely to be faster for a small number of files - spawning
a shell which then spawns two other programs is not exactly a cheap
operation.

Click to expand...

Gasp! You mean, you don't *always* have a TERM to hand?!

*I* may have, but the script I write don't. They often run as cron jobs,
or web applications or whatever. But in this case the presence or
absence of a terminal is irrelevant: qx(ls | egrep something) has
exactly the same work to do whether there is a terminal or not.

I see what you mean, I was just thinking quick and dirty, and not
"write once, use many"... which is a habit I'm trying to cultivate.

You were arguing with performance. Performance and "quick and dirty"
usually don't mix well.

hp

Jürgen Exner · Aug 14, 2008

fidokomik said:
I have directory, say "c:\documents" on Windows or "/home/petr/
documents" on Linux. In this directory many files are stored with many
filetypes (extensions), say *.doc, *.txt, *.zip. I need to find fastes
way how to check if some filetype exist.

I would simply use a glob and check if it returns any results:

if (<*.doc>) {
print ".doc file(s) found\n";
}

Only downside: it will read the directory in full even if the first file
is a match already.
Only way I know to avoid that is to do what you are doing already: loop
through the directory using readdir() and abort as soon as the first
match is found.
Which one of these is faster for your environment you will have to
benchmark.

jue

Checking for the existence of Duplicates	1	Sep 28, 2007
how to check for valid image	3	Oct 28, 2010
How do I check for directory existence in Linux?	6	Sep 28, 2005
Windows command line to python	0	Sep 29, 2021
Can I check for file existence in JavaScript?	3	Apr 27, 2005
IIS5.1: PHP+FastCGI - How to setup and check?	0	Apr 4, 2011
how to check for out of disk space	5	Mar 31, 2009
Q: Hi-HO! How to implement this search engine... ?	1	Sep 20, 2010

How to check for filetype existence quickly

fidokomik

Lars Eighner

RedGrittyBrick

Justin C

Leon Timmermans

xhoster

cc96ai

Peter J. Holzer

fidokomik

Ben Morrow

Jim Gibson

Tad J McClellan

xhoster

Justin C

Bart Lateur

xhoster

Peter J. Holzer

Jürgen Exner

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads