hash table usage questions

F

freesoft12

Hi,

I am storing a large number of file paths into a hash table's keys (to
avoid duplicate paths) with well-known extensions
like .cc, .cpp, .h,.hpp. If any of the paths is a symbolic link then
the link is stored in the value field.

My questions are:

1) Is a custom data structure better than using a hash to store the
file paths?

2) I want to remove some of the files from the hash table that don't
match a regular expression (say I am only interested in *.cc files)
a) Is there a smart way to apply this regular expression on the
hash table? My current solution iterates over each item in the hash
table and then stores the keys that don't match the regex in a
separate list. I then iterate over that list and remove each key from
the hash table.

3) Does Perl allocate new memory if I were to copy the keys (paths) in
the hash table into a list or is a reference just copied?

Regards
John
 
S

sln

Hi,

I am storing a large number of file paths into a hash table's keys (to
avoid duplicate paths) with well-known extensions
like .cc, .cpp, .h,.hpp. If any of the paths is a symbolic link then
the link is stored in the value field.

My questions are:

1) Is a custom data structure better than using a hash to store the
file paths?

2) I want to remove some of the files from the hash table that don't
match a regular expression (say I am only interested in *.cc files)
a) Is there a smart way to apply this regular expression on the
hash table? My current solution iterates over each item in the hash
table and then stores the keys that don't match the regex in a
separate list. I then iterate over that list and remove each key from
the hash table.

3) Does Perl allocate new memory if I were to copy the keys (paths) in
the hash table into a list or is a reference just copied?
^^^^^^^^^^^^^


Regards
John

Why don't you clearly state what your trying to do instead of grabbing
straws and spewing all the buzzwords in the book.

You obviously need to learn Perl from the beginner position.
You seem to wan't somebody to not only write the code for you, but
provide documentation. As it is now, you exhibit knowledge below what is
necessary to understand a solution should one be provided.

Not alot of people want to do your work and not get paid for it.
Can you do my work for me?

sln
 
T

Tim Greer

Hi,

I am storing a large number of file paths into a hash table's keys (to
avoid duplicate paths) with well-known extensions
like .cc, .cpp, .h,.hpp. If any of the paths is a symbolic link then
the link is stored in the value field.

My questions are:

1) Is a custom data structure better than using a hash to store the
file paths?

By the sound of it, hashes might be fine. How are you originally
gathering the data to process (to obtain the paths), and exactly how
many are there? To avoid duplicates, hashes can be great for checking
that sort of thing.
2) I want to remove some of the files from the hash table that don't
match a regular expression (say I am only interested in *.cc files)

You can easily do that.
a) Is there a smart way to apply this regular expression on the
hash table?

You'd probably want to determine if it's something you want before
adding to the hash in the first place. How are you going about
creating/populating the hash?
My current solution iterates over each item in the hash
table and then stores the keys that don't match the regex in a
separate list. I then iterate over that list and remove each key from
the hash table.

Can you provide the relevant portions of your code? You probably don't
need to iterate over anything, especially if you have a hash value
saved or rejected based upon the logistical conditions you mentioned
with an example above.
3) Does Perl allocate new memory if I were to copy the keys (paths) in
the hash table into a list or is a reference just copied?

By the sound of it, you shouldn't need to store anything you don't want
to, or copy any keys, but I may have misunderstood? If you have
duplicate keys/another hash, it will use that much more memory. Maybe
I don't understand what you're asking?
 
T

Tad J McClellan

I am storing a large number of file paths into a hash table's keys (to
avoid duplicate paths)

2) I want to remove some of the files from the hash table that don't
match a regular expression (say I am only interested in *.cc files)
a) Is there a smart way to apply this regular expression on the
hash table?


%h = map /\.cc$/
? ($_, $h{$_})
: (),
keys %h;
 
U

Uri Guttman

TJM> %h = map /\.cc$/
TJM> ? ($_, $h{$_})
TJM> : (),
TJM> keys %h;

delete @h{ grep !/\.cc$/, keys %h } ;

that's a bit simpler IMO and definitely should be faster. it also uses
delete with a hash slice which is a combo that should be more well
known.

uri
 
T

Ted Zlatanov

fc> I am storing a large number of file paths into a hash table's keys
fc> (to avoid duplicate paths) with well-known extensions like .cc,
fc> .cpp, .h,.hpp. If any of the paths is a symbolic link then the link
fc> is stored in the value field.

By "large" do you mean thousands (A) or millions (B)?

fc> My questions are:

fc> 1) Is a custom data structure better than using a hash to store the
fc> file paths?

A: no

B: yes, consider nested hash tables with one level per directory. You
can also use SQLite to manage the data in a single DB file.

fc> 2) I want to remove some of the files from the hash table that don't
fc> match a regular expression (say I am only interested in *.cc files)
fc> a) Is there a smart way to apply this regular expression on the hash
fc> table? My current solution iterates over each item in the hash table
fc> and then stores the keys that don't match the regex in a separate
fc> list. I then iterate over that list and remove each key from the
fc> hash table.

A: use the solutions others have posted

B: you'll need a function to walk the nested hash tables and call a
check function for each entry. Accumulate the results into a temporary
list and delete it (if you worry that the temporary list will grow too
large, delete the entries in place). With SQLite this is a trivial SQL
statement.

Ted
 
T

Ted Zlatanov

On Tue, 30 Dec 2008 01:34:43 GMT (e-mail address removed) wrote:

s> Why don't you clearly state what your trying to do instead of
s> grabbing straws and spewing all the buzzwords in the book.

There was nothing in the OP's questions that warranted your rudeness.

s> Not alot of people want to do your work and not get paid for it. Can
s> you do my work for me?

You, apparently, assume your time and intelligence are too precious to
waste helping people for free. This is not the right forum for you.

Ted
 
F

freesoft12

Thanks to all your code suggestions! I will give each of them a try!


Regards
John
 
F

freesoft12

You obviously need to learn Perl from the beginner position.
You seem to wan't somebody to not only write the code for you, but
provide documentation. As it is now, you exhibit knowledge below what is
necessary to understand a solution should one be provided.

Not alot of people want to do your work and not get paid for it.
Can you do my work for me?

sln

Yes, I am a beginner. I am trying to learn Perl by reading and asking
for advice from intermediate & advanced users. Rather than just asking
questions, I posted my Perl script that I created after receiving a
suggestion from one of the answers to one of my prev posts.

Your asinine answers to my questions and to other people's posts to
this newsgroup (i checked) spoils the great work done by others in
teaching Perl to beginners.

I am going to recommend to this group's moderator that they cancel
your membership. Your attitude is detrimental for beginners learning
Perl from this newsgroup.

Regards
John
 
J

Jürgen Exner

Whom are you quoting here? It has been a proven custom for over 2
decades to name the original author because in Usenet you cannot assume
that the article you are replying to is visible to someone else.
I am going to recommend to this group's moderator

That may be difficult because CLP is not moderated.
that they cancel your membership.

That may be difficult, too, because there is no membership in Usenet.
Your attitude is detrimental for beginners learning
Perl from this newsgroup.

Can't comment on that because I can't tell whom you are talking about.
But yes, there are a few nutcases trolling in this NG, just like in
pretty much any other NG. Luckily they are easy to identify and just as
easy to filter.

jue
 
T

Tim Greer

Ted said:
On Tue, 30 Dec 2008 01:34:43 GMT (e-mail address removed) wrote:

s> Why don't you clearly state what your trying to do instead of
s> grabbing straws and spewing all the buzzwords in the book.

There was nothing in the OP's questions that warranted your rudeness.

s> Not alot of people want to do your work and not get paid for it.
Can s> you do my work for me?

You, apparently, assume your time and intelligence are too precious to
waste helping people for free. This is not the right forum for you.

Ted

Unfortunately, this is the norm with that poster. Sometimes I get
confused, because (rarely) he actually does try and offer help (when
he's not trying to pretend he's the smartest, most important person
here... or trying to push his ridiculous parsing engine). Sometimes...
I have hope (but then I see his posts like the one you've quoted).
 
F

freesoft12

Here are the answers to the questions that Tim, Ted had asked:

My Perl project description:

1) I get thousands of files (lets call them: TFILES) from a C++
program that prints out all the files, being opened by the program
over a period of time, to a log. Hence there are several TFILES that
are opened many times.

2) My Perl script analyzes the TFILEs and collects & publishes various
statistics about each of the TFILEs.

3) the user can specify one or more filters (containing one or more
regular expressions) so that they can see the statistics about a
subset of TFILES (say, just *.cc and *.cpp files).

4) the filters can be specified in the foll 3 ways:
a) on the command line
- All filters are specified on the command line. Hence, I can
populate the hash table with just the TFILEs that match the filter(s)

b) interactively, from a Perl-Tk GUI
- For this case, I need to read in all the TFILEs into the
hash table and show the TFILEs to the user in the GUI. The user then
enters the regular expressions to create a filter file. I apply the
filter file and remove the filtered-out TFILEs from the hash table and
show the reduced hash table to the user again in the GUI.

My Question: My results show that there is quite some time spent in
copying the keys in the hash to the Perl-Tk GUI (once for creating the
filters) and then again, to show the filtered results.

Regards
John
 
T

Tad J McClellan

[ snip: sln has gone off his meds again ]

Your asinine answers to my questions and to other people's posts to
this newsgroup (i checked) spoils the great work done by others in
teaching Perl to beginners.


Simply ignore the jackoffs and pay attention to the others.

I am going to recommend to this group's moderator that they cancel
your membership.


Neither of those are possible, as there are no moderators for
this newsgroup, and there is no concept of "membership".

That is how most Usenet newgroups operate.
 
T

Ted Zlatanov

fc> Here are the answers to the questions that Tim, Ted had asked:

fc> My Perl project description:

fc> 1) I get thousands of files (lets call them: TFILES) from a C++
fc> program that prints out all the files, being opened by the program
fc> over a period of time, to a log. Hence there are several TFILES that
fc> are opened many times.

You just need a hash with filenames as keys. Anything else I mentioned
is overkill (but you should be aware that some day you may need to redo
things, so you should abstract the storage functionality iterate/get/put
functions).

fc> 2) My Perl script analyzes the TFILEs and collects & publishes various
fc> statistics about each of the TFILEs.

fc> 3) the user can specify one or more filters (containing one or more
fc> regular expressions) so that they can see the statistics about a
fc> subset of TFILES (say, just *.cc and *.cpp files).

fc> 4) the filters can be specified in the foll 3 ways:
fc> a) on the command line
fc> - All filters are specified on the command line. Hence, I can
fc> populate the hash table with just the TFILEs that match the filter(s)

fc> b) interactively, from a Perl-Tk GUI
fc> - For this case, I need to read in all the TFILEs into the
fc> hash table and show the TFILEs to the user in the GUI. The user then
fc> enters the regular expressions to create a filter file. I apply the
fc> filter file and remove the filtered-out TFILEs from the hash table and
fc> show the reduced hash table to the user again in the GUI.

fc> My Question: My results show that there is quite some time spent in
fc> copying the keys in the hash to the Perl-Tk GUI (once for creating the
fc> filters) and then again, to show the filtered results.

Your biggest delay is probably in populating a list widget with the file
names--something Perl can't really improve. As a test, append the file
names to a text area and see how much faster the operation is. I don't
see any of your Perl-Tk code so I can't tell what could be slow, but
eliminating widget updates is a good first step to find performance
bottlenecks in GUIs.

Ted
 
S

sln

Here are the answers to the questions that Tim, Ted had asked:

My Perl project description:
I'm pretty sure I asked the question.
1) I get thousands of files (lets call them: TFILES) from a C++
program

There is no such thing as a C++ program. There are only programs
written in C++ (or other languages).
that prints out all the files, being opened by the program
over a period of time, to a log.

The executable in question prints out entire files to a log file?
What files are they, and where do they come from? Over and over again
huh? And they are .c or .cc or .cpp files to boot. Thousands of times
over and over and over and over again huh? What about the file names,
same thing.. thousands and thousands of times over and over again?
Hence there are several TFILES that
are opened many times.
Thousands and thousands of times, over and over again...
2) My Perl script analyzes the TFILEs and collects & publishes various
statistics about each of the TFILEs.
You havn't got a script, thats why you post here. You never have a script,
thats why you post here. Its pretty clear why you post here...
Now your publishing statistics.
3) the user can specify one or more filters (containing one or more
regular expressions) so that they can see the statistics about a
subset of TFILES (say, just *.cc and *.cpp files).
So the user can specify regular expression filters to parse said thousands
and thousands of said published statistics of said thousands and thousands
of said file's opened by a C++ program. Say like *.c or *.cc or *.cpp or *.cxx..
Not too regy expressionist.
4) the filters can be specified in the foll 3 ways:
a) on the command line
- All filters are specified on the command line. Hence, I can
populate the hash table with just the TFILEs that match the filter(s)
You really can't pass in regular expressions from the command line.
For reasons I won't go into, its extremely limited and virtually useless.
The mechanics for such uselessness far outweights the uselessness itself.
b) interactively, from a Perl-Tk GUI

Perhaps you should do a native program with real controls using C++.
- For this case, I need to read in all the TFILEs into the
hash table and show the TFILEs to the user in the GUI. The user then
enters the regular expressions to create a filter file. I apply the
filter file and remove the filtered-out TFILEs from the hash table and
show the reduced hash table to the user again in the GUI.
But, how do you get all those published statistics for thousands and
thousands of files, over and over and over.
My Question: My results show that there is quite some time spent in
copying the keys in the hash to the Perl-Tk GUI (once for creating the
filters) and then again, to show the filtered results.
Generally, there is no blocking in the control. If you have thousands
and thousands and thousands to populate a list control with, perhaps
a timer will help to populate it with, and/or use messages.
Regards
John

As a long, long, long time C programmer, I notice your focus is on C
source files. Statistically speaking, there are only a few categories
that your intentions fall into. One is Source Control type. If not
actually trying to implement a flavor of your own..., then trying to gleen
results from existing output, possibly a command line version with its own
script language, that spits out reems and reems of data and statistics.

The other is, your just plain nuts.
Its not good enough to spew buzzwords and hypotheticals and expect
thought to be applied, on your behalf, without respect #1 and humility #2.

sln
 
S

sln

On Tue, 30 Dec 2008 01:34:43 GMT (e-mail address removed) wrote:

s> Why don't you clearly state what your trying to do instead of
s> grabbing straws and spewing all the buzzwords in the book.

There was nothing in the OP's questions that warranted your rudeness.

s> Not alot of people want to do your work and not get paid for it. Can
s> you do my work for me?

You, apparently, assume your time and intelligence are too precious to
waste helping people for free. This is not the right forum for you.

Ted
While your Enlish is correct, your grammar punctuates dumb.

sln
 
T

Tim Greer

I meant En'lish..

sln

Maybe you meant Engrish? Anyway, before you call someone on their
grammar, you should note the difference between your and you're in your
very next post. This advice comes at no charge to you, and you're
welcome.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,821
Latest member
AleidaSchi

Latest Threads

Top