Search script to index dynamic pages

R

Rob

I recently tried to add a downloaded CGI script to my site, before
realising that it would naturally only index static pages on the site
(i.e. only files that it could open using Perl routines). I have
altered the indexing routine so that it does not crawl all directories
and index all files. Instead it opens a file containing a specified
list of files, and indexes only those instead.

As the majority of the content on the website is dynamic content, does
anybody know of any search CGI scripts that will index pages with
dynamic CGI content? (e.g. "website.com/cgi-bin/viewpage.cgi?id=100" )

If such a script is not available, is there a way to return content
from a dynamic page to a CGI script (for indexing purposes)? I could
alter the indexing routine so that it does not just open the files it
is told to, but to return the content (that the script would output to
the server) to the indexing routine instead. This would then suit the
needs of the website perfectly.

Apologies if this has already been covered elsewhere, I have had no
success in finding a solution online. Any help with this would be much
appreciated.

Best Regards

Rob
 
K

Keith Keller

As the majority of the content on the website is dynamic content, does
anybody know of any search CGI scripts that will index pages with
dynamic CGI content? (e.g. "website.com/cgi-bin/viewpage.cgi?id=100" )

Not directly, but you might consider using something like KinoSearch,
which can create an index of anything you feed to it. You'd need to
code up the indexer yourself, but it's fairly straightforward, assuming
you have access to the backend content you're trying to index.

--keith
 
J

Jürgen Exner

Rob said:
As the majority of the content on the website is dynamic content, does
anybody know of any search CGI scripts that will index pages with
dynamic CGI content? (e.g. "website.com/cgi-bin/viewpage.cgi?id=100" )

How would that script know which parameters are supported? Is id=100
legal? Is id=100000000000000 legal? Is it myid=... instead of id=...?
If such a script is not available, is there a way to return content
from a dynamic page to a CGI script (for indexing purposes)?

That is trivial, see
perldoc -q "How do I fetch an HTML file?"

jue
 
R

Rob

Thank you all for your responses.

I have tried to download and index the pages using the script with the
LWP module, but so far without success. I have done quite a lot of
Perl programming in the past, but the LWP module is quite new to me.

I have written a test routine just to see if it works but this has not
worked either (below):

######################################
#!/usr/bin/perl -w
use strict;
use LWP::Simple;

$data = get("http://www.samlpesite.org");

print $data;
######################################

I just get server errors. The file permissions are correct and the LWP
module is installed on the server. Have I missed something obvious, or
used the 'get' routine incorrectly?

Any help would be great.

Regards

Rob
 
R

Rob

Did it occur to you that the text of the error messages
might be helpful in debugging the problem?

Yes, this did occur to me- however my hosting provider does not give
error logs. The code works when run on my own machine, but does not
work correctly when run on the server. If I pass the information from
a 'get' command to a variable it is simply left blank.

After it is working from the command line, *then* run it under
a web server.

I am now able to 'get' a page when I run the script from my own
computer - it successfully downloads the page and I can do what I like
with the data. However when I run this script online it does not work
at all. I have tried other techniques, such as WWW::Mechanize and
LWP::UserAgent, neither of which produce better results.

One thought was that there may have been firewall/bot protection on
the server, but as the script works from my own computer then it
should work from the server also?

Many thanks for your help so far,

Rob
 
W

Willem

Rob wrote:
)> Did it occur to you that the text of the error messages
)> might be helpful in debugging the problem?
)>
)
) Yes, this did occur to me- however my hosting provider does not give
) error logs.

Off the top of my head:

BEGIN {
print "Content-type: text/plain\n\n";
$SIG{__WARN__} = sub { print @_ };
$SIG(__DIE__} = sub { print @_ unless $^S };
}

Should make the error message go to the browser,
and that should even work for compile-time errors.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
R

Rob

Off the top of my head:

BEGIN {
        print "Content-type: text/plain\n\n";
        $SIG{__WARN__} = sub { print @_ };
        $SIG(__DIE__} = sub { print @_ unless $^S };

}

Should make the error message go to the browser,
and that should even work for compile-time errors.

Thank you for this. When I run with this code the only error I get is
that the variable '$data' is uninitialized (presumably because the
'get' function has not succeeded in passing any data to it).

The code now stands as follows (for the test file I have made):


################################################################################
#!/usr/bin/perl -w
use CGI qw:)all);
use LWP::Simple;

BEGIN {
print "Content-type: text/plain\n\n";
$SIG{__WARN__} = sub { print @_ };
$SIG{__DIE__} = sub { print @_ unless $^S };
}
my $data = get("http://www.samplesite.org");

open (CF2, "testtext.txt");
print CF2 "$data";
close(CF2);
################################################################################

Like I said, it works on my computer but not on the server.

Thanks,

Rob
 
W

Willem

Rob wrote:
)
)> Off the top of my head:
)>
)> BEGIN {
)> ? ? ? ? print "Content-type: text/plain\n\n";
)> ? ? ? ? $SIG{__WARN__} = sub { print @_ };
)> ? ? ? ? $SIG(__DIE__} = sub { print @_ unless $^S };
)>
)> }
)>
)> Should make the error message go to the browser,
)> and that should even work for compile-time errors.
)
) Thank you for this. When I run with this code the only error I get is
) that the variable '$data' is uninitialized (presumably because the
) 'get' function has not succeeded in passing any data to it).

Yes. LWP::Simple doesn't do error reporting. At all.
You should use LWP::UserAgent if you want to know more than 'it failed'.

) The code now stands as follows (for the test file I have made):
)
)
) ################################################################################
) #!/usr/bin/perl -w
) use CGI qw:)all);
) use LWP::Simple;

use strict;
use warnings;

)
) BEGIN {
) print "Content-type: text/plain\n\n";
) $SIG{__WARN__} = sub { print @_ };
) $SIG{__DIE__} = sub { print @_ unless $^S };
) }
) my $data = get("http://www.samplesite.org");
)
) open (CF2, "testtext.txt");
) print CF2 "$data";
) close(CF2);

# Use lexical filehandles.

) ################################################################################
)
) Like I said, it works on my computer but not on the server.

Dump LWP::Simple, and code it using LWP::UserAgent

use LWP::UserAgent;
my $response = LWP::UserAgent->new->get("http://www.samplesite.org");
if ($response->is_success) {
open (my $cf, '>', 'testtext.txt') or die "Failed to write: $!";
print $cf $response->decoded_content;
close $cf;
} else {
die $response->status_line;
}


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
R

Rob

Dump LWP::Simple, and code it using LWP::UserAgent

use LWP::UserAgent;

That has helped to show me that it is a matter of the request timing
out. I imagine this may be because of a security feature which is not
allowing a script within the site to download itself? I have tried
changing the IP address of the UserAgent using:

$ua->local_address("10.10.10.10");

but I am told that the method "local_address" can't be located through
the UserAgent package. After looking this up, it would appear that the
version of Perl is out of date on the server.

Rob
 
X

Xho Jingleheimerschmidt

Willem said:
Rob wrote:
)> Did it occur to you that the text of the error messages
)> might be helpful in debugging the problem?
)>
)
) Yes, this did occur to me- however my hosting provider does not give
) error logs.

Holy cow. Do you pay for this hosting provider?
Off the top of my head:

BEGIN {
print "Content-type: text/plain\n\n";
$SIG{__WARN__} = sub { print @_ };
$SIG(__DIE__} = sub { print @_ unless $^S };
}

Should make the error message go to the browser,
and that should even work for compile-time errors.

As long the compile time error occurs after the BEGIN.

I
use CGI::Carp qw(fatalsToBrowser);

Doesn't deal with the warnings, but that module provides other ways to
do that.

Xho
 
J

J. Gleixner

Rob wrote:
[...]
################################################################################
#!/usr/bin/perl -w
use CGI qw:)all);
use LWP::Simple;

BEGIN {
print "Content-type: text/plain\n\n";
$SIG{__WARN__} = sub { print @_ };
$SIG{__DIE__} = sub { print @_ unless $^S };
}
my $data = get("http://www.samplesite.org");

open (CF2, "testtext.txt");
Ahhhh.. that's (possibly) opening 'testtext.txt' for reading.

print CF2 "$data";
Not writing.

Add some error checking and open it for write.
close(CF2);
################################################################################

Like I said, it works on my computer but not on the server.
Doubtful, unless 'works' means it doesn't create a file.
 
R

Rob

Ahhhh.. that's (possibly) opening 'testtext.txt' for reading.

I had deleted the path for the purposes of posting it to this forum
and in the process deleted the ">" at the start, my mistake!
Doubtful, unless 'works' means it doesn't create a file.

It is creating a file, but leaving it empty - I'm awaiting a response
from my host now about their perl version.

Rob
 
J

J. Gleixner

Rob said:
I had deleted the path for the purposes of posting it to this forum
and in the process deleted the ">" at the start, my mistake!

OK.. and what if the open failed???? Add some simple error checking!
It is creating a file, but leaving it empty - I'm awaiting a response
from my host now about their perl version.

Why does your version of perl matter?

Since you're saying it works when you run it, but not when executed
as a CGI, than the first thing I'd look at is a permission
problem. Back-up a bit, and simplify your problem. Change your
script to simply open the file you want to write to (for write),
print something if open fails, and write something to it. That's all.
Does that work?

A very basic example:

use CGI qw( header );
print header;
my $file = '/some/path/to/file';
if ( open( my $fh, '>', $file ) )
{
print $fh 'Testing 123';
close $fh;

open( my $o, '<', $file );
print "content of $file: ", <$o>;
close( $o );

}
else
{
print "opening $file for write failed: $!";
}
 
W

Willem

Rob wrote:
)
)> Dump LWP::Simple, and code it using LWP::UserAgent
)>
)> use LWP::UserAgent;
)
) That has helped to show me that it is a matter of the request timing
) out. I imagine this may be because of a security feature which is not
) allowing a script within the site to download itself?

Sounds unlikely. Especially if you're getting a timeout.
Timeouts usually mean firewalls silently dropping packets.

Have you tried using 'localhost' in place of the server name ?

) I have tried
) changing the IP address of the UserAgent using:
)
) $ua->local_address("10.10.10.10");

That's not likely to help, IMO.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
R

Rob

OK.. and what if the open failed????  Add some simple error checking!

The open didn't fail, it left me with an empty text file. The only
error I get when running the script is a timeout.
Why does your version of perl matter?

The version of Perl matters because I am trying to use LWP::UserAgent,
and the version of LWP::UserAgent on my server does not apparently
include the local_address function. I was hoping to use this as
currently when I try to get a page (using LWP::UserAgent) it is timing
out when running from the server, but working when I run from my own
computer. As this is an indexing routine, I would like it to work from
the server.
Back-up a bit, and simplify your problem. Change your
script to simply open the file you want to write to (for write),
print something if open fails, and write something to it. That's all.
Does that work?

Thanks for the example but it is not a file permission problem, I have
many scripts which read and write files on this server, none of which
are problematic.

Rob
 
J

Jim Gibson

Rob said:
but I am told that the method "local_address" can't be located through
the UserAgent package. After looking this up, it would appear that the
version of Perl is out of date on the server.

Here is a program I stole many years ago to print out server Perl
information. It probably depends upon the server being linux, but it
may be possible to adapt it to other environments:


#!/usr/bin/perl -T
use strict;
use File::Find;
use File::Basename;

my $debug;

# untaint path
$ENV{'PATH'} = '/usr/bin';

# get perl version
my $perlout = `perl -v`;
my $perlver;
if( $perlout =~ m/This is perl, v([\d.]+) built for/s ) {
$perlver = $1;
}


print "Content-type: text/html\n\n";
print "<HTML>\n";
print "<HEAD><TITLE>Perl Environment</TITLE></HEAD>\n";
print "<BODY>\n";
print "<h1>Perl Version:</h1>\n";
print "Perl version is $perlver<br>\n" if $perlver;
print "<pre>\n";
print "$perlout\n";
print `perl -V`;
print "</pre>\n";
print "<H1>Perl Modules Installed:</H1>\n";
my( %modules, %seen );
my @subdirs = qw( i386-linux-thread-multi i686-linux );
my( $dirlen, $curdir);
for my $incdir ( @INC ) {
$curdir = $incdir;
$dirlen = length($incdir);
print "\n<br>Look in $incdir ($dirlen):<br>\n\n" if $debug;
find( {wanted=>\&add_module, no_chdir=>1}, $incdir);
}

print "<p><table border=1 cellspacing=2 cellpadding=4>\n";
print "<tr><th>Module</th><th>Location</th></tr>\n";
foreach my $file ( sort keys %modules ) {
print "<tr><td>$file</td><td>$modules{$file}</td></tr>\n";
}
print "</table>\n";
print "</BODY></HTML>\n";
exit (0);

sub add_module
{
# only include once
return if $seen{$File::Find::name}++;

# only include Perl modules ending with '.pm'
return unless /\.pm$/;

# eliminate unless belongs to active Perl version
if( $perlver ) {
return unless /$perlver/;
}

#return unless /site/;
print " found $_<br>\n" if $debug;
my $name;
$name = substr($File::Find::name,$dirlen+1,-3);
my $loc = substr($File::Find::name,0,$dirlen);
print "name=$name, loc=$loc<br>\n" if $debug;
$name =~ s/\//::/g;
print " saving &quot;$name&quot; => &quot;$loc&quot;<br>\n"
if $debug;
$modules{$name} = $loc;

}
 
J

J. Gleixner

Rob said:
The open didn't fail, it left me with an empty text file. The only
error I get when running the script is a timeout.


The version of Perl matters because I am trying to use LWP::UserAgent,
and the version of LWP::UserAgent on my server does not apparently
include the local_address function. I was hoping to use this as
currently when I try to get a page (using LWP::UserAgent) it is timing
out when running from the server, but working when I run from my own
computer.

Still the version of perl really doesn't matter. The version of
that module might, however you can find that version yourself:

use LWP::UserAgent;
print $LWP::UserAgent::VERSION;

or to see if a method 'can' be called:

my $lwp = LWP::UserAgent->new();
print "local_address is available." if $lwp->can( 'local_address' );

As this is an indexing routine, I would like it to work from
the server.



No idea why you want to 'index' something through a CGI, but.... Do you
have shell access to the server? If you do, then connect to that
machine, using ssh/telnet/whatever, and do everything from that machine.
Try 'telnet localhost 80' Do you get a connection? I guess the server
could not be allowing connections to port 80 from itself. If you do
get a connection, then using LWP::Debug might help figure out the
problem, or use the debugger and step through your program, to
see what's happening, or not happening. There are a lot of possible
problems, working with someone who owns the machine is probably your
best bet.

Thanks for the example but it is not a file permission problem, I have
many scripts which read and write files on this server, none of which
are problematic.

OK. I haven't been following this very closely. That's usually a very
common problem.
 
L

Lawrence Statton

Rob said:
The version of Perl matters because I am trying to use LWP::UserAgent,
and the version of LWP::UserAgent on my server does not apparently
include the local_address function.


Would not then, the version of LWP::UserAgent (and it's ascendants) be
a much more interesting datum?

--L
 
J

Jürgen Exner

Rob said:
I am now able to 'get' a page when I run the script from my own
computer - it successfully downloads the page and I can do what I like
with the data.

Great! This means your Perl problem is solved.

Anything else is something else, like e.g. web server config issues,
missing modules, incorrect use of CGI, ... that list goes on and on.

jue
 
R

Rob

Great! This means your Perl problem is solved.

Thank you all for your advice and help with this. It turned out that
the real problem was the firewall on the server, I can't do any http
communication using scripts within the site. I have overcome this by
running the indexer on my own machine and uploading the idexed
database each time, which now works fine.

Rob
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top