"getting" a website

Uno · Mar 8, 2011

My family owns this website, and I serve as an agent in its behalf.

http://germanresistance.com/

I have the _Perl Cookbook_, but have never actually gone and taken
somebody's website before. I can assure you that the law is on my side
here, unless it isn't because Barack has pointy ears.

The guys I'm taking it from have to answer to the fbi and the local fuzz.

Looking for tips as I do this with or without you.

Michael Vilain · Mar 8, 2011

Uno <[email protected]> said:
My family owns this website, and I serve as an agent in its behalf.

http://germanresistance.com/

I have the _Perl Cookbook_, but have never actually gone and taken
somebody's website before. I can assure you that the law is on my side
here, unless it isn't because Barack has pointy ears.

The guys I'm taking it from have to answer to the fbi and the local fuzz.

Looking for tips as I do this with or without you.

If you don't have ftp access to the server's docroot, you'll only get
static html and image files. Since this is a Wordpress site, perl
programming will only help you capture the pages that the CMS constructs
and presents to a browser. I'd let the authorities do their job before
you muddy the waters of your case and make it that much harder to
prosecute the owners of the site. Start by contacting 1and1, the ISP
hosting the site.

Open your wallet wide.

Uno · Mar 9, 2011

What do you mean when you say "take" a website?

Tips on how to do what, exactly?

And how do expect to use Perl for whatever it is that you do mean to do?

Ask us a question about Perl, and we will answer it.

Boy, I must have been plowed when I wrote that. Upset, too, these guys
severely abused my uncle, who is ga-ga for Dietrich Bonhoeffer.
Anyways, this is from the material in chp 20:

$ cat gurl1.pl
#!/usr/bin/perl -w
# gurl - get content from an url

use LWP::Simple;
my $URL = 'http://germanresistance.com';
my $content = get($URL);

unless (defined ($content = get $URL)) {
die "could not get $URL\n";
}

print $content;
$

Really straightforward stuff here.

$ pwd
/home/dan/source/cookbook.examples/ch20
$ perl gurl1.pl
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US">

<head profile="http://gmpg.org/xfn/11">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="template" content="K2 1.0-RC8" />

<title> Christ and the German Resistance</title>

<link rel="stylesheet" type="text/css" media="screen"
href="http://germanresistance.com/wp-content/themes/k2/style.css" />

....

The output goes forever.

q1) What's best way to serialize a website with perl?

q2) What things don't get gotten with get?

Uno · Mar 9, 2011

If you don't have ftp access to the server's docroot, you'll only get
static html and image files. Since this is a Wordpress site, perl
programming will only help you capture the pages that the CMS constructs
and presents to a browser. I'd let the authorities do their job before
you muddy the waters of your case and make it that much harder to
prosecute the owners of the site. Start by contacting 1and1, the ISP
hosting the site.

Open your wallet wide.

I'm just a cash-flow device anymore. The "webmaster" is part of the
scam that bilked this guy out of shocking amounts of $$, and I've got to
make him superfluous somehow.

I'll get ftp access. Thx.

Uno · Mar 9, 2011

Ask us a question about Perl, and we will answer it.

$ perl gurl3.pl
Wide character in print at gurl3.pl line 16.
<td height="14"><a
href="http://www.germanresistance.com/documents/Intro_to_Bonhoeffer.pdf"
target="_blank">Introduction to Dietrich Bonhoeffer</a></td>
<td height="14"><a
href="http://www.germanresistance.com/documents/Chronology_of_the_life_of_Dietrich_Bonhoeï¬€er.pdf"
target="_blank">Chronology of the life of Dietrich Bonhoeffer</a></td>
<td height="14"><a
href="http://www.germanresistance.com/documents/Bonhoeï¬€er_on_Abortion.pdf"
target="_blank">Dietrich Bonhoeffer on Abortion</a></td>
<td height="14"><a
href="http://www.germanresistance.com/documents/Luther_Bonhoeï¬€er_and_Revolution.pdf"
target="_blank">Luther, Bonhoeffer and Revolution</a></td>
<td height="14"><a
href="http://www.germanresistance.com/doc...€er_and_the_Russian_Religious_Renaissance.pdf"
target="_blank">Dietrich Bonhoeffer and the Russian Religious
Renaissance</a></td>
<td height="14"><a
href="http://www.germanresistance.com/doc...ï¬€er_the_resistance_and_the_two_kingdoms.pdf"
target="_blank">Dietrich Bonhoeffer, the resistance, and the two
kingdoms</a></td>
<td height="14"><a
href="http://www.germanresistance.com/documents/Pius_XII_and_the_Jews.pdf"
target="_blank">Pius XII and the Jews</a></td>
<td height="14"><a
href="http://www.germanresistance.com/documents/Dietrich_Bonhoeffer_and_the_German_Resistance_'95.pdf"
target="_blank">Dietrich Bonhoeffer and the German Resistance</a></td>
<td height="14"><a
href="http://www.germanresistance.com/documents/Dietrich_Bonhoeï¬€er_and_Liberalism.pdf"
target="_blank">Dietrich Bonhoeffer and Liberalism</a></td>
<td height="14"><a
href="http://www.germanresistance.com/documents/Dietrich_Bonhoeï¬€er_and_Canossa.pdf"
target="_blank">Dietrich Bonhoeffer and Canossa</a></td>
<td height="14"><a
href="http://www.germanresistance.com/documents/Invidious_Comparisons.pdf"
target="_blank">Invidious Comparisons</a></td>
<td height="14"><a
href="http://www.germanresistance.com/documents/Agent_of_Grace.pdf"
target="_blank">Dietrich Bonhoeffer – a discussion of
“Bonhoeffer: Agent of Grace”</a></td>
<td height="14"><a
href="http://www.germanresistance.com/documents/The_stereotyping_of_Dietrich_Bonhoeï¬€er.pdf"
target="_blank">The stereotyping of Dietrich Bonhoeffer</a></td>
<td height="14"><a
href="http://www.germanresistance.com/documents/From_Dietrich_Bonhoeï¬€erâ€™s_Wedding_Sermon.pdf"
target="_blank">From Dietrich Bonhoeferâ€™s Wedding Sermon</a></td>
<td height="14"><a
href="http://www.germanresistance.com/documents/Dietrich_Bonhoeï¬€er_and_the_Theology_of_the_Cross.pdf"
target="_blank">Dietrich Bonhoeffer and the Theology of the Cross</a></td>
<td height="14"><a
href="http://www.germanresistance.com/documents/Dietrich_Bonhoeï¬€er_on_Authority.pdf"
target="_blank">Dietrich Bonhoeffer on Authority</a></td>
<td height="14"><a
href="http://www.germanresistance.com/documents/Dietrich_Bonhoeï¬€er_and_the_Formula_of_Concord.pdf"
target="_blank">Dietrich Bonhoeffer and the Formula of Concord</a></td>
<td height="14"><a
href="http://www.germanresistance.com/documents/The_German_Resistance_60_years.pdf"
target="_blank">The German Resistance – 60 years</a></td>
<td height="14"><a
href="http://www.germanresistance.com/documents/Dietrich_Bonhoeï¬€er_and_Karl_Barth.pdf"
target="_blank">Dietrich Bonhoeffer and Karl Barth</a></td>
<td height="14"><a
href="http://www.germanresistance.com/documents/Luther_and_Bonhoeï¬€er_misunderstood.pdf"
target="_blank">Luther and Bonhoeffer misunderstood</a></td>
<td height="14"><a
href="http://www.germanresistance.com/documents/The_German_Resistance_and_Dietrich_Bonhoeï¬€er_'07.pdf"
target="_blank">The German Resistance and Dietrich Bonhoeffer</a></td>
$ cat gurl3.pl
#!/usr/bin/perl -w
# gurl - get content from an url

use LWP::Simple;

if ( ! open FILE, ">bonhoeffer") {
die "Cannot create file: $!";
}
my $URL = 'http://germanresistance.com/index-of-papers/';
my $content = get($URL);

unless (defined ($content = get $URL)) {
die "could not get $URL\n";
}

print FILE $content;
close FILE;

if ( ! open FILE2, "<bonhoeffer") {
die "Cannot open file: $!";
}

while (<FILE2>) {
if (m{
http://www.germanresistance.com/documents # first part
..* #anything in between
pdf # last part
}six) {
print $_;
}
}
$

So I'm pecking away at this. I want this script to match on the pdf
documents. I think I get them all to match, but how do I strip off all
the rest of it?

For example this is now:
<td height="14"><a
href="http://www.germanresistance.com/documents/The_German_Resistance_60_years.pdf"
target="_blank">The German Resistance – 60 years</a></td>

, and I want it to be:
http://www.germanresistance.com/documents/The_German_Resistance_60_years.pdf

, without the newline.

Also, I couldn't really think of a way to get this done without opening
a file twice, but I'm sure there is a way to a) Have $content stored in
a file and b) run it through the while clause.

Thanks for your comment.

Jürgen Exner · Mar 9, 2011

Uno said:
For example this is now:
<td height="14"><a
href="http://www.germanresistance.com/documents/The_German_Resistance_60_years.pdf"
target="_blank">The German Resistance – 60 years</a></td>

, and I want it to be:
http://www.germanresistance.com/documents/The_German_Resistance_60_years.pdf

Simple. Exactly the same way as it has been done a gazillion of times
before in this NG: you take an HTML parser, you run your text through
that parser, and you retrieve the value of the href attribute.

jue

Keith Keller · Mar 9, 2011

q1) What's best way to serialize a website with perl?

I'm going to take a WAG that you want to download as much as possible
using GETs (ftp will be better, since you mentioned that you're trying
to get it). You will probably want to look at wget's recursive
retrieval options. If you later need to parse and modify the HTML, then
as Jurgen suggested you should use the various HTML parsers available
(e.g., HTML:

arser, HTML::TreeBuilder).

If that's wrong then you need to explain exactly what "serializing" a
website means.

q2) What things don't get gotten with get?

Anything dynamic, e.g., a backend database, scripts, flat files
generated dynamically, SSI directives (do people still use those?),
probably lots more.

If you get ftp access, you could use the Net::FTP module to retrieve
files (or use a tool like ncftpget).

--keith

Uno · Mar 9, 2011

Simple. Exactly the same way as it has been done a gazillion of times
before in this NG: you take an HTML parser, you run your text through
that parser, and you retrieve the value of the href attribute.

Uno · Mar 9, 2011

I'm going to take a WAG that you want to download as much as possible
using GETs (ftp will be better, since you mentioned that you're trying
to get it). You will probably want to look at wget's recursive
retrieval options. If you later need to parse and modify the HTML, then
as Jurgen suggested you should use the various HTML parsers available
(e.g., HTML:arser, HTML::TreeBuilder).

If that's wrong then you need to explain exactly what "serializing" a
website means.

By "serialize," I mean have a faithful representation of both the
directories and the files.

Anything dynamic, e.g., a backend database, scripts, flat files
generated dynamically, SSI directives (do people still use those?),
probably lots more.

If you get ftp access, you could use the Net::FTP module to retrieve
files (or use a tool like ncftpget).

Ok, thx, keith. I don't want to let this guy know what I'm doing, and I
just have to hope that he doesn't notice all the attention that has
recently occurred to pdf docs about how Dietrich Bonhoeffer can be
shoe-horned into the parochial, regressive, dyspeptic notions of the
Lutheran Church--Missouri Synod.

Isn't there a usenet group that discusses legality, property rights,
criminal behavior in a web context?

Uno · Mar 9, 2011

Well, perl's great and all that, but there's a well know, and free tool
specifically designed to do this:

http://www.gnu.org/software/wget/manual/html_node/Recursive-Download.html#Recursive-Download

Bugbear, I'm in shock, in a good way. This is something that would
never happen to a windows user:

wget -rH -Dgermanresistance.com http://germanresistance.com >text1

$ wget -rH -Dgermanresistance.com http://germanresistance.com >text1
--2011-03-09 12:23:04-- http://germanresistance.com/
Resolving germanresistance.com... 74.208.40.60
Connecting to germanresistance.com|74.208.40.60|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `germanresistance.com/index.html'

[ <=> ] 7,755 38.0K/s in
0.2s

2011-03-09 12:23:05 (38.0 KB/s) - `germanresistance.com/index.html'
saved [7755]

Loading robots.txt; please ignore errors.
--2011-03-09 12:23:05-- http://germanresistance.com/robots.txt
Connecting to germanresistance.com|74.208.40.60|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: `germanresistance.com/robots.txt'

.... [big snip]

2011-03-09 12:23:23 (27.5 KB/s) -
`germanresistance.com/contact-us/feed/index.html' saved [708]

FINISHED --2011-03-09 12:23:23--
Downloaded: 41 files, 295K in 7.5s (39.0 KB/s)
$

I told my uncle that it would take me 50 seconds. Nice to know I've
still got the touch. They've made him believe that this website is some
unbelievable, irreplaceable thing. They also said it was in Dan's basement.

Out of curiosity, can somebody tell where this server is, physically?

I wasn't expecting any of this to happen, but when I saw that this
script was saving filings for me in a directory structure, then I set
about finding it, and they put it right in one's home:
$ pwd
/home/dan/germanresistance.com
$ ls -l
total 40
drwxr-xr-x 3 dan dan 4096 2011-03-09 12:23 contact-us
drwxr-xr-x 3 dan dan 4096 2011-03-09 12:23 home
-rw-r--r-- 1 dan dan 7755 2011-03-09 12:23 index.html
drwxr-xr-x 2 dan dan 4096 2011-03-09 12:23 index-of-papers
-rw-r--r-- 1 dan dan 24 2011-03-09 12:23 robots.txt
drwxr-xr-x 5 dan dan 4096 2011-03-09 12:23 wp-content
drwxr-xr-x 3 dan dan 4096 2011-03-09 12:23 wp-includes
-rw-r--r-- 1 dan dan 42 2011-03-09 12:23 xmlrpc.php
-rw-r--r-- 1 dan dan 858 2011-03-09 12:23 xmlrpc.php?rsd
$ ls *
index.html robots.txt xmlrpc.php xmlrpc.php?rsd

contact-us:
feed index.html

home:
feed

index-of-papers:
index.html

wp-content:
plugins themes uploads

wp-includes:
js wlwmanifest.xml
$

Had only a few minutes for this today and feel like I hit a home run.
Now to get my truck back from the shop and pay the bill.

Keith Keller · Mar 9, 2011

wget -rH -Dgermanresistance.com http://germanresistance.com >text1

Since you're using wget, this is offtopic for comp.lang.perl.misc.

[snip]

--keith

Jürgen Exner · Mar 10, 2011

Uno said:
By "serialize," I mean have a faithful representation of both the
directories and the files.

That is impossible if you are limited to access by HTTP.

HTTP does not know about directories or files, it knows only about URLs.
And a URL may map to anything or everything on the server, from
primitive static files over SSI or "dynamic web pages" to responses that
are generated on the fly by the server and never exist as a file or
anything even remotly resembling a file.

And if you are not limited to access by HTTP then use that access and
this whole discussion becomes pointless.

jue

Uno · Mar 10, 2011

Wayne, Pennsylvania?

<http://www.melissadata.com/lookups/iplocation.asp?ipaddress=74.208.40.6
0>
<http://www.geobytes.com/IpLocator.htm?GetLocation>
<http://www.geobytes.com/IpLocator.htm?GetLocation>

Please limit your posts to discussions of Perl. Thank you.

Right, Jim, and I appreciate the forum's forbearance on a thread with me
beginning with an emotional drunkpost. I try to bring the discussion
back to the topic by my own efforts, which were doubly successful today:

$ pwd
/home/dan/source/cookbook.examples/ch18/www.merrillpjensen.com
$ ls
colorschemes img0.png live_tinc.js style1.css
images index.html main.css style.css
$ perl gurl6.pl
Can't open perl script "gurl6.pl": No such file or directory
$ cd ..
$ perl gurl6.pl
Name "main::remotefile" used only once: possible typo at gurl6.pl line 18.
$ ls
ch18.code expn gurl6.pl hostaddrs mxhost www.merrillpjensen.com
zax.html
$ cat zax.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
"http://www.w3.org/TR/html4/frameset.dtd">
<html>
<head>
....

HUGE SNIP

....
</body>
</noframes>
</frameset>
</html>
$ cat gurl6.pl
#!/usr/bin/perl -w
# gurl - get content from an url

use LWP::Simple;
require HTML::TokeParser;
use Net::FTP;

my $domain = 'www.merrillpjensen.com';
my $username = 'u61210220';
my $password = '';
my $file = 'index.html';
my $file2 = 'zax.html';

$ftp = Net::FTP->new($domain) or die "Can't connect: $@\n";
$ftp->login($username, $password) or die "Couldn't login\n";

$ftp->get($file, $file2)
or die "Can't fetch $remotefile : $!\n";

# end output and script listing

So this is my site, which I bought from 1and1, who just happen to be my
uncle's provider as well. Lucky breaks fall to those who break things
for a living, only to improve them after the demo portion.

What was the remote server telling me here:?

Name "main::remotefile" used only once: possible typo at gurl6.pl line 18.

Jürgen Exner · Mar 10, 2011

Uno said:
or die "Can't fetch $remotefile : $!\n";

Name "main::remotefile" used only once: possible typo at gurl6.pl line 18.
What was the remote server telling me here:?

Nothing.
This has nothing to do with a remote server but is a warning from perl
which is telling you, that you are using the variable $remotefile only
once:
- you are not declaring the variable (which begs the question why aren't
you using strict? Had you used strict then perl would have told you)
- you are not defining this variable anywhere
- and you are using it only once in this one line.
And because you never assign any value to $remotefile you can just as
well remove it completely from the die statement.

jue

Uno · Mar 10, 2011

Nothing.
This has nothing to do with a remote server but is a warning from perl
which is telling you, that you are using the variable $remotefile only
once:
- you are not declaring the variable (which begs the question why aren't
you using strict? Had you used strict then perl would have told you)
- you are not defining this variable anywhere
- and you are using it only once in this one line.
And because you never assign any value to $remotefile you can just as
well remove it completely from the die statement.

I think I understand:

$ pwd
/home/dan/source/cookbook.examples/ch18
$ perl gurl6.pl
$ ls
ch18.code gurl6.pl mxhost zax2.html
expn hostaddrs www.merrillpjensen.com zax.html
$ cat gurl6.pl
#!/usr/bin/perl -w
# gurl - get content from an url

use LWP::Simple;
require HTML::TokeParser;
use Net::FTP;

my $domain = 'www.merrillpjensen.com';
my $username = 'u61210220';
my $password = '';
my $file = 'index.html';
my $file2 = 'zax2.html';

$ftp = Net::FTP->new($domain) or die "Can't connect: $@\n";
$ftp->login($username, $password) or die "Couldn't login\n";

$ftp->get($file, $file2)
or die "Can't fetch $file : $!\n";

$

Wenn ich also "tja" sagte, koenntest Du behaupten, dass ich etwa die
Pfade verloren habe. Allerdings existeirt ein neues Datei zax2.

Mir geschieht das recht.

Uno · Mar 10, 2011

Tips on how to do what, exactly?

And how do expect to use Perl for whatever it is that you do mean to do?

Ask us a question about Perl, and we will answer it.

I'm looking for a reference in _Learning Perl_. which I bought at your
recommendation.

This is a matching control:

while (<FILE2>) {
if (m{
http://www.germanresistance.com/documents # first part
.* #anything in between
pdf # last part
}six) {
print $_;
}
}

[had to print as quotation which makes me feel like mao tse dong:
antiquated}

What if I wanted to match on zip as well? I've been looking through
that book, and in want of better advice, find it to be a family argument
with cavemen.

On the one hand, the "fred, dino, or barney stuff" isn't quite cutting
it for me, because I can't find the page I read the original material on
in _Learning Perl_.

I believe the part that needs expansion is:

pdf # last part

q43) How do I match on zip as well?

Also, presumably q44, why did $_ return more than I wanted?

Gruesst euch.

Jürgen Exner · Mar 10, 2011

Uno said:
q43) How do I match on zip as well?

perldoc perlretut
perldoc perlre

jue

Uno · Mar 18, 2011

On 03/10/2011 06:35 AM, Tad McClellan wrote:

[re-ordered, for thematic reasons}

Uno<[email protected]> wrote:

$_ is a variable.

It cannot "return" anything.

It does have a value though:

why is the value of $_ more than I wanted.

For us to answer that, we would need to know what you wanted!

What did you want?

What I want is a reasonable discussion of perl. My notion of
"reasonable" might be another's notion of "highly idiomatic and usually OT."

I suspect that a default variable is something that one just has to get
used to, like quantum mechanics.

Most excellent! You have found the right spot.

By adding alternation and grouping:

(pdf|zip) # last part

or, since you are already using the m//x modifier:

( # last part (will be available in $1)
pdf | zip # either one of these
)

or, if you don't need to capture:

(?: # last part
pdf | zip # either one of these
)

Uno · Mar 18, 2011

perldoc perlretut

$ perldoc perlretut
You need to install the perl-doc package to use this program.
$

Apparently, perl documentation doesn't survive a re-install, necessarily.

Jürgen Exner · Mar 18, 2011

Uno said:
$ perldoc perlretut
You need to install the perl-doc package to use this program.

Well, then why don't you tell the adminstrator of your system that he
didn't finish his job?

jue

Need help with code on website (noob)	2	Jul 18, 2022
How to store data from a sign up form on a website into an sql databse	1	Sep 9, 2022
Want to host websites that I will probably be the only user from home. Sacrilege, I know, but it has always been a dream of mine. Where do I start?	2	Aug 13, 2024
How Do I Set text on an Image and use the image as a border?	7	Mar 16, 2023
Making a Website with Ruby (not rails?)	11	Dec 9, 2010
New "coder", I have a project that i want to accomplish	1	Apr 22, 2020
Problem debugging Website	0	Mar 4, 2010
A patented website form?	0	Dec 22, 2008

"getting" a website

Uno

Michael Vilain

Uno

Uno

Uno

Jürgen Exner

Keith Keller

Uno

Uno

Uno

Keith Keller

Jürgen Exner

Uno

Jürgen Exner

Uno

Uno

Jürgen Exner

Uno

Uno

Jürgen Exner

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads