Regexp help.

Cab · Jun 2, 2006

Hi all.

I'm trying to set up a script to strip out URL's from the body of a
Usenet post.

Any clues please? I have some expressions that I'm using, but they're
very long winded and inefficient, as seen below. At the moment, I've
done this in bash, but want to eventually set up a perl script to do
this.

So far I've got this small script that will remove URLs that start at
the beginning of a line, into a file. This is the easy part (Note, I
know this is messy, but this is still a dev script, at the moment).

---
echo remove spaces from the start of lines
sed 's/^ *//g' sorted_file > 1

echo Remove all '>' from a file.
sed '/>/d' 1 > 2

echo uniq the file
uniq 2 > 3

echo Move all lines beginning with http or www into another file
sed -n '/^http/p' 3 > 4
sed -n '/^www/p' 3 >> 4

echo Remove all junk on lines from "space" to EOL
sed '/ .*$/d' 4 > 4.1

echo uniq the file
uniq 4.1 > 4.2

echo So far, I've got a file with all www and http only.
mv 4.2 http_and_www_only
---

Once I've stripped these lines (easy enough), I have a file that
remains like this:

----
And the URL is:
Anton, try reading: url:http://ukrm.net/faq/UKRMsCBT.html
Anyone got any experience with http://www.girlsbike2.com/ ? SWMBO needs
Anyone still got the url of the pages about the woman who keeps going
Are available on: http://www.spete.net/ukrm/sedan06/index.html
are July 6-8. The reason being "Power Big Meet",
http://www.bigmeet.com/ ,
Are you sure? http://www.usgpru.net/
a scout around www.nslu2-linux.org - and perhaps there isn't any easier
asked where the sinks were and if you could plug curling tongs into the
----

The result I want is a list like the following:

http://ukrm.net/faq/UKRMsCBT.html
http://www.girlsbike2.com/
http://www.spete.net/ukrm/sedan06/index.html
http://www.bigmeet.com/
http://www.usgpru.net/
www.nslu2-linux.org

Can anyone give me some clues or pointers to websites where I can go
into this in more detail please?

Mirco Wahab · Jun 2, 2006

Thus spoke Cab (on 2006-06-02 15:57):

I'm trying to set up a script to strip out URL's from the body of a
Usenet post.
The result I want is a list like the following:

http://ukrm.net/faq/UKRMsCBT.html
http://www.girlsbike2.com/
http://www.spete.net/ukrm/sedan06/index.html
http://www.bigmeet.com/
http://www.usgpru.net/
www.nslu2-linux.org

The following prints all links
(starting w/http or www) from $text

use:
$> perl dumplinks.pl < text.txt

#!/usr/bin/perl
use strict;
use warnings;

my $data = do {local $/; <> };
print "$1\n" while $data =~ /(\b(http|www)\S+)/g;

# or:
# while (<>) {
# print "$1\n" while /(\b(http|www)\S+)/g;
# }

Of course, this can be done by an one-liner ;-)

Regards

Mirco

Dr.Ruud · Jun 2, 2006

Cab schreef:

Subject: Regexp help.

Please go and read the Posting Guidelines.

Paul Lalli · Jun 2, 2006

Cab said:
I'm trying to set up a script to strip out URL's from the body of a
Usenet post.

Can anyone give me some clues or pointers to websites where I can go
into this in more detail please?

open the original file for reading
open two files for writing - one for the modified file, one for the
list of URLs
loop through each line of the original file
Search for a URI, using Regexp::Common::URI. Replace it with nothing,
and be sure to capture the URI.
print the modified line to the modified file
print the captured URI to the URI file.

Documentation to help you in this goal:
open a file: perldoc -f open
Looping: perldoc perlsyn
Reading a line from a file: perldoc -f readline
Using search-and-replace: perldoc perlop, perldoc perlretut
Regexp::Common::URI:
http://search.cpan.org/~abigail/Regexp-Common-2.120/lib/Regexp/Common/URI.pm
printing to a file: perldoc -f print

Once you have made your *perl* attempt, if it doesn't work the way you
want, feel free to post it here to seek assistance. In the mean time,
be sure to read the posting guidelines for this group. They are posted
here twice a week.

Paul Lalli

Xicheng Jia · Jun 2, 2006

Cab said:
Hi all.

I'm trying to set up a script to strip out URL's from the body of a
Usenet post.

Any clues please? I have some expressions that I'm using, but they're
very long winded and inefficient, as seen below. At the moment, I've
done this in bash, but want to eventually set up a perl script to do
this.

So far I've got this small script that will remove URLs that start at
the beginning of a line, into a file. This is the easy part (Note, I
know this is messy, but this is still a dev script, at the moment).

---
echo remove spaces from the start of lines
sed 's/^ *//g' sorted_file > 1

echo Remove all '>' from a file.
sed '/>/d' 1 > 2

echo uniq the file
uniq 2 > 3

echo Move all lines beginning with http or www into another file
sed -n '/^http/p' 3 > 4
sed -n '/^www/p' 3 >> 4

echo Remove all junk on lines from "space" to EOL
sed '/ .*$/d' 4 > 4.1

echo uniq the file
uniq 4.1 > 4.2

echo So far, I've got a file with all www and http only.
mv 4.2 http_and_www_only
---

Once I've stripped these lines (easy enough), I have a file that
remains like this:

----
And the URL is:
Anton, try reading: url:http://ukrm.net/faq/UKRMsCBT.html
Anyone got any experience with http://www.girlsbike2.com/ ? SWMBO needs
Anyone still got the url of the pages about the woman who keeps going
Are available on: http://www.spete.net/ukrm/sedan06/index.html
are July 6-8. The reason being "Power Big Meet",
http://www.bigmeet.com/ ,
Are you sure? http://www.usgpru.net/
a scout around www.nslu2-linux.org - and perhaps there isn't any easier
asked where the sinks were and if you could plug curling tongs into the
----

The result I want is a list like the following:

http://ukrm.net/faq/UKRMsCBT.html
http://www.girlsbike2.com/
http://www.spete.net/ukrm/sedan06/index.html
http://www.bigmeet.com/
http://www.usgpru.net/
www.nslu2-linux.org

you can start from here:

lynx -dump http://your_url | grep -o '$http\|www$://.*'

then filter out any unwanted links.

HTH,
Xicheng

Cab · Jun 2, 2006

Mirco said:
Thus spoke Cab (on 2006-06-02 15:57):

The following prints all links
(starting w/http or www) from $text

use:
$> perl dumplinks.pl < text.txt

#!/usr/bin/perl
use strict;
use warnings;

my $data = do {local $/; <> };
print "$1\n" while $data =~ /(\b(http|www)\S+)/g;

# or:
# while (<>) {
# print "$1\n" while /(\b(http|www)\S+)/g;
# }

Of course, this can be done by an one-liner ;-)

Regards

Mirco

Ta very much for that. Very helpful.

Cab · Jun 2, 2006

Paul said:
Documentation to help you in this goal:
open a file: perldoc -f open
Looping: perldoc perlsyn
Reading a line from a file: perldoc -f readline
Using search-and-replace: perldoc perlop, perldoc perlretut
Regexp::Common::URI:
http://search.cpan.org/~abigail/Regexp-Common-2.120/lib/Regexp/Common/
URI.pm printing to a file: perldoc -f print

^^^^^^^^^^^^^^^^^^^

Ah, that's handy. Thanks.

Dr.Ruud · Jun 2, 2006

Mirco Wahab schreef:

my $data = do {local $/; <> };
print "$1\n" while $data =~ /(\b(http|www)\S+)/g;

{ local ($", $\, $/) = ("\n", "\n", undef) ;
print "@{[ <> =~ /(\b(?:http:|www\.)\S+)/g ]}"
}

But read `perldoc -q URL`.

John W. Krahn · Jun 3, 2006

Dr.Ruud said:
Mirco Wahab schreef:

my $data = do {local $/; <> };
print "$1\n" while $data =~ /(\b(http|www)\S+)/g;

Click to expand...

{ local ($", $\, $/) = ("\n", "\n", undef) ;
print "@{[ <> =~ /(\b(?:http:|www\.)\S+)/g ]}"
}

{ local ( $,, $\, $/ ) = ( "\n", "\n" );
print <> =~ /\b(?:http:|www\.)\S+/g
}

John

Dr.Ruud · Jun 3, 2006

John W. Krahn schreef:

Dr.Ruud:

Mirco Wahab:

my $data = do {local $/; <> };
print "$1\n" while $data =~ /(\b(http|www)\S+)/g;

Click to expand...

{ local ($", $\, $/) = ("\n", "\n", undef) ;
print "@{[ <> =~ /(\b(?:http:|www\.)\S+)/g ]}"
}

Click to expand...

{ local ( $,, $\, $/ ) = ( "\n", "\n" );
print <> =~ /\b(?:http:|www\.)\S+/g
}

Yes, that certainly is a cleaner variant. I did hesitate to put the
C<undef> at the end of the rightside list, but decided it would be more
educational. But then I was already trapped in using C<$"> where C<$,>
is cleaner.

John W. Krahn · Jun 3, 2006

Dr.Ruud said:
John W. Krahn schreef:

Dr.Ruud:

Mirco Wahab:
my $data = do {local $/; <> };
print "$1\n" while $data =~ /(\b(http|www)\S+)/g;
{ local ($", $\, $/) = ("\n", "\n", undef) ;
print "@{[ <> =~ /(\b(?:http:|www\.)\S+)/g ]}"
}

Click to expand...

{ local ( $,, $\, $/ ) = ( "\n", "\n" );
print <> =~ /\b(?:http:|www\.)\S+/g
}

Click to expand...

Yes, that certainly is a cleaner variant. I did hesitate to put the
C<undef> at the end of the rightside list, but decided it would be more
educational. But then I was already trapped in using C<$"> where C<$,>
is cleaner.

Thanks, and you could also do it like this:

{ local ( $\, $/ ) = "\n";
print for <> =~ /\b(?:http:|www\.)\S+/g
}

John

Mumia W. · Jun 3, 2006

John said:
Dr.Ruud said:

Mirco Wahab schreef:

my $data = do {local $/; <> };
print "$1\n" while $data =~ /(\b(http|www)\S+)/g;

Click to expand...

{ local ($", $\, $/) = ("\n", "\n", undef) ;
print "@{[ <> =~ /(\b(?:http:|www\.)\S+)/g ]}"
}

Click to expand...

{ local ( $,, $\, $/ ) = ( "\n", "\n" );
print <> =~ /\b(?:http:|www\.)\S+/g
}

John

Due to sentence structure, people like to put periods and commas on the
end of their urls, so I decided to strip them off. I'm sorry this is so
longwinded compared to the others:

use strict;
use warnings;

my $data = q{
And the URL is:
Anton, try reading: url:http://ukrm.net/faq/UKRMsCBT.html
Anyone got any experience with http://www.girlsbike2.com/ ? SWMBO needs
Anyone still got the url of the pages about the woman who keeps going
Are available on: http://www.spete.net/ukrm/sedan06/index.html
are July 6-8. The reason being "Power Big Meet",
http://www.bigmeet.com/ www4.redhat.com,
Get a better browser: ftp.mozilla.org.
Are you sure? http://www.usgpru.net/
a scout around www.nslu2-linux.org - and perhaps there isn't any easier
asked where the sinks were and if you could plug curling tongs into the
};

local $_;
open (FH, '<', \$data)
or die("Couldn't open in-memory file: $!\n");

my @urls =
map { /^(.*?)[,.]?$/; }
map { /\b(?:http|ftp|www\d*\.)\S+/g; } <FH>;
print join "\n", @urls;

close FH;

Xicheng Jia · Jun 3, 2006

Mumia said:
John said:

Dr.Ruud said:

Mirco Wahab schreef:

my $data = do {local $/; <> };
print "$1\n" while $data =~ /(\b(http|www)\S+)/g;
{ local ($", $\, $/) = ("\n", "\n", undef) ;
print "@{[ <> =~ /(\b(?:http:|www\.)\S+)/g ]}"
}

Click to expand...

{ local ( $,, $\, $/ ) = ( "\n", "\n" );
print <> =~ /\b(?:http:|www\.)\S+/g
}

John

Click to expand...

Due to sentence structure, people like to put periods and commas on the
end of their urls, so I decided to strip them off. I'm sorry this is so
longwinded compared to the others:

use strict;
use warnings;

my $data = q{
And the URL is:
Anton, try reading: url:http://ukrm.net/faq/UKRMsCBT.html
Anyone got any experience with http://www.girlsbike2.com/ ? SWMBO needs
Anyone still got the url of the pages about the woman who keeps going
Are available on: http://www.spete.net/ukrm/sedan06/index.html
are July 6-8. The reason being "Power Big Meet",
http://www.bigmeet.com/ www4.redhat.com,
Get a better browser: ftp.mozilla.org.
Are you sure? http://www.usgpru.net/
a scout around www.nslu2-linux.org - and perhaps there isn't any easier
asked where the sinks were and if you could plug curling tongs into the
};

local $_;
open (FH, '<', \$data)
or die("Couldn't open in-memory file: $!\n");

my @urls =
map { /^(.*?)[,.]?$/; }
map { /\b(?:http|ftp|www\d*\.)\S+/g; } <FH>;
print join "\n", @urls;

close FH;

what if I add one line at the end of your data, say: $data .= "\nI like
ftpd httpd www....". I guess the result is not what you wanted..

Xicheng

Mumia W. · Jun 3, 2006

Xicheng said:
what if I add one line at the end of your data, say: $data .= "\nI like
ftpd httpd www....". I guess the result is not what you wanted..

Xicheng

Right, writing a RE that does a complete job of separating URLs from
text is not trivial. Tom Christiansen wrote one in FMTEYEWTK, and it's
a lot more than two lines

But the OPs requirements have been more than fulfilled.

Big problem I need to solve with some unix utils	1	Jun 19, 2022
Search Results with Pagination	0	Oct 25, 2024
Getting error message with code	1	Sep 26, 2024
Help with my responsive home page	2	Dec 14, 2022
DJForm Login Help!	1	Sep 28, 2024
Help with Paypal Live Transactions	1	May 19, 2023
I need advice re mysqli dropdown	0	Sep 21, 2016
Login form no longer working	2	Feb 18, 2023

Regexp help.

Cab

Mirco Wahab

Dr.Ruud

Paul Lalli

Xicheng Jia

Cab

Cab

Dr.Ruud

John W. Krahn

Dr.Ruud

John W. Krahn

Mumia W.

Xicheng Jia

Mumia W.

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads