C
Cab
Hi all.
I'm trying to set up a script to strip out URL's from the body of a
Usenet post.
Any clues please? I have some expressions that I'm using, but they're
very long winded and inefficient, as seen below. At the moment, I've
done this in bash, but want to eventually set up a perl script to do
this.
So far I've got this small script that will remove URLs that start at
the beginning of a line, into a file. This is the easy part (Note, I
know this is messy, but this is still a dev script, at the moment).
---
echo remove spaces from the start of lines
sed 's/^ *//g' sorted_file > 1
echo Remove all '>' from a file.
sed '/>/d' 1 > 2
echo uniq the file
uniq 2 > 3
echo Move all lines beginning with http or www into another file
sed -n '/^http/p' 3 > 4
sed -n '/^www/p' 3 >> 4
echo Remove all junk on lines from "space" to EOL
sed '/ .*$/d' 4 > 4.1
echo uniq the file
uniq 4.1 > 4.2
echo So far, I've got a file with all www and http only.
mv 4.2 http_and_www_only
---
Once I've stripped these lines (easy enough), I have a file that
remains like this:
----
And the URL is:
Anton, try reading: url:http://ukrm.net/faq/UKRMsCBT.html
Anyone got any experience with http://www.girlsbike2.com/ ? SWMBO needs
Anyone still got the url of the pages about the woman who keeps going
Are available on: http://www.spete.net/ukrm/sedan06/index.html
are July 6-8. The reason being "Power Big Meet",
http://www.bigmeet.com/ ,
Are you sure? http://www.usgpru.net/
a scout around www.nslu2-linux.org - and perhaps there isn't any easier
asked where the sinks were and if you could plug curling tongs into the
----
The result I want is a list like the following:
http://ukrm.net/faq/UKRMsCBT.html
http://www.girlsbike2.com/
http://www.spete.net/ukrm/sedan06/index.html
http://www.bigmeet.com/
http://www.usgpru.net/
www.nslu2-linux.org
Can anyone give me some clues or pointers to websites where I can go
into this in more detail please?
I'm trying to set up a script to strip out URL's from the body of a
Usenet post.
Any clues please? I have some expressions that I'm using, but they're
very long winded and inefficient, as seen below. At the moment, I've
done this in bash, but want to eventually set up a perl script to do
this.
So far I've got this small script that will remove URLs that start at
the beginning of a line, into a file. This is the easy part (Note, I
know this is messy, but this is still a dev script, at the moment).
---
echo remove spaces from the start of lines
sed 's/^ *//g' sorted_file > 1
echo Remove all '>' from a file.
sed '/>/d' 1 > 2
echo uniq the file
uniq 2 > 3
echo Move all lines beginning with http or www into another file
sed -n '/^http/p' 3 > 4
sed -n '/^www/p' 3 >> 4
echo Remove all junk on lines from "space" to EOL
sed '/ .*$/d' 4 > 4.1
echo uniq the file
uniq 4.1 > 4.2
echo So far, I've got a file with all www and http only.
mv 4.2 http_and_www_only
---
Once I've stripped these lines (easy enough), I have a file that
remains like this:
----
And the URL is:
Anton, try reading: url:http://ukrm.net/faq/UKRMsCBT.html
Anyone got any experience with http://www.girlsbike2.com/ ? SWMBO needs
Anyone still got the url of the pages about the woman who keeps going
Are available on: http://www.spete.net/ukrm/sedan06/index.html
are July 6-8. The reason being "Power Big Meet",
http://www.bigmeet.com/ ,
Are you sure? http://www.usgpru.net/
a scout around www.nslu2-linux.org - and perhaps there isn't any easier
asked where the sinks were and if you could plug curling tongs into the
----
The result I want is a list like the following:
http://ukrm.net/faq/UKRMsCBT.html
http://www.girlsbike2.com/
http://www.spete.net/ukrm/sedan06/index.html
http://www.bigmeet.com/
http://www.usgpru.net/
www.nslu2-linux.org
Can anyone give me some clues or pointers to websites where I can go
into this in more detail please?