extracting urls

mnml · Nov 18, 2007

Hi, I made a little function to extract urls from any content with a
regular expression but it doesn't really work.
when i try to extract urls from http://google.com i only get 4 results
in my array:

* http://images.google.nl/imghp?oe=ISO-8859-1&hl=nl&tab=wi
* http://
* .nl
* /imghp?oe=ISO-8859-1&hl=nl&tab=wi

Here is the code of my function:

public static void find_url(String content) {
Pattern p = Pattern.compile("(@)?(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-
zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\\+\\%/\\.\\w]+)?");

Matcher m = p.matcher(content);

if (m.find())
{
for (int i=0; i<=m.groupCount(); i++) {
myVar.urls = m.group(i);
}
}

}

Andrew Thompson · Nov 18, 2007

Hi, I made a ...

...little boo-boo in multi-posting this message
to comp.lang.java.help, after making a post to
comp.lang.java.programmer.

Please refrain from multi-posting, in future.

X-post to c.l.j.p./h., w/ f-u to c.l.j.h. only.

SadRed · Nov 18, 2007

Hi, I made a little function to extract urls from any content with a
regular expression but it doesn't really work.
when i try to extract urls fromhttp://google.comi only get 4 results
in my array:

*http://images.google.nl/imghp?oe=ISO-8859-1&hl=nl&tab=wi
* http://
* .nl
* /imghp?oe=ISO-8859-1&hl=nl&tab=wi

Here is the code of my function:

public static void find_url(String content) {
Pattern p = Pattern.compile("(@)?(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-
zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\\+\\%/\\.\\w]+)?");

Matcher m = p.matcher(content);

if (m.find())
{
for (int i=0; i<=m.groupCount(); i++) {
myVar.urls = m.group(i);
}
}

}

Don't clutter the forum with your multi posts, please!
Your regex code is very wrong. Study this code and go to bed. I didn't
touch your weird regex string but I firmly believe it is also wrong
for your desired purpose which I don't know in its details.
----------------------------------------------
import java.net.*;
import java.util.regex.*;
import java.io.*;
import java.util.*;

public class Mnm{

public static void main(String[] args) throws Exception{
String contStr = "";
String line = null;

Locale.setDefault(Locale.US);
// String urlStr = "http://google.com";
String urlStr = "http://www.google.com/ig?hl=en";

if (args.length > 0){
urlStr = args[0];
}

URL url = new URL(urlStr);
InputStream is = url.openStream();

BufferedReader br = new BufferedReader(new InputStreamReader(is));
while ((line = br.readLine()) != null){
contStr += line;
}

findUrl(contStr);
}

public static void findUrl(String content) {
int gc, counter, gcounter;
gc = counter = gcounter = 0;

Pattern p = Pattern.compile
("(@)?(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\
\+\\%/\\.\\w]+)?");

Matcher m = p.matcher(content);
gc = m.groupCount();
for (int i = 0; i <= gc; ++i){
System.out.println("GROUP" + i + " : ");
while (m.find()){
++counter;
++gcounter;
System.out.println(gcounter + ".> " + m.group(i));
}
m.reset(content); // for next group
gcounter = 0;
}
if (counter == 0){
System.out.println("--no match--");
}
}
}
----------------------------------------

mnml · Nov 18, 2007

Hi, I made a little function to extract urls from any content with a
regular expression but it doesn't really work.
when i try to extract urls fromhttp://google.comionly get 4 results
in my array:

Click to expand...

*http://images.google.nl/imghp?oe=ISO-8859-1&hl=nl&tab=wi
* http://
* .nl
* /imghp?oe=ISO-8859-1&hl=nl&tab=wi

Click to expand...

Here is the code of my function:

Click to expand...

public static void find_url(String content) {
Pattern p = Pattern.compile("(@)?(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-
zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\\+\\%/\\.\\w]+)?");

Click to expand...

Matcher m = p.matcher(content);

Click to expand...

if (m.find())
{
for (int i=0; i<=m.groupCount(); i++) {
myVar.urls = m.group(i);
}
}

Click to expand...

}

Click to expand...

Don't clutter the forum with your multi posts, please!
Your regex code is very wrong. Study this code and go to bed. I didn't
touch your weird regex string but I firmly believe it is also wrong
for your desired purpose which I don't know in its details.
----------------------------------------------
import java.net.*;
import java.util.regex.*;
import java.io.*;
import java.util.*;

public class Mnm{

public static void main(String[] args) throws Exception{
String contStr = "";
String line = null;

Locale.setDefault(Locale.US);
// String urlStr = "http://google.com";
String urlStr = "http://www.google.com/ig?hl=en";

if (args.length > 0){
urlStr = args[0];
}

URL url = new URL(urlStr);
InputStream is = url.openStream();

BufferedReader br = new BufferedReader(new InputStreamReader(is));
while ((line = br.readLine()) != null){
contStr += line;
}

findUrl(contStr);
}

public static void findUrl(String content) {
int gc, counter, gcounter;
gc = counter = gcounter = 0;

Pattern p = Pattern.compile
("(@)?(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\
\+\\%/\\.\\w]+)?");

Matcher m = p.matcher(content);
gc = m.groupCount();
for (int i = 0; i <= gc; ++i){
System.out.println("GROUP" + i + " : ");
while (m.find()){
++counter;
++gcounter;
System.out.println(gcounter + ".> " + m.group(i));
}
m.reset(content); // for next group
gcounter = 0;
}
if (counter == 0){
System.out.println("--no match--");
}
}}

----------------------------------------

Thanks for your example, yeah the regexp is wrong with your example it
was returning stuff like:

3.> http://www.google.com/favicon.ico
4.> http://www.google.com/favicon.ico
5.> WeTHhV4cOxM.js
6.> document.location.hostname
7.> domain.indexOf
8.> domain.substring
9.> document.cookie

Roedy Green · Nov 19, 2007

Pattern p = Pattern.compile("(@)?(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-
zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\\+\\%/\\.\\w]+)?");

to find out the problem, keep chopping the tail end off and redoing
the search. When the elements it missed come back, you know the
problem was in the bit you just chopped.

I usually compose these just a bit at a time, adding on just a phrase
before testing.

for other hints see http://mindprod.com/jgloss/regex.html

mnml · Nov 20, 2007

Pattern p = Pattern.compile("(@)?(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-
zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\\+\\%/\\.\\w]+)?");

Click to expand...

to find out the problem, keep chopping the tail end off and redoing
the search. When the elements it missed come back, you know the
problem was in the bit you just chopped.

I usually compose these just a bit at a time, adding on just a phrase
before testing.

for other hints seehttp://mindprod.com/jgloss/regex.html

ok, thank you for the link

Chris · Nov 20, 2007

mnml said:
Hi, I made a little function to extract urls from any content with a
regular expression but it doesn't really work.

If you're extracting URLs from HTML, it's a lot easier to try to
recognize the anchor tags. Write a regex to recognize:

<a ~ href="~" ~ >

where ~ means "up to". I've implemented this in a lexer and it works
reliably. (Regexes work a little differently in a lexer, so I don't have
a regex to post). Just adjust to handle mixed case.

mnml · Nov 21, 2007

If you're extracting URLs from HTML, it's a lot easier to try to
recognize the anchor tags. Write a regex to recognize:

<a ~ href="~" ~ >

where ~ means "up to". I've implemented this in a lexer and it works
reliably. (Regexes work a little differently in a lexer, so I don't have
a regex to post). Just adjust to handle mixed case.

ok, thank you

Regular Expression extract all links in a page.	1	Aug 7, 2006
With regex, accessing multiple groups under quantifiers	1	Sep 9, 2007
regex: How to extract substrings?	2	Dec 10, 2005
Regular Expression Conundrum	3	Apr 19, 2006
JavaMail - RFC822	4	Dec 23, 2004
regexp lookahead	3	May 3, 2006
Extracting links from a html table	1	May 19, 2008
extracting text data in the presence of a "look-up" file: Is it possible?	5	Jan 7, 2004

extracting urls

mnml

Andrew Thompson

SadRed

mnml

Roedy Green

mnml

Chris

mnml

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads