extracting urls

M

mnml

Hi, I made a little function to extract urls from any content with a
regular expression but it doesn't really work.
when i try to extract urls from http://google.com i only get 4 results
in my array:

* http://images.google.nl/imghp?oe=ISO-8859-1&hl=nl&tab=wi
* http://
* .nl
* /imghp?oe=ISO-8859-1&hl=nl&tab=wi



Here is the code of my function:

public static void find_url(String content) {
Pattern p = Pattern.compile("(@)?(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-
zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\\+\\%/\\.\\w]+)?");

Matcher m = p.matcher(content);



if (m.find())
{
for (int i=0; i<=m.groupCount(); i++) {
myVar.urls = m.group(i);
}
}

}
 
A

Andrew Thompson

Hi, I made a ...

...little boo-boo in multi-posting this message
to comp.lang.java.help, after making a post to
comp.lang.java.programmer.

Please refrain from multi-posting, in future.

X-post to c.l.j.p./h., w/ f-u to c.l.j.h. only.
 
S

SadRed

Hi, I made a little function to extract urls from any content with a
regular expression but it doesn't really work.
when i try to extract urls fromhttp://google.comi only get 4 results
in my array:

*http://images.google.nl/imghp?oe=ISO-8859-1&hl=nl&tab=wi
* http://
* .nl
* /imghp?oe=ISO-8859-1&hl=nl&tab=wi

Here is the code of my function:

public static void find_url(String content) {
Pattern p = Pattern.compile("(@)?(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-
zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\\+\\%/\\.\\w]+)?");

Matcher m = p.matcher(content);

if (m.find())
{
for (int i=0; i<=m.groupCount(); i++) {
myVar.urls = m.group(i);
}
}

}


Don't clutter the forum with your multi posts, please!
Your regex code is very wrong. Study this code and go to bed. I didn't
touch your weird regex string but I firmly believe it is also wrong
for your desired purpose which I don't know in its details.
----------------------------------------------
import java.net.*;
import java.util.regex.*;
import java.io.*;
import java.util.*;

public class Mnm{

public static void main(String[] args) throws Exception{
String contStr = "";
String line = null;

Locale.setDefault(Locale.US);
// String urlStr = "http://google.com";
String urlStr = "http://www.google.com/ig?hl=en";

if (args.length > 0){
urlStr = args[0];
}

URL url = new URL(urlStr);
InputStream is = url.openStream();

BufferedReader br = new BufferedReader(new InputStreamReader(is));
while ((line = br.readLine()) != null){
contStr += line;
}

findUrl(contStr);
}

public static void findUrl(String content) {
int gc, counter, gcounter;
gc = counter = gcounter = 0;

Pattern p = Pattern.compile
("(@)?(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\
\+\\%/\\.\\w]+)?");

Matcher m = p.matcher(content);
gc = m.groupCount();
for (int i = 0; i <= gc; ++i){
System.out.println("GROUP" + i + " : ");
while (m.find()){
++counter;
++gcounter;
System.out.println(gcounter + ".> " + m.group(i));
}
m.reset(content); // for next group
gcounter = 0;
}
if (counter == 0){
System.out.println("--no match--");
}
}
}
----------------------------------------
 
M

mnml

Hi, I made a little function to extract urls from any content with a
regular expression but it doesn't really work.
when i try to extract urls fromhttp://google.comionly get 4 results
in my array:
*http://images.google.nl/imghp?oe=ISO-8859-1&hl=nl&tab=wi
* http://
* .nl
* /imghp?oe=ISO-8859-1&hl=nl&tab=wi
Here is the code of my function:
public static void find_url(String content) {
Pattern p = Pattern.compile("(@)?(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-
zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\\+\\%/\\.\\w]+)?");
Matcher m = p.matcher(content);
if (m.find())
{
for (int i=0; i<=m.groupCount(); i++) {
myVar.urls = m.group(i);
}
}


Don't clutter the forum with your multi posts, please!
Your regex code is very wrong. Study this code and go to bed. I didn't
touch your weird regex string but I firmly believe it is also wrong
for your desired purpose which I don't know in its details.
----------------------------------------------
import java.net.*;
import java.util.regex.*;
import java.io.*;
import java.util.*;

public class Mnm{

public static void main(String[] args) throws Exception{
String contStr = "";
String line = null;

Locale.setDefault(Locale.US);
// String urlStr = "http://google.com";
String urlStr = "http://www.google.com/ig?hl=en";

if (args.length > 0){
urlStr = args[0];
}

URL url = new URL(urlStr);
InputStream is = url.openStream();

BufferedReader br = new BufferedReader(new InputStreamReader(is));
while ((line = br.readLine()) != null){
contStr += line;
}

findUrl(contStr);
}

public static void findUrl(String content) {
int gc, counter, gcounter;
gc = counter = gcounter = 0;

Pattern p = Pattern.compile
("(@)?(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\
\+\\%/\\.\\w]+)?");

Matcher m = p.matcher(content);
gc = m.groupCount();
for (int i = 0; i <= gc; ++i){
System.out.println("GROUP" + i + " : ");
while (m.find()){
++counter;
++gcounter;
System.out.println(gcounter + ".> " + m.group(i));
}
m.reset(content); // for next group
gcounter = 0;
}
if (counter == 0){
System.out.println("--no match--");
}
}}

----------------------------------------



Thanks for your example, yeah the regexp is wrong with your example it
was returning stuff like:

3.> http://www.google.com/favicon.ico
4.> http://www.google.com/favicon.ico
5.> WeTHhV4cOxM.js
6.> document.location.hostname
7.> domain.indexOf
8.> domain.substring
9.> document.cookie
 
M

mnml

Pattern p = Pattern.compile("(@)?(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-
zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\\+\\%/\\.\\w]+)?");

to find out the problem, keep chopping the tail end off and redoing
the search. When the elements it missed come back, you know the
problem was in the bit you just chopped.

I usually compose these just a bit at a time, adding on just a phrase
before testing.

for other hints seehttp://mindprod.com/jgloss/regex.html

ok, thank you for the link
 
C

Chris

mnml said:
Hi, I made a little function to extract urls from any content with a
regular expression but it doesn't really work.

If you're extracting URLs from HTML, it's a lot easier to try to
recognize the anchor tags. Write a regex to recognize:

<a ~ href="~" ~ >

where ~ means "up to". I've implemented this in a lexer and it works
reliably. (Regexes work a little differently in a lexer, so I don't have
a regex to post). Just adjust to handle mixed case.
 
M

mnml

If you're extracting URLs from HTML, it's a lot easier to try to
recognize the anchor tags. Write a regex to recognize:

<a ~ href="~" ~ >

where ~ means "up to". I've implemented this in a lexer and it works
reliably. (Regexes work a little differently in a lexer, so I don't have
a regex to post). Just adjust to handle mixed case.

ok, thank you :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,736
Latest member
AdolphBig6

Latest Threads

Top