Optimizing HTML links parser

C

chingooo3k

Hi,

I am a newbie to java and html parsing although I have done lex/yacc
compilers before. I am trying to leech any http link from a given file
be it a proper 'http://www.....' or just a reference like
'/somedierctory/..../stuff' .... For now I plan on running quick tests
on the local file references to see if they exist or not on the hard
drive and so I got into Java and regular expressions ....

Can the java gurus here (hehe ok I'm not being picky) please comment on
my code and how I can optimize it ? Please don't just say it 'sucks' (I
know it does) .. give me a 'because' and perhaps some pointers on how
to make it not so sucky :)

Thanks.

*******************************
*******************************
import java.io.*;
import java.net.*;
import java.util.*;
import java.util.regex.*;


public class InternalLinkChecker
{
private static Pattern pattern;
private static Matcher matcher;
private static String REGEX;
private static BufferedReader in = null;
private static FileWriter out_rep = null;

public static void main (String [] args)
{
try
{
if(args.length != 1)
throw new IllegalArgumentException("Need to let me know which
file.");
else
{
File file = new File(args[0]);

if (file.exists())
{
in = new BufferedReader(new FileReader(file));
StringBuffer buff = new StringBuffer();
int c;


while((c=in.read())!= -1)
buff.append((char) c);

StringBuffer temp2 = new StringBuffer();
String blah;
String [] Split;
int count = 0;


REGEX = "(<a href=)[^\\s]+(\")";

pattern = Pattern.compile(REGEX, Pattern.CASE_INSENSITIVE);
matcher = pattern.matcher(buff);

while(matcher.find())
{
System.out.println("----------------------");
System.out.println("I found: \' " + matcher.group() + "' \n" +
"Range: " + matcher.start() + " to " + matcher.end());

count++;

temp2.append(matcher.group());

}

System.out.println("\n so I found a total of " + count + "
URLS.");


blah = temp2.toString();

blah = blah.replaceAll("(?i)<A HREF=\"","");

Split = blah.split("\"");

out_rep = new FileWriter(new File("Rep.txt"));

for (int i=0; i<Split.length; i++)
out_rep.write(Split + "\n");

}
else
{
throw new IllegalArgumentException("Your file does not exist!");
}
}
}
catch (IOException e)
{
System.err.println(e);
e.printStackTrace();
}
finally
{
try
{
in.close(); out_rep.close();
}
catch (IOException ex)
{
ex.printStackTrace();
System.err.println(ex);
}
}

}

}

*****************************
******************************
 
J

Joan

Hi,

I am a newbie to java and html parsing although I have done
lex/yacc
compilers before. I am trying to leech any http link from a
given file
be it a proper 'http://www.....' or just a reference like
'/somedierctory/..../stuff' .... For now I plan on running
quick tests
on the local file references to see if they exist or not on the
hard
drive and so I got into Java and regular expressions ....
One thing that has always bothered me. How to tell when the
filename is over
if it can contain blanks. For example:

http:/abc.com/~joan/My Documents

Is the filename My or is it My Documents?

How do you address this?
 
C

chingooo3k

REGEX = "(<a href=)[^\\s]+(\")";

takes care of it ... the file name does not need to end in .shtml or
whatever, I am ending my regular expression looking for " ...
 
C

chingooo3k

To address the second issue of spaces, I am assuming the user will not
have spaces in their filenames... honestly, I think having whitespaces
is just wrong because it leads to broken links and other issues later
on .. so I am thinking it is safe to assume this restriction.


--offtopic, but how do you edit your posts ?
 
D

Daniel Dyer

To address the second issue of spaces, I am assuming the user will not
have spaces in their filenames... honestly, I think having whitespaces
is just wrong because it leads to broken links and other issues later
on .. so I am thinking it is safe to assume this restriction.

Spaces in URLs should be encoded as "%20" (search for URL encoding).
--offtopic, but how do you edit your posts ?

You can't. This is USENET son, we don't make mistakes.

As you are posting from Google Groups, you might want to read Andrew's
explanation of its relationship with USENET
(http://www.physci.org/codes/javafaq.jsp#usenet).

Dan.
 
C

chingooo3k

still no comments on the actual code and its optimization ... cmon
gurus, I know you are hiding in there :)

no editing huh ? hehe ok .. so it's like the remove command in Unix :)
 
D

Daniel Dyer

still no comments on the actual code and its optimization ... cmon
gurus, I know you are hiding in there :)

I haven't analysed your code in detail, but a few things I noticed:

Firstly, trying to match "<a href=\"" won't work in all cases. You can
have whitespace around the '=' character, you can have more than one
whitepsace character between the 'a' and "href" and they don't have to be
spaces (they could be tabs or new lines).

Secondly, if you are using Java 5, use StringBuilder instead of
StringBuffer since your code is not multi-threaded and doesn't need to
synchronise. That said, performance gains will probably not be noticeable.

Thirdly, I would use the readLine method on BufferedReader, rather than
reading one character at a time.

Finally, why not change your code so that it accepts a URL rather than a
file system path (you can use a file:// URL if you need to access local
files)? That way you can point your program at a page on the web to
extract links (or even recursively extract the links files that are linked
to from the first file).

Dan.
 
C

chingooo3k

Thanks Dan.

<quote>
Firstly, trying to match "<a href=\"" won't work in all cases. You can

have whitespace around the '=' character, you can have more than one
whitepsace character between the 'a' and "href" and they don't have to
be
spaces (they could be tabs or new lines).
</quote>

Yup.. this is just a prototype I guess so I didn't think about that but
it's pretty easy to include optional whitespaces using regular
expressions....

<quote>

Secondly, if you are using Java 5, use StringBuilder instead of
StringBuffer since your code is not multi-threaded and doesn't need to

synchronise. That said, performance gains will probably not be
noticeable.

</quote>

hmm cool ... well I just started this so there is no threading but with
what I have in mind (assuming I can atleast get this to work) is
moderately complex gui which will probably need threading.. I still
don't know for sure.

<quote>
Thirdly, I would use the readLine method on BufferedReader, rather than

reading one character at a time.
</quote>

ah nice .. I was afraid to use readLine because I read somewhere it has
some bugs/issues... but ok I'll use this instead :)

<quote>
Finally, why not change your code so that it accepts a URL rather than
a
file system path (you can use a file:// URL if you need to access local

files)? That way you can point your program at a page on the web to
extract links (or even recursively extract the links files that are
linked
to from the first file).
</quote>

hmmm eventually yes but right now I'm just trying to get it up and
doing something useful for me. Currently, I am still unsure how fast
this is and how accurate so that's why I was afraid of major blunders
in my approach.

Or maybe there are already kick ass parsers that can leech html links
?? the ones I googled all have some crap dependency issues and some
don't even ship with the proper source files (HTMLSchema in tagsoup
html parser) .... need direction !! :)

Thanks for all the help again!
 
A

Abhijat Vatsyayan

Why not use antlr and an HTML grammar?
JDK comes with HTML parser. Why not use that ?
What about an SGML parser (never used one in, never seen one in java)?
Anyone know of a good (comprehensive) SGML grammar file for antlr ?


Hi,

I am a newbie to java and html parsing although I have done lex/yacc
compilers before. I am trying to leech any http link from a given file
be it a proper 'http://www.....' or just a reference like
'/somedierctory/..../stuff' .... For now I plan on running quick tests
on the local file references to see if they exist or not on the hard
drive and so I got into Java and regular expressions ....

Can the java gurus here (hehe ok I'm not being picky) please comment on
my code and how I can optimize it ? Please don't just say it 'sucks' (I
know it does) .. give me a 'because' and perhaps some pointers on how
to make it not so sucky :)

Thanks.

*******************************
*******************************
import java.io.*;
import java.net.*;
import java.util.*;
import java.util.regex.*;


public class InternalLinkChecker
{
private static Pattern pattern;
private static Matcher matcher;
private static String REGEX;
private static BufferedReader in = null;
private static FileWriter out_rep = null;

public static void main (String [] args)
{
try
{
if(args.length != 1)
throw new IllegalArgumentException("Need to let me know which
file.");
else
{
File file = new File(args[0]);

if (file.exists())
{
in = new BufferedReader(new FileReader(file));
StringBuffer buff = new StringBuffer();
int c;


while((c=in.read())!= -1)
buff.append((char) c);

StringBuffer temp2 = new StringBuffer();
String blah;
String [] Split;
int count = 0;


REGEX = "(<a href=)[^\\s]+(\")";

pattern = Pattern.compile(REGEX, Pattern.CASE_INSENSITIVE);
matcher = pattern.matcher(buff);

while(matcher.find())
{
System.out.println("----------------------");
System.out.println("I found: \' " + matcher.group() + "' \n" +
"Range: " + matcher.start() + " to " + matcher.end());

count++;

temp2.append(matcher.group());

}

System.out.println("\n so I found a total of " + count + "
URLS.");


blah = temp2.toString();

blah = blah.replaceAll("(?i)<A HREF=\"","");

Split = blah.split("\"");

out_rep = new FileWriter(new File("Rep.txt"));

for (int i=0; i<Split.length; i++)
out_rep.write(Split + "\n");

}
else
{
throw new IllegalArgumentException("Your file does not exist!");
}
}
}
catch (IOException e)
{
System.err.println(e);
e.printStackTrace();
}
finally
{
try
{
in.close(); out_rep.close();
}
catch (IOException ex)
{
ex.printStackTrace();
System.err.println(ex);
}
}

}

}

*****************************
******************************
 
C

chingooo3k

antlr and an html grammar :) problem is, nobody is following HTML
guidelines properly... so I need a parser that won't choke on the nasty
HTML we see a lot these days.... not fun to do all by myself :) I could
have the proper grammar, just the majority not following it is what I'm
saying.


JDK comes with the dirty old shitty java something browser parser
engine in javax.swing.... they used to ship a few years back. It's not
been keeping up and has other issues too *cough* multi-threading
*cough*.... I did get it to work and it was pretty easy but I don't
trust it enough.

SGML parser.. dunno, sounds cool though :)

nope, not going into antlr stuff man .. no way .. if anyone else runs
into the same issue as myself, get yourself a proper parser like HTML
parser on sourceforge. It's is HUGE. I had to dig thorugh I lost track
of how many lines of code.... but at the end of the day, it gets the
job done so I'm happy :) ah yes the luxury of not having to go through
the crap of nifty little things like setting up proper base urls for
online / offline pages.. I love you guys (not in the wrong way) ...

Thanks for all the help.
 
R

Rogan Dawes

Hi,

I am a newbie to java and html parsing although I have done lex/yacc
compilers before. I am trying to leech any http link from a given file
be it a proper 'http://www.....' or just a reference like
'/somedierctory/..../stuff' .... For now I plan on running quick tests
on the local file references to see if they exist or not on the hard
drive and so I got into Java and regular expressions ....

Can the java gurus here (hehe ok I'm not being picky) please comment on
my code and how I can optimize it ? Please don't just say it 'sucks' (I
know it does) .. give me a 'because' and perhaps some pointers on how
to make it not so sucky :)

Thanks.

Another alternative to HTMLParser which you mention in another email, is
TagSoup.

TagSoup is an HTML parser that generates SAX events based on what it reads.

It might also be useful for you . . .

Regards,

Rogan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,701
Latest member
XavierQ83

Latest Threads

Top