Optimizing HTML links parser

chingooo3k · Nov 3, 2005

Hi,

I am a newbie to java and html parsing although I have done lex/yacc
compilers before. I am trying to leech any http link from a given file
be it a proper 'http://www.....' or just a reference like
'/somedierctory/..../stuff' .... For now I plan on running quick tests
on the local file references to see if they exist or not on the hard
drive and so I got into Java and regular expressions ....

Can the java gurus here (hehe ok I'm not being picky) please comment on
my code and how I can optimize it ? Please don't just say it 'sucks' (I
know it does) .. give me a 'because' and perhaps some pointers on how
to make it not so sucky

Thanks.

*******************************
*******************************
import java.io.*;
import java.net.*;
import java.util.*;
import java.util.regex.*;

public class InternalLinkChecker
{
private static Pattern pattern;
private static Matcher matcher;
private static String REGEX;
private static BufferedReader in = null;
private static FileWriter out_rep = null;

public static void main (String [] args)
{
try
{
if(args.length != 1)
throw new IllegalArgumentException("Need to let me know which
file.");
else
{
File file = new File(args[0]);

if (file.exists())
{
in = new BufferedReader(new FileReader(file));
StringBuffer buff = new StringBuffer();
int c;

while((c=in.read())!= -1)
buff.append((char) c);

StringBuffer temp2 = new StringBuffer();
String blah;
String [] Split;
int count = 0;

REGEX = "(<a href=)[^\\s]+(\")";

pattern = Pattern.compile(REGEX, Pattern.CASE_INSENSITIVE);
matcher = pattern.matcher(buff);

while(matcher.find())
{
System.out.println("----------------------");
System.out.println("I found: \' " + matcher.group() + "' \n" +
"Range: " + matcher.start() + " to " + matcher.end());

count++;

temp2.append(matcher.group());

}

System.out.println("\n so I found a total of " + count + "
URLS.");

blah = temp2.toString();

blah = blah.replaceAll("(?i)<A HREF=\"","");

Split = blah.split("\"");

out_rep = new FileWriter(new File("Rep.txt"));

for (int i=0; i<Split.length; i++)
out_rep.write(Split + "\n");

}
else
{
throw new IllegalArgumentException("Your file does not exist!");
}
}
}
catch (IOException e)
{
System.err.println(e);
e.printStackTrace();
}
finally
{
try
{
in.close(); out_rep.close();
}
catch (IOException ex)
{
ex.printStackTrace();
System.err.println(ex);
}
}

}

}

*****************************
******************************

Joan · Nov 3, 2005

Hi,

I am a newbie to java and html parsing although I have done
lex/yacc
compilers before. I am trying to leech any http link from a
given file
be it a proper 'http://www.....' or just a reference like
'/somedierctory/..../stuff' .... For now I plan on running
quick tests
on the local file references to see if they exist or not on the
hard
drive and so I got into Java and regular expressions ....

One thing that has always bothered me. How to tell when the
filename is over
if it can contain blanks. For example:

http:/abc.com/~joan/My Documents

Is the filename My or is it My Documents?

How do you address this?

chingooo3k · Nov 3, 2005

REGEX = "(<a href=)[^\\s]+(\")";

takes care of it ... the file name does not need to end in .shtml or
whatever, I am ending my regular expression looking for " ...

chingooo3k · Nov 3, 2005

To address the second issue of spaces, I am assuming the user will not
have spaces in their filenames... honestly, I think having whitespaces
is just wrong because it leads to broken links and other issues later
on .. so I am thinking it is safe to assume this restriction.

--offtopic, but how do you edit your posts ?

Daniel Dyer · Nov 3, 2005

To address the second issue of spaces, I am assuming the user will not
have spaces in their filenames... honestly, I think having whitespaces
is just wrong because it leads to broken links and other issues later
on .. so I am thinking it is safe to assume this restriction.

Spaces in URLs should be encoded as "%20" (search for URL encoding).

--offtopic, but how do you edit your posts ?

You can't. This is USENET son, we don't make mistakes.

As you are posting from Google Groups, you might want to read Andrew's
explanation of its relationship with USENET
(http://www.physci.org/codes/javafaq.jsp#usenet).

Dan.

chingooo3k · Nov 3, 2005

still no comments on the actual code and its optimization ... cmon
gurus, I know you are hiding in there

no editing huh ? hehe ok .. so it's like the remove command in Unix

Daniel Dyer · Nov 3, 2005

still no comments on the actual code and its optimization ... cmon
gurus, I know you are hiding in there

I haven't analysed your code in detail, but a few things I noticed:

Firstly, trying to match "<a href=\"" won't work in all cases. You can
have whitespace around the '=' character, you can have more than one
whitepsace character between the 'a' and "href" and they don't have to be
spaces (they could be tabs or new lines).

Secondly, if you are using Java 5, use StringBuilder instead of
StringBuffer since your code is not multi-threaded and doesn't need to
synchronise. That said, performance gains will probably not be noticeable.

Thirdly, I would use the readLine method on BufferedReader, rather than
reading one character at a time.

Finally, why not change your code so that it accepts a URL rather than a
file system path (you can use a file:// URL if you need to access local
files)? That way you can point your program at a page on the web to
extract links (or even recursively extract the links files that are linked
to from the first file).

Dan.

chingooo3k · Nov 3, 2005

Thanks Dan.

<quote>
Firstly, trying to match "<a href=\"" won't work in all cases. You can

have whitespace around the '=' character, you can have more than one
whitepsace character between the 'a' and "href" and they don't have to
be
spaces (they could be tabs or new lines).
</quote>

Yup.. this is just a prototype I guess so I didn't think about that but
it's pretty easy to include optional whitespaces using regular
expressions....

<quote>

Secondly, if you are using Java 5, use StringBuilder instead of
StringBuffer since your code is not multi-threaded and doesn't need to

synchronise. That said, performance gains will probably not be
noticeable.

</quote>

hmm cool ... well I just started this so there is no threading but with
what I have in mind (assuming I can atleast get this to work) is
moderately complex gui which will probably need threading.. I still
don't know for sure.

<quote>
Thirdly, I would use the readLine method on BufferedReader, rather than

reading one character at a time.
</quote>

ah nice .. I was afraid to use readLine because I read somewhere it has
some bugs/issues... but ok I'll use this instead

<quote>
Finally, why not change your code so that it accepts a URL rather than
a
file system path (you can use a file:// URL if you need to access local

files)? That way you can point your program at a page on the web to
extract links (or even recursively extract the links files that are
linked
to from the first file).
</quote>

hmmm eventually yes but right now I'm just trying to get it up and
doing something useful for me. Currently, I am still unsure how fast
this is and how accurate so that's why I was afraid of major blunders
in my approach.

Or maybe there are already kick ass parsers that can leech html links
?? the ones I googled all have some crap dependency issues and some
don't even ship with the proper source files (HTMLSchema in tagsoup
html parser) .... need direction !!

Thanks for all the help again!

Abhijat Vatsyayan · Nov 3, 2005

Why not use antlr and an HTML grammar?
JDK comes with HTML parser. Why not use that ?
What about an SGML parser (never used one in, never seen one in java)?
Anyone know of a good (comprehensive) SGML grammar file for antlr ?

Hi,

I am a newbie to java and html parsing although I have done lex/yacc
compilers before. I am trying to leech any http link from a given file
be it a proper 'http://www.....' or just a reference like
'/somedierctory/..../stuff' .... For now I plan on running quick tests
on the local file references to see if they exist or not on the hard
drive and so I got into Java and regular expressions ....

Can the java gurus here (hehe ok I'm not being picky) please comment on
my code and how I can optimize it ? Please don't just say it 'sucks' (I
know it does) .. give me a 'because' and perhaps some pointers on how
to make it not so sucky

Thanks.

*******************************
*******************************
import java.io.*;
import java.net.*;
import java.util.*;
import java.util.regex.*;

public class InternalLinkChecker
{
private static Pattern pattern;
private static Matcher matcher;
private static String REGEX;
private static BufferedReader in = null;
private static FileWriter out_rep = null;

public static void main (String [] args)
{
try
{
if(args.length != 1)
throw new IllegalArgumentException("Need to let me know which
file.");
else
{
File file = new File(args[0]);

if (file.exists())
{
in = new BufferedReader(new FileReader(file));
StringBuffer buff = new StringBuffer();
int c;

while((c=in.read())!= -1)
buff.append((char) c);

StringBuffer temp2 = new StringBuffer();
String blah;
String [] Split;
int count = 0;

REGEX = "(<a href=)[^\\s]+(\")";

pattern = Pattern.compile(REGEX, Pattern.CASE_INSENSITIVE);
matcher = pattern.matcher(buff);

while(matcher.find())
{
System.out.println("----------------------");
System.out.println("I found: \' " + matcher.group() + "' \n" +
"Range: " + matcher.start() + " to " + matcher.end());

count++;

temp2.append(matcher.group());

}

System.out.println("\n so I found a total of " + count + "
URLS.");

blah = temp2.toString();

blah = blah.replaceAll("(?i)<A HREF=\"","");

Split = blah.split("\"");

out_rep = new FileWriter(new File("Rep.txt"));

for (int i=0; i<Split.length; i++)
out_rep.write(Split + "\n");

}
else
{
throw new IllegalArgumentException("Your file does not exist!");
}
}
}
catch (IOException e)
{
System.err.println(e);
e.printStackTrace();
}
finally
{
try
{
in.close(); out_rep.close();
}
catch (IOException ex)
{
ex.printStackTrace();
System.err.println(ex);
}
}

}

}

*****************************
******************************

chingooo3k · Nov 7, 2005

antlr and an html grammar

problem is, nobody is following HTML
guidelines properly... so I need a parser that won't choke on the nasty
HTML we see a lot these days.... not fun to do all by myself

I could
have the proper grammar, just the majority not following it is what I'm
saying.

JDK comes with the dirty old shitty java something browser parser
engine in javax.swing.... they used to ship a few years back. It's not
been keeping up and has other issues too *cough* multi-threading
*cough*.... I did get it to work and it was pretty easy but I don't
trust it enough.

SGML parser.. dunno, sounds cool though

nope, not going into antlr stuff man .. no way .. if anyone else runs
into the same issue as myself, get yourself a proper parser like HTML
parser on sourceforge. It's is HUGE. I had to dig thorugh I lost track
of how many lines of code.... but at the end of the day, it gets the
job done so I'm happy

ah yes the luxury of not having to go through
the crap of nifty little things like setting up proper base urls for
online / offline pages.. I love you guys (not in the wrong way) ...

Thanks for all the help.

Rogan Dawes · Nov 8, 2005

Hi,

I am a newbie to java and html parsing although I have done lex/yacc
compilers before. I am trying to leech any http link from a given file
be it a proper 'http://www.....' or just a reference like
'/somedierctory/..../stuff' .... For now I plan on running quick tests
on the local file references to see if they exist or not on the hard
drive and so I got into Java and regular expressions ....

Can the java gurus here (hehe ok I'm not being picky) please comment on
my code and how I can optimize it ? Please don't just say it 'sucks' (I
know it does) .. give me a 'because' and perhaps some pointers on how
to make it not so sucky

Thanks.

Another alternative to HTMLParser which you mention in another email, is
TagSoup.

TagSoup is an HTML parser that generates SAX events based on what it reads.

It might also be useful for you . . .

Regards,

Rogan

The distinction between a java applet and an application	1	Jan 4, 2023
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
Error with server	3	Nov 20, 2022
Getting Enclosure Contents using Rome RSS parser	31	Apr 30, 2013
Simple/pojo loc parser for java	2	Sep 6, 2010
How to sort a CSV file with merge sort JAVA	7	May 6, 2021
Image overlay and comparison code error.	2	Jul 1, 2021
Picture Comparison Code Not Working Properly	1	Jul 24, 2021

Optimizing HTML links parser

chingooo3k

Joan

chingooo3k

chingooo3k

Daniel Dyer

chingooo3k

Daniel Dyer

chingooo3k

Abhijat Vatsyayan

chingooo3k

Rogan Dawes

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads