D
dayzman
Hi,
I'm interested in a program that extracts the structure of unstructured
HTML documents. The program should be able to make good estimates about
different font styles used to represent headings, for example, some may
use <font size = 24> for headings and some may use <h1>, in the end,
both should output the same structure. The output can be in XML or
other formats. Manual driving should remain minimal. Does anyone know
of such program (preferably open-source)?
Cheers,
Michael
I'm interested in a program that extracts the structure of unstructured
HTML documents. The program should be able to make good estimates about
different font styles used to represent headings, for example, some may
use <font size = 24> for headings and some may use <h1>, in the end,
both should output the same structure. The output can be in XML or
other formats. Manual driving should remain minimal. Does anyone know
of such program (preferably open-source)?
Cheers,
Michael