Buildfile: build-html.xml
version-init:
[mkdir] Created dir: C:\Documents and Settings\Anupam
Jain\Desktop\nekohtml-0.9.5\bin\html\src\org\cyberneko\html
version:
[echo] Generating bin/html/src/org/cyberneko/html/Version.java
[echo] Generating bin/html/src/MANIFEST_html
compile:
[javac] Compiling 26 source files to C:\Documents and
Settings\Anupam Jain\Desktop\nekohtml-0.9.5\bin\html
[javac] C:\Documents and Settings\Anupam
Jain\Desktop\nekohtml-0.9.5\src\html\org\cyberneko\html\HTMLScanner.java:89:
org.cyberneko.html.HTM
LScanner is not abstract and does not override abstract method
getXMLVersion() in org.apache.xerces.xni.XMLLocator
[javac] public class HTMLScanner
[javac] ^
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 1 error
BUILD FAILED
C:\Documents and Settings\Anupam
Jain\Desktop\nekohtml-0.9.5\build-html.xml:51: Compile failed; see the
compiler error output for details.
Total time: 16 seconds
So basically the error is : org.cyberneko.html.HTMLScanner is not
abstract and does not override abstract method getXMLVersion() in
org.apache.xerces.xni.XMLLocator
- Anupam
Philippe said:
Hi,
After 2 weeks of search/hit-and-trial I finally thought to revert to
the group to find solution to my problem.(something I should have done
much earlier)
This is the deal :
On a JSP page, I want to grab a URL and parse /change the HTML and send
it to the JSP page. I take the URL from the user in a textbox (not the
browser location box).
In the Java class file (that I have imported in JSP), I tried to use
Xerces parser earlier till I realised it only supports well-formed XML.
So I switched to OpenXML which supports HTML (but it took like 10
minutes to parse it and after that also it gave me the Out of Memory
Exception - even when I increased the buffer size of Tomcat to a good
amount and when I was parsing a page as simple as
www.google.com)
But if I dont use the DOCUMENT_HTML option in OpenXML and just treat
the HTML as normal XML file, it does parse it properly(maybe it skips
the non terminated tags) but there's no way to return the XML back to
the browser because doc.getDocumentElement().toString() returns '
HTML:
1 nodes'
So then I switched to Jtidy and tried to convert HTML to XHTML. But it
seems the Document type returned by JTidy doesnt support most standard
document methods (including converting XML to string using
doc.getDocumentElement().toString()) leaving me at the same place where
I started from.
Can anybody suggest me what can be a good idea to approach my problem.
All that I want to do is grab a URL's HTML, add some tags to it (a
couple of appendChild()s) and then send the HTML back to the user to
be displayed(intrepreted) on the browser.
I'll be really thankful for your help!
Anupam
[/QUOTE]
hi,
I did exactly the same thing with NekoHTML : parsing the HTML to XML,
then selecting some nodes with XPath, appending/replacing some nodes,
and transforming or serializing it back to HTML
http://people.apache.org/~andyc/neko/doc/html/index.html
(a nice tool)
--------------------------------------------
Did you think on a full XML solution ?
With Active Tags I used some tags/actions to achieve this. For this
purpose you could use RefleX at the top of Tomcat :
http://reflex.gforge.inria.fr/
(a nice tool too)
RefleX comes with a servlet that can run Active Tags
Your code would then look like this :
<web:service
xmlns:web="http://www.inria.fr/xml/active-tags/web"
xmlns:io="http://www.inria.fr/xml/active-tags/io"
xmlns:xcl="http://www.inria.fr/xml/active-tags/xcl"
xmlns:xhtml="http://www.w3.org/1999/xhtml"<!--understand it as a HTTP service-->
<!--things that are performed when the server starts-->
<web:init>
<!--share a stylesheet with all HTTP requests-->
<xcl:parse-stylesheet name="ralyx.xsl"
source="web:///WEB-INF/xslt/ralyx.xsl" scope="shared"/>
</web:init>
<!--map the URL-path with a regexp-->
<web:mapping
match="^/(\d{4})/Fiches/([\p{Lower}\d\-_+]+)/\2\.(html|xml)$"
method="GET" mime-type="">
<!--use an HTML parser because the documents are not
well-formed ; <xcl:parse-html> uses NekoHTML-->
<xcl:parse-html name="fiche"
source="http://www.inria.fr/recherche/equipes/{ $web:match/node()[ 2 ]
}.en.html"/>
<xcl:set name="corps" value="{
$fiche//xhtml:DIV[@class='corps'] }"/>
<xcl:set name="about" value="{ $corps/xhtml:TABLE//xhtml:TD[2] }"/>
<xcl:replace referent="{ $about }">
<td width="200px" align="right" class="projet">
<div class="menu_box">{ $about/node() }</div>
</td>
</xcl:replace>
<!--rebuild a new document-->
<xcl:document name="projet">
<projet xml="xml" title="{ string(
$corps/preceding-sibling::xhtml:H1 ) }">
{ $corps }
</projet>
</xcl:document>
<!--relativizing URLs in <A href> and <IMG src>-->
<xcl:for-each name="link" select="{ $projet//xhtml:A[@href] }">
<xcl:attribute referent="{ $link }" name="href" value="{
io:resolve-uri('http://www.inria.fr/recherche/equipes/', string(
$link/@href ) ) }"/>
</xcl:for-each>
<xcl:for-each name="link" select="{ $projet//xhtml:IMG[@src] }">
<xcl:attribute referent="{ $link }" name="src" value="{
io:resolve-uri('http://www.inria.fr/recherche/equipes/', string(
$link/@src ) ) }"/>
</xcl:for-each>
<!--selecting the stylesheet-->
<xcl:set xcl:if="{ $web:match/node()[ 3 ] != 'xml' }"
name="xslt" value="{ $ralyx.xsl }"/>
<!--back to the browser-->
<xcl:transform
output="{ value( $web:response/@web:output ) }"
source="{ $projet }"
stylesheet="{ $xslt }"
/>
</web:mapping>
</web:service>
the result is a new HTML document that contains an updated-part of
another HTML document (this mapping act almost like a proxy) ; it is
used in a real-application deployed at INRIA
to use it, simply declares the ReflexServlet in Tomcat :
<web-app>
<display-name>RefleX application</display-name>
<description>My RefleX application</description>
<servlet>
<servlet-name>ReflexServlet</servlet-name>
<display-name>RefleX servlet</display-name>
<description>Runs an Active Sheet</description>
<servlet-class>org.inria.reflex.ReflexServlet</servlet-class>
<init-param>
<param-name>activeSheetPath</param-name>
<param-value>web:///WEB-INF/active-sheet.xml</param-value>
</init-param>
<load-on-startup>1</load-on-startup>
</servlet>
<servlet-mapping><!--custom mappings-->
<url-pattern>*.gif</url-pattern>
<servlet-name>default</servlet-name>
</servlet-mapping>
<servlet-mapping><!--RefleX mapping-->
<servlet-name>ReflexServlet</servlet-name>
<url-pattern>/</url-pattern>
</servlet-mapping>
</web-app>
when downloading RefleX, check the dependencies and ensure that NekoHTML
0.9.5 is in the full distribution : for the moment, the last version of
RefleX available (0.1.2) uses NekoHTML 0.9.4 that is bugged regarding
namespace URIs ; this issue is fixed in NekoHTML 0.9.5 which is
available online and that will be in RefleX 0.1.3 (coming soon) ;
Enjoy :)
--
Cordialement,
///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |[/QUOTE]
--
Cordialement,
///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |