6
6real
Dear all,
I have a strange behavior regarding what I do and to be honnest I
don't how to solve my issu because I am not familiar with encoding
issues.
here is what i would like to do :
1 - parse an HTML file
2 - Extract a part of this page which is an XML
3 - Store this file in a database
It seems simple but I met an encoding issu.
Here is my code snippet to parse the web page :
URL url = new URL(getURLToUpdate());
URLConnection urlconn = url.openConnection();
Log.d("MGR", "open url");
Document doc = null;
try {
// isolate the kml part
String page =
FormatUtility.slurp(urlconn.getInputStream());
// index of KML start and stop
int indexStartKML =
page.indexOf(Constant.TAG_KML_START);
int indexStopKML =
page.indexOf(Constant.TAG_KML_STOP);
String kml = page.substring(indexStartKML,
indexStopKML + 6);
// Remove the CDATA information
kml = kml.replace("<![CDATA[", "");
kml = kml.replace("]]>", "");
DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource inStream = new InputSource();
inStream.setCharacterStream(new StringReader(kml));
doc = db.parse(inStream);
Here is the slup() method :
public static String slurp (InputStream in) throws IOException {
StringBuffer out = new StringBuffer();
byte[] b = new byte[4096];
for (int n; (n = in.read(b)) != -1 {
out.append(new String(b, 0, n));
}
return out.toString();
}
I try to force the encoding but with no success. I don't know where to
search now either when I load the page from input stream, when I
convert the stream into String. ?.
Any help or idea will be highly appreciated !
Thanks for reading, (this is for an freeware ;-) ) !
C.
PS : This is the response header of the web page :
Date Tue, 29 Jul 2008 21:16:23 GMT
Server Apache
X-Powered-By PHP/5.1.4
Expires Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control no-store, no-cache, must-revalidate, post-check=0, pre-
check=0
Pragma no-cache
Keep-Alive timeout=15, max=99
Connection Keep-Alive
Transfer-Encoding chunked
Content-Type text/html; charset=ISO-8859-1
I have a strange behavior regarding what I do and to be honnest I
don't how to solve my issu because I am not familiar with encoding
issues.
here is what i would like to do :
1 - parse an HTML file
2 - Extract a part of this page which is an XML
3 - Store this file in a database
It seems simple but I met an encoding issu.
The web page is defined with ISO-8859-1 charset
The XML header (when extracted) is specify UTF-8 as encoding charset.
Here is my code snippet to parse the web page :
URL url = new URL(getURLToUpdate());
URLConnection urlconn = url.openConnection();
Log.d("MGR", "open url");
Document doc = null;
try {
// isolate the kml part
String page =
FormatUtility.slurp(urlconn.getInputStream());
// index of KML start and stop
int indexStartKML =
page.indexOf(Constant.TAG_KML_START);
int indexStopKML =
page.indexOf(Constant.TAG_KML_STOP);
String kml = page.substring(indexStartKML,
indexStopKML + 6);
// Remove the CDATA information
kml = kml.replace("<![CDATA[", "");
kml = kml.replace("]]>", "");
DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource inStream = new InputSource();
inStream.setCharacterStream(new StringReader(kml));
doc = db.parse(inStream);
Here is the slup() method :
public static String slurp (InputStream in) throws IOException {
StringBuffer out = new StringBuffer();
byte[] b = new byte[4096];
for (int n; (n = in.read(b)) != -1 {
out.append(new String(b, 0, n));
}
return out.toString();
}
I try to force the encoding but with no success. I don't know where to
search now either when I load the page from input stream, when I
convert the stream into String. ?.
Any help or idea will be highly appreciated !
Thanks for reading, (this is for an freeware ;-) ) !
C.
PS : This is the response header of the web page :
Date Tue, 29 Jul 2008 21:16:23 GMT
Server Apache
X-Powered-By PHP/5.1.4
Expires Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control no-store, no-cache, must-revalidate, post-check=0, pre-
check=0
Pragma no-cache
Keep-Alive timeout=15, max=99
Connection Keep-Alive
Transfer-Encoding chunked
Content-Type text/html; charset=ISO-8859-1