tidy to convert google scholar page in xml

  • Thread starter রà§à¦¦à§à¦° বà§à¦¯à¦¾à¦£à¦¾à¦°à§à¦œà§€
  • Start date
À

রà§à¦¦à§à¦° বà§à¦¯à¦¾à¦£à¦¾à¦°à§à¦œà§€

Dear friends,
I am trying to convert a google scholar page to xml.
First, I am getting the mapge using the script:
#!/usr/bin/python
from HTMLParser import HTMLParser
import urllib2
response =
urllib2.urlopen(urllib2.Request("http://scholar.google.co.uk/scholar?q=albert+einstein+1905&btnG=&hl=en&as_sdt=0,5&as_sdtp=", headers={"User-Agent":"Mozilla/5.0 Cheater/1.0"}))
f=open('sch.html','w')
f.write(response.read())

Which is giving sch.html starting as:
<!doctype html><html><head><meta http-equiv="Content-Type"
content="text/html;charset=UTF-8"><meta http-equiv="X-UA-Compatible"
content="IE=Edge"><meta name="viewport"
content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2"><meta name="format-detection" content="telephone=no">

if I try tidy to convert this html page to xml, I get:
$ tidy <sch.html |more
line 3 column 40 - Warning: <style> isn't allowed in <div> elements
line 3 column 23 - Info: <div> previously mentioned
/**************************
AND MANY MORE WARNNING
**************************/
Info: Document content looks like HTML 4.01 Transitional
Info: No system identifier in emitted doctype
131 warnings, 0 errors were found!

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Linux (vers 25 March 2009), see www.w3.org">
<meta http-equiv="Content-Type" content=
"text/html; charset=us-ascii">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta name="viewport" content=
"width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2">
<meta name="format-detection" content="telephone=no">
<title>albert einstein+1905 - Google Scholar</title>

<script type="text/javascript">
var gs_ts=Number(new Date());
</script>
<style type="text/css">
html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol,ul,li,button{margin:0;padding:
0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}#gs_
top{position:relative;min-width:980px;_width:expression(document.documentElement
..clientWidth<982?"980px":"auto");}.gs_el_ph #gs_top,.gs_el_ta
#gs_top{min-width:
300px;_width:expression(document.documentElement.clientWidth<302?"300px":"auto")
;}body,td{font-size:13px;font-family:Arial,sans-serif;line-height:1.24}body{back


So, this is still in html, not in xml. How can I convert the page to
xml?
 
D

Dave Angel

Dear friends,
I am trying to convert a google scholar page to xml.
First, I am getting the mapge using the script:
#!/usr/bin/python
from HTMLParser import HTMLParser
import urllib2
response =
urllib2.urlopen(urllib2.Request("http://scholar.google.co.uk/scholar?q=albert+einstein+1905&btnG=&hl=en&as_sdt=0,5&as_sdtp=", headers={"User-Agent":"Mozilla/5.0 Cheater/1.0"}))
f=open('sch.html','w')
f.write(response.read())

Which is giving sch.html starting as:
<!doctype html><html><head><meta http-equiv="Content-Type"
content="text/html;charset=UTF-8"><meta http-equiv="X-UA-Compatible"
content="IE=Edge"><meta name="viewport"
content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2"><meta name="format-detection" content="telephone=no">

if I try tidy to convert this html page to xml, I get:
$ tidy <sch.html |more
line 3 column 40 - Warning: <style> isn't allowed in <div> elements
line 3 column 23 - Info: <div> previously mentioned
/**************************
AND MANY MORE WARNNING
**************************/
Info: Document content looks like HTML 4.01 Transitional
Info: No system identifier in emitted doctype
131 warnings, 0 errors were found!

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Linux (vers 25 March 2009), see www.w3.org">
<meta http-equiv="Content-Type" content=
"text/html; charset=us-ascii">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta name="viewport" content=
"width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2">
<meta name="format-detection" content="telephone=no">
<title>albert einstein+1905 - Google Scholar</title>

<script type="text/javascript">
var gs_ts=Number(new Date());
</script>
<style type="text/css">
html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol,ul,li,button{margin:0;padding:
0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}#gs_
top{position:relative;min-width:980px;_width:expression(document.documentElement
.clientWidth<982?"980px":"auto");}.gs_el_ph #gs_top,.gs_el_ta
#gs_top{min-width:
300px;_width:expression(document.documentElement.clientWidth<302?"300px":"auto")
;}body,td{font-size:13px;font-family:Arial,sans-serif;line-height:1.24}body{back


So, this is still in html, not in xml. How can I convert the page to
xml?

What makes you think it's possible? (Possible automatically, that is)
There is no mapping from html to xml, so a program that tries this is
just guessing in many places. Further, many, if not most, web pages are
not even valid html, just good enough to work with most browsers. Now,
if the page was in valid xhtml, then it would already be valid xml.

Do you have a license from google? If not, better read their terms of
service. While they probably won't pursue the occasional page scraping,
you should consider the costs before spending too much effort. Besides,
they have APIs for most of their services, and there might be one
that'll be much easier to use than trying to scrape the html.

Do you have a plan for what to do when the page layout changes?

You should look into Beautiful Soup; it's designed for parsing sloppily
written html. I've no direct experience with it, but it gets
recommended a lot.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,981
Messages
2,570,187
Members
46,728
Latest member
FernMcmull

Latest Threads

Top