tidy to convert google scholar page in xml

Oct 8, 2012

Dear friends,
I am trying to convert a google scholar page to xml.
First, I am getting the mapge using the script:
#!/usr/bin/python
from HTMLParser import HTMLParser
import urllib2
response =
urllib2.urlopen(urllib2.Request("http://scholar.google.co.uk/scholar?q=albert+einstein+1905&btnG=&hl=en&as_sdt=0,5&as_sdtp=", headers={"User-Agent":"Mozilla/5.0 Cheater/1.0"}))
f=open('sch.html','w')
f.write(response.read())

Which is giving sch.html starting as:
<!doctype html><html><head><meta http-equiv="Content-Type"
content="text/html;charset=UTF-8"><meta http-equiv="X-UA-Compatible"
content="IE=Edge"><meta name="viewport"
content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2"><meta name="format-detection" content="telephone=no">

if I try tidy to convert this html page to xml, I get:
$ tidy <sch.html |more
line 3 column 40 - Warning: <style> isn't allowed in <div> elements
line 3 column 23 - Info: <div> previously mentioned
/**************************
AND MANY MORE WARNNING
**************************/
Info: Document content looks like HTML 4.01 Transitional
Info: No system identifier in emitted doctype
131 warnings, 0 errors were found!

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Linux (vers 25 March 2009), see www.w3.org">
<meta http-equiv="Content-Type" content=
"text/html; charset=us-ascii">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta name="viewport" content=
"width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2">
<meta name="format-detection" content="telephone=no">
<title>albert einstein+1905 - Google Scholar</title>

<script type="text/javascript">
var gs_ts=Number(new Date());
</script>
<style type="text/css">
html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol,ul,li,button{margin:0;padding:
0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}#gs_
top{position:relative;min-width:980px;_width:expression(document.documentElement
..clientWidth<982?"980px":"auto");}.gs_el_ph #gs_top,.gs_el_ta
#gs_top{min-width:
300px;_width:expression(document.documentElement.clientWidth<302?"300px":"auto")
;}body,td{font-size:13px;font-family:Arial,sans-serif;line-height:1.24}body{back

So, this is still in html, not in xml. How can I convert the page to
xml?

Dave Angel · Oct 8, 2012

Dear friends,
I am trying to convert a google scholar page to xml.
First, I am getting the mapge using the script:
#!/usr/bin/python
from HTMLParser import HTMLParser
import urllib2
response =
urllib2.urlopen(urllib2.Request("http://scholar.google.co.uk/scholar?q=albert+einstein+1905&btnG=&hl=en&as_sdt=0,5&as_sdtp=", headers={"User-Agent":"Mozilla/5.0 Cheater/1.0"}))
f=open('sch.html','w')
f.write(response.read())

Which is giving sch.html starting as:
<!doctype html><html><head><meta http-equiv="Content-Type"
content="text/html;charset=UTF-8"><meta http-equiv="X-UA-Compatible"
content="IE=Edge"><meta name="viewport"
content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2"><meta name="format-detection" content="telephone=no">

if I try tidy to convert this html page to xml, I get:
$ tidy <sch.html |more
line 3 column 40 - Warning: <style> isn't allowed in <div> elements
line 3 column 23 - Info: <div> previously mentioned
/**************************
AND MANY MORE WARNNING
**************************/
Info: Document content looks like HTML 4.01 Transitional
Info: No system identifier in emitted doctype
131 warnings, 0 errors were found!

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Linux (vers 25 March 2009), see www.w3.org">
<meta http-equiv="Content-Type" content=
"text/html; charset=us-ascii">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta name="viewport" content=
"width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2">
<meta name="format-detection" content="telephone=no">
<title>albert einstein+1905 - Google Scholar</title>

<script type="text/javascript">
var gs_ts=Number(new Date());
</script>
<style type="text/css">
html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol,ul,li,button{margin:0;padding:
0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}#gs_
top{position:relative;min-width:980px;_width:expression(document.documentElement
.clientWidth<982?"980px":"auto");}.gs_el_ph #gs_top,.gs_el_ta
#gs_top{min-width:
300px;_width:expression(document.documentElement.clientWidth<302?"300px":"auto")
;}body,td{font-size:13px;font-family:Arial,sans-serif;line-height:1.24}body{back

So, this is still in html, not in xml. How can I convert the page to
xml?

What makes you think it's possible? (Possible automatically, that is)
There is no mapping from html to xml, so a program that tries this is
just guessing in many places. Further, many, if not most, web pages are
not even valid html, just good enough to work with most browsers. Now,
if the page was in valid xhtml, then it would already be valid xml.

Do you have a license from google? If not, better read their terms of
service. While they probably won't pursue the occasional page scraping,
you should consider the costs before spending too much effort. Besides,
they have APIs for most of their services, and there might be one
that'll be much easier to use than trying to scrape the html.

Do you have a plan for what to do when the page layout changes?

You should look into Beautiful Soup; it's designed for parsing sloppily
written html. I've no direct experience with it, but it gets
recommended a lot.

How to have two html audio players on one page?	0	May 3, 2022
Google sheets song request	3	Apr 19, 2022
Align separate li to right	2	Jun 19, 2024
Script stops working when using variables to save time typing...	4	Oct 31, 2022
Please Help?	0	Jul 23, 2022
Centering a button using flexbox	2	Feb 5, 2023
How to position the tooltip comment on these buttons?	9	Nov 4, 2023
Image shifts to the right when export the page to pdf	4	May 5, 2023

tidy to convert google scholar page in xml

à¦°à§à¦¦à§à¦° à¦¬à§à¦¯à¦¾à¦£à¦¾à¦°à§à¦œà§€

Dave Angel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads