F
Frank Potter
There are ten web pages I want to deal with.
from http://www.af.shejis.com/new_lw/html/125926.shtml
to http://www.af.shejis.com/new_lw/html/125936.shtml
Each of them uses the charset of Chinese "gb2312", and firefox
displays all of them in the right form, that's readable Chinese.
My job is, I get every page and extract the html title of it and
dispaly the title on linux shell Termial.
And, my problem is, to some page, I get human readable title(that's in
Chinese), but to other pages, I got disordered word. Since each page
has the same charset, I don't know why I can't get every title in the
same way.
Here's my python code, get_title.py :
Will somebody please help me out? Thanks in advance.
from http://www.af.shejis.com/new_lw/html/125926.shtml
to http://www.af.shejis.com/new_lw/html/125936.shtml
Each of them uses the charset of Chinese "gb2312", and firefox
displays all of them in the right form, that's readable Chinese.
My job is, I get every page and extract the html title of it and
dispaly the title on linux shell Termial.
And, my problem is, to some page, I get human readable title(that's in
Chinese), but to other pages, I got disordered word. Since each page
has the same charset, I don't know why I can't get every title in the
same way.
Here's my python code, get_title.py :
Code:
#!/usr/bin/python
import urllib2
from BeautifulSoup import BeautifulSoup
min_page=125926
max_page=125936
def make_page_url(page_index):
return ur"".join([ur"http://www.af.shejis.com/new_lw/
html/",str(page_index),ur".shtml"])
def get_page_title(page_index):
url=make_page_url(page_index)
print "now getting: ", url
user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers={'User-Agent':user_agent}
req=urllib2.Request(url,None,headers)
response=urllib2.urlopen(req)
#print response.info()
page=response.read()
#extract tile by beautiful soup
soup=BeautifulSoup(page)
full_title=str(soup.html.head.title.string)
#title is in the format of "title --title"
#use this code to delete the "--" and the duplicate title
title=full_title[full_title.rfind('-')+1::]
return title
for i in xrange(min_page,max_page):
print get_page_title(i)
Will somebody please help me out? Thanks in advance.