extracting from web pages but got disordered words sometimes

F

Frank Potter

There are ten web pages I want to deal with.
from http://www.af.shejis.com/new_lw/html/125926.shtml
to http://www.af.shejis.com/new_lw/html/125936.shtml

Each of them uses the charset of Chinese "gb2312", and firefox
displays all of them in the right form, that's readable Chinese.

My job is, I get every page and extract the html title of it and
dispaly the title on linux shell Termial.

And, my problem is, to some page, I get human readable title(that's in
Chinese), but to other pages, I got disordered word. Since each page
has the same charset, I don't know why I can't get every title in the
same way.

Here's my python code, get_title.py :

Code:
#!/usr/bin/python
import urllib2
from BeautifulSoup import BeautifulSoup

min_page=125926
max_page=125936

def make_page_url(page_index):
    return ur"".join([ur"http://www.af.shejis.com/new_lw/
html/",str(page_index),ur".shtml"])

def get_page_title(page_index):
    url=make_page_url(page_index)
    print "now getting: ", url
    user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers={'User-Agent':user_agent}
    req=urllib2.Request(url,None,headers)
    response=urllib2.urlopen(req)
    #print response.info()
    page=response.read()

    #extract tile by beautiful soup
    soup=BeautifulSoup(page)
    full_title=str(soup.html.head.title.string)

    #title is in the format of "title --title"
    #use this code to delete the "--" and the duplicate title
    title=full_title[full_title.rfind('-')+1::]

    return title

for i in xrange(min_page,max_page):
    print get_page_title(i)

Will somebody please help me out? Thanks in advance.
 
P

Paul McGuire

There are ten web pages I want to deal with.
fromhttp://www.af.shejis.com/new_lw/html/125926.shtml
to http://www.af.shejis.com/new_lw/html/125936.shtml

Each of them uses the charset of Chinese "gb2312", and firefox
displays all of them in the right form, that's readable Chinese.

My job is, I get every page and extract the html title of it and
dispaly the title on linux shell Termial.

And, my problem is, to some page, I get human readable title(that's in
Chinese), but to other pages, I got disordered word. Since each page
has the same charset, I don't know why I can't get every title in the
same way.

Here's my python code, get_title.py :

Code:
#!/usr/bin/python
import urllib2
from BeautifulSoup import BeautifulSoup

min_page=125926
max_page=125936

def make_page_url(page_index):
return ur"".join([ur"http://www.af.shejis.com/new_lw/
html/",str(page_index),ur".shtml"])

def get_page_title(page_index):
url=make_page_url(page_index)
print "now getting: ", url
user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers={'User-Agent':user_agent}
req=urllib2.Request(url,None,headers)
response=urllib2.urlopen(req)
#print response.info()
page=response.read()

#extract tile by beautiful soup
soup=BeautifulSoup(page)
full_title=str(soup.html.head.title.string)

#title is in the format of "title --title"
#use this code to delete the "--" and the duplicate title
title=full_title[full_title.rfind('-')+1::]

return title

for i in xrange(min_page,max_page):
print get_page_title(i)

Will somebody please help me out? Thanks in advance.

This pyparsing solution seems to extract what you were looking for,
but I don't know if this will render to Chinese or not.

-- Paul

from pyparsing import makeHTMLTags,SkipTo
import urllib

titleStart,titleEnd = makeHTMLTags("title")
scanExpr = titleStart + SkipTo("- -",include=True) +
SkipTo(titleEnd).setResultsName("titleChars") + titleEnd

def extractTitle(htmlSource):
titleSource = scanExpr.searchString(htmlSource, maxMatches=1)[0]
return titleSource.titleChars


for urlIndex in range(125926,125936+1):
url = "http://www.af.shejis.com/new_lw/html/%d.shtml" % urlIndex
pg = urllib.urlopen(url)
html = pg.read()
pg.close()
print url,':',extractTitle(html)


Gives:

http://www.af.shejis.com/new_lw/html/125926.shtml : GSM±¾µØÍø×éÍø·½Ê½
http://www.af.shejis.com/new_lw/html/125927.shtml : GSM
±¾µØÍø×éÍø·½Ê½³õ̽
http://www.af.shejis.com/new_lw/html/125928.shtml : GSMµÄÊý¾ÝÒµÎñ
http://www.af.shejis.com/new_lw/html/125929.shtml :
GSMµÄÊý¾ÝÒµÎñºÍ³ÐÔØÄÜÁ¦
http://www.af.shejis.com/new_lw/html/125930.shtml : GSMµÄÍøÂçÑݽø-
´ÓGSMµ½GPRSµ½3G £¨¸½Í¼£©
http://www.af.shejis.com/new_lw/html/125931.shtml : GSM¶ÌÏûÏ
¢ÒµÎñÔÚË®Çé×Ô¶¯²â±¨ÏµÍ³ÖеÄÓ¦ÓìØ
http://www.af.shejis.com/new_lw/html/125932.shtml : £Ç£Ó
£Í½»»»ÏµÍ³µÄÍøÂçÓÅ»¯
http://www.af.shejis.com/new_lw/html/125933.shtml : GSMÇл»µô»°µÄ·ÖÎö¼
°½â¾ö°ì·¨
http://www.af.shejis.com/new_lw/html/125934.shtml : GSMÊÖ»ú²¦½ÐÊл°Ä
£¿é¾ÖÓû§¹ÊÕϵÄÆÊÎö
http://www.af.shejis.com/new_lw/html/125935.shtml :
GSMÊÖ»úµ½WCDMAÖն˵ÄÑݱä
http://www.af.shejis.com/new_lw/html/125936.shtml : GSMÊÖ»úµÄάÐÞ·½·¨
 
P

Paul McGuire

After looking at the pyparsing results, I think I see the problem with
your original code. You are selecting only the characters after the
rightmost "-" character, but you really want to select everything to
the right of "- -". In some of the titles, the encoded Chinese
includes a "-" character, so you are chopping off everything before
that.

Try changing your code to:
title=full_title.split("- -")[1]

I think then your original program will work.

-- Paul
 
F

Frank Potter

Thank you, I tried again and I figured it out.
That's something with beautiful soup, I worked with it a year ago also
dealing with Chinese html pages and nothing error happened. I read the
old code and I find the difference. Change the page to unicode before
feeding to beautiful soup, then everything will be OK.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,812
Latest member
GracielaWa

Latest Threads

Top