how can i use lxml with win32com?

Michiel Overtoom · Oct 25, 2009

elca said:
http://news.search.naver.com/search.naver?sm=tab_hty&where=news&query=korea+times&x=0&y=0
that is korea portal site and i was search keyword using 'korea times'
and i want to scrap resulted to text name with 'blogscrap_save.txt'

Aha, now we're getting somewhere.

Getting and parsing that page is no problem, and doesn't need JavaScript
or Internet Explorer.

import urllib2
import BeautifulSoup
doc=urllib2.urlopen("http://news.search.naver.com/search.naver?sm=tab_hty&where=news&query=korea+times&x=0&y=0")
soup=BeautifulSoup.BeautifulSoup(doc)

By analyzing the structure of that page you can see that the articles
are presented in an unordered list which has class "type01". The
interesting bit in each list item is encapsulated in a <dd> tag with
class "sh_news_passage". So, to parse the articles:

ul=soup.find("ul","type01")
for li in ul.findAll("li"):
dd=li.find("dd","sh_news_passage")
print dd.renderContents()
print

This example prints them, but you could also save them to a file (or a
database, whatever).

Greetings,

Dennis Lee Bieber · Oct 26, 2009

Pot. Kettle. Black.
comp.lang.python really is a usenet news group. There is a mailing list that mirrors the
newsgroup though.

And the mailing list is then also available via NNTP on gmane as
gmane.comp.python.general...

comp.lang.python (via NNTP)
<> mailing list (via SMTP/POP3)
<> gmane.comp.python.general (via NNTP)

I'm deliberately not defining what Google does with it...

elca · Oct 26, 2009

motoom said:
Aha, now we're getting somewhere.

Getting and parsing that page is no problem, and doesn't need JavaScript
or Internet Explorer.

import urllib2
import BeautifulSoup
doc=urllib2.urlopen("http://news.search.naver.com/search.naver?sm=tab_hty&where=news&query=korea+times&x=0&y=0")
soup=BeautifulSoup.BeautifulSoup(doc)

By analyzing the structure of that page you can see that the articles
are presented in an unordered list which has class "type01". The
interesting bit in each list item is encapsulated in a <dd> tag with
class "sh_news_passage". So, to parse the articles:

ul=soup.find("ul","type01")
for li in ul.findAll("li"):
dd=li.find("dd","sh_news_passage")
print dd.renderContents()
print

This example prints them, but you could also save them to a file (or a
database, whatever).

Greetings,

--
"The ability of the OSS process to collect and harness
the collective IQ of thousands of individuals across
the Internet is simply amazing." - Vinod Valloppillil
http://www.catb.org/~esr/halloween/halloween4.html

Hi, thanks for your help..
thread is too long, so i will open another new post.
thanks a lot

Paul

docx/lxml	6	Jul 31, 2012
I cannot find a suitable guide on how to use jzy3d, can anyone help me?	1	Jan 5, 2024
How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023
Can I declare a variable with an uncertainty number suffix?	0	Mar 5, 2024
Can I declare a variable with an uncertainty number suffix?	3	Mar 5, 2024
How can I convert PST to MBOX with attachments?	3	Jan 16, 2025
installing lxml ?	14	Nov 11, 2009
Trying to use clangd with VSCodium, CMake_World_COMPILER not set	1	Nov 5, 2024

how can i use lxml with win32com?

Michiel Overtoom

Dennis Lee Bieber

elca

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads