I wonder if I would be able to collect data from such page using Python

C

Comment Holder

Hi,
I am totally new to Python. I noticed that there are many videos showing how to collect data from Python, but I am not sure if I would be able to accomplish my goal using Python so I can start learning.

Here is the example of the target page:
http://and.medianewsonline.com/hello.html
In this example, there are 10 articles.

What I exactly need is to do the following:
1- Collect the article title, date, source, and contents.
2- I need to be able to export the final results to excel or a database client. That is, I need to have all of those specified in step 1 in one row, while each of them saved in separate column. For example:

Title1 Date1 Source1 Contents1
Title2 Date2 Source2 Contents2

I appreciate any advise regarding my case.

Thanks & Regards//
 
J

Joel Goldstick

Hi,
I am totally new to Python. I noticed that there are many videos showing how to collect data from Python, but I am not sure if I would be able to accomplish my goal using Python so I can start learning.

Here is the example of the target page:
http://and.medianewsonline.com/hello.html
In this example, there are 10 articles.

What I exactly need is to do the following:
1- Collect the article title, date, source, and contents.
2- I need to be able to export the final results to excel or a database client. That is, I need to have all of those specified in step 1 in one row, while each of them saved in separate column. For example:

Title1 Date1 Source1 Contents1
Title2 Date2 Source2 Contents2

I appreciate any advise regarding my case.

Thanks & Regards//

I'm guessing that you are not only new to Python, but that you haven't
much experience in writing computer programs at all. So, you need to
do that. There is a good tutorial on the python site, and lots of
links to other resources.

then do this:

1. write code to access the page you require. The Requests module can
help with that
2. write code to select the data you want. The BeautifulSoup module
is excellent for this
3. write code to save your data in comma separated value format.
4. import to excel or wherever

Now, go off and write the code. When you get stuck, copy and paste
the portion of the code that is giving you problems, along with the
traceback. You can also get help at the python-tutor mailing list
 
C

Comment Holder

Many thanks Joel,

You are right to some extent. I come from Finance background, but I am very familiar with what could be referred to as non-native languages such as Matlab, VBA,.. actually, I have developed couple of complete programs.

I have asked this question, because I am a little worried about the structure of this particular page, as there are no specific defined classes.

I know how powerful Python is, but I wonder if it could do the job with this particular page.

Again, many thanks Joel, I appreciate your guidance.
All Best//
 
J

Joel Goldstick

Many thanks Joel,

You are right to some extent. I come from Finance background, but I am very familiar with what could be referred to as non-native languages such as Matlab, VBA,.. actually, I have developed couple of complete programs.

I have asked this question, because I am a little worried about the structure of this particular page, as there are no specific defined classes.

I know how powerful Python is, but I wonder if it could do the job with this particular page.

Again, many thanks Joel, I appreciate your guidance.
All Best//

Your biggest hurdle will be to get proficient with python. Give
yourself a weekend with a good tutorial. You won't be very skilled,
but you will get the gist of things.

Also, google Beautiful Soup. You need the latest version. Its v4 I
think. They have a GREAT tutorial. Spend a few hours with it and you
will see your way to get the data you want from your web pages.

Since you gave a sample web page, I am guessing that you need to log
in to the site for 'real data'. For that, you need to really
understand stuff that you might not. At any rate, study the Requests
Module documentation. Python comes with urllib, and urllib2 that
cover the same ground, but Requests is a lot simpler to understand
 
C

Comment Holder

Dear Joel,

Many thanks for your help - I think I shall start with this way and see how it goes. My concerns were if the task can be accomplished with Python, and from your posts, I guess it can - so I shall give it a try :).

Again, thanks a lot & all best//
 
J

Joel Goldstick

Dear Joel,

Many thanks for your help - I think I shall start with this way and see how it goes. My concerns were if the task can be accomplished with Python, and from your posts, I guess it can - so I shall give it a try :).

Again, thanks a lot & all best//


You're welcome. One thought popped into my mind. Since the site
seems to be from the Wall Street Journal, you may want to look into
whether they have an api for searching and retrieving articles. If
they do, this would be simpler and probably safer than parsing web
pages. From time to time, websites change their layout, which would
probably break your program. However APIs are more stable

good luck to you
 
T

Terry Reedy

CM: You still seem a bit doubtful. If you are wondering why no one else
has answered, it is because Joel has given you a really good answer that
cannot be beat without writing your code for you.
You're welcome. One thought popped into my mind. Since the site
seems to be from the Wall Street Journal, you may want to look into
whether they have an api for searching and retrieving articles. If
they do, this would be simpler and probably safer than parsing web
pages. From time to time, websites change their layout, which would
probably break your program. However APIs are more stable

Including this suggestion, which I did not think of.
 
C

Comment Holder

Dear Terry,

Many thanks for your comments. Actually I was, because the target-page doesn't have a neat structure. But, after all of your contributions, I think the task can be achieved very well with Python.

Thanks again & all best//
 
C

Comment Holder

Dear Piet,

Many thanks for your assistance. It is much appreciated. I have just installed Python 3.3.2 and BeautifulSoup 4.3.1. I tried running the code, but run into some syntax errors.
I wonder how you would want that with multiparagraph contents.

I am looking to save all the paragraphs of an article in one field, so that, the afterwards-analysis becomes easier.

As I am new, I won't ask for assistance before I get some general idea about Python. I shall dedicate the weekend for this purpose, or at least Sunday. Once I am done, I will post my results back in here.

Thanks again & all best//
 
C

Chris Angelico

As I am new, I won't ask for assistance before I get some general idea about Python. I shall dedicate the weekend for this purpose, or at least Sunday. Once I am done, I will post my results back in here.


Smart move :) I strongly recommend the inbuilt tutorial, if you
haven't seen it already:

http://docs.python.org/3/tutorial/

And you're using the current version, which is good. Saves the hassle
of figuring out what's different in an old version.

All the best!

ChrisA
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,968
Messages
2,570,154
Members
46,702
Latest member
LukasConde

Latest Threads

Top