Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()

B

Bryan Britten

Hey, everyone!

I'm very new to Python and have only been using it for a couple of days, but have some experience in programming (albeit mostly statistical programming in SAS or R) so I'm hoping someone can answer this question in a technical way, but without using an abundant amount of jargon.

The issue I'm having is that I'm trying to pull information from a website to practice Python with, but I'm having trouble getting the data in a timely fashion. If I use the following code:

<code>
import json
import urllib

urlStr = "https://stream.twitter.com/1/statuses/sample.json"

twtrDict = [json.loads(line) for line in urllib.urlopen(urlStr)]
</code>

I get a memory issue. I'm running 32-bit Python 2.7 with 4 gigs of RAM if that helps at all.

If I use the following code:

<code>
import urllib

urlStr = "https://stream.twitter.com/1/statuses/sample.json"

fileHandle = urllib.urlopen(urlStr)

twtrText = fileHandle.readlines()
</code>

It takes hours (upwards of 6 or 7, if not more) to finish computing the last command.

With that being said, my question is whether there is a more efficient manner to do this. I'm worried that if it's taking this long to process the .readlines() command, trying to work with the data is going to be a computational nightmare.

Thanks in advance for any insights or advice!
 
R

Roy Smith

Bryan Britten said:
If I use the following code:

<code>
import urllib

urlStr = "https://stream.twitter.com/1/statuses/sample.json"

fileHandle = urllib.urlopen(urlStr)

twtrText = fileHandle.readlines()
</code>

It takes hours (upwards of 6 or 7, if not more) to finish computing the last
command.

I'm not surprised! readlines() reads in the ENTIRE file in one gulp.
That a lot of tweets!
With that being said, my question is whether there is a more efficient manner
to do this.

In general, when reading a large file, you want to iterate over lines of
the file and process each one. Something like:

for line in urllib.urlopen(urlStr):
twtrDict = json.loads(line)

You still need to download and process all the data, but at least you
don't need to store it in memory all at once. There is an assumption
here that there's exactly one json object per line. If that's not the
case, things might get a little more complicated.
 
B

Bryan Britten

Try to not sigh audibly as I ask what I'm sure are two asinine questions.

1) How is this approach different from twtrDict = [json.loads(line) for line in urllib.urlopen(urlStr)]?

2) How do I tell how many JSON objects are on each line?
 
D

Denis McMahon

Try to not sigh audibly as I ask what I'm sure are two asinine
questions.

1) How is this approach different from twtrDict = [json.loads(line) for
line in urllib.urlopen(urlStr)]?

2) How do I tell how many JSON objects are on each line?

Your code at (1) creates a single list of all the json objects

The code you replied to loaded each object, assumed you did something
with it, and then over-wrote it with the next one.

As for (2) - either inspection, or errors from the json parser.
 
F

Fábio Santos

Try to not sigh audibly as I ask what I'm sure are two asinine questions.

1) How is this approach different from twtrDict = [json.loads(line) for
line in urllib.urlopen(urlStr)]?
The suggested approach made use of generators. Just because you can iterate
over something, that doesn't mean it is all in memory ;)

Check out the difference between range() and xrange() in python 2
 
D

Dave Angel

Hey, everyone!

I'm very new to Python and have only been using it for a couple of days, but have some experience in programming (albeit mostly statistical programming in SAS or R) so I'm hoping someone can answer this question in a technical way, but without using an abundant amount of jargon.

The issue I'm having is that I'm trying to pull information from a website to practice Python with, but I'm having trouble getting the data in a timely fashion. If I use the following code:

<code>
import json
import urllib

urlStr = "https://stream.twitter.com/1/statuses/sample.json"

twtrDict = [json.loads(line) for line in urllib.urlopen(urlStr)]
</code>

I get a memory issue. I'm running 32-bit Python 2.7 with 4 gigs of RAM if that helps at all.

Which OS?

The first question I'd ask is how big this file is. I can't tell, since
it needs a user name & password to actually get the file. But it's not
unusual to need at least double that space in memory, and in Windoze
you're limited to two gig max, regardless of how big your hardware might be.

If you separately fetch the file, then you can experiment with it,
including cutting it down to a dozen lines, and see if you can deal with
that much.

How could you fetch it? With wget, with a browser (and saveAs), with a
simple loop which uses read(4096) repeatedly and writes each block to a
local file. Don't forget to use 'wb', as you don't know yet what line
endings it might use.

Once you have an idea what the data looks like, you can answer such
questions as whether it's json at all, whether the lines each contain a
single json record, or what.

For all we know, the file might be a few terabytes in size.
 
D

Dennis Lee Bieber

unusual to need at least double that space in memory, and in Windoze
you're limited to two gig max, regardless of how big your hardware might be.
If the boot config is set for "server mode", WinXP can give 3GB to
user process.
 
B

Bryan Britten

Hey, everyone!

I'm very new to Python and have only been using it for a couple of days, but have some experience in programming (albeit mostly statistical programming in SAS or R) so I'm hoping someone can answer this question in a technical way, but without using an abundant amount of jargon.

The issue I'm having is that I'm trying to pull information from a website to practice Python with, but I'm having trouble getting the data in a timely fashion. If I use the following code:


import json
import urllib
urlStr = "https://stream.twitter.com/1/statuses/sample.json"
twtrDict = [json.loads(line) for line in urllib.urlopen(urlStr)]

I get a memory issue. I'm running 32-bit Python 2.7 with 4 gigs of RAM if that helps at all.



Which OS?

I'm operating on Windows 7.
The first question I'd ask is how big this file is. I can't tell, since

it needs a user name & password to actually get the file.

If you have Twitter, you can just use your log-in information to access thefile.
But it's not unusual to need at least double that space in memory, and inWindoze

you're limited to two gig max, regardless of how big your hardware might be.



If you separately fetch the file, then you can experiment with it,

including cutting it down to a dozen lines, and see if you can deal with

that much.



How could you fetch it? With wget, with a browser (and saveAs), with a

simple loop which uses read(4096) repeatedly and writes each block to a

local file. Don't forget to use 'wb', as you don't know yet what line

endings it might use.
I'm not familiar with using read(4096), I'll have to look into that. When Itried to just save the file, my computer just sat in limbo for some time and didn't seem to want to process the command.
Once you have an idea what the data looks like, you can answer such

questions as whether it's json at all, whether the lines each contain a

single json record, or what.
Based on my *extremely* limited knowledge of JSON, that's definitely the type of file this is. Here is a snippet of what is seen when you log in:

{"created_at":"Tue May 28 03:09:23 +0000 2013","id":339216806461972481,"id_str":"339216806461972481","text":"RT @aleon_11: Sigo creyendo que las noches lluviosas me acercan mucho m\u00e1s a ti!","source":"\u003ca href=\"http:\/\/blackberry.com\/twitter\" rel=\"nofollow\"\u003eTwitter for BlackBerry\u00ae\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":310910123,"id_str":"310910123","name":"\u2661","screen_name":"LaMarielita_","location":"","url":null,"description":"MERCADOLOGA & PUBLICISTA EN PROCESO, AMO A MI DIOS & MI FAMILIA\u2665 ME ENCANTA REIRME , MOLESTAR & HABLAR :D BFF, pancho, ale & china :) LY\u2661","protected":false,"followers_count":506,"friends_count":606,"listed_count":1,"created_at":"Sat Jun 04 15:24:19 +0000 2011","favourites_count":207,"utc_offset":-25200,"time_zone":"Mountain Time (US & Canada)","geo_enabled":false,"verified":false,"statuses_count":17241,"lang":"es","contributors_enabled":false,"is_translator":false,"profile_background_color":"FF6699","profile_background_image_url":"http:\/\/a0.twimg.com\/images\/themes\/theme11\/bg.gif","profile_background_image_url_https":"https:\/\/si0.twimg.com\/images\/themes\/theme11\/bg.gif","profile_background_tile":true,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/3720425493\/13a48910e56ca34edeea07ff04075c77_normal.jpeg","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/3720425493\/13a48910e56ca34edeea07ff04075c77_normal.jpeg","profile_link_color":"B40B43","profile_sidebar_border_color":"CC3366","profile_sidebar_fill_color":"E5507E","profile_text_color":"362720","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Tue May 28 02:57:40 +0000 2013","id":339213856922537984,"id_str":"339213856922537984","text":"Sigo creyendo que las noches lluviosas me acercan mucho m\u00e1s a ti!","source":"web","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":105252134,"id_str":"105252134","name":"Alejandra Le\u00f3n","screen_name":"aleon_11","location":"Guatemala","url":null,"description":"La vida se disfruta m\u00e1s, cuando no se le pone tanta importancia.","protected":false,"followers_count":143,"friends_count":251,"listed_count":0,"created_at":"Fri Jan 15 20:49:38 +0000 2010","favourites_count":83,"utc_offset":-28800,"time_zone":"Pacific Time (US & Canada)","geo_enabled":false,"verified":false,"statuses_count":1863,"lang":"es","contributors_enabled":false,"is_translator":false,"profile_background_color":"F8F2FC","profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/811443451\/81abf2f37ee3e37deda396befa7fb557.jpeg","profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/811443451\/81abf2f37ee3e37deda396befa7fb557.jpeg","profile_background_tile":true,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/3578979563\/e973196904e25af5d960f2971616eb61_normal.jpeg","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/3578979563\/e973196904e25af5d960f2971616eb61_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/105252134\/1364957374","profile_link_color":"F01A1A","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"7AC3EE","profile_text_color":"3D1957","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":2,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[]},"favorited":false,"retweeted":false,"lang":"es"},"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[{"screen_name":"aleon_11","name":"Alejandra Le\u00f3n","id":105252134,"id_str":"105252134","indices":[3,12]}]},"favorited":false,"retweeted":false,"filter_level":"low"}
 
F

Fábio Santos

I'm not familiar with using read(4096), I'll have to look into that. When
I tried to just save the file, my computer just sat in limbo for some time
and didn't seem to want to process the command.

That's just file.read with an integer argument. You can read a file by
chunks by repeatedly calling that function until you get the empty string.
Based on my *extremely* limited knowledge of JSON, that's definitely the
type of file this is. Here is a snippet of what is seen when you log in:
....
That's json. It's pretty big, but not big enough to stall a slow computer
more than half a second.

-

I've looked for documentation on that method on twitter.

It seems that it's part of the twitter streaming api.

https://dev.twitter.com/docs/streaming-apis

What this means is that the requests aren't supposed to end. They are
supposed to be read gradually, using the lines to split the response into
meaningful chunks. That's why you can't read the data and why your browser
never gets around to download it. Both urlopen and your browser block while
waiting for the request to end.

Here's more info on streaming requests on their docs:

https://dev.twitter.com/docs/streaming-apis/processing

For streaming requests in python, I would point you to the requests
library, but I am not sure it handles streaming requests.
 
B

Bryan Britten

Thanks to everyone for the help and insight. I think for now I'll just back away from this file and go back to something much easier to practice with.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,233
Members
46,821
Latest member
AleidaSchi

Latest Threads

Top