Split string but ignore quotes

S

Scooter

I'm attempting to reformat an apache log file that was written with a
custom output format. I'm attempting to get it to w3c format using a
python script. The problem I'm having is the field-to-field matching.
In my python code I'm using split with spaces as my delimiter. But it
fails when it reaches the user agent because that field itself
contains spaces. But that user agent is enclosed with double quotes.
So is there a way to split on a certain delimiter but not to split
within quoted words.

i.e. a line might look like

2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200
1923 1360 31715 -
 
B

Björn Lindqvist

2009/9/29 Scooter said:
I'm attempting to reformat an apache log file that was written with a
custom output format. I'm attempting to get it to w3c format using a
python script. The problem I'm having is the field-to-field matching.
In my python code I'm using split with spaces as my delimiter. But it
fails when it reaches the user agent because that field itself
contains spaces. But that user agent is enclosed with double quotes.
So is there a way to split on a certain delimiter but not to split
within quoted words.

i.e. a line might look like

2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200
1923 1360 31715 -

Try shlex:
['2009-09-29', '12:00:00', '-', 'GET', '/', 'Mozilla/4.0 (compatible;
MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media
Center PC 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)',
'http://somehost.com', '200']
 
M

MRAB

Björn Lindqvist said:
2009/9/29 Scooter said:
I'm attempting to reformat an apache log file that was written with a
custom output format. I'm attempting to get it to w3c format using a
python script. The problem I'm having is the field-to-field matching.
In my python code I'm using split with spaces as my delimiter. But it
fails when it reaches the user agent because that field itself
contains spaces. But that user agent is enclosed with double quotes.
So is there a way to split on a certain delimiter but not to split
within quoted words.

i.e. a line might look like

2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200
1923 1360 31715 -

Try shlex:
['2009-09-29', '12:00:00', '-', 'GET', '/', 'Mozilla/4.0 (compatible;
MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media
Center PC 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)',
'http://somehost.com', '200']
The regex solution is:
7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200'['2009-09-29', '12:00:00', '-', 'GET', '/', '"Mozilla/4.0 (compatible;
MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center
PC 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)"',
'http://somehost.com', '200']
 
S

Simon Forman

I'm attempting to reformat an apache log file that was written with a
custom output format. I'm attempting to get it to w3c format using a
python script. The problem I'm having is the field-to-field matching.
In my python code I'm using split with spaces as my delimiter. But it
fails when it reaches the user agent because that field itself
contains spaces. But that user agent is enclosed with double quotes.
So is there a way to split on a certain delimiter but not to split
within quoted words.

i.e. a line might look like

2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200
1923 1360 31715 -

s = '''2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0;
..NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200 1923
1360 31715 -'''


initial, user_agent, trailing = s.split('"')

# Then depending on what you want to do with them...
foo = initial.split() + [user_agent] + trailing.split()
 
B

BJ Swope

Would the csv module be appropriate?

I'm attempting to reformat an apache log file that was written with a
custom output format. I'm attempting to get it to w3c format using a
python script. The problem I'm having is the field-to-field matching.
In my python code I'm using split with spaces as my delimiter. But it
fails when it reaches the user agent because that field itself
contains spaces. But that user agent is enclosed with double quotes.
So is there a way to split on a certain delimiter but not to split
within quoted words.

i.e. a line might look like

2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200
1923 1360 31715 -


--
To argue that honorable conduct is only required against an honorable
enemy degrades the Americans who must carry out the orders. -- Charles
Krulak, Former Commandant of the Marine Corps

We are all slave to our own paradigm. -- Joshua Williams

If the letters PhD appear after a person's name, that person will
remain outdoors even after it's started raining. -- Jeff Kay
 
P

Processor-Dev1l

I'm attempting to reformat an apache log file that was written with a
custom output format. I'm attempting to get it to w3c format using a
python script. The problem I'm having is the field-to-field matching.
In my python code I'm using split with spaces as my delimiter. But it
fails when it reaches the user agent because that field itself
contains spaces. But that user agent is enclosed with double quotes.
So is there a way to split on a certain delimiter but not to split
within quoted words.

i.e. a line might look like

2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)"http://somehost.com200
1923 1360 31715 -

Best option for you is to use shlex module as Björn said.
This is quite a simple question and you would find it on your own for
sure if you search python docs a little bit :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top