H
Hans =?iso-8859-1?q?Alm=E5sbakk?=
Hi,
I have a problem which I believe is seen before:
Finding the correct pattern to use, in order to split a line correctly,
using the split function in the re module.
I'm new to regexp, and it isn't always easy to comprehend for a newbie
The lines I want to split are like this:
(The following is one line, even if news client splits it up
"abc ",,"-",,,,,"Doe, John D.",2004,"A long text, which may contain many
characters. Dots, commas, and if I'm real unlucky: maybe even
"-characters","-",32454,,
These lines are in a csv file exported from excel.
Comma is obviously the separator, but as you can see a comma might
occur between " ", and if that is the case, it should not be (a
separator).
Then I pondered upon a way of using " chars in the splitting aswell,
something like "?,"? . (optional " before and after comma), which of course
also goes wrong. " may and may not occur around the splitting comma, but
that would also match single commas inside quoted text, see example.
Any pointer will be greatly appreciated. Maybe I'm attacking this problem
the wrong way already from the start? (Not that I can see another way
myself
Regards
I have a problem which I believe is seen before:
Finding the correct pattern to use, in order to split a line correctly,
using the split function in the re module.
I'm new to regexp, and it isn't always easy to comprehend for a newbie
The lines I want to split are like this:
(The following is one line, even if news client splits it up
"abc ",,"-",,,,,"Doe, John D.",2004,"A long text, which may contain many
characters. Dots, commas, and if I'm real unlucky: maybe even
"-characters","-",32454,,
These lines are in a csv file exported from excel.
Comma is obviously the separator, but as you can see a comma might
occur between " ", and if that is the case, it should not be (a
separator).
Then I pondered upon a way of using " chars in the splitting aswell,
something like "?,"? . (optional " before and after comma), which of course
also goes wrong. " may and may not occur around the splitting comma, but
that would also match single commas inside quoted text, see example.
Any pointer will be greatly appreciated. Maybe I'm attacking this problem
the wrong way already from the start? (Not that I can see another way
myself
Regards