S
Sideswipe
I know this question has been asked before, and believe me I checked
the newsgroup and web extensively before asking, but I think my needs
are slightly different.
I need to parse either a CSV or a Tab delimited file, BUT I need to
keep the delimiting token -- I am parsing these files as generated
from excel and the user expects them to process EXACTLY as it appears
in the spreadsheet.
I am cross posting this in the Perl and Java groups because, my
implementation is in Java, but Perl users use regexp far more
frequently.
Here are the 3 different REGEX expressions I have found /created but
none are correct. The only certainty I can get is to get rid of all
the delimiters. I have to maintain the delimiters because the
information I am accessing is column based (and thus fixed)
private final Pattern COLUMN_PATTERN = Pattern.compile("(\"[^\"]*\",,|
[^,]+)"); // I think this close
private final Pattern COLUMN_PATTERN = Pattern.compile("([^\",]*|\"([^
\"]|\"\")+\")(,)");
private final Pattern COLUMN_PATTERN = Pattern.compile(",(?=(?:[^\\\"]*
\\\"[^\\\"]*\\\")*(?![^\\\"]*\\\"))");
So, you have the cases of:
1) continuous string or with space -> single ',' (comma) separated
2) String has a comma in it, and is "" -> it is followed by a ",,"
double comma token. So the string in "" is a token and the double
comma is also a token
3) blank cells are just a single comma ,
That's my understanding of the cases. The logic should be IDENTICAL
for tab delimited and simply substitute characters
the newsgroup and web extensively before asking, but I think my needs
are slightly different.
I need to parse either a CSV or a Tab delimited file, BUT I need to
keep the delimiting token -- I am parsing these files as generated
from excel and the user expects them to process EXACTLY as it appears
in the spreadsheet.
I am cross posting this in the Perl and Java groups because, my
implementation is in Java, but Perl users use regexp far more
frequently.
Here are the 3 different REGEX expressions I have found /created but
none are correct. The only certainty I can get is to get rid of all
the delimiters. I have to maintain the delimiters because the
information I am accessing is column based (and thus fixed)
private final Pattern COLUMN_PATTERN = Pattern.compile("(\"[^\"]*\",,|
[^,]+)"); // I think this close
private final Pattern COLUMN_PATTERN = Pattern.compile("([^\",]*|\"([^
\"]|\"\")+\")(,)");
private final Pattern COLUMN_PATTERN = Pattern.compile(",(?=(?:[^\\\"]*
\\\"[^\\\"]*\\\")*(?![^\\\"]*\\\"))");
So, you have the cases of:
1) continuous string or with space -> single ',' (comma) separated
2) String has a comma in it, and is "" -> it is followed by a ",,"
double comma token. So the string in "" is a token and the double
comma is also a token
3) blank cells are just a single comma ,
That's my understanding of the cases. The logic should be IDENTICAL
for tab delimited and simply substitute characters