Can Python fix vcard files?

D

Dotan Cohen

KDE's Kontact PIM breaks quoted-printable vcard files because it
linebreaks in the middle of a word. Take this text for example:
NOTE;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:=D7=A9=D7=95=D7=A8=D7=94 =D7=A
8=D7=90=D7=A9=D7=95=D7=A0=D7=94.\n=D7=94=D7=A9=D7=95=D7=A8=D7=94 =D7=94=D7=
A9=D7=A0=D7=99=D7=94 =D7=9B=D7=\n

The whole thing should be on one line, and the spaces at the beginning
of each line shouldn't be there at all. I have a directory with 422
files corrupted like this.

Can Python go through a directory of files and replace each instance
of "newline-space" with nothing? The system is Ubuntu 8.04 with KDE if
it matters. Thanks.

--
Dotan Cohen

http://what-is-what.com
http://gibberish.co.il
×-ב-×’-ד-×”-ו-×–-×—-ט-×™-ך-×›-ל-×-מ-ן-× -ס-×¢-×£-פ-×¥-צ-ק-ר-ש-ת

ä-ö-ü-ß-Ä-Ö-Ü
 
P

Paul Boddie

KDE's Kontact PIM breaks quoted-printable vcard files because it
linebreaks in the middle of a word. Take this text for example:
NOTE;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:=D7=A9=D7=95=D7=A8=D7=94 =D7=A
 8=D7=90=D7=A9=D7=95=D7=A0=D7=94.\n=D7=94=D7=A9=D7=95=D7=A8=D7=94 =D7=94=D7=
 A9=D7=A0=D7=99=D7=94 =D7=9B=D7=\n

The whole thing should be on one line, and the spaces at the beginning
of each line shouldn't be there at all. I have a directory with 422
files corrupted like this.

Although I think it's "rude" to break quoted-printable characters in
the middle (as seen above), isn't it permitted by the specification to
wrap lines to a predetermined length? It's been a while since I looked
at the specification, but this is one of the things that
implementations have to be able to handle.
Can Python go through a directory of files and replace each instance
of "newline-space" with nothing? The system is Ubuntu 8.04 with KDE if
it matters. Thanks.

You should file a bug against Kontact: the KDE developers love fixing
bugs, especially in their old work. ;-)

Paul
 
D

Dotan Cohen

2008/10/14 Paul Boddie said:
You should file a bug against Kontact: the KDE developers love fixing
bugs, especially in their old work. ;-)

I had to reopen an old bug on this:
https://bugs.kde.org/show_bug.cgi?id=68350

I would really appreciate it if the knowledgeable folks here would
chime in on that bug. Thanks!

--
Dotan Cohen

http://what-is-what.com
http://gibberish.co.il
×-ב-×’-ד-×”-ו-×–-×—-ט-×™-ך-×›-ל-×-מ-ן-× -ס-×¢-×£-פ-×¥-צ-ק-ר-ש-ת

ä-ö-ü-ß-Ä-Ö-Ü
 
D

Dotan Cohen

2008/10/14 Paul Boddie said:
You should file a bug against Kontact: the KDE developers love fixing
bugs, especially in their old work. ;-)

I had to reopen an old bug on this:
https://bugs.kde.org/show_bug.cgi?id=68350

I would really appreciate it if the knowledgeable folks here would
chime in on that bug. Thanks!

--
Dotan Cohen

http://what-is-what.com
http://gibberish.co.il
×-ב-×’-ד-×”-ו-×–-×—-ט-×™-ך-×›-ל-×-מ-ן-× -ס-×¢-×£-פ-×¥-צ-ק-ר-ש-ת

ä-ö-ü-ß-Ä-Ö-Ü
 
P

Paul Boddie

KDE's Kontact PIM breaks quoted-printable vcard files because it
linebreaks in the middle of a word. Take this text for example:
NOTE;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:=D7=A9=D7=95=D7=A8=D7=94 =D7=A
 8=D7=90=D7=A9=D7=95=D7=A0=D7=94.\n=D7=94=D7=A9=D7=95=D7=A8=D7=94 =D7=94=D7=
 A9=D7=A0=D7=99=D7=94 =D7=9B=D7=\n
[...]

Although I think it's "rude" to break quoted-printable characters in
the middle (as seen above), isn't it permitted by the specification to
wrap lines to a predetermined length? It's been a while since I looked
at the specification, but this is one of the things that
implementations have to be able to handle.

The vCard specification (RFC 2426 [1]) refers to RFC 2425 [2], which
says this in section 5.8.1:

A logical line MAY be continued on the next physical line anywhere
between two characters by inserting a CRLF immediately followed by a
single white space character (space, ASCII decimal 32, or horizontal
tab, ASCII decimal 9).

This is like the iCalendar specification (RFC 2445 [3]), section 4.1:

Lines of text SHOULD NOT be longer than 75 octets, excluding the
line
break. Long content lines SHOULD be split into a multiple line
representations using a line "folding" technique. That is, a long
line can be split between any two characters by inserting a CRLF
immediately followed by a single linear white space character (i.e.,
SPACE, US-ASCII decimal 32 or HTAB, US-ASCII decimal 9).

I didn't find anything which forbids splitting quoted-printable
character values in these specifications.

Paul

[1] http://www.ietf.org/rfc/rfc2426.txt
[2] http://www.ietf.org/rfc/rfc2425.txt
[3] http://www.ietf.org/rfc/rfc2445.txt
 
L

Lawrence D'Oliveiro

In message
The vCard specification (RFC 2426 [1]) refers to RFC 2425 [2], which
says this in section 5.8.1:

A logical line MAY be continued on the next physical line anywhere
between two characters by inserting a CRLF immediately followed by a
single white space character (space, ASCII decimal 32, or horizontal
tab, ASCII decimal 9).

I didn't find anything which forbids splitting quoted-printable
character values in these specifications.

What adds to the confusion is that quoted-printable has its own convention
for soft-wrapping long lines, using an equals sign followed by a newline.
 
P

Paul Boddie

What adds to the confusion is that quoted-printable has its own convention
for soft-wrapping long lines, using an equals sign followed by a newline.

I think the necessary approach involves interpreting data in the vCard
"content model" before interpreting data in the quoted-printable
"content model". That is, follow the vCard rules around line
formatting to first reconstruct encoded content, then do what you
would normally do with that encoded content. It's a bit like parsing
XML and then attempting to read text from the document's parsed
representation, rather than just matching a particular region with a
regular expression and finding that it yields "<" and ">"
instead of the expected "<" and ">".

Paul
 
L

Lawrence D'Oliveiro

My test file has newlines not preceded by an equals sign:

As was mentioned upthread by Paul Boddie, the vCard spec has its own
convention for continuing a value across multiple lines. Provided you stick
to that, you should be OK.
 
D

Dotan Cohen

2008/10/15 Lawrence D'Oliveiro <[email protected]_zealand>:

Thanks. The RFC pages for vcard (http://www.ietf.org/rfc/rfc2426.txt
and http://www.ietf.org/rfc/rfc2425.txt) are very difficult for me to
read. I'm using the test file to learn, and I will work out the kinks
on other files that I come across. This is for personal use, not
production, so I can be sloppy :)

--
Dotan Cohen

http://what-is-what.com
http://gibberish.co.il
×-ב-×’-ד-×”-ו-×–-×—-ט-×™-ך-×›-ל-×-מ-ן-× -ס-×¢-×£-פ-×¥-צ-ק-ר-ש-ת

ä-ö-ü-ß-Ä-Ö-Ü
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top