Eclipse/PyDev - BOM Lexical Error

TheOne · Oct 4, 2010

Hi.

I installed eclipse/pydev today.
I created a pydev project and added python source files with utf-8
BOM.
Eclipse/Pydev reports lexical error :
Lexical error at line 1, column 1. Encountered: "\ufeff" (65279),
after : ""

I want the source files to have BOM character. How could I shut off
this error msg?

My eclipse is Helios, pydev version is 1.6.2.2010090812

Thanks.

Diez B. Roggisch · Oct 4, 2010

TheOne said:
Hi.

I installed eclipse/pydev today.
I created a pydev project and added python source files with utf-8
BOM.
Eclipse/Pydev reports lexical error :
Lexical error at line 1, column 1. Encountered: "\ufeff" (65279),
after : ""

I want the source files to have BOM character. How could I shut off
this error msg?

No idea. Why do you want it? Is somebody else processing these scripts?
If it's about declaring them to be utf-8, you should consider placing

# -*- coding: utf-8 -*-

on the first or second line. That works for python, and should work for eclipse.

Diez

TheOne · Oct 5, 2010

No idea. Why do you want it? Is somebody else processing these scripts?
If it's about declaring them to be utf-8, you should consider placing

# -*- coding: utf-8 -*-

on the first or second line. That works for python, and should work for eclipse.

Diez

I also included that "# -*- coding:" line. I just don't want me or
other
project members to accidentally save them in different encoding. So I
(and
my team) thought it would be safe to have the BOM character.

Anybody any idea?

TIA.

Lawrence D'Oliveiro · Oct 5, 2010

In message

TheOne said:
I want the source files to have BOM character.

What exactly is the point of a BOM in a UTF-8-encoded file?

Diez B. Roggisch · Oct 5, 2010

Lawrence D'Oliveiro said:
In message

What exactly is the point of a BOM in a UTF-8-encoded file?

It's a MS-specific thing that makes a file identifieable as
UTF-8-encoded under windows. The name BOM is obviously BS, but it's the
way it is.

Diez

Diez B. Roggisch · Oct 5, 2010

TheOne said:
I also included that "# -*- coding:" line. I just don't want me or
other
project members to accidentally save them in different encoding. So I
(and
my team) thought it would be safe to have the BOM character.

Well, me and my team we don't have it, but rarely if ever (can't
remember when) issues with this. And we are using vim, emacs + pydev.

Diez

TheOne · Oct 5, 2010

It's a MS-specific thing that makes a file identifieable as
UTF-8-encoded under windows. The name BOM is obviously BS, but it's the
way it is.

Diez

I didn't know that it's a MS-thing. (Is it really?)
Anyway, it would be great if I could make my eclipse/pydev to
understand the BOM character and suppress the lexical error msg.

Lawrence D'Oliveiro · Oct 5, 2010

In message

TheOne said:
Anyway, it would be great if I could make my eclipse/pydev to
understand the BOM character and suppress the lexical error msg.

What exactly is the point of a BOM in a UTF-8-encoded file?

Diez B. Roggisch · Oct 5, 2010

Lawrence D'Oliveiro said:
In message

What exactly is the point of a BOM in a UTF-8-encoded file?

It's a marker like the "coding: utf-8" in python-files. It tells the
software aware of it that the content is UTF-8. Naming it "BOM" is
obviously stupid, but that's the way it is called.

Diez

Terry Reedy · Oct 5, 2010

I didn't know that it's a MS-thing. (Is it really?)

Yes, who else would 'customize' an international standard by corrupting
it when adopting it. Sort of like animals pissing on things to mark them
as theirs.

Here is the relevant part of
https://secure.wikimedia.org/wikipedia/en/wiki/UTF-8
"Byte order mark

Many Windows programs (including Windows Notepad) add the bytes 0xEF,
0xBB, 0xBF at the start of any document saved as UTF-8. This is the
UTF-8 encoding of the Unicode byte order mark (BOM), and is commonly
referred to as a UTF-8 BOM even though it is not relevant to byte order.
The BOM can also appear if another encoding with a BOM is translated to
UTF-8 without stripping it.

The presence of the UTF-8 BOM may cause interoperability problems with
existing software that could otherwise handle UTF-8, for example:

* Older text editors may display the BOM as "ï»¿" at the start of
the document, even if the UTF-8 file contains only ASCII and would
otherwise display correctly.
* Programming language parsers not explicitly designed for UTF-8
can often handle UTF-8 in string constants and comments, but cannot
parse the BOM at the start of the file.
* Programs that identify file types by leading characters may fail
to identify the file if a BOM is present even if the user of the file
could skip the BOM. Or conversely they will identify the file when the
user cannot handle the BOM. An example is the Unix shebang syntax.
* Programs that insert information at the start of a file will
result in a file with the BOM somewhere in the middle of it (this is
also a problem with the UTF-16 BOM). One example is offline browsers
that add the originating URL to the start of the file.

If compatibility with existing programs is not important, the BOM could
be used to identify if a file is UTF-8 versus a legacy encoding, but
this is still problematic due to many instances where the BOM is added
or removed without actually changing the encoding, or various encodings
are concatenated together. Checking if the text is valid UTF-8 is more
reliable than using BOM.
"

Anyway, it would be great if I could make my eclipse/pydev to
understand the BOM character and suppress the lexical error msg.

It IS an error for *decoded* unicode strings to contain the BOM
'character'. BOM is only intended for use in multibyte transfer
*encodings*. Its very illegality within text is what makes it useful for
its purpose. Exclipse understands that, hence

Eclipse/Pydev reports lexical error :
Python deals with this by having separate standard utf_8 and utf_8_sig
(nature) codecs for encoding and decoding:
''

So if you insist on mal-forming your files, you need to tell
eclipse/pydev to use the equivalent of the utf_8_sig codec.

Lawrence D'Oliveiro · Oct 6, 2010

It's a marker like the "coding: utf-8" in python-files. It tells the
software aware of it that the content is UTF-8.

But if the software is aware of it, then why does it need to be told?

Naming it "BOM" is obviously stupid, but that's the way it is called.

It is in fact a Unicode BOM character, and I can understand why itâ€™s called
that. What Iâ€™m trying to understand is why you need to put one in a UTF-8-
encoded file.

Diez B. Roggisch · Oct 7, 2010

Lawrence D'Oliveiro said:
But if the software is aware of it, then why does it need to be told?

Let me rephrase: windows editors such as notepad recognize the BOM, and
then assume (hopefully rightfully so) that the rest of the file is text
in utf-8 encoding.

So it is similar to the coding-header in Python.

It is in fact a Unicode BOM character, and I can understand why itâ€™s called
that. What Iâ€™m trying to understand is why you need to put one in a UTF-8-
encoded file.

I hope that's clear now. It says "I'm a UTF-8 file".

Diez

Lawrence D'Oliveiro · Oct 8, 2010

Let me rephrase: windows editors such as notepad recognize the BOM, and
then assume (hopefully rightfully so) that the rest of the file is text
in utf-8 encoding.

But they can only recognize it as a BOM if they assume UTF-8 encoding to
begin with. Otherwise it could be interpreted as some other coding.

Ethan Furman · Oct 8, 2010

Lawrence said:
But they can only recognize it as a BOM if they assume UTF-8 encoding to
begin with. Otherwise it could be interpreted as some other coding.

Not so. The first three bytes are the flag. For example, in a .dbf
file, the first byte determines what type of dbf the file is: \x03 =
dBase III, \x83 = dBase III with memos, etc. More checking should
naturally be done to ensure the rest of the fields make sense for the
dbf type specified.

MS decided that if the first three bytes = \xEF \xBB \xBF then it's a
UTF-8 file, and if it is not, don't open it with an MS product.
Likewise, MS will add those bytes to any UTF-8 file it saves.

Naturally, this causes problems for non-MS usages, but anybody who's had
to work with both MS and non-MS platforms/products/methodologies knows
that MS does not play well with others.

~Ethan~

Lawrence D'Oliveiro · Oct 10, 2010

Ethan said:
Not so. The first three bytes are the flag.

But this is just a text file. All parts of its contents are text, there is
no â€œflagâ€.

If you think otherwise, then tell us what are these three â€œflagâ€ bytes for a
Windows-1252-encoded text file?

Ethan Furman · Oct 11, 2010

Lawrence said:
In message <[email protected]>, Ethan
Furman wrote:

But this is just a text file. All parts of its contents are text, there is
no â€œflagâ€.

If you think otherwise, then tell us what are these three â€œflagâ€ bytes for a
Windows-1252-encoded text file?

MS treats those first three bytes as a flag -- if they equal the BOM, MS
treats it as UTF-8, if they equal anything else, MS does not treat it as
UTF-8.

If you think otherwise, hop on an MS machine and test it out.

~Ethan~

Lawrence D'Oliveiro · Oct 11, 2010

Ethan said:
MS treats those first three bytes as a flag -- if they equal the BOM, MS
treats it as UTF-8, if they equal anything else, MS does not treat it as
UTF-8.

So what does it treat it as? You previously gave examples of flag values for
dBase III. What are the flag values for Windows-1252, versus, say,
ISO-8859-15?

Ethan Furman · Oct 11, 2010

Lawrence said:
In message <[email protected]>, Ethan
Furman wrote:

So what does it treat it as? You previously gave examples of flag values for
dBase III. What are the flag values for Windows-1252, versus, say,
ISO-8859-15?

I am not aware of any other flag values for text files besides the BOM
for UTF-8. If the BOM is not there, I imagine MS defaults to whatever
the locale for that machine is, but I do not know for sure.

~Ethan~

Lawrence D'Oliveiro · Oct 14, 2010

Ethan said:
I am not aware of any other flag values for text files besides the BOM
for UTF-8.

Then how can you say â€œMS treats those first three bytes as a flagâ€, then?

Steven D'Aprano · Oct 14, 2010

Then how can you say â€œMS treats those first three bytes as a flagâ€,
then?

Because Microsoft tools treat those first three bytes as a flag. An
*optional* flag, but still a flag. If the first three bytes of a text
file equal the UTF-8 BOM, most MS tools treat them as a BOM. If they
equal any other value, then they are not treated as a BOM, but merely
part of the file's contents.

http://blogs.msdn.com/b/oldnewthing/archive/2004/03/24/95235.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2007/04/17/2158334.aspx

It's not just Notepad either:

http://support.microsoft.com/kb/301623
http://msdn.microsoft.com/en-us/library/cc295463.aspx

The Python interpreter does the same thing too:

http://docs.python.org/reference/lexical_analysis.html#encoding-declarations

PyDev 3.5.0 Released	0	May 20, 2014
PyDev 2.8.2 released	0	Sep 5, 2013
PyDev 3.3.3 Released	0	Jan 28, 2014
PyDev 2.2.3 Released	1	Oct 6, 2011
Slightly OT - using PyUIC from Eclipse	0	Apr 30, 2014
Eclipse/PyDev question	4	Aug 13, 2007
PyDev multiple source files?	1	May 30, 2008
print UTF-8 file with BOM	5	Dec 23, 2005

Eclipse/PyDev - BOM Lexical Error

TheOne

Diez B. Roggisch

TheOne

Lawrence D'Oliveiro

Diez B. Roggisch

Diez B. Roggisch

TheOne

Lawrence D'Oliveiro

Diez B. Roggisch

Terry Reedy

Lawrence D'Oliveiro

Diez B. Roggisch

Lawrence D'Oliveiro

Ethan Furman

Lawrence D'Oliveiro

Ethan Furman

Lawrence D'Oliveiro

Ethan Furman

Lawrence D'Oliveiro

Steven D'Aprano

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads