Why BOM in logging message?

R

Roy Smith

We've got 10 (supposedly) identical servers, all running Ubuntu 12.04,
Python 2.7, Django 1.3. We log to syslog using the logging module and
a custom fomatter.

'formatters': {
'verbose': {
'format': '%(asctime)s [%(process)d]: %(program)s %(session_id)s %(request_id)s %(request_id_digest)s %(remote_addr)s %(name)s %(level\
name)s %(funcName)s() %(message)s',
'()': 'songza.logging.ContextFormatter',
},
},

There's nothing particularly exciting in the formatter code:

class ContextFormatter(logging.Formatter):
def format(self, record):
record.program = context.get_program()
record.request_id = context.get_request_id()
record.request_id_digest = context.get_request_id_digest()
record.session_id = context.get_session_id() or '-'
record.remote_addr = context.get_remote_addr() or '-'
return logging.Formatter.format(self, record)

What's weird is that two of the servers, and only those two, stick a
BOM (Byte Order Mark) in front of the message they log. It shows up
in syslog as:

2013-01-09T00:00:00+00:00 web5.songza.com <U+FEFF>2013-01-0900:00:00,754 [18979]: [etc...]

The other machines, never put the BOM in. Given that all 10 machines
are ostensibly clones of each other, we're scratching our heads what
might be different about those two which cause the BOMs to appear.
Any ideas?

I suppose it's possible it's a syslog config problem and not a Python
problem, but I figured I'd start at the beginning of the chain.
 
C

Chris Angelico

What's weird is that two of the servers, and only those two, stick a
BOM (Byte Order Mark) in front of the message they log.

Could it be this issue you're looking at?

http://bugs.python.org/issue14452

What are the exact Python versions in use? Are the two different
servers running an older revision of Py 2.7?

ChrisA
 
J

John Gordon

In said:
What's weird is that two of the servers, and only those two, stick a
BOM (Byte Order Mark) in front of the message they log. It shows up
in syslog as:
2013-01-09T00:00:00+00:00 web5.songza.com <U+FEFF>2013-01-0900:00:00,754 [18979]: [etc...]

I worked on an application that would insert a BOM in syslog messages if
the logged message contained unicode, but not if it was plain ascii.

Not sure if this relates to your issue, but it's similar enough that it
seemed worth mentioning.
 
R

Roy Smith

John Gordon said:
In said:
What's weird is that two of the servers, and only those two, stick a
BOM (Byte Order Mark) in front of the message they log. It shows up
in syslog as:
2013-01-09T00:00:00+00:00 web5.songza.com <U+FEFF>2013-01-0900:00:00,754
[18979]: [etc...]

I worked on an application that would insert a BOM in syslog messages if
the logged message contained unicode, but not if it was plain ascii.

Not sure if this relates to your issue, but it's similar enough that it
seemed worth mentioning.

That doesn't seem to be it. All messages from web{2,5} have BOMs, no
message from web{1,3,4,6,7,8,9,10} ever does.

I even tried looking at the output of socket.gethostname() on the
various machines to see if maybe the hostname had some unicode character
in it. No joy.
 
R

Roy Smith

It sounds like it might be it, but we're running 2.7.3 on all machines.

Well, this is fascinating. It turns out that while all of our
machines report that they're running 2.7.3, they have two different
versions of /usr/lib/python2.7/logging/handlers.py!

-rw-r--r-- 1 root root 45076 Aug 1 05:39 /usr/lib/python2.7/logging/handlers.py
-rw-r--r-- 1 root root 45143 Apr 20 2012 /usr/lib/python2.7/logging/handlers.py

The April 24th version has the BOM code in SysLogHander.emit():

| # Message is a string. Convert to bytes as required by RFC 5424
| if type(msg) is unicode:
| msg = msg.encode('utf-8')
| if codecs:
| msg = codecs.BOM_UTF8 + msg
| msg = prio + msg

and the August 1st version doesn't:

| # Message is a string. Convert to bytes as required by RFC 5424
| if type(msg) is unicode:
| msg = msg.encode('utf-8')
| msg = prio + msg

Is it possible there's two different 2.7.3 builds that somehow got
packaged by Ubuntu at different times? The pattern of which machines
have the August code and which have the April code correlates with
when we rolled out each server instance.
 
C

Chris Angelico

Well, this is fascinating. It turns out that while all of our
machines report that they're running 2.7.3, they have two different
versions of /usr/lib/python2.7/logging/handlers.py!

-rw-r--r-- 1 root root 45076 Aug 1 05:39 /usr/lib/python2.7/logging/handlers.py
-rw-r--r-- 1 root root 45143 Apr 20 2012 /usr/lib/python2.7/logging/handlers.py

Ha, that would do it!

I don't have a corresponding system to compare against, but what
package is that file managed by?

$ dpkg -S /usr/lib/python2.7/logging/handlers.py

See if both systems show it as part of the same package, and if so, if
the package is at the same version on each. On my Maverick desktop, I
have 2.6, and the package is python2.6-minimal.

ChrisA
 
C

Chris Angelico

Yup, on some machines we've got 2.7.3-0ubuntu3, and on others,
2.7.3-0ubuntu3.1 of python2.7-minimal. Details at:

https://launchpad.net/ubuntu/+source/python2.7/2.7.3-0ubuntu3.1

I love it when everything adds up! The message even cites the specific
change. It's like a cryptic crossword - once you get the answer, you
KNOW it's the answer because suddenly it all makes sense :)

Thanks for bringing a fun problem to solve! It's (unfortunately) a
refreshing change to read a post from someone who knows how to ask
smart questions.

ChrisA
 
T

Terry Reedy

Well, this is fascinating. It turns out that while all of our
machines report that they're running 2.7.3, they have two different
versions of /usr/lib/python2.7/logging/handlers.py!

-rw-r--r-- 1 root root 45076 Aug 1 05:39 /usr/lib/python2.7/logging/handlers.py
-rw-r--r-- 1 root root 45143 Apr 20 2012 /usr/lib/python2.7/logging/handlers.py

The April 24th version has the BOM code in SysLogHander.emit():

The 'BOM' you are discussing here is not a true 2 or 4 byte
byte-order-mark, but the pseudo-BOM (pseudo because a stream of single
bytes has no byte order within the single bytes) that Micro$tupid adds
(sometimes) to utf-8 encoded bytes to mark their encoding as utf-8
rather than anything else. In otherwords, it is a non-(anti-)standard
U(nicode)E(ncoding)M(ark). It is actually the utf-8 encoding of the
2-byte BOM, and hence not a single mark! It is really confusing when
people use 'BOM' to refer to a UEM sequence instead of a BOM.
| # Message is a string. Convert to bytes as required by RFC 5424
| if type(msg) is unicode:
| msg = msg.encode('utf-8')
| if codecs:
| msg = codecs.BOM_UTF8 + msg
| msg = prio + msg

2.7.3 was released in April.
and the August 1st version doesn't:

| # Message is a string. Convert to bytes as required by RFC 5424
| if type(msg) is unicode:
| msg = msg.encode('utf-8')
| msg = prio + msg

The issue referenced in an earlier messaged was to remove the UEM where
it did not belong.
Is it possible there's two different 2.7.3 builds that somehow got
packaged by Ubuntu at different times?

As you discovered, Ubuntu sometimes updates a release with bug patches
before we release a new official bugfix release. 2.7.4, with many
bugfixes, is still to see the light of day.
The pattern of which machines
have the August code and which have the April code correlates with
when we rolled out each server instance.

Great detective work ;-).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,961
Messages
2,570,131
Members
46,689
Latest member
liammiller

Latest Threads

Top