Enhancing the Gateway (Help Needed)

  • Thread starter James Edward Gray II
  • Start date
J

James Edward Gray II

Here's the short-story on the current situation with our mailing list =20=

to Usenet gateway:

* Our Usenet host rejects multipart/alternative messages
because they are technically illegal Usenet posts
* This means that some emails do not reach comp.lang.ruby
(several messages each day according to the logs)
* We don't like this

To solve this, we want to enhance the gateway to convert multipart/=20
alternative messages into something we can legally post to Usenet. I =20=

have two thoughts on this strategy:

1. If possible, we should gather all text/plain portions of an email =20=

and post those with a content-type of text/plain
2. If that fails, we can just post the original body but force the =20
content-type to text/plain for maximum compatibility

Now I need all of you email and Usenet experts to tell me if that's a =20=

sane strategy. If another approach would be better, please clue me in.

I've pretty much made it this far. The code at the bottom of this =20
message is the mail_to_news.rb script used by the gateway rewritten =20
using this strategy.

If you aren't familiar with the gateway code, you can get details =20
from the articles at:

http://blog.grayproductions.net/categories/the_gateway

There's one problem left I know I haven't solved correctly. Help me =20
figure out a decent strategy for this last piece and we can deploy =20
the new code.

The outstanding issue is how to handle character sets for the =20
constructed message. You'll see in the code below that I just pull =20
the charset param from the original message, but after looking at a =20
few messages, I realize that this doesn't make sense. For example, =20
here are the relevant portions of a recent post that wasn't gated =20
correctly:

Content-Type: multipart/alternative; boundary=3DApple-Mail-18-445454026=


--Apple-Mail-18-445454026
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=3DUS-ASCII;
delsp=3Dyes;
format=3Dflowed

As you can see, the overall email doesn't have a charset but each =20
text portion can. If we are going to merge these parts, what's the =20
best strategy for handling the charset?

I thought of trying to convert them all to UTF-8 with Iconv, but I'm =20
not sure what to do if a type doesn't declare a charset or when Iconv =20=

chokes on what is declared? Please share your opinions.

If you are feeling really adventurous, rewrite the relevant portion =20
of the code below which I will bracket with a FIX ME comments.

Here's the script:

#!/usr/bin/env ruby

# written by James Edward Gray II <[email protected]>

$KCODE =3D "u"

GATEWAY_DIR =3D File.join(File.dirname(__FILE__), "..").freeze

$LOAD_PATH << File.join(GATEWAY_DIR, "config") << File.join=20
(GATEWAY_DIR, "lib")

require "tmail"

require "servers_config"
require "nntp"

require "logger"
require "timeout"

# prepare log
log =3D Logger.new(ARGV.shift || $stdout)
log.datetime_format =3D "%Y-%m-%d %H:%M "

# build incoming and outgoing message object
incoming =3D TMail::Mail.parse($stdin.read)
outgoing =3D TMail::Mail.new

# skip any flagged messages
if incoming["X-Rubymirror"].to_s =3D=3D "yes"
log.info "Skipping message ##{incoming.message_id}, sent by =20
news_to_mail"
exit
elsif incoming["X-Spam-Status"].to_s =3D~ /\AYes/
log.info "Ignoring Spam ##{incoming.message_id}: " +
"#{incoming.subject}=96#{incoming.from}"
exit
end

# only allow certain headers through
%w[from subject in_reply_to transfer_encoding date].each do |header|
outgoing.send("#{header}=3D", incoming.send(header))
end
outgoing.message_id =3D incoming.message_id.sub(/\.+>$/, ">")
%w[X-ML-Name X-Mail-Count X-X-Sender].each do |header|
outgoing[header] =3D incoming[header].to_s if incoming.key?header
end

# doctor headers for Ruby Talk
outgoing.references =3D if incoming.key? "References"
incoming.references
else
if incoming.key? "In-Reply-To"
incoming.reply_to
else
if incoming.subject =3D~ /^Re:/
outgoing.reply_to =3D =
"<this_is_a_dummy_message-id@rubygateway>"
end
end
end
outgoing["X-Ruby-Talk"] =3D incoming.message_id
outgoing["X-Received-From"] =3D <<END_GATEWAY_DETAILS.gsub(/\s+/, " ")
This message has been automatically forwarded from the ruby-talk =20
mailing list by
a gateway at #{ServersConfig::NEWSGROUP}. If it is SPAM, it did not =20
originate at
#{ServersConfig::NEWSGROUP}. Please report the original sender, and =20
not us.
Thanks! For more details about this gateway, please visit:
http://blog.grayproductions.net/categories/the_gateway
END_GATEWAY_DETAILS
outgoing["X-Rubymirror"] =3D "Yes"

# translate the body of the message, if needed
if incoming.multipart? and incoming.sub_type =3D=3D "alternative"
### FIX ME ###
# handle multipart/alternative messages
# extract body
body =3D ""
extract_text =3D lambda do |message_or_part|
if message_or_part.multipart?
message_or_part.each_part { |part| extract_text[part] }
elsif message_or_part.content_type =3D=3D "text/plain"
body +=3D message_or_part.body
end
end
extract_text[incoming]
if body.empty?
outgoing.body =3D "Note: the content-type of this message was =20
altered by " +
"the gateway.\n\n#{incoming.body}"
else
outgoing.body =3D "Note: non-text portions of this message were =20=

stripped " +
"by the gateway.\n\n#{body}"
end
# set the content type of the new message
outgoing.set_content_type( "text", "plain",
"charset" =3D> incoming.type_param=20
("charset") )
### END FIX ME ###
else
%w[content_type body].each do |header|
outgoing.send("#{header}=3D", incoming.send(header))
end
end

log.info "Sending message ##{incoming.message_id}: " +
"#{incoming.subject}=96#{incoming.from}=85"
log.info "Message looks like:\n#{outgoing.encoded}"

# connect to NNTP host
begin
nntp =3D nil
Timeout.timeout(30) do
nntp =3D Net::NNTP.new( ServersConfig::NEWS_SERVER,
Net::NNTP::NNTP_PORT,
ServersConfig::NEWS_USER,
ServersConfig::NEWS_PASS )
end
rescue Timeout::Error
log.error "The NNTP connection timed out"
exit -1
rescue
log.fatal "Unable to establish connection to NNTP host: =
#{$!.message}"
exit -1
end

# attempt to send newsgroup post
unless $DEBUG
begin
result =3D nil
Timeout.timeout(30) { result =3D nntp.post(outgoing.encoded) }
rescue Timeout::Error
log.error "The NNTP post timed out"
exit -1
rescue
log.fatal "Unable to post to NNTP host: #{$!.message}"
exit -1
end
log.info "=85 Sent. nntp.post() result: #{result}"
end

__END__

Thanks for the help.

James Edward Gray II
 
B

Bill Kelly

From: "James Edward Gray II said:
1. If possible, we should gather all text/plain portions of an email
and post those with a content-type of text/plain

Do we get many HTML-only messages, having a text/html part, without a
corresponding text/plain part?

Or is that too uncommon to worry about?


Regards,

Bill
 
N

Nobuyoshi Nakada

Hi,

At Mon, 29 Oct 2007 06:20:48 +0900,
James Edward Gray II wrote in [ruby-talk:276334]:
To solve this, we want to enhance the gateway to convert multipart/
alternative messages into something we can legally post to Usenet. I
have two thoughts on this strategy:

1. If possible, we should gather all text/plain portions of an email
and post those with a content-type of text/plain

Rather I want it to be done by FML itself on ruyb-lang.org.
2. If that fails, we can just post the original body but force the
content-type to text/plain for maximum compatibility

I do it locally by `w3m -dump -T text/html`.
The outstanding issue is how to handle character sets for the
constructed message. You'll see in the code below that I just pull
the charset param from the original message, but after looking at a
few messages, I realize that this doesn't make sense. For example,
here are the relevant portions of a recent post that wasn't gated
correctly:

Content-Type: multipart/alternative; boundary=Apple-Mail-18-445454026

--Apple-Mail-18-445454026
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
delsp=yes;
format=flowed

As you can see, the overall email doesn't have a charset but each
text portion can. If we are going to merge these parts, what's the
best strategy for handling the charset?

"alternative" means each bodies have actually same contents,
so, in theoretically, you can and should select one of them.
Merging them all is wrong behavior. I suspect you mean
multipart/relative.
I thought of trying to convert them all to UTF-8 with Iconv, but I'm
not sure what to do if a type doesn't declare a charset or when Iconv
chokes on what is declared? Please share your opinions.

Should be defaulted to US-ASCII.
 
J

James Edward Gray II

Hi,

At Mon, 29 Oct 2007 06:20:48 +0900,
James Edward Gray II wrote in [ruby-talk:276334]:
To solve this, we want to enhance the gateway to convert multipart/
alternative messages into something we can legally post to Usenet. I
have two thoughts on this strategy:

1. If possible, we should gather all text/plain portions of an email
and post those with a content-type of text/plain

Rather I want it to be done by FML itself on ruyb-lang.org.

Excellent. Are their any plans to make that happen?

I'm trying to get it in the gateway so we can stop having this
discussion. ;) But if there are plans to have the list itself do
it, that's great.
I do it locally by `w3m -dump -T text/html`.

Yes, I assume we could use lynx/links to similar effect. My strategy
wasn't as clever, but I thought by swapping the content type we would
at least get the content, though it would have some noise.
"alternative" means each bodies have actually same contents,
so, in theoretically, you can and should select one of them.
Merging them all is wrong behavior.

Now you know why I asked for help. I know so little about email
rules. Thanks for explaining this.

This is good news because it greatly simplifies the process.

Do you know if multipart content can be nested? For example, could a
single part of a multipart message itself be multipart? The design
of TMail seems to support this, but again it's easier if that's not
the case.
I suspect you mean multipart/relative.

I wasn't even aware of that format, to be honest. I knew of
multipart/mixed (which our Usenet host will allow) and multipart/
alternative. What is the purpose of multipart/relative?
Should be defaulted to US-ASCII.

Do you mean that US-ASCII is the charset when one is not specified?

Thanks for all for the information.

James Edward Gray II
 
J

James Edward Gray II

Do we get many HTML-only messages, having a text/html part, without a
corresponding text/plain part?

I know I have seen it at least once in the past. I suspect it's =20
rare, but that's just me guessing. When dealing with the Internet at =20=

large, I think we always need to be prepared for the worst case =20
scenario.
Or is that too uncommon to worry about?

You made a good point here that I should try looking at some actual =20
Ruby Talk messages to see what we're up against. I'll put together a =20=

script to comb through a subset of the archives=85

James Edward Gray II=
 
N

Nobuyoshi Nakada

Hi,

At Mon, 29 Oct 2007 12:18:40 +0900,
James Edward Gray II wrote in [ruby-talk:276357]:
Excellent. Are their any plans to make that happen?

I'm asking to eban.
Do you know if multipart content can be nested? For example, could a
single part of a multipart message itself be multipart? The design
of TMail seems to support this, but again it's easier if that's not
the case.

Yes, and the depth isn't restricted.
I wasn't even aware of that format, to be honest. I knew of
multipart/mixed (which our Usenet host will allow) and multipart/
alternative. What is the purpose of multipart/relative?

As the above.
Do you mean that US-ASCII is the charset when one is not specified?

RFC 2045 Internet Message Bodies November 1996

5.2. Content-Type Defaults

Default RFC 822 messages without a MIME Content-Type header are taken
by this protocol to be plain text in the US-ASCII character set,
which can be explicitly specified as:

Content-type: text/plain; charset=us-ascii

This default is assumed if no Content-Type header field is specified.
 
N

Nobuyoshi Nakada

Hi,

At Mon, 29 Oct 2007 13:17:24 +0900,
Nobuyoshi Nakada wrote in [ruby-talk:276371]:
As the above.

Oops, it was multipart/related, and I removed the paragraph
mentioned about it. My mistake, sorry.
 
J

James Edward Gray II

Hi,

At Mon, 29 Oct 2007 13:17:24 +0900,
Nobuyoshi Nakada wrote in [ruby-talk:276371]:
As the above.

Oops, it was multipart/related, and I removed the paragraph
mentioned about it. My mistake, sorry.

I've been looking into this a little this morning.

We do receive multipart/related messages, though they seem fairly
uncommon compared to multipart/alternative. They don't appear to be
gated properly. In fact, the mailing list archives don't even seem
to show them. For example 271796 was a multipart/related message and
I can't find it in the archives or on comp.lang.ruby.

To understand what we are dealing with here, I read:

http://www.faqs.org/rfcs/rfc2387.html

This type does not seem easy to deal with and I open to suggestions
for the best strategy to use.

James Edward Gray II
 
M

mortee

James said:
I've been looking into this a little this morning.

We do receive multipart/related messages, though they seem fairly
uncommon compared to multipart/alternative. They don't appear to be
gated properly. In fact, the mailing list archives don't even seem to
show them. For example 271796 was a multipart/related message and I
can't find it in the archives or on comp.lang.ruby.

To understand what we are dealing with here, I read:

http://www.faqs.org/rfcs/rfc2387.html

This type does not seem easy to deal with and I open to suggestions for
the best strategy to use.

AFAIK it's mostly used for HTML messages with images embedded in the
email itself. I guess it would mostly be one part of a
multipart/alternative message, of which one alternative should be
text/plain anyway. Otherwise, you're most likely left with HTML to
strip, and images which you may either drop or attach to the output as
files.

Sorry if I happen to be wrong on one point or the other.

mortee
 
T

Todd Benson

Hi,

At Mon, 29 Oct 2007 13:17:24 +0900,
Nobuyoshi Nakada wrote in [ruby-talk:276371]:
I suspect you mean multipart/relative.

I wasn't even aware of that format, to be honest. I knew of
multipart/mixed (which our Usenet host will allow) and multipart/
alternative. What is the purpose of multipart/relative?

As the above.

Oops, it was multipart/related, and I removed the paragraph
mentioned about it. My mistake, sorry.

I've been looking into this a little this morning.

We do receive multipart/related messages, though they seem fairly
uncommon compared to multipart/alternative. They don't appear to be
gated properly. In fact, the mailing list archives don't even seem
to show them. For example 271796 was a multipart/related message and
I can't find it in the archives or on comp.lang.ruby.

To understand what we are dealing with here, I read:

http://www.faqs.org/rfcs/rfc2387.html

This type does not seem easy to deal with and I open to suggestions
for the best strategy to use.

James Edward Gray II

I haven't built enough clout in this group for my opinion to matter,
but here goes...

James did a great job with the gateway ... no doubt about that.
Should we even have it? I absolutely think so.

The lowest common denominator for language is US-ASCII (is that a good
thing or bad thing? You decide).

Make sure, James and others, that you label the reformed
emails/postings with some kind of rejoinder that says something to the
effect of "mail/posting has been modified to make it available."

Todd
 
J

James Edward Gray II

AFAIK it's mostly used for HTML messages with images embedded in the
email itself.

Yeah, I think that's what I'm seeing in my analysis of the messages.
I guess it would mostly be one part of a multipart/alternative
message, of which one alternative should be text/plain anyway.

Most of the cases I have found have a multipart/alternative section
inside the multipart/related section, like this example shows:

271796: multipart/related ()
multipart/alternative ()
image/png ()

Obviously I need to extend my statistics gathering script to handle
the nesting, but I've checked this message by hand and there was a
text/plain part in there.
Otherwise, you're most likely left with HTML to
strip, and images which you may either drop or attach to the output as
files.

Right. Which means I still need to settle on an HTML strategy as well.
Sorry if I happen to be wrong on one point or the other.

The other usage that seems common, more common than the HTML case in
fact, is as part of a signed message:

271822: multipart/signed ()
multipart/related ()
application/pgp-signature ()

I've not yet checked to see if these messages are gated properly with
our current setup.

James Edward Gray II
 
M

mortee

Todd said:
The lowest common denominator for language is US-ASCII (is that a good
thing or bad thing? You decide).

Aside from any language bias: the language of this list/group is
certainly English, which does just well in ASCII. So IMHO we wouldn't
loose much by falling back to that in case of some iconv errors. At
least certainly not as much as it'd be worth extraneous effort to work
around.

mortee
 
J

James Edward Gray II

Hi,

At Mon, 29 Oct 2007 13:17:24 +0900,
Nobuyoshi Nakada wrote in [ruby-talk:276371]:
I suspect you mean multipart/relative.

I wasn't even aware of that format, to be honest. I knew of
multipart/mixed (which our Usenet host will allow) and multipart/
alternative. What is the purpose of multipart/relative?

As the above.

Oops, it was multipart/related, and I removed the paragraph
mentioned about it. My mistake, sorry.

I've been looking into this a little this morning.

We do receive multipart/related messages, though they seem fairly
uncommon compared to multipart/alternative. They don't appear to be
gated properly. In fact, the mailing list archives don't even seem
to show them. For example 271796 was a multipart/related message and
I can't find it in the archives or on comp.lang.ruby.

To understand what we are dealing with here, I read:

http://www.faqs.org/rfcs/rfc2387.html

This type does not seem easy to deal with and I open to suggestions
for the best strategy to use.

James Edward Gray II

I haven't built enough clout in this group for my opinion to matter,
but here goes...

I'm in over my head with all this email stuff and need all the help I
can get. The gateway belongs to all of us, not my. So don't be
shy. Help me fix this right and we all benefit.
James did a great job with the gateway ... no doubt about that.

Just to be totally clear, I didn't make the original gateway. I'm
just the current caretaker.
Make sure, James and others, that you label the reformed
emails/postings with some kind of rejoinder that says something to the
effect of "mail/posting has been modified to make it available."

I will absolutely do this. The code I posted earlier in this thread
already does.

James Edward Gray II
 
F

F. Senault

Le 29 octobre à 16:06, James Edward Gray II a écrit :
On Oct 29, 2007, at 9:20 AM, mortee wrote:

Right. Which means I still need to settle on an HTML strategy as well.

I'm not sure you have that many HTML only messages. For my mailbox, I
have an HTML-only filter. It catches 0.5% of my incoming mail, and it's
100% spam.

OTOH, I seem to recall we looked at a weird multipart/alternative
message recently which had only one plain text part.
The other usage that seems common, more common than the HTML case in
fact, is as part of a signed message:

271822: multipart/signed ()
multipart/related ()
application/pgp-signature ()

I've not yet checked to see if these messages are gated properly with
our current setup.

Yes. I have <[email protected]> / ruby-talk 276326,
for instance. I can't guarantee it's propagated as well as a pure text
message, but it should be on most servers.

Fred
 
F

F. Senault

Le 28 octobre à 22:20, James Edward Gray II a écrit :
The outstanding issue is how to handle character sets for the
constructed message. You'll see in the code below that I just pull
the charset param from the original message, but after looking at a
few messages, I realize that this doesn't make sense. For example,
here are the relevant portions of a recent post that wasn't gated
correctly:

Content-Type: multipart/alternative; boundary=Apple-Mail-18-445454026

--Apple-Mail-18-445454026
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
delsp=yes;
format=flowed

As you can see, the overall email doesn't have a charset but each
text portion can. If we are going to merge these parts, what's the
best strategy for handling the charset?

Well, usually, you don't have more than one charset in a message ; you
should push the charset of the part back to the main header and be done
with it.

Now, if you have more than one text part and different charsets, it's a
bit more complicated...
I thought of trying to convert them all to UTF-8 with Iconv, but I'm
not sure what to do if a type doesn't declare a charset or when Iconv
chokes on what is declared? Please share your opinions.

Hm... Complain to the poster / the software writer ? :)

Fred
 
J

James Edward Gray II

Fred, you always show up when I need you. That's why you're still my =20=

best friend. ;)

Le 29 octobre =E0 16:06, James Edward Gray II a =E9crit :


I'm not sure you have that many HTML only messages. For my mailbox, I
have an HTML-only filter. It catches 0.5% of my incoming mail, and =20=
it's 100% spam.

Yes, you may be right about that. Perhaps not much of a concern. =20
I'm not seeing any such messages in my sample data.
OTOH, I seem to recall we looked at a weird multipart/alternative
message recently which had only one plain text part.

Sadly, that's extremely common. Have a look at just the beginning of =20=

my sample data:

271456: multipart/alternative ()
text/plain (UTF-8)
271541: multipart/signed ()
text/plain (utf-8)
application/pgp-signature ()
271567: multipart/signed ()
text/plain (iso-8859-1)
application/pgp-signature ()
271588: multipart/signed ()
text/plain (utf-8)
application/pgp-signature ()
271569: multipart/alternative ()
text/plain (ISO-8859-1)
271578: multipart/alternative ()
text/plain (ISO-8859-1)
271566: multipart/signed ()
text/plain (iso-8859-1)
application/pgp-signature ()
271568: multipart/alternative ()
text/plain (ISO-8859-1)
271444: multipart/alternative ()
text/plain (ISO-8859-1)
271452: multipart/alternative ()
text/plain (ISO-8859-1)
271640: multipart/alternative ()
text/plain (UTF-8)
271669: multipart/alternative ()
text/plain (ISO-8859-1)
=85

Good thing those are super easy to fix. ;)
Yes. I have <[email protected]> / ruby-talk =20
276326,
for instance. I can't guarantee it's propagated as well as a pure =20
text
message, but it should be on most servers.

Awesome. That's good to know. Thanks for checking that for me.

James Edward Gray II=
 
J

James Edward Gray II

Now I need all of you email and Usenet experts to tell me if that's
a sane strategy.

OK, here is the revised plan folks. Complain now if you see flaws:

* The gateway will only alter messages with a top-level content-type
of multipart/alternative or multipart/related
* For both types of messages, if will search for the first text/plain
part and promote that to the body, discarding other types (this is
probably not the ideal handling multipart/related, but it seems to
fit the messages we are seeing on Ruby Talk)
* All modified messages will begin with a disclaimer on the first line

James Edward Gray II
 
J

James Edward Gray II

OK, here is the revised plan folks. Complain now if you see flaws:

I forgot one detail=85
* The gateway will only alter messages with a top-level content-=20
type of multipart/alternative or multipart/related
* For both types of messages, if will search for the first text/=20
plain part and promote that to the body, discarding other types =20
(this is probably not the ideal handling multipart/related, but it =20
seems to fit the messages we are seeing on Ruby Talk)

* If we fail to find a text/plain part, the gateway will keep the =20
body as is, but force the content-type of the message to text/plain =20
in the hopes of getting the content through with some noise (it seems =20=

this will be needed for very few messages, possibly none)
* All modified messages will begin with a disclaimer on the first line

James Edward Gray II=
 
M

mortee

James said:
* If we fail to find a text/plain part, the gateway will keep the body
as is, but force the content-type of the message to text/plain in the
hopes of getting the content through with some noise (it seems this will
be needed for very few messages, possibly none)

Do you think *anyone* would ever attempt to read a post which would show
its html source as plain text (that will happen if you force the
content-type of a html mail to text/plain)? I guess you should either
drop those or try to strip the html tags.

mortee
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,705
Latest member
Stefkari24

Latest Threads

Top