[OT] a little about regex

F

Fulvio

***********************
Your mail has been scanned by InterScan MSS.
***********************


Hello,

I'm trying to get working an assertion which filter address from some domain
but if it's prefixed by '.com'.
Even trying to put the result in a negate test I can't get the wanted result.

The tought in program term :
.... import re
.... allow = re.compile('.*\.my(>|$)')
.... deny = re.compile('.*\.com\.my(>|$)')
.... cnt = 0
.... if deny.search(adr): cnt += 1
.... if allow.search(adr): cnt += 1
.... return cnt
....
Seem that I miss some better regex implementation to avoid that both of the
filters taking action. I'm thinking of lookbehind (negative or positive)
option, but I think I couldn't realize it yet.
I think the compilation should either allow have no '.com' before '.my' or
deny should have _only_ '.com' before '.my'. Sorry I don't get the correct
sintax to do it.

Suggestions are welcome.

F
 
R

Ron Adam

Fulvio said:
***********************
Your mail has been scanned by InterScan MSS.
***********************


Hello,

I'm trying to get working an assertion which filter address from some domain
but if it's prefixed by '.com'.
Even trying to put the result in a negate test I can't get the wanted result.

The tought in program term :

... import re
... allow = re.compile('.*\.my(>|$)')
... deny = re.compile('.*\.com\.my(>|$)')
... cnt = 0
... if deny.search(adr): cnt += 1
... if allow.search(adr): cnt += 1
... return cnt
...
1

Seem that I miss some better regex implementation to avoid that both of the
filters taking action. I'm thinking of lookbehind (negative or positive)
option, but I think I couldn't realize it yet.
I think the compilation should either allow have no '.com' before '.my' or
deny should have _only_ '.com' before '.my'. Sorry I don't get the correct
sintax to do it.

Suggestions are welcome.

F

Instead of using two separate if's, Use an if - elif and be sure to test the
narrower filter first. (You have them in the correct order) That way it will
skip the more general filter and not increment cnt twice.

It's not exactly clear on what output you are seeking. If you want 0 for not
filtered and 1 for filtered, then look to Freds Hint.

Or are you writing a test at the moment, a 1 means it only passed one filter so
you know your filters are working as designed?

Another approach would be to assign values for filtered, accepted, and undefined
and set those accordingly instead of incrementing and decrementing a counter.

Cheers,
Ron
 
R

Rob Wolfe

Fulvio said:
I'm trying to get working an assertion which filter address from some domain
but if it's prefixed by '.com'.
Even trying to put the result in a negate test I can't get the wanted result.
[...]

Seem that I miss some better regex implementation to avoid that both of the
filters taking action. I'm thinking of lookbehind (negative or positive)
option, but I think I couldn't realize it yet.
I think the compilation should either allow have no '.com' before '.my' or
deny should have _only_ '.com' before '.my'. Sorry I don't get the correct
sintax to do it.

Suggestions are welcome.

Try this:

def filter(adr): # note that "filter" is a builtin function also
import re

allow = re.compile(r'.*(?<!\.com)\.my(>|$)') # negative lookbehind
deny = re.compile(r'.*\.com\.my(>|$)')
cnt = 0
if deny.search(adr): cnt += 1
if allow.search(adr): cnt += 1
return cnt


HTH,
Rob
 
F

Fulvio

***********************
Your mail has been scanned by InterScan MSS.
***********************


|def filter(adr):    # note that "filter" is a builtin function also
|    import re

I didn't know it, but my function _is_ starting by underscore (a bit of
localization :) )
|    allow = re.compile(r'.*(?<!\.com)\.my(>|$)')  # negative lookbehind
|    deny = re.compile(r'.*\.com\.my(>|$)')

Great, it works perfectly. I found my errors.
I didn't use r ahead of the patterns and i was close to the 'allow' pattern
but didn't give positive result and KregexEditor reported wrong way. This
specially because of '<' inside the stream. I thing that is not a normal
regex input. It's only python valid. Am I right?

More details are the previous thread.

F
 
F

Fulvio

***********************
Your mail has been scanned by InterScan MSS.
***********************


|Instead of using two separate if's, Use an if - elif and be sure to test

Thank you, Ron, for the input :)
I'll examine also in this mode. Meanwhile I had faced the total disaster :) of
deleting all my emails from all server ;(
(I've saved them locally, luckly :) )
|It's not exactly clear on what output you are seeking.  If you want 0 for
| not filtered and 1 for filtered, then look to Freds Hint.

Actually the return code is like herein:

if _filter(hdrs,allow,deny):
# allow and deny are objects prepared by re.compile(pattern)
_del(Num_of_Email)

In short, it means unwanted to be deleted.
And now the function is :

def _filter(msg,al,dn):
""" Filter try to classify a list of lines for a set of compiled
patterns."""
a = 0
for hdrline in msg:
# deny has the first priority and stop any further searching. Score 10
#times
if dn.search(hdrline): return len(msg) * 10
if al.search(hdrline): return 0
a += 1
return a # it returns with a score of rejected matches or zero if none


The patterns are taken from a configuration file. Those with Axx ='pattern'
are allowing streams the others are Dxx to block under different criteria.
Here they're :

[Filters]
A01 = ^From:.*\.it\b
A02 = ^(To|Cc):.*frioio@
A03 = ^(To|Cc):.*the_sting@
A04 = ^(To|Cc):.*calm_me_or_die@
A05 = ^(To|Cc):.*further@
A06 = ^From:.*\.za\b
D01 = ^From:.*\.co\.au\b
D02 = ^Subject:.*\*\*\*SPAM\*\*\*

*A bit of fake in order to get some privacy* :)
I'm using configparser to fetch their value and they're are joint by :

allow = re.compile('|'.join([k[1] for k in ifil if k[0] is 'a']))
deny = re.compile('|'.join([k[1] for k in ifil if k[0] is 'd']))

ifil is the input filter's section.

At this point I suppose that I have realized the right thing, just I'm a bit
curious to know if ithere's a better chance and realize a single regex
compilation for all of the options.
Basically the program will work, in term of filtering as per config and
sincronizing with local $HOME/Mail/trash (configurable path). This last
option will remove emails on the server for those that are in the local
trash.
Todo = backup local and remote emails for those filtered as good.
multithread to connect all server in parallel
SSL for POP3 and IMAP4 as well
Actually I've problem on issuing the command to imap server to flag "Deleted"
the message which count as spam. I only know the message details but what
is the correct command is a bit obscure, for me.
BTW whose Fred?

F
 
A

Ant

Rob Wolfe wrote:
....
def filter(adr): # note that "filter" is a builtin function also
import re

allow = re.compile(r'.*(?<!\.com)\.my(>|$)') # negative lookbehind
deny = re.compile(r'.*\.com\.my(>|$)')
cnt = 0
if deny.search(adr): cnt += 1
if allow.search(adr): cnt += 1
return cnt

Which makes the 'deny' code here redundant so in this case the function
could be reduced to:

import re

def allow(adr): # note that "filter" is a builtin function also
allow = re.compile(r'.*(?<!\.com)\.my(>|$)') # negative lookbehind
if allow.search(adr):
return True
return False

Though having the explicit allow and deny expressions may make what's
going on clearer than the fairly esoteric negative lookbehind.
 
R

Rob Wolfe

Fulvio said:
Great, it works perfectly. I found my errors.
I didn't use r ahead of the patterns and i was close to the 'allow' pattern
but didn't give positive result and KregexEditor reported wrong way. This
specially because of '<' inside the stream. I thing that is not a normal
regex input. It's only python valid. Am I right?

The sequence inside "(?...)" is an extension notation specific to
python.

Regards,
Rob
 
F

Fulvio

    allow = re.compile(r'.*(?<!\.com)\.my(>|$)')  # negative lookbehind
    if allow.search(adr):
        return True
    return False

I'd point out that :
allow = re.search(r'.*(?<!\.com)\.my(>|$)',adr)

Will do as yours, since the call to 're' class will do the compilation as here
it's doing separately.
Though having the explicit allow and deny expressions may make what's
going on clearer than the fairly esoteric negative lookbehind.

This makes me think that your point is truly correct.
The option for my case is meant as "deny all except those are specified".
Also may go viceversa. Therefore I should refine the way the filtering act.
In fact the (temporarily) ignored score is the base of the method to be
applied.
Obviously here mainly we are talking about email addresses, so my intention is
like the mailfilter concept, which means the program may block an entire
domain but some are allowed and all from ".my" are allowed but not those
from ".com.my" (mostly annoying emails :p )

At the sum of the view I've considered a flexible programming as much as I'm
thinking that may be published some time to benefit for multiplatform user as
python is.
In such perspective I'm a bit curious to know if exist sites on the web where
small program are welcomed and people like me can express all of their
ignorance about the mode of using python. For such ignorance I may concour
for the Nobel Price :)

Also the News Group doesn't contemplate the idea to split into beginners and
high level programmers (HLP). Of course the HLP are welcome to discuss on
such NG :).

F
 
R

Ron Adam

Fulvio said:
***********************
Your mail has been scanned by InterScan MSS.
***********************




Thank you, Ron, for the input :)
I'll examine also in this mode. Meanwhile I had faced the total disaster :) of
deleting all my emails from all server ;(
(I've saved them locally, luckly :) )


Actually the return code is like herein:

if _filter(hdrs,allow,deny):
# allow and deny are objects prepared by re.compile(pattern)
_del(Num_of_Email)

In short, it means unwanted to be deleted.
And now the function is :

def _filter(msg,al,dn):
""" Filter try to classify a list of lines for a set of compiled
patterns."""
a = 0
for hdrline in msg:
# deny has the first priority and stop any further searching. Score 10
#times
if dn.search(hdrline): return len(msg) * 10
if al.search(hdrline): return 0
a += 1
return a # it returns with a score of rejected matches or zero if none

I see, is this a cleanup script to remove the least wanted items?

The allow/deny caused me to think it was more along the lines of a white/black
list. Where as keep/discard would be terms more suitable to cleaning out items
already allowed.

Or is it a bit of both? Why the score?

Just curious, I don't think I have any suggestions that will help in any
specific ways.

I would think the allow(keep?) filters would always have priority over deny filters.

The patterns are taken from a configuration file. Those with Axx ='pattern'
are allowing streams the others are Dxx to block under different criteria.
Here they're :

[Filters]
A01 = ^From:.*\.it\b
A02 = ^(To|Cc):.*frioio@
A03 = ^(To|Cc):.*the_sting@
A04 = ^(To|Cc):.*calm_me_or_die@
A05 = ^(To|Cc):.*further@
A06 = ^From:.*\.za\b
D01 = ^From:.*\.co\.au\b
D02 = ^Subject:.*\*\*\*SPAM\*\*\*

*A bit of fake in order to get some privacy* :)
I'm using configparser to fetch their value and they're are joint by :

allow = re.compile('|'.join([k[1] for k in ifil if k[0] is 'a']))
deny = re.compile('|'.join([k[1] for k in ifil if k[0] is 'd']))

ifil is the input filter's section.
>
At this point I suppose that I have realized the right thing, just I'm a bit
curious to know if ithere's a better chance and realize a single regex
compilation for all of the options.

I think keeping the allow filter seperate from the deny filter is good.

You might be able to merge the header lines and run the filters across the whole
header at once instead of each line.
Basically the program will work, in term of filtering as per config and
sincronizing with local $HOME/Mail/trash (configurable path). This last
option will remove emails on the server for those that are in the local
trash.
Todo = backup local and remote emails for those filtered as good.
multithread to connect all server in parallel
SSL for POP3 and IMAP4 as well
Actually I've problem on issuing the command to imap server to flag "Deleted"
the message which count as spam. I only know the message details but what
is the correct command is a bit obscure, for me.

I can't help you here. Sorry.
BTW whose Fred?

F

Fredrik see...

news://news.cox.net:119/[email protected]
 
F

Fulvio

I see, is this a cleanup script to remove the least wanted items?

Yes. Probably will remain in this mode for a while.
I'm not prepaired to bring out a new algorithm
Or is it a bit of both?  Why the score?

As exposed on another post. There should be a way to define a deny/allow with
some particular exception.( I.e deny all ".com" but not
(e-mail address removed))
I would think the allow(keep?) filters would always have priority over deny
filters.

It's a term which discerning capacity are involved. The previous post got this
point up. I think to allow all ".uk" (let us say) but not "info.uk" (all
reference are purely meant as example). Therefore if applying regex denial
on ".info.uk" surely that doesn't match only ".uk".
I think keeping the allow filter seperate from the deny filter is good.
Agreed with you. Simply I was supposing the regex can do negative matching.
You might be able to merge the header lines and run the filters across the
whole header at once instead of each line.

I got into this idea, which is good, I still need a bit of thinking to code
it. It need to remember what will be the right separator between fields,
otherwise may cause problems with different charset.
I can't help you here.  Sorry.

Found it :), by try&fail.
   
news://news.cox.net:119/[email protected]

I can't link foreigner NG than my isp giving me. I'm curious and I'll give it
a try.

F
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,999
Messages
2,570,243
Members
46,838
Latest member
KandiceChi

Latest Threads

Top