Weird problem matching with REs

Andrew Berg · May 29, 2011

I have an RE that should work (it even works in Kodos [1], but not in my
code), but it keeps failing to match characters after a newline.

I'm writing a little program that scans the webpage of an arbitrary
application and gets the newest version advertised on the page.

test3.py:

# -*- coding: utf-8 -*-

import configparser
import re
import urllib.request
import os
import sys
import logging
import collections

class CouldNotFindVersion(Exception):
def __init__(self, app_name, reason, exc_value):
self.value = 'The latest version of ' + app_name + ' could not
be determined because ' + reason
self.cause = exc_value
def __str__(self):
return repr(self.value)

class AppUpdateItem():
def __init__(self, config_file_name, config_file_section):
self.section = config_file_section
self.name = self.section['Name']
self.url = self.section['URL']
self.filename = self.section['Filename']
self.file_re = re.compile(self.section['FileURLRegex'])
self.ver_re = re.compile(self.section['VersionRegex'])
self.prev_ver = self.section['CurrentVersion']
try:
self.page = str(urllib.request.urlopen(self.url).read(),
encoding='utf-8')
self.file_URL = self.file_re.findall(self.page)[0] #here
is where it fails
self.last_ver = self.ver_re.findall(self.file_URL)[0]
except urllib.error.URLError:
self.error = str(sys.exc_info()[1])
logging.info('[' + self.name + ']' + ' Could not load URL:
' + self.url + ' : ' + self.error)
self.success = False
raise CouldNotFindVersion(self.name, self.error,
sys.exc_info()[0])
except IndexError:
logging.warning('Regex did not return a match.')
def update_ini(self):
self.section['CurrentVersion'] = self.last_ver
with open(config_file_name, 'w') as configfile:
config.write(configfile)
def rollback_ini(self):
self.section['CurrentVersion'] = self.prev_ver
with open(config_file_name, 'w') as configfile:
config.write(configfile)
def download_file(self):
self.__filename = self.section['Filename']
with open(self.__filename, 'wb') as file:
self.__file_req = urllib.request.urlopen(self.file_URL).read()
file.write(self.__file_req)

if __name__ == '__main__':
config = configparser.ConfigParser()
config_file = 'checklist.ini'
config.read(config_file)
queue = collections.deque()
for section in config.sections():
try:
queue.append(AppUpdateItem(config_file, config[section]))
except CouldNotFindVersion as exc:
logging.warning(exc.value)
for elem in queue:
if elem.last_ver != elem.prev_ver:
elem.update_ini()
try:
elem.download_file()
except IOError:
logging.warning('[' + elem.name + '] Download failed.')
except:
elem.rollback_ini()
print(elem.name + ' succeeded.')
checklist.ini:
[x264_64]
name = x264 (64-bit)
filename = x264.exe
url = http://x264.nl/x264_main.php
fileurlregex =
http://x264.nl/x264/64bit/8bit_depth/revision\n{0,3}[0-9]{4}\n{0,3}/x264\n{0,3}.exe
versionregex = [0-9]{4}
currentversion = 1995

The part it's supposed to match in http://x264.nl/x264_main.php:

<a href="http://x264.nl/x264/64bit/8bit_depth/revision
1995
/x264

.exe <view-source-tab:http://x264.nl/x264/64bit/8bit_depth/revision 1995 /x264 .exe>"

I was able to make a regex that matches in my code, but it shouldn't:
http://x264.nl/x264/64bit/8bit_depth/revision.\n{1,3}[0-9]{4}.\n{1,3}/x264.\n{1,3}.\n{1,3}.exe
I have to add a dot before each "\n". There is no character not
accounted for before those newlines, but I don't get a match without the
dots. I also need both those ".\n{1,3}" sequences before the ".exe". I'm
really confused.

Using Python 3.2 on Windows, in case it matters.

[1] http://kodos.sourceforge.net/ (using the compiled Win32 version
since it doesn't work with Python 3)

Steven D'Aprano · May 29, 2011

I have an RE that should work (it even works in Kodos [1], but not in my
code), but it keeps failing to match characters after a newline.

Not all regexes are the same. Different regex engines accept different
symbols, and sometimes behave differently, or have different default
behavior. That your regex works in Kodos but not Python might mean you're
writing a Kodus regex instead of a Python regex.

I'm writing a little program that scans the webpage of an arbitrary
application and gets the newest version advertised on the page.

Firstly, most of the code you show is irrelevant to the problem. Please
simplify it to the shortest, most simple example you can give. That would
be a simplified piece of text (not the entire web page!), the regex, and
the failed attempt to use it. The rest of your code is just noise for the
purposes of solving this problem.

Secondly, you probably should use a proper HTML parser, rather than a
regex. Resist the temptation to use regexes to rip out bits of text from
HTML, it almost always goes wrong eventually.

I was able to make a regex that matches in my code, but it shouldn't:
http://x264.nl/x264/64bit/8bit_depth/revision.\n{1,3}[0-9]{4}.\n{1,3}/

x264.\n{1,3}.\n{1,3}.exe

What makes you think it shouldn't match?

By the way, you probably should escape the dots, otherwise it will match
strings containing any arbitrary character, rather than *just* dots:

http://x264Znl ...blah blah blah

Andrew Berg · May 29, 2011

You are aware that most text-emitting processes on Windows, and Internet
text protocols like the HTTP standard, use the two-character â€œCR LFâ€
sequence (U+000C U+000A) for terminating lines?

Yes, but I was not having trouble with just '\n' before, and the pattern
did match in Kodos, so I figured Python was doing its newline magic like
it does with the write() method for file objects.
http://x264.nl/x264/64bit/8bit_depth/revision[\r\n]{1,3}[0-9]{4}[\r\n]{1,3}/x264[\r\n]{1,3}.exe
does indeed match. One thing that confuses me, though (and one reason I
dismissed the possibility of it being a newline issue): isn't '.'
supposed to not match '\r'?

Andrew Berg · May 29, 2011

I have an RE that should work (it even works in Kodos [1], but not in my
code), but it keeps failing to match characters after a newline.

Click to expand...

Not all regexes are the same. Different regex engines accept different
symbols, and sometimes behave differently, or have different default
behavior. That your regex works in Kodos but not Python might mean you're
writing a Kodus regex instead of a Python regex.

Kodos is written in Python and uses Python's regex engine. In fact, it
is specifically intended to debug Python regexes.

Firstly, most of the code you show is irrelevant to the problem. Please
simplify it to the shortest, most simple example you can give. That would
be a simplified piece of text (not the entire web page!), the regex, and
the failed attempt to use it. The rest of your code is just noise for the
purposes of solving this problem.

I wasn't sure how much would be relevant since it could've been a
problem with other code. I do apologize for not putting more effort into
trimming it down, though.

Secondly, you probably should use a proper HTML parser, rather than a
regex. Resist the temptation to use regexes to rip out bits of text from
HTML, it almost always goes wrong eventually.

I find this a much simpler approach, especially since I'm dealing with
broken HTML. I guess I don't see how the effort put into learning a
parser and adding the extra code to use it pays off in this particular
endeavor.

I was able to make a regex that matches in my code, but it shouldn't:
http://x264.nl/x264/64bit/8bit_depth/revision.\n{1,3}[0-9]{4}.\n{1,3}/

Click to expand...

x264.\n{1,3}.\n{1,3}.exe

What makes you think it shouldn't match?

AFAIK, dots aren't supposed to match carriage returns or any other
whitespace characters.

By the way, you probably should escape the dots, otherwise it will match
strings containing any arbitrary character, rather than *just* dots:

You're right; I overlooked the dots in the URL.

Steven D'Aprano · May 29, 2011

On 2011.05.29 08:09 AM, Steven D'Aprano wrote: [...]
Kodos is written in Python and uses Python's regex engine. In fact, it
is specifically intended to debug Python regexes.

Fair enough.

I find this a much simpler approach, especially since I'm dealing with
broken HTML. I guess I don't see how the effort put into learning a
parser and adding the extra code to use it pays off in this particular
endeavor.

The temptation to take short-cuts leads to the Dark Side

Perhaps you're right, in this instance. But if you need to deal with
broken HTML, try BeautifulSoup.

AFAIK, dots aren't supposed to match carriage returns or any other
whitespace characters.

They won't match *newlines* \n unless you pass the DOTALL flag, but they
do match whitespace:
True

Andrew Berg · May 29, 2011

They won't match *newlines* \n unless you pass the DOTALL flag, but they
do match whitespace:

True

I got things mixed up there (was thinking whitespace instead of
newlines), but I thought dots aren't supposed to match '\r' (carriage
return). Why is '\r' not considered a newline character?

Roy Smith · May 29, 2011

Andrew Berg said:
Kodos is written in Python and uses Python's regex engine. In fact, it
is specifically intended to debug Python regexes.

Named after the governor of Tarsus IV?

Andrew Berg · May 29, 2011

Named after the governor of Tarsus IV?

Judging by the graphic at http://kodos.sourceforge.net/help/kodos.html ,
it's named after the Simpsons character.

John S · May 29, 2011

On 2011.05.29 09:18 AM, Steven D'Aprano wrote:> >> What makes you think it shouldn't match?

I got things mixed up there (was thinking whitespace instead of
newlines), but I thought dots aren't supposed to match '\r' (carriage
return). Why is '\r' not considered a newline character?

Dots don't match end-of-line-for-your-current-OS is how I think of
it.

While I almost usually nod my head at Steven D'Aprano's comments, in
this case I have to say that if you just want to grab something from a
chunk of HTML, full-blown HTML parsers are overkill. True, malformed
HTML can throw you off, but they can also throw a parser off.

I could not make your regex work on my Linux box with Python 2.6.

In your case, and because x264 might change their HTML, I suggest the
following code, which works great on my system.YMMV. I changed your
newline matches to use \s and put some capturing parentheses around
the date, so you could grab it.
.... print m.group(1)
....
1995

\s is your friend -- matches space, tab, newline, or carriage return.
\s* says match 0 or more spaces, which is what's needed here in case
the web site decides to *not* put whitespace in the middle of a URL...

As Steven said, when you want match a dot, it needs to be escaped,
although it will work by accident much of the time. Also, be sure to
use a raw string when composing REs, so you don't run into backslash
issues.

HTH,
John Strickler

Andrew Berg · May 29, 2011

Dots don't match end-of-line-for-your-current-OS is how I think of
it.

IMO, the docs should say the dot matches any character except a line
feed ('\n'), since that is more accurate.

True, malformed
HTML can throw you off, but they can also throw a parser off.

That was part of my point. html.parser.HTMLParser from the standard
library will definitely not work on x264.nl's broken HTML, and fixing it
requires lxml (I'm working with Python 3; I've looked into
BeautifulSoup, and does not work with Python 3 at all). Admittedly,
fixing x264.nl's HTML only requires one or two lines of code, but really
nasty HTML might require quite a bit of work.

In your case, and because x264 might change their HTML, I suggest the
following code, which works great on my system.YMMV. I changed your
newline matches to use \s and put some capturing parentheses around
the date, so you could grab it.

I've been meaning to learn how to use parenthesis groups.

Also, be sure to
use a raw string when composing REs, so you don't run into backslash
issues.

How would I do that when grabbing strings from a config file (via the
configparser module)? Or rather, if I have a predefined variable
containing a string, how do change it into a raw string?

John S · May 29, 2011

I've been meaning to learn how to use parenthesis groups.

How would I do that when grabbing strings from a config file (via the
configparser module)? Or rather, if I have a predefined variable
containing a string, how do change it into a raw string?

When reading the RE from a file it's not an issue. Only literal
strings can be raw. If the data is in a file, the data will not be
parsed by the Python interpreter. This was just a general warning to
anyone working with REs. It didn't apply in this case.

--john strickler

Thomas 'PointedEars' Lahn · May 29, 2011

Andrew said:
Judging by the graphic at http://kodos.sourceforge.net/help/kodos.html ,
it's named after the Simpsons character.

<OT>

I don't think that's a coincidence; both are from other planets and both are
rather evil[tm]. Kodos the Executioner, arguably human, became a dictator
who had thousands killed (by his own account, not to let the rest die of
hunger); Kodos the slimy extra-terrestrial is a conqueror (and he likes to
zap humans as well ;-))

[BTW, Tarsus IV, a planet where thousands (would) have died of hunger and
have died in executions was probably yet another hidden Star Trek euphemism.
I have found out that Tarsus is, among other things, the name of a
collection of bones in the human foot next to the heel. Bones as a
reference to death aside, see also Achilles for the heel. But I'm only
speculating here.]

</OT>

Trying to understand html.parser.HTMLParser	0	May 15, 2011
ANN: 'rex', a module for easy creation and use of regular expressions	0	Jun 10, 2004

Weird problem matching with REs

Andrew Berg

Steven D'Aprano

Andrew Berg

Andrew Berg

Steven D'Aprano

Andrew Berg

Roy Smith

Andrew Berg

John S

Andrew Berg

John S

Thomas 'PointedEars' Lahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads