M
Michel Perez
Hi:
i'm so newbie in python that i don't get the right idea about regular
expressions. This is what i want to do:
Extract using python some information and them replace this expresion
for others, i use as a base the wikitext and this is what i do:
<code file="parse.py">
paragraphs = """
= Test '''wikitest'''=
[[Image:image_link.jpg|rigth|thumbnail|200px|"PREMIER"]]
[http://www.google.com.cu]
::''Note: This is just an example to test some regular expressions
stuffs.''
The ''wikitext'' is a text format that helps a lot. In concept is a
simple [[markup]] [[programming_language|language]]. That helps to make
simple create documentations texts.
==Wikitext==
Created by Warn as a ...
<nowiki>[</nowiki> this is a normal <nowiki>sign]</nowiki>
""".split('\n\n')
import re
wikipatterns = {
'a_nowiki' : re.compile(r"<nowiki>(.\S+)</nowiki>"), # nowiki
'section' : re.compile(r"\=(.*)\="), # section one tags
'sectiontwo' : re.compile(r"\=\=(.*?)\=\="),# section two tags
'wikilink': re.compile(r"\[\[(.*?)\]\]"), # links tags
'link': re.compile(r"\[(.*?)\]"), # external links tags
'italic': re.compile(r"\'\'(.*?)\'\'"), # italic text tags
'bold' : re.compile(r"\'\'\'(.*?)\'\'\'"), # bold text tags
}
for pattern in wikipatterns:
print "===> processing pattern :", pattern, "<=============="
for paragraph in paragraphs:
print wikipatterns[pattern].findall(paragraph)
</code>
But When i run it the result is not what i want, it's something like:
<code>
michel@cerebellum:/local/python$python parser.py
===> processing pattern : bold <==============
['braille']
[]
[]
[]
[]
[]
===> processing pattern : section <==============
[" Test '''wikitest'''"]
[]
[]
['=Wikitext=']
[]
[]
===> processing pattern : sectiontwo <==============
[]
[]
[]
['Wikitext']
[]
[]
===> processing pattern : link <==============
['[Image:image_link.jpg|rigth|thumbnail|200px|"PREMIER"']
['http://www.google.com.cu']
['[markup', '[programming_language|language']
[]
[]
['</nowiki> this is a normal <nowiki>sign']
===> processing pattern : italic <==============
["'wikitest"]
['Note: This is just an example to test some regular expressions
stuffs.']
['wikitext']
[]
[]
[]
===> processing pattern : wikilink <==============
['Image:image_link.jpg|rigth|thumbnail|200px|"PREMIER"']
[]
['markup', 'programming_language|language']
[]
[]
[]
===> processing pattern : a_nowiki <==============
[]
[]
[]
[]
[]
['sign]']
</code>
In the first case the result it's Ok
In the second the first it's Ok, but the second it's not because second
result it's a level two section not a level one.
In the third result things are Ok
The fourth, the first and thrid result are wrong beacuse they are level
two links, but the second it's Ok.
The fifth it Ok
The sixth shows only one result and it should show two.
Please help.
PS: am really sorry about my technical English.
i'm so newbie in python that i don't get the right idea about regular
expressions. This is what i want to do:
Extract using python some information and them replace this expresion
for others, i use as a base the wikitext and this is what i do:
<code file="parse.py">
paragraphs = """
= Test '''wikitest'''=
[[Image:image_link.jpg|rigth|thumbnail|200px|"PREMIER"]]
[http://www.google.com.cu]
::''Note: This is just an example to test some regular expressions
stuffs.''
The ''wikitext'' is a text format that helps a lot. In concept is a
simple [[markup]] [[programming_language|language]]. That helps to make
simple create documentations texts.
==Wikitext==
Created by Warn as a ...
<nowiki>[</nowiki> this is a normal <nowiki>sign]</nowiki>
""".split('\n\n')
import re
wikipatterns = {
'a_nowiki' : re.compile(r"<nowiki>(.\S+)</nowiki>"), # nowiki
'section' : re.compile(r"\=(.*)\="), # section one tags
'sectiontwo' : re.compile(r"\=\=(.*?)\=\="),# section two tags
'wikilink': re.compile(r"\[\[(.*?)\]\]"), # links tags
'link': re.compile(r"\[(.*?)\]"), # external links tags
'italic': re.compile(r"\'\'(.*?)\'\'"), # italic text tags
'bold' : re.compile(r"\'\'\'(.*?)\'\'\'"), # bold text tags
}
for pattern in wikipatterns:
print "===> processing pattern :", pattern, "<=============="
for paragraph in paragraphs:
print wikipatterns[pattern].findall(paragraph)
</code>
But When i run it the result is not what i want, it's something like:
<code>
michel@cerebellum:/local/python$python parser.py
===> processing pattern : bold <==============
['braille']
[]
[]
[]
[]
[]
===> processing pattern : section <==============
[" Test '''wikitest'''"]
[]
[]
['=Wikitext=']
[]
[]
===> processing pattern : sectiontwo <==============
[]
[]
[]
['Wikitext']
[]
[]
===> processing pattern : link <==============
['[Image:image_link.jpg|rigth|thumbnail|200px|"PREMIER"']
['http://www.google.com.cu']
['[markup', '[programming_language|language']
[]
[]
['</nowiki> this is a normal <nowiki>sign']
===> processing pattern : italic <==============
["'wikitest"]
['Note: This is just an example to test some regular expressions
stuffs.']
['wikitext']
[]
[]
[]
===> processing pattern : wikilink <==============
['Image:image_link.jpg|rigth|thumbnail|200px|"PREMIER"']
[]
['markup', 'programming_language|language']
[]
[]
[]
===> processing pattern : a_nowiki <==============
[]
[]
[]
[]
[]
['sign]']
</code>
In the first case the result it's Ok
In the second the first it's Ok, but the second it's not because second
result it's a level two section not a level one.
In the third result things are Ok
The fourth, the first and thrid result are wrong beacuse they are level
two links, but the second it's Ok.
The fifth it Ok
The sixth shows only one result and it should show two.
Please help.
PS: am really sorry about my technical English.