Replace and inserting strings within .txt files with the use of regex

Î

Îίκος

Two problems here:

str.replace doesn't use regular expressions. You'll have to use the re
module to use regexps. (the re.sub function to be precise)

'.'  matches a single character. Any character, but only one.
'.*' matches as many characters as possible. This is not what you want,
since it will match everything between the *first* <? and the *last* ?>.
You want non-greedy matching.

'.*?' is the same thing, without the greed.

Thanks you,

So i guess this needs to be written as:

src_data = re.sub( '<?(.*?)?>', '', src_data )

Tha 'r' special char doesn't need to be inserter before the regex here
due to regex ain't containing backslashes.
You will have to find the </body> tag before inserting the string.
str.find should help -- or you could use str.replace and replace the
</body> tag with you counter line, plus a new </body>.

Ah yes! Damn why din't i think of it.... str.replace should do the
trick. I was stuck trying to figure regexes.

So, i guess that should work:

src_data = src_data.replace('</body>', '<br><br><h4><font
color=green> ΑÏιθμός Επισκεπτών: %(counter)d said:
No it's not. You're just giving up too soon.

Yes youa re right, your hints keep me going and thank you for that.
 
Î

Îίκος

Now the code looks as follows:

=============================
#!/usr/bin/python

import re, os, sys


id = 0 # unique page_id

for currdir, files, dirs in os.walk('test'):

for f in files:

if f.endswith('php'):

# get abs path to filename
src_f = join(currdir, f)

# open php src file
print ( 'reading from %s' % src_f )
f = open(src_f, 'r')
src_data = f.read() # read contents of PHP file
f.close()

# replace tags
print ( 'replacing php tags and contents within' )
src_data = re.sub( '<?(.*?)?>', '', src_data )

# add ID
print ( 'adding unique page_id' )
src_data = ( '<!-- %d -->' % id ) + src_data
id += 1

# add template variables
print ( 'adding counter template variable' )
src_data = src_data.replace('</body>', '<br><br><center><h4><font
color=green> ΑÏιθμός Επισκεπτών: %(counter)d </body>' )

# rename old php file to new with .html extension
src_file = src_file.replace('.php', '.html')

# open newly created html file for inserting data
print ( 'writing to %s' % dest_f )
dest_f = open(src_f, 'w')
dest_f.write(src_data) # write contents
dest_f.close()

I just tried to test it. I created a folder names 'test' in me 'd:\'
drive.
Then i have put to .php files inside form the original to test if it
would work ok for those too files before acting in the whole copy and
after in the original project.

so i opened a 'cli' form my Win7 and tried

D:\>convert.py

D:\>

Itsjust printed an empty line and nothign else. Why didn't even try to
open the folder and fiels within?
Syntactically it doesnt ghive me an error!
Somehting with os.walk() methos perhaps?
 
P

Peter Otten

Îίκος said:
Now the code looks as follows:
for currdir, files, dirs in os.walk('test'):

for f in files:

if f.endswith('php'):

# get abs path to filename
src_f = join(currdir, f)
I just tried to test it. I created a folder names 'test' in me 'd:\'
drive.
Then i have put to .php files inside form the original to test if it
would work ok for those too files before acting in the whole copy and
after in the original project.

so i opened a 'cli' form my Win7 and tried

D:\>convert.py

D:\>

Itsjust printed an empty line and nothign else. Why didn't even try to
open the folder and fiels within?
Syntactically it doesnt ghive me an error!
Somehting with os.walk() methos perhaps?

If there is a folder D:\test and it does contain some PHP files (double-
check!) the extension could be upper-case. Try

if f.lower().endswith("php"): ...

or

php_files = fnmatch.filter(files, "*.php")
for f in php_files: ...

Peter
 
Î

Îίκος

If there is a folder D:\test and it does contain some PHP files (double-
check!) the extension could be upper-case. Try

if f.lower().endswith("php"): ...

or

php_files = fnmatch.filter(files, "*.php")
for f in php_files: ...

Peter

The extension is in in lower case. folder is there, php files is
there, i dont know why it doesnt't want to go into the d:\test to find
them.

Thast one problem.

The other one is:

i made the code simpler by specifying the filename my self.

=========================
# get abs path to filename
src_f = 'd:\\test\\index.php'

# open php src file
print ( 'reading from %s' % src_f )
f = open(src_f, 'r')
src_data = f.read() # read contents of PHP file
f.close()
=========================

but although ti nwo finds the fiel i egt this error in 'cli':

D:\>aconvert.py
reading from d:\test\index.php
Traceback (most recent call last):
File "D:\aconvert.py", line 16, in <module>
src_data = f.read() # read contents of PHP file
File "C:\Python32\lib\encodings\cp1253.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9f in position
321: char
acter maps to <undefined>

Somethign with the damn encodings again!!
 
P

Peter Otten

Îίκος said:
If there is a folder D:\test and it does contain some PHP files (double-
check!) the extension could be upper-case. Try

if f.lower().endswith("php"): ...

or

php_files = fnmatch.filter(files, "*.php")
for f in php_files: ...

Peter

The extension is in in lower case. folder is there, php files is
there, i dont know why it doesnt't want to go into the d:\test to find
them.

Thast one problem.

The other one is:

i made the code simpler by specifying the filename my self.

=========================
# get abs path to filename
src_f = 'd:\\test\\index.php'

# open php src file
print ( 'reading from %s' % src_f )
f = open(src_f, 'r')
src_data = f.read() # read contents of PHP file
f.close()
=========================

but although ti nwo finds the fiel i egt this error in 'cli':

D:\>aconvert.py
reading from d:\test\index.php
Traceback (most recent call last):
File "D:\aconvert.py", line 16, in <module>
src_data = f.read() # read contents of PHP file
File "C:\Python32\lib\encodings\cp1253.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9f in position
321: char
acter maps to <undefined>

Somethign with the damn encodings again!!

Hmm, at one point in this thread you switched from Python 2.x to Python 3.2.
There are a lot of subtle and not so subtle differences between 2.x and 3.x,
and I recommend that you stick to one while you are still in newbie mode.

If you want to continue to use 3.x I recommend that you at least use the
stable 3.1 version.

Now one change from Python 2 to 3 is that open(filename, "r") gives you a
beast that is unicode-aware and assumes that the file is encoded in utf-8
unless you tell it otherwise with open(..., encoding=whatever). So what is
the charset used for your index.php?

Peter
 
Î

Îίκος

The extension is in in lower case. folder is there, php files is
there, i dont know why it doesnt't want to go into the d:\test to find
them.
Thast one problem.
The other one is:
i made the code simpler by specifying the filename my self.
=========================
# get abs path to filename
src_f = 'd:\\test\\index.php'
# open php src file
print ( 'reading from %s' % src_f )
f = open(src_f, 'r')
src_data = f.read()                # read contents of PHP file
f.close()
=========================
but  although ti nwo finds the fiel i egt this error in 'cli':
D:\>aconvert.py
reading from d:\test\index.php
Traceback (most recent call last):
  File "D:\aconvert.py", line 16, in <module>
    src_data = f.read()         # read contents of PHP file
  File "C:\Python32\lib\encodings\cp1253.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9f in position
321: char
acter maps to <undefined>
Somethign with the damn encodings again!!

Hmm, at one point in this thread you switched from Python 2.x to Python 3..2.
There are a lot of subtle and not so subtle differences between 2.x and 3..x,
and I recommend that you stick to one while you are still in newbie mode.

If you want to continue to use 3.x I recommend that you at least use the
stable 3.1 version.

Now one change from Python 2 to 3 is that open(filename, "r") gives you a
beast that is unicode-aware and assumes that the file is encoded in utf-8
unless you tell it otherwise with open(..., encoding=whatever). So what is
the charset used for your index.php?

Peter


Yes yesterday i switched to Python 3.2 Peter.

When i open index.php within Notapad++ it says its in utf-8 without
BOM and it contains inside exepect form english chars , greek cjhars
as well fro printing.

The file was made by my client in dreamweaver.

So since its utf-8 what the problem of opening it?
 
P

Peter Otten

Îίκος said:
Îίκος said:
Îίκος wrote:
Now the code looks as follows:
for currdir, files, dirs in os.walk('test'):
for f in files:
if f.endswith('php'):
# get abs path to filename
src_f = join(currdir, f)
I just tried to test it. I created a folder names 'test' in me 'd:\'
drive.
Then i have put to .php files inside form the original to test if it
would work ok for those too files before acting in the whole copy
and after in the original project.
so i opened a 'cli' form my Win7 and tried


Itsjust printed an empty line and nothign else. Why didn't even try
to open the folder and fiels within?
Syntactically it doesnt ghive me an error!
Somehting with os.walk() methos perhaps?
If there is a folder D:\test and it does contain some PHP files
(double- check!) the extension could be upper-case. Try
if f.lower().endswith("php"): ...

php_files = fnmatch.filter(files, "*.php")
for f in php_files: ...

The extension is in in lower case. folder is there, php files is
there, i dont know why it doesnt't want to go into the d:\test to find
them.
Thast one problem.
The other one is:
i made the code simpler by specifying the filename my self.
=========================
# get abs path to filename
src_f = 'd:\\test\\index.php'
# open php src file
print ( 'reading from %s' % src_f )
f = open(src_f, 'r')
src_data = f.read() # read contents of PHP file
f.close()
=========================
but although ti nwo finds the fiel i egt this error in 'cli':
D:\>aconvert.py
reading from d:\test\index.php
Traceback (most recent call last):
File "D:\aconvert.py", line 16, in <module>
src_data = f.read() # read contents of PHP file
File "C:\Python32\lib\encodings\cp1253.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9f in position
321: char
acter maps to <undefined>
Somethign with the damn encodings again!!

Hmm, at one point in this thread you switched from Python 2.x to Python
3.2. There are a lot of subtle and not so subtle differences between 2.x
and 3.x, and I recommend that you stick to one while you are still in
newbie mode.

If you want to continue to use 3.x I recommend that you at least use the
stable 3.1 version.

Now one change from Python 2 to 3 is that open(filename, "r") gives you a
beast that is unicode-aware and assumes that the file is encoded in utf-8
unless you tell it otherwise with open(..., encoding=whatever). So what
is the charset used for your index.php?

Peter


Yes yesterday i switched to Python 3.2 Peter.

When i open index.php within Notapad++ it says its in utf-8 without
BOM and it contains inside exepect form english chars , greek cjhars
as well fro printing.

The file was made by my client in dreamweaver.

So since its utf-8 what the problem of opening it?

Python says it's not, and I tend to believe it. You can open the file with

open(..., errors="replace")

but you will lose data (which is already garbled, anyway).

Again: in the unlikely case that Python is causing your problem -- you do
understand what an alpha version is?

Peter
 
Î

Îίκος

Python says it's not, and I tend to believe it.

You are right!

I tried to do the same exact openign via IDLE enviroment and i goth
the encoding of the file from there!
<_io.TextIOWrapper name='d:\\test\\index.php' encoding='cp1253'>

Thats why in the error in my previous post it said
File "C:\Python32\lib\encodings\cp1253.py", line 23, in decode
it tried to use the cp1253 encoding.

But now sicne Python as we see can undestand the nature of the
encoding what causing it not to open the file?
 
P

Peter Otten

Îίκος said:
You are right!

I tried to do the same exact openign via IDLE enviroment and i goth
the encoding of the file from there!

<_io.TextIOWrapper name='d:\\test\\index.php' encoding='cp1253'>

Thats why in the error in my previous post it said
File "C:\Python32\lib\encodings\cp1253.py", line 23, in decode
it tried to use the cp1253 encoding.

But now sicne Python as we see can undestand the nature of the
encoding what causing it not to open the file?

It doesn't. You have to tell. *If* the file uses cp1253 you can open it with

open(..., encoding="cp1253")

Note that if the file is not in cp1253 python will still happily open it as
long as it doesn't contain the following bytes:

.... try: chr(i).decode("cp1253") and None
.... except: print i
....
129
136
138
140
141
142
143
144
152
154
156
157
158
159
170
210
255

Peter
 
M

MRAB

Îίκος said:
Thanks you,

So i guess this needs to be written as:

src_data = re.sub( '<?(.*?)?>', '', src_data )
In a regex '?' is a special character, so if you want a literal '?' you
need to escape it. Therefore:
 
Î

Îίκος

It doesn't. You have to tell.

Why it doesn't? The idle response designates that it knows that file
encoding is in "cp1253" which means it can identify it.

*If* the file uses cp1253 you can open it with
open(..., encoding="cp1253")

Note that if the file is not in cp1253 python will still happily open it as
long as it doesn't contain the following bytes:


...     try: chr(i).decode("cp1253") and None
...     except: print i
...
129
136
138
140
141
142
143
144
152
154
156
157
158
159
170
210
255

Peter

I'm afraid it does because whn i tried:

f = open(src_f, 'r', encoding="cp1253" )

i got the same error again.....what are those characters?Dont they
belong too tot he same weird 'cp1253' encoding? Why compiler cant open
them?
 
Î

Îίκος

In a regex '?' is a special character, so if you want a literal '?' you
need to escape it. Therefore:

     src_data = re.sub(r'<\?(.*?)\?>', '', src_data)

i see, or perhaps even this:

   src_data = re.sub(r'<?(.*?)?>', '', src_data)

maybe it works here as well.
 
M

MRAB

Îίκος said:
i see, or perhaps even this:

src_data = re.sub(r'<?(.*?)?>', '', src_data)

maybe it works here as well.

No. That regex means that it should match:

<? # optional '<'
(.*?)? # optional group of any number of any characters
 
Í

Íßêïò

Please tell me that no matter what weird charhs has inside ic an still
open thosie fiels and make the neccessary replacements.
 
P

Peter Otten

Îίκος said:
Please tell me that no matter what weird charhs has inside ic an still
open thosie fiels and make the neccessary replacements.

Go back to 2.6 for the moment and defer learning about unicode until you're
done with the conversion job.
 
Î

Îίκος

Go back to 2.6 for the moment and defer learning about unicode until you're
done with the conversion job.

You are correct again! 3.2 caused the problem, i switched to 2.7 and
now i donyt have that problem anymore. File is openign okey!

it ALMOST convert correctly!

# replace tags
print ( 'replacing php tags and contents within' )
src_data = re.sub( '<\?(.*?)\?>', '', src_data )

it only convert the first instance of php tages and not the rest?
But why?
 
T

Thomas Jollans

You are correct again! 3.2 caused the problem, i switched to 2.7 and
now i donyt have that problem anymore. File is openign okey!

it ALMOST convert correctly!

# replace tags
print ( 'replacing php tags and contents within' )
src_data = re.sub( '<\?(.*?)\?>', '', src_data )

it only convert the first instance of php tages and not the rest?
But why?

http://docs.python.org/library/re.html#re.S

You probably need to pass the re.DOTALL flag.
 
Î

Îίκος

When replacing text in an HTML document with re.sub, you want to use
the re.S (singleline) option; otherwise your pattern won't match when
the opening tag is on one line and the closing is on another.

Thats exactly the problem iam facing now with this statement.

src_data = re.sub( '<\?(.*?)\?>', '', src_data )

you mean i have to switch it like this?

src_data = re.S ( '<\?(.*?)\?>', '', src_data ) ?
 
Í

Íßêïò

Now the code looks as follows:

=============================
#!/usr/bin/python

import re, os, sys

id = 0  # unique page_id

for currdir, files, dirs in os.walk('test'):

        for f in files:

                if f.endswith('php'):

                        # get abs path to filename
                        src_f = join(currdir, f)

                        # open php src file
                        print ( 'reading from %s' % src_f )
                        f = open(src_f, 'r')
                        src_data = f.read()             # read contents of PHP file
                        f.close()

                        # replace tags
                        print ( 'replacing php tags and contents within' )
                        src_data = re.sub( '<?(..*?)?>', '', src_data )

                        # add ID
                        print ( 'adding unique page_id' )
                        src_data = ( '<!-- %d -->' % id ) + src_data
                        id += 1

                        # add template variables
                        print ( 'adding counter template variable' )
                        src_data = src_data.replace('</body>', '<br><br><center><h4><font
color=green> Áñéèìüò Åðéóêåðôþí: %(counter)d </body>' )

                        # rename old php file to new with .html extension
                        src_file = src_file.replace('.php', '.html')

                        # open newly created html file for inserting data
                        print ( 'writing to %s' % dest_f )
                        dest_f = open(src_f, 'w')
                        dest_f.write(src_data)          # write contents
                        dest_f.close()

I just tried to test it. I created a folder names 'test' in me 'd:\'
drive.
Then i have put to .php files inside form the original to test if it
would work ok for those too files before acting in the whole copy and
after in the original project.

so i opened a 'cli' form my Win7 and tried

D:\>convert.py

D:\>

Itsjust printed an empty line and nothign else. Why didn't even try to
open the folder and fiels within?
Syntactically it doesnt ghive me an error!
Somehting with os.walk() methos perhaps?

Can you help in this too please?

Now iam able to just convrt a single file 'd:\test\index.php'

But these needs to be done for ALL the php files in every subfolder.
for currdir, files, dirs in os.walk('test'):

        for f in files:

                if f.endswith('php'):

Should the above lines enter folders and find php files in each folder
so to be edited?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,170
Messages
2,570,927
Members
47,469
Latest member
benny001

Latest Threads

Top