extract certain values from file with re

Fabian Braennstroem · Oct 6, 2006

Hi,

I would like to remove certain lines from a log files. I had
some sed/awk scripts for this, but now, I want to use python
with its re module for this task.

Actually, I have two different log files. The first file looks
like:

...
'some text'
...

ITER I----------------- GLOBAL ABSOLUTE RESIDUAL -----------------I I------------ FIELD VALUES AT MONITORING LOCATION ----------I
NO UMOM VMOM WMOM MASS T EN DISS ENTH U V W P TE ED T
1 9.70E-02 8.61E-02 9.85E-02 1.00E+00 1.61E+01 7.65E+04 0.00E+00 1.04E-01-8.61E-04 3.49E-02 1.38E-03 7.51E-05 1.63E-05 2.00E+01
2 3.71E-02 3.07E-02 3.57E-02 1.00E+00 3.58E-01 6.55E-01 0.00E+00 1.08E-01-1.96E-03 4.98E-02 7.11E-04 1.70E-04 4.52E-05 2.00E+01
3 2.64E-02 1.99E-02 2.40E-02 1.00E+00 1.85E-01 3.75E-01 0.00E+00 1.17E-01-3.27E-03 6.07E-02 4.02E-04 4.15E-04 1.38E-04 2.00E+01
4 2.18E-02 1.52E-02 1.92E-02 1.00E+00 1.21E-01 2.53E-01 0.00E+00 1.23E-01-4.85E-03 6.77E-02 1.96E-05 9.01E-04 3.88E-04 2.00E+01
5 1.91E-02 1.27E-02 1.70E-02 1.00E+00 8.99E-02 1.82E-01 0.00E+00 1.42E-01-6.61E-03 7.65E-02 1.78E-04 1.70E-03 9.36E-04 2.00E+01
...
...
...

2997 3.77E-04 2.89E-04 3.05E-04 2.71E-02 5.66E-04 6.28E-04 0.00E+00 -3.02E-01 3.56E-02-7.97E-02-7.11E-02 4.08E-02 1.86E-01 2.00E+01
2998 3.77E-04 2.89E-04 3.05E-04 2.71E-02 5.65E-04 6.26E-04 0.00E+00 -3.02E-01 3.63E-02-8.01E-02-7.10E-02 4.02E-02 1.83E-01 2.00E+01
2999 3.76E-04 2.89E-04 3.05E-04 2.70E-02 5.64E-04 6.26E-04 0.00E+00 -3.02E-01 3.69E-02-8.04E-02-7.10E-02 3.96E-02 1.81E-01 2.00E+01
3000 3.78E-04 2.91E-04 3.07E-04 2.74E-02 5.64E-04 6.26E-04 0.00E+00 -3.01E-01 3.75E-02-8.07E-02-7.09E-02 3.91E-02 1.78E-01 2.00E+01
&&&&&& -------------------------------------------------------------- ----

....
'some text'
....

I actually want to extract the lines with the numbers, write
them to a file and finally use gnuplot for plotting them. A
nicer and more python way would be to extract those numbers,
write them into an array according to their column and plot
those using the gnuplot or matplotlib module

Unfortunately, I am pretty new to the re module and tried
the following so far:

import re
pat = re.compile('\ \ \ NO.*?&&&&&&', re.DOTALL)
print re.sub(pat, '', open('log_star_orig').read())

but this works just the other way around, which means that
the original log file is printed without the number part. So
the next step would be to delete the part from the first
line to '\ \ \ \ NO' and the part from '&&&&&&' to the end,
but I do not know how to address the first and last line!?

Would be nice, if you can give me a hint and especially
interesting would it be, when you have an idea, how I can
put those columns in arrays, so I can plot them right away!

A more difficult log file looks like:

======================================================================
OUTER LOOP ITERATION = 1 CPU SECONDS = 2.40E+01
----------------------------------------------------------------------
| Equation | Rate | RMS Res | Max Res | Linear Solution |
+----------------------+------+---------+---------+------------------+
| U-Mom | 0.00 | 1.0E-02 | 5.0E-01 | 4.9E-03 OK|
| V-Mom | 0.00 | 2.4E-14 | 5.6E-13 | 3.8E+09 ok|
| W-Mom | 0.00 | 2.5E-14 | 8.2E-13 | 8.3E+09 ok|
| P-Mass | 0.00 | 1.1E-02 | 3.4E-01 | 8.9 2.7E-02 OK|
+----------------------+------+---------+---------+------------------+
| K-TurbKE | 0.00 | 1.8E+00 | 1.8E+00 | 5.8 2.2E-08 OK|
| E-Diss.K | 0.00 | 1.9E+00 | 2.0E+00 | 12.4 2.2E-08 OK|
+----------------------+------+---------+---------+------------------+

======================================================================
OUTER LOOP ITERATION = 2 CPU SECONDS = 8.57E+01
----------------------------------------------------------------------
| Equation | Rate | RMS Res | Max Res | Linear Solution |
+----------------------+------+---------+---------+------------------+
| U-Mom | 1.44 | 1.5E-02 | 5.3E-01 | 9.6E-03 OK|
| V-Mom |99.99 | 1.1E-03 | 6.2E-02 | 5.7E-02 OK|
| W-Mom |99.99 | 1.9E-03 | 6.0E-02 | 5.9E-02 OK|
| P-Mass | 0.27 | 3.0E-03 | 2.0E-01 | 8.9 7.9E-02 OK|
+----------------------+------+---------+---------+------------------+
| K-TurbKE | 0.03 | 5.4E-02 | 4.4E-01 | 5.8 2.9E-08 OK|
| E-Diss.K | 0.05 | 8.9E-02 | 9.3E-01 | 12.4 2.6E-08 OK|
+----------------------+------+---------+---------+------------------+

....
....
....

======================================================================
OUTER LOOP ITERATION = 416 CPU SECONDS = 2.28E+04
----------------------------------------------------------------------
| Equation | Rate | RMS Res | Max Res | Linear Solution |
+----------------------+------+---------+---------+------------------+
| U-Mom | 0.96 | 1.8E-04 | 5.8E-03 | 1.8E-02 OK|
| V-Mom | 0.98 | 3.6E-05 | 1.5E-03 | 4.4E-02 OK|
| W-Mom | 0.99 | 4.5E-05 | 2.1E-03 | 4.3E-02 OK|
| P-Mass | 0.96 | 8.3E-06 | 3.0E-04 | 12.9 4.0E-02 OK|
+----------------------+------+---------+---------+------------------+
| K-TurbKE | 0.98 | 1.5E-03 | 3.0E-02 | 5.7 2.5E-06 OK|
| E-Diss.K | 0.97 | 4.2E-04 | 1.1E-02 | 12.3 3.9E-08 OK|
+----------------------+------+---------+---------+------------------+

With my sed/awk/grep/gnuplot script I would extract the
values in the 'U-Mom' row using grep and print a certain
column (e.g. 'Max Res') to a file and print it with gnuplot.
Maybe I have to remove those '|' using sed before...
Do you have an idea, how I can do this completely using
python?

Thanks for your help!

Greetings!
Fabian

Bernard · Oct 6, 2006

Hi Fabian,
I'm still a youngster in Python but I think I can help with the
"extracting data from the log file" part. As I'm seeing it right now,
the only character separating the numbers below is the space character.
You could try splitting all the lines by that character starting from
the NO Column. The starting point of the split function could easily be
defined by regexes. Using this regex : \s+\d+\s{1,2}[\d\w\.-]*\s+ ... I
was able to extract the 2 first columns of every row. And since the
while document is structured like a table, you could define a
particular index for each of the columns of the split result.

I sincerely hope this can help in any way

johnzenger · Oct 6, 2006

Can you safely assume that the lines you want to extract all contain
numbers, and that the lines you do not wish to extract do not contain
numbers?

If so, you could just use the Linux grep utility: "grep '[0123456789]'
filename"

Or, in Python:

import re
inf = file("your-filename-here.txt")
outf = file("result-file.txt","w")
digits = re.compile("\d")

for line in inf:
if digits.search(line): outf.write(line)
outf.close()
inf.close()

As for your "more difficult" file, take a look at the CSV module. I
think that by changing the delimiter from a comma to a |, you will be
95% of the way to your goal.

bearophileHUGS · Oct 6, 2006

Fabian Braennstroem:

A more difficult log file looks like:
...
With my sed/awk/grep/gnuplot script I would extract the
values in the 'U-Mom' row using grep and print a certain
column (e.g. 'Max Res') to a file and print it with gnuplot.
Maybe I have to remove those '|' using sed before...

This is possible (quite raw) solution for the second log using string
methods only:

data = """
======================================================================
OUTER LOOP ITERATION = 1 CPU SECONDS = 2.40E+01
----------------------------------------------------------------------
| Equation | Rate | RMS Res | Max Res | Linear Solution |
+----------------------+------+---------+---------+------------------+
| U-Mom | 0.00 | 1.0E-02 | 5.0E-01 | 4.9E-03 OK|
| V-Mom | 0.00 | 2.4E-14 | 5.6E-13 | 3.8E+09 ok|
| W-Mom | 0.00 | 2.5E-14 | 8.2E-13 | 8.3E+09 ok|
| P-Mass | 0.00 | 1.1E-02 | 3.4E-01 | 8.9 2.7E-02 OK|
+----------------------+------+---------+---------+------------------+
| K-TurbKE | 0.00 | 1.8E+00 | 1.8E+00 | 5.8 2.2E-08 OK|
| E-Diss.K | 0.00 | 1.9E+00 | 2.0E+00 | 12.4 2.2E-08 OK|
+----------------------+------+---------+---------+------------------+

======================================================================
OUTER LOOP ITERATION = 2 CPU SECONDS = 8.57E+01
----------------------------------------------------------------------
| Equation | Rate | RMS Res | Max Res | Linear Solution |
+----------------------+------+---------+---------+------------------+
| U-Mom | 1.44 | 1.5E-02 | 5.3E-01 | 9.6E-03 OK|
| V-Mom |99.99 | 1.1E-03 | 6.2E-02 | 5.7E-02 OK|
| W-Mom |99.99 | 1.9E-03 | 6.0E-02 | 5.9E-02 OK|
| P-Mass | 0.27 | 3.0E-03 | 2.0E-01 | 8.9 7.9E-02 OK|
+----------------------+------+---------+---------+------------------+
| K-TurbKE | 0.03 | 5.4E-02 | 4.4E-01 | 5.8 2.9E-08 OK|
| E-Diss.K | 0.05 | 8.9E-02 | 9.3E-01 | 12.4 2.6E-08 OK|
+----------------------+------+---------+---------+------------------+
""".splitlines()

print [float(l.split("|")[4]) for l in data if 'U-Mom' in l]

Output:
[0.5, 0.53000000000000003]

Bye,
bearophile

Scott David Daniels · Oct 6, 2006

(e-mail address removed) wrote:
<a fine solution component for the second problem>

Use his solution like:
datafile = open(data_file_name, 'r')
for line in datafile:
if 'U-Mom' in line:
print float(line.split("|")[4])
datafile.close()

For the earlier problem:

def data_specific(source):
global headings # in case some other bit wants to read them
saw_top = False
gen = iter(source)
for line in gen:
cut = line.split(None, 1)
if len(cut) > 1 and (cut[0] == 'ITER'
and 'GLOBAL ABSOLUTE RESIDUAL' in cut[1]):
break
else:
return
headings = gen.next().split() # column headings
starts = range(11, 74, 9) + range(75, 138, 9) # for fixed-width
for line in gen:
data = line.split()
if data and data != ['...']: # suppress blank lines
if data[0] == '&&&&&&': # found the terminator
break
assert line[10] == ' ' and line[74] == ' '
yield [int(line[:10])] + [
float(line[n : n+9]) for n in starts]

datafile = open(data_file_name, 'r')
for row in data_specific(datafile):
print row # or row[headings.index('MASS')] or whatever
datafile.close()

The general theme here is: don't use re unless it is a good solution.
sometimes you know which columns things are in, sometimes you know a
separator, sometimes there is a mix, and sometimes you do need a regular
expression. Save re for when you need to do pattern matching.

--Scott David Daniels
(e-mail address removed)

Paul McGuire · Oct 6, 2006

Fabian Braennstroem said:
Hi,

I would like to remove certain lines from a log files. I had
some sed/awk scripts for this, but now, I want to use python
with its re module for this task.

Actually, I have two different log files. The first file looks
like:

...
'some text'
...

ITER I----------------- GLOBAL ABSOLUTE RESIDUAL -----------------I
I------------ FIELD VALUES AT MONITORING LOCATION ----------I
NO UMOM VMOM WMOM MASS T EN DISS ENTH
U V W P TE ED T
1 9.70E-02 8.61E-02 9.85E-02 1.00E+00 1.61E+01 7.65E+04 0.00E+00
1.04E-01-8.61E-04 3.49E-02 1.38E-03 7.51E-05 1.63E-05 2.00E+01
2 3.71E-02 3.07E-02 3.57E-02 1.00E+00 3.58E-01 6.55E-01 0.00E+00
1.08E-01-1.96E-03 4.98E-02 7.11E-04 1.70E-04 4.52E-05 2.00E+01
3 2.64E-02 1.99E-02 2.40E-02 1.00E+00 1.85E-01 3.75E-01 0.00E+00
1.17E-01-3.27E-03 6.07E-02 4.02E-04 4.15E-04 1.38E-04 2.00E+01
4 2.18E-02 1.52E-02 1.92E-02 1.00E+00 1.21E-01 2.53E-01 0.00E+00
1.23E-01-4.85E-03 6.77E-02 1.96E-05 9.01E-04 3.88E-04 2.00E+01
5 1.91E-02 1.27E-02 1.70E-02 1.00E+00 8.99E-02 1.82E-01 0.00E+00
1.42E-01-6.61E-03 7.65E-02 1.78E-04 1.70E-03 9.36E-04 2.00E+01
...

The pyparsing wiki includes an example
(http://pyparsing.wikispaces.com/space/showimage/dictExample2.py) for
parsing test data of the form:

+-------+------+------+------+------+------+------+------+------+
| | A1 | B1 | C1 | D1 | A2 | B2 | C2 | D2 |
+=======+======+======+======+======+======+======+======+======+
| min | 7 | 43 | 7 | 15 | 82 | 98 | 1 | 37 |
| max | 11 | 52 | 10 | 17 | 85 | 112 | 4 | 39 |
| ave | 9 | 47 | 8 | 16 | 84 | 106 | 3 | 38 |
| sdev | 1 | 3 | 1 | 1 | 1 | 3 | 1 | 1 |
+-------+------+------+------+------+------+------+------+------+

and accessing the parsed data (returned in the example in the variable
'data') as:
print "data keys=", data.keys()
print "data['min']=", data['min']
print "sum(data['min']) =", sum(data['min'])
print "data.max =", data.max
print "sum(data.max) =", sum(data.max)
print "data.columns =", data.columns

Giving:
data keys= ['ave', 'min', 'sdev', 'columns', 'max']
data['min']= [7, 43, 7, 15, 82, 98, 1, 37]
sum(data['min']) = 290
data.max = [11, 52, 10, 17, 85, 112, 4, 39]
sum(data.max) = 330
data.columns = ['A1', 'B1', 'C1', 'D1', 'A2', 'B2', 'C2', 'D2']

Not too disimilar from your example.

-- Paul

Paddy · Oct 6, 2006

Fabian said:
Hi,
....

I actually want to extract the lines with the numbers, write
them to a file and finally use gnuplot for plotting them. A
nicer and more python way would be to extract those numbers,
write them into an array according to their column and plot
those using the gnuplot or matplotlib module

You might try comparing Ploticus to Gnuplot for your graph plotting
http://ploticus.sourceforge.net/

.... but if you already know gnuplot, and it does what you want then ...

- Pad.

Matteo · Oct 6, 2006

Fabian said:
Hi,

I would like to remove certain lines from a log files. I had
some sed/awk scripts for this, but now, I want to use python
with its re module for this task.

Actually, I have two different log files. The first file looks
like:

...
'some text'
...

ITER I----------------- GLOBAL ABSOLUTE RESIDUAL -----------------I I------------ FIELD VALUES AT MONITORING LOCATION ----------I
NO UMOM VMOM WMOM MASS T EN DISS ENTH U V W P TE ED T
1 9.70E-02 8.61E-02 9.85E-02 1.00E+00 1.61E+01 7.65E+04 0.00E+00 1.04E-01-8.61E-04 3.49E-02 1.38E-03 7.51E-05 1.63E-05 2.00E+01
2 3.71E-02 3.07E-02 3.57E-02 1.00E+00 3.58E-01 6.55E-01 0.00E+00 1.08E-01-1.96E-03 4.98E-02 7.11E-04 1.70E-04 4.52E-05 2.00E+01

....

Just a thought, but what about using exceptions - something like:

for line in logfile:
vals=line.split()
try:
no=int(vals[0])
# parse line as needed
except ValueError: #first item is not a number
pass # ignore line, or parse it
separately

Coming from C++, using exceptions in this way still feels a bit creepy
to me, but I've been assured that this is very pythonic, and I'm slowly
adopting this style in my python code.

Parsing the line can be easy too:
(umom,vmom,wmom,mass...) = map(float,vals[1:])

-matt

hanumizzle · Oct 7, 2006

Coming from C++, using exceptions in this way still feels a bit creepy
to me, but I've been assured that this is very pythonic, and I'm slowly
adopting this style in my python code.

Parsing the line can be easy too:
(umom,vmom,wmom,mass...) = map(float,vals[1:])

Style question.

Should one consider the map functional deprecated and use [float(val)
for val in vals[1:]] or no? I'm not sure myself.

-- Theerasak

extract text from log file using re	2	Sep 13, 2007
How to use ufixed when it involves multiplication a number of times?(VHDL question)	0	Aug 22, 2016
How to format a string from an array?	3	Jun 13, 2007
Character set woes with binary data	0	Apr 1, 2007
xslt newbie question	2	May 1, 2007
Simple Design and Testing Conference - Mumbai, India; June 26-27 2010	0	Jun 4, 2010
Extract same text from file.	6	Aug 16, 2004
OCI-8, Oracle : 'ORDER BY' doesn't work with 'bind_param'	5	Mar 13, 2009

extract certain values from file with re

Fabian Braennstroem

Bernard

johnzenger

bearophileHUGS

Scott David Daniels

Paul McGuire

Paddy

Matteo

hanumizzle

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads