* 'struct-like' list *

E

Ernesto

I'm still fairly new to python, so I need some guidance here...

I have a text file with lots of data. I only need some of the data. I
want to put the useful data into an [array of] struct-like
mechanism(s). The text file looks something like this:

[BUNCH OF NOT-USEFUL DATA....]

Name: David
Age: 108 Birthday: 061095 SocialSecurity: 476892771999

[MORE USELESS DATA....]

Name........

I would like to have an array of "structs." Each struct has

struct Person{
string Name;
int Age;
int Birhtday;
int SS;
}

I want to go through the file, filling up my list of structs.

My problems are:

1. How to search for the keywords "Name:", "Age:", etc. in the file...
2. How to implement some organized "list of lists" for the data
structure.

Any help is much appreciated.
 
R

Rene Pijlman

Ernesto:
1. How to search for the keywords "Name:", "Age:", etc. in the file...

You could use regular expression matching:
http://www.python.org/doc/lib/module-re.html

Or plain string searches:
http://www.python.org/dev/doc/devel/lib/string-methods.html
2. How to implement some organized "list of lists" for the data
structure.

You could make it a list of bunches, for example:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52308

Or a list of objects of your custom class.
 
?

=?ISO-8859-1?Q?Sch=FCle_Daniel?=

I would like to have an array of "structs." Each struct has

struct Person{
string Name;
int Age;
int Birhtday;
int SS;
}


the easiest way would be

class Person:
pass

john = Person()
david = Person()

john.name = "John Brown"
john.age = 35
etc

think of john as namespace .. with attributes (we call them so) added on
runtime

better approch would be to make real class with constructor

class Person(object):
def __init__(self, name, age):
self.name = name
self.age = age
def __str__(self):
return "person name = %s and age = %i" % (self.name, self.age)

john = Person("john brown", 35)
print john # this calls __str__

I want to go through the file, filling up my list of structs.

My problems are:

1. How to search for the keywords "Name:", "Age:", etc. in the file...
2. How to implement some organized "list of lists" for the data

this depend on the structure of the file
consider this format

New
Name: John
Age: 35
Id: 23242
New
Name: xxx
Age
Id: 43324
OtherInfo: foo
New

here you could read all as string and split it on "New"

here small example
>>> txt = "fooXbarXfoobar"
>>> txt.split("X") ['foo', 'bar', 'foobar']
>>>

in more complicated case I would use regexp but
I doubt this is neccessary in your case

Regards, Daniel
 
P

Paul McGuire

Ernesto said:
I'm still fairly new to python, so I need some guidance here...

I have a text file with lots of data. I only need some of the data. I
want to put the useful data into an [array of] struct-like
mechanism(s). The text file looks something like this:

[BUNCH OF NOT-USEFUL DATA....]

Name: David
Age: 108 Birthday: 061095 SocialSecurity: 476892771999

[MORE USELESS DATA....]

Name........

I would like to have an array of "structs." Each struct has

struct Person{
string Name;
int Age;
int Birhtday;
int SS;
}

I want to go through the file, filling up my list of structs.

My problems are:

1. How to search for the keywords "Name:", "Age:", etc. in the file...
2. How to implement some organized "list of lists" for the data
structure.

Any help is much appreciated.
Ernesto -

Since you are searching for keywords and matching fields, and trying to
populate data structures as you go, this sounds like a good fit for
pyparsing. Pyparsing as built-in features for scanning through text and
extracting data, with suitably named data fields for accessing later.

Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul

------------------------------------------------
from pyparsing import *

inputData = """[BUNCH OF NOT-USEFUL DATA....]

Name: David
Age: 108 Birthday: 061095 SocialSecurity: 476892771999

[MORE USELESS DATA....]

Name: Fred
Age: 101 Birthday: 061065 SocialSecurity: 587903882000

[MORE USELESS DATA....]

Name: Barney
Age: 99 Birthday: 061265 SocialSecurity: 698014993111

[MORE USELESS DATA....]

"""

dob = Word(nums,exact=6)
# this matches your sample data, but I think SSN's are only 9 digits long
socsecnum = Word(nums,exact=12)

# define the personalData pattern - use results names to associate
# field names with matched tokens, can then access data as if they were
# attributes on an object
personalData = ( "Name:" + empty + restOfLine.setResultsName("Name") +
"Age:" + Word(nums).setResultsName("Age") +
"Birthday:" + dob.setResultsName("Birthday") +
"SocialSecurity:" + socsecnum.setResultsName("SS") )

# use personData.scanString to scan through the input, returning the
matching
# tokens, and their respective start/end locations in the string
for person,s,e in personalData.scanString(inputData):
print "Name:", person.Name
print "Age:", person.Age
print "DOB:", person.Birthday
print "SSN:", person.SS
print

# or use a list comp to scan the whole file, and return your Person data,
giving you
# your requested array of "structs" - not really structs, but ParseResults
objects
persons = [person for person,s,e in personalData.scanString(inputData)]

# or convert to Python dict's, which some people prefer to pyparsing's
ParseResults
persons = [dict(p) for p,s,e in personalData.scanString(inputData)]
print persons[0]
print

# or create an array of Person objects, as suggested in previous postings
class Person(object):
def __init__(self,parseResults):
self.__dict__.update(dict(parseResults))

def __str__(self):
return "Person(%s, %s, %s, %s)" %
(self.Name,self.Age,self.Birthday,self.SS)

persons = [Person(p) for p,s,e in personalData.scanString(inputData)]
for p in persons:
print p.Name,"->",p

--------------------------------------
prints out:
Name: David
Age: 108
DOB: 061095
SSN: 476892771999

Name: Fred
Age: 101
DOB: 061065
SSN: 587903882000

Name: Barney
Age: 99
DOB: 061265
SSN: 698014993111

{'SS': '476892771999', 'Age': '108', 'Birthday': '061095', 'Name': 'David'}

David -> Person(David, 108, 061095, 476892771999)
Fred -> Person(Fred, 101, 061065, 587903882000)
Barney -> Person(Barney, 99, 061265, 698014993111)
 
R

Raymond Hettinger

[Ernesto]
I'm still fairly new to python, so I need some guidance here...

I have a text file with lots of data. I only need some of the data. I
want to put the useful data into an [array of] struct-like
mechanism(s). The text file looks something like this:

[BUNCH OF NOT-USEFUL DATA....]

Name: David
Age: 108 Birthday: 061095 SocialSecurity: 476892771999

[MORE USELESS DATA....]

Name........

I would like to have an array of "structs." Each struct has

struct Person{
string Name;
int Age;
int Birhtday;
int SS;
}

I want to go through the file, filling up my list of structs.

My problems are:

1. How to search for the keywords "Name:", "Age:", etc. in the file...
2. How to implement some organized "list of lists" for the data
structure.

Since you're just starting out in Python, this problem presents an
excellent opportunity to learn Python's two basic approaches to text
parsing.

The first approach involves looping over the input lines, searching for
key phrases, and extracting them using string slicing and using
str.strip() to trim irregular length input fields. The start/stop
logic is governed by the first and last key phrases and the results get
accumulated in a list. This approach is easy to program, maintain, and
explain to others:

# Approach suitable for inputs with fixed input positions
result = []
for line in inputData.splitlines():
if line.startswith('Name:'):
name = line[7:].strip()
elif line.startswith('Age:'):
age = line[5:8].strip()
bd = line[20:26]
ssn = line[45:54]
result.append((name, age, bd, ssn))
print result

The second approach uses regular expressions. The pattern is to search
for a key phrase, skip over whitespace, and grab the data field in
parenthesized group. Unlike slicing, this approach is tolerant of
loosely formatted data where the target fields do not always appear in
the same column position. The trade-off is having less flexibility in
parsing logic (i.e. the target fields must arrive in a fixed order):

# Approach for more loosely formatted inputs
import re
pattern = '''(?x)
Name:\s+(\w+)\s+
Age:\s+(\d+)\s+
Birthday:\s+(\d+)\s+
SocialSecurity:\s+(\d+)
'''
print re.findall(pattern, inputData)

Other respondants have suggested the third-party PyParsing module which
provides a powerful general-purpose toolset for text parsing; however,
it is always worth mastering Python basics before moving on to special
purpose tools. The above code fragements are easy to construct and not
hard to explain to others. Maintenance is a breeze.


Raymond


P.S. Once you've formed a list of tuples, it is trivial to create
Person objects for your pascal-like structure:

class Person(object):
def __init__(self, (name, age, bd, ssn)):
self.name=name; self.age=age; self.bd=bd; self.ssn=ssn

personlist = map(Person, result)
for p in personlist:
print p.name, p.age, p.bd, p.ssn
 
E

Ernesto

Thanks for the approach. I decided to use regular expressions. I'm
going by the code you posted (below). I replaced the line re.findall
line with my file handle read( ) like this:

print re.findall(pattern, myFileHandle.read())

This prints out only brackets []. Is a 're.compile' perhaps necessary
?
 
?

=?ISO-8859-1?Q?Sch=FCle_Daniel?=

Ernesto said:
Thanks for the approach. I decided to use regular expressions. I'm
going by the code you posted (below). I replaced the line re.findall
line with my file handle read( ) like this:

print re.findall(pattern, myFileHandle.read())

This prints out only brackets []. Is a 're.compile' perhaps necessary
?

if you see [] that means findall didn't find anything
that would match your pattern
if you re.compile your pattern beforehand that
would not make findall find the matched text
it's only there for the optimization

consider
lines = [line for line in file("foo.txt").readlines() if
re.match(r"\d+",line)]

in this case it's better to pre-compile regexp one and use it
to match all lines

number = re.compile(r"\d+")
lines = [line for line in file("foo.txt").readlines() if number.match(line)]

fire interactive python and play with re and patterns
speaking from own experience ... the propability is
against you that you will make pattern right on first time

Regards, Daniel
 
B

Bengt Richter

I'm still fairly new to python, so I need some guidance here...

I have a text file with lots of data. I only need some of the data. I
want to put the useful data into an [array of] struct-like
mechanism(s). The text file looks something like this:

[BUNCH OF NOT-USEFUL DATA....]

Name: David
Age: 108 Birthday: 061095 SocialSecurity: 476892771999

[MORE USELESS DATA....]

Name........

Does the useful data always come in fixed-format pairs of lines as in your example?
If so, you could just iterate through the lines of your text file as in example at end [1]
I would like to have an array of "structs." Each struct has

struct Person{
string Name;
int Age;
int Birhtday;
int SS;
}
You don't normally want to do real structs in python. You probably want to define
a class to contain the data, e.g., class Person in example at end [1]
I want to go through the file, filling up my list of structs.

My problems are:

1. How to search for the keywords "Name:", "Age:", etc. in the file...
2. How to implement some organized "list of lists" for the data
structure.
It may be very easy, if the format is fixed and space-separated and line-paired
as in your example data, but you will have to tell us more if not.

[1] exmaple:

----< ernesto.py >---------------------------------------------------------
class Person(object):
def __init__(self, name):
self.name = name
def __repr__(self): return 'Person(%r)'%self.name

def extract_info(lineseq):
lineiter = iter(lineseq) # normalize access to lines
personlist = []
for line in lineiter:
substrings = line.split()
if substrings and isinstance(substrings, list) and substrings[0] == 'Name:':
try:
name = ' '.join(substrings[1:]) # allow for names with spaces
line = lineiter.next()
age_hdr, age, bd_hdr, bd, ss_hdr, ss = line.split()
assert age_hdr=='Age:' and bd_hdr=='Birthday:' and ss_hdr=='SocialSecurity:', \
'Bad second line after "Name: %s" line:\n %r'%(name, line)
person = Person(name)
person.age = int(age); person.bd = int(bd); person.ss=int(ss)
personlist.append(person)
except Exception,e:
print '%s: %s'%(e.__class__.__name__, e)
return personlist

def test():
lines = """\
[BUNCH OF NOT-USEFUL DATA....]

Name: David
Age: 108 Birthday: 061095 SocialSecurity: 476892771999

[MORE USELESS DATA....]

Name: Ernesto
Age: 25 Birthday: 040181 SocialSecurity: 123456789

Name: Ernesto
Age: 44 Brithdy: 040106 SocialSecurity: 123456789

Name........
"""
persondata = extract_info(lines.splitlines())
print persondata
ssdict = {}
for person in persondata:
if person.ss in ssdict:
print 'Rejecting %r with duplicate ss %s'%(person, person.ss)
else:
ssdict[person.ss] = person
print 'ssdict keys: %s'%ssdict.keys()
for ss, pers in sorted(ssdict.items(), key=lambda item:item[1].name): #sorted by name
print 'Name: %s Age: %s SS: %s' % (pers.name, pers.age, pers.ss)

if __name__ == '__main__': test()
---------------------------------------------------------------------------

this produces output:

[10:07] C:\pywk\clp>py24 ernesto.py
AssertionError: Bad second line after "Name: Ernesto" line:
'Age: 44 Brithdy: 040106 SocialSecurity: 123456789'
[Person('David'), Person('Ernesto')]
ssdict keys: [123456789, 476892771999L]
Name: David Age: 108 SS: 476892771999
Name: Ernesto Age: 25 SS: 123456789

if you want to try this on a file, (we'll use the source itself here
since it includes valid example data lines), do something like:
AssertionError: Bad second line after "Name: Ernesto" line:
'Age: 44 Brithdy: 040106 SocialSecurity: 123456789\n' [Person('David'), Person('Ernesto')]

tweak to taste ;-)

Regards,
Bengt Richter
 
B

Bengt Richter

]
----< ernesto.py >---------------------------------------------------------
[...]
Just noticed:
substrings = line.split()
if substrings and isinstance(substrings, list) and substrings[0] == 'Name:':
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^--not needed

str.split always returns a list, even if it's length 1, so that was harmless but should be

if substrings and substrings[0] == 'Name:':

(the first term is needed because ''.split() => [], to avoid [][0])
Sorry.

Regards,
Bengt Richter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,284
Messages
2,571,413
Members
48,106
Latest member
JamisonDev

Latest Threads

Top