f = open(r'c:c:\somefile.txt', 'w')
f.write('0123456789\n0123456789\n0123456789')
Not the most explanatory sample data... It would be better if the
records had different contents.
f = open(r'c:\somefile.txt', 'r')
Here you extract one "line" from the file
f.seek(3,0)
print f.read(1) #just to know if its printing the rigth column
And here you ignored the entire line you read, seeking to the fourth
byte from the beginning of the file, andreadingjust one byte from it.
I have no idea of how seek()/read() behaves relative to line
iteration in the for loop... Given the small size of the test data set
it is quite likely that the first "for line in f" resulted in the entire
file being read into a buffer, and that buffer scanned to find the line
ending and return the data preceding it; then the buffer position is set
to after that line ending so the next "for line" continues from that
point.
But in a situation with a large data set, or an unbuffered I/O
system, the seek()/read() could easily result in resetting the file
position used by the "for line", so that the second call returns
"456789\n"... And all subsequent calls too, resulting in an infinite
loop.
Presuming the assignment requires pulling multiple selected fields
from individual records, where each record is of the same
format/spacing, AND that the field selection can not be preprogrammed...
Sample data file (use fixed width font to view):
-=-=-=-=-=-
Wulfraed 09Ranger 1915
Bask Euren 13Cleric 1511
Aethelwulf 07Mage 0908
Cwiculf 08Mage 1008
-=-=-=-=-=-
Sample format definition file:
-=-=-=-=-=-
Name 0-14
Level 15-16
Class 17-24
THAC0 25-26
Armor 27-28
-=-=-=-=-=-
Code to process (Python 2.5, with minimal error handling):
-=-=-=-=-=-
class Extractor(object):
def __init__(self, formatFile):
ff = open(formatFile, "r")
self._format = {}
self._length = 0
for line in ff:
form = line.split("\t") #file must be tab separated
if len(form) != 2:
print "Invalid file format definition: %s" % line
continue
name = form[0]
columns = form[1].split("-")
if len(columns) == 1: #single column definition
start = int(columns[0])
end = start
elif len(columns) == 2:
start = int(columns[0])
end = int(columns[1])
else:
print "Invalid column definition: %s" % form[1]
continue
self._format[name] = (start, end)
self._length = max(self._length, end)
ff.close()
def __call__(self, line):
data = {}
if len(line) < self._length:
print "Data line is too short for required format: ignored"
else:
for (name, (start, end)) in self._format.items():
data[name] = line[start:end+1]
return data
if __name__ == "__main__":
FORMATFILE = "SampleFormat.tsv"
DATAFILE = "SampleData.txt"
characterExtractor = Extractor(FORMATFILE)
df = open(DATAFILE, "r")
for line in df:
fields = characterExtractor(line)
for (name, value) in fields.items():
print "Field name: '%s'\t\tvalue: '%s'" % (name, value)
print
df.close()
-=-=-=-=-=-
Output from running above code:
-=-=-=-=-=-
Field name: 'Armor' value: '15'
Field name: 'THAC0' value: '19'
Field name: 'Level' value: '09'
Field name: 'Class' value: 'Ranger '
Field name: 'Name' value: 'Wulfraed '
Field name: 'Armor' value: '11'
Field name: 'THAC0' value: '15'
Field name: 'Level' value: '13'
Field name: 'Class' value: 'Cleric '
Field name: 'Name' value: 'Bask Euren '
Field name: 'Armor' value: '08'
Field name: 'THAC0' value: '09'
Field name: 'Level' value: '07'
Field name: 'Class' value: 'Mage '
Field name: 'Name' value: 'Aethelwulf '
Field name: 'Armor' value: '08'
Field name: 'THAC0' value: '10'
Field name: 'Level' value: '08'
Field name: 'Class' value: 'Mage '
Field name: 'Name' value: 'Cwiculf '
-=-=-=-=-=-
Note that string fields have not been trimmed, also numeric fields
are still intextformat... The format definition file would need to be
expanded to include a "string", "integer", "float" (and "Boolean"?) code
in order for the extractor to do proper type conversions.