Reading by positions plain text files

J

javivd

Hi all,

Sorry, newbie question:

I have database in a plain text file (could be .txt or .dat, it's the
same) that I need to read in python in order to do some data
validation. In other files I read this kind of files with the split()
method, reading line by line. But split() relies on a separator
character (I think... all I know is that it's work OK).

I have a case now in wich another file has been provided (besides the
database) that tells me in wich column of the file is every variable,
because there isn't any blank or tab character that separates the
variables, they are stick together. This second file specify the
variable name and his position:


VARIABLE NAME POSITION (COLUMN) IN FILE
var_name_1 123-123
var_name_2 124-125
var_name_3 126-126
...
...
var_name_N 512-513 (last positions)

How can I read this so each position in the file it's associated with
each variable name?

Thanks a lot!!

Javier
 
T

Tim Harig

I have a case now in wich another file has been provided (besides the
database) that tells me in wich column of the file is every variable,
because there isn't any blank or tab character that separates the
variables, they are stick together. This second file specify the
variable name and his position:

VARIABLE NAME POSITION (COLUMN) IN FILE
var_name_1 123-123
var_name_2 124-125
var_name_3 126-126
..
..
var_name_N 512-513 (last positions)

I am unclear on the format of these positions. They do not look like
what I would expect from absolute references in the data. For instance,
123-123 may only contain one byte??? which could change for different
encodings and how you mark line endings. Frankly, the use of the
world columns in the header suggests that the data *is* separated by
line endings rather then absolute position and the position refers to
the line number. In which case, you can use splitlines() to break up
the data and then address the proper line by index. Nevertheless,
you can use file.seek() to move to an absolute offset in the file,
if that really is what you are looking for.
 
M

MRAB

Hi all,

Sorry, newbie question:

I have database in a plain text file (could be .txt or .dat, it's the
same) that I need to read in python in order to do some data
validation. In other files I read this kind of files with the split()
method, reading line by line. But split() relies on a separator
character (I think... all I know is that it's work OK).

I have a case now in wich another file has been provided (besides the
database) that tells me in wich column of the file is every variable,
because there isn't any blank or tab character that separates the
variables, they are stick together. This second file specify the
variable name and his position:


VARIABLE NAME POSITION (COLUMN) IN FILE
var_name_1 123-123
var_name_2 124-125
var_name_3 126-126
..
..
var_name_N 512-513 (last positions)

How can I read this so each position in the file it's associated with
each variable name?
It sounds like a similar problem to this:

http://groups.google.com/group/comp.../123422d510187dc3?show_docid=123422d510187dc3
 
J

javivd

I am unclear on the format of these positions.  They do not look like
what I would expect from absolute references in the data.  For instance,
123-123 may only contain one byte??? which could change for different
encodings and how you mark line endings.  Frankly, the use of the
world columns in the header suggests that the data *is* separated by
line endings rather then absolute position and the position refers to
the line number. In which case, you can use splitlines() to break up
the data and then address the proper line by index.  Nevertheless,
you can use file.seek() to move to an absolute offset in the file,
if that really is what you are looking for.

I work in a survey research firm. the data im talking about has a lot
of 0-1 variables, meaning yes or no of a lot of questions. so only one
position of a character is needed (not byte), explaining the 123-123
kind of positions of a lot of variables.

and no, MRAB, it's not the similar problem (at least what i understood
of it). I have to associate the position this file give me with the
variable name this file give me for those positions.

thank you both and sorry for my english!

J
 
M

MRAB

I work in a survey research firm. the data im talking about has a lot
of 0-1 variables, meaning yes or no of a lot of questions. so only one
position of a character is needed (not byte), explaining the 123-123
kind of positions of a lot of variables.

and no, MRAB, it's not the similar problem (at least what i understood
of it). I have to associate the position this file give me with the
variable name this file give me for those positions.

thank you both and sorry for my english!
You just have to parse the second file to build a list (or dict)
containing the name, start position and end position of each variable:

variables = [("var_name_1", 123, 123), ...]

and then work through that list, extracting the data between those
positions in the first file and putting the values in another list (or
dict).

You also need to check whether the positions are 1-based or 0-based
(Python uses 0-based).
 
T

Tim Chase

and no, MRAB, it's not the similar problem (at least what i understood
of it). I have to associate the position this file give me with the
variable name this file give me for those positions.

MRAB may be referring to my reply in that thread where you can do
something like

OFFSETS = 'offsets.txt'
offsets = {}
f = file(OFFSETS)
f.next() # throw away the headers
for row in f:
varname, rest = row.split()[:2]
# sanity check
if varname in offsets:
print "[%s] in %s twice?!" % (varname, OFFSETS)
if '-' not in rest: continue
start, stop = map(int, rest.split('-'))
offsets[varname] = slice(start, stop+1) # 0-based offsets
#offsets[varname] = slice(start+1, stop+2) # 1-based offsets
f.close()

def do_something_with(data):
# your real code goes here
print data['var_name_2']

for row in file('data.txt'):
data = dict((name, row[offsets[name]]) for name in offsets)
do_something_with(data)

There's additional robustness-checks I'd include if your
offsets-file isn't controlled by you (people send me daft data).

-tkc
 
T

Tim Harig

I work in a survey research firm. the data im talking about has a lot
of 0-1 variables, meaning yes or no of a lot of questions. so only one
position of a character is needed (not byte), explaining the 123-123
kind of positions of a lot of variables.

Then file.seek() is what you are looking for; but, you need to be aware of
line endings and encodings as indicated. Make sure that you open the file
using whatever encoding was used when it was generated or you could have
problems with multibyte characters affecting the offsets.
 
J

javivd

Then file.seek() is what you are looking for; but, you need to be aware of
line endings and encodings as indicated.  Make sure that you open the file
using whatever encoding was used when it was generated or you could have
problems with multibyte characters affecting the offsets.

Ok, I will try it and let you know. Thanks all!!
 
J

javivd

Thenfile.seek() is what you are looking for; but, you need to be aware of
line endings and encodings as indicated.  Make sure that you open thefile
using whatever encoding was used when it was generated or you could have
problems with multibyte characters affecting the offsets.

I've tried your advice and something is wrong. Here is my code,



f = open(r'c:c:\somefile.txt', 'w')

f.write('0123456789\n0123456789\n0123456789')

f.close()

f = open(r'c:\somefile.txt', 'r')


for line in f:
f.seek(3,0)
print f.read(1) #just to know if its printing the rigth column

I used .seek() in this manner, but is not working.

Let me put the problem in another way. I have .txt file with NO
headers, and NO blanks between any columns. But i know that from
columns, say 13 to 15, is variable VARNAME_1 (of course, a three digit
var). How can extract that column in a list call VARNAME_1??

Obviously, this should extend to all the positions and variables i
have to extract from the file.

Thanks!

J
 
T

Tim Harig

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that I specifically questioned the use of absolute file position vs.
postion within a column. These are two different things. You use
different methods to extract each.
f = open(r'c:c:\somefile.txt', 'w')

I suspect you don't need to use the c: twice.
f.write('0123456789\n0123456789\n0123456789')

Note that the file you a writing contains three lines. Is the data that
you are looking for located at an absolute position in the file or on a
position within a individual line? If the latter, not that line endings
may be composed of more then a single character.
f.write('0123456789\n0123456789\n0123456789')
^ postion 3 using fseek()
for line in f:

Perhaps you meant:
for character in f.read():
or
for line in f.read().splitlines()
f.seek(3,0)

This will always take you back to the exact fourth position in the file
(indicated above).
I used .seek() in this manner, but is not working.

It is working the way it is supposed to.

If you want the absolution position 3 in a file then:

f = open('somefile.txt', 'r')
f.seek(3)
variable = f.read(1)

If you want the absolute position in a column:
f = open('somefile.txt', 'r').read().splitlines()
for column in f:
variable = column[3]
 
T

Tim Harig

I used .seek() in this manner, but is not working.

It is working the way it is supposed to.
If you want the absolute position in a column:

f = open('somefile.txt', 'r').read().splitlines()
for column in f:
variable = column[3]

or:
f = open('somefile.txt', 'r')
for column in f.readlines():
variable = column[3]
 
D

Dennis Lee Bieber

f = open(r'c:c:\somefile.txt', 'w')

f.write('0123456789\n0123456789\n0123456789')
Not the most explanatory sample data... It would be better if the
records had different contents.
f.close()

f = open(r'c:\somefile.txt', 'r')


for line in f:

Here you extract one "line" from the file
f.seek(3,0)
print f.read(1) #just to know if its printing the rigth column
And here you ignored the entire line you read, seeking to the fourth
byte from the beginning of the file, and reading just one byte from it.

I have no idea of how seek()/read() behaves relative to line
iteration in the for loop... Given the small size of the test data set
it is quite likely that the first "for line in f" resulted in the entire
file being read into a buffer, and that buffer scanned to find the line
ending and return the data preceding it; then the buffer position is set
to after that line ending so the next "for line" continues from that
point.

But in a situation with a large data set, or an unbuffered I/O
system, the seek()/read() could easily result in resetting the file
position used by the "for line", so that the second call returns
"456789\n"... And all subsequent calls too, resulting in an infinite
loop.


Presuming the assignment requires pulling multiple selected fields
from individual records, where each record is of the same
format/spacing, AND that the field selection can not be preprogrammed...

Sample data file (use fixed width font to view):
-=-=-=-=-=-
Wulfraed 09Ranger 1915
Bask Euren 13Cleric 1511
Aethelwulf 07Mage 0908
Cwiculf 08Mage 1008
-=-=-=-=-=-

Sample format definition file:
-=-=-=-=-=-
Name 0-14
Level 15-16
Class 17-24
THAC0 25-26
Armor 27-28
-=-=-=-=-=-

Code to process (Python 2.5, with minimal error handling):
-=-=-=-=-=-

class Extractor(object):
def __init__(self, formatFile):
ff = open(formatFile, "r")
self._format = {}
self._length = 0
for line in ff:
form = line.split("\t") #file must be tab separated
if len(form) != 2:
print "Invalid file format definition: %s" % line
continue
name = form[0]
columns = form[1].split("-")
if len(columns) == 1: #single column definition
start = int(columns[0])
end = start
elif len(columns) == 2:
start = int(columns[0])
end = int(columns[1])
else:
print "Invalid column definition: %s" % form[1]
continue
self._format[name] = (start, end)
self._length = max(self._length, end)
ff.close()

def __call__(self, line):
data = {}
if len(line) < self._length:
print "Data line is too short for required format: ignored"
else:
for (name, (start, end)) in self._format.items():
data[name] = line[start:end+1]
return data


if __name__ == "__main__":
FORMATFILE = "SampleFormat.tsv"
DATAFILE = "SampleData.txt"

characterExtractor = Extractor(FORMATFILE)

df = open(DATAFILE, "r")
for line in df:
fields = characterExtractor(line)
for (name, value) in fields.items():
print "Field name: '%s'\t\tvalue: '%s'" % (name, value)
print

df.close()
-=-=-=-=-=-

Output from running above code:
-=-=-=-=-=-
Field name: 'Armor' value: '15'
Field name: 'THAC0' value: '19'
Field name: 'Level' value: '09'
Field name: 'Class' value: 'Ranger '
Field name: 'Name' value: 'Wulfraed '

Field name: 'Armor' value: '11'
Field name: 'THAC0' value: '15'
Field name: 'Level' value: '13'
Field name: 'Class' value: 'Cleric '
Field name: 'Name' value: 'Bask Euren '

Field name: 'Armor' value: '08'
Field name: 'THAC0' value: '09'
Field name: 'Level' value: '07'
Field name: 'Class' value: 'Mage '
Field name: 'Name' value: 'Aethelwulf '

Field name: 'Armor' value: '08'
Field name: 'THAC0' value: '10'
Field name: 'Level' value: '08'
Field name: 'Class' value: 'Mage '
Field name: 'Name' value: 'Cwiculf '
-=-=-=-=-=-

Note that string fields have not been trimmed, also numeric fields
are still in text format... The format definition file would need to be
expanded to include a "string", "integer", "float" (and "Boolean"?) code
in order for the extractor to do proper type conversions.
 
D

Dennis Lee Bieber

Sample data file (use fixed width font to view):
-=-=-=-=-=-
Wulfraed 09Ranger 1915
Bask Euren 13Cleric 1511
Aethelwulf 07Mage 0908
Cwiculf 08Mage 1008
-=-=-=-=-=-

Sample format definition file:
-=-=-=-=-=-
Name 0-14
Level 15-16
Class 17-24
THAC0 25-26
Armor 27-28
-=-=-=-=-=-
If it isn't clear from the code -- the DATA file is SPACE FILLED,
but the DEFINITION file uses a TAB to separate the columns, not spaces.
 
J

javivd

f = open(r'c:c:\somefile.txt', 'w')
f.write('0123456789\n0123456789\n0123456789')

        Not the most explanatory sample data... It would be better if the
records had different contents.
f.close()
f = open(r'c:\somefile.txt', 'r')
for line in f:

        Here you extract one "line" from the file
    f.seek(3,0)
    print f.read(1) #just to know if its printing the rigth column

        And here you ignored the entire line you read, seeking to the fourth
byte from the beginning of the file, andreadingjust one byte from it.

        I have no idea of how seek()/read() behaves relative to line
iteration in the for loop... Given the small size of the test data set
it is quite likely that the first "for line in f" resulted in the entire
file being read into a buffer, and that buffer scanned to find the line
ending and return the data preceding it; then the buffer position is set
to after that line ending so the next "for line" continues from that
point.

        But in a situation with a large data set, or an unbuffered I/O
system, the seek()/read() could easily result in resetting the file
position used by the "for line", so that the second call returns
"456789\n"... And all subsequent calls too, resulting in an infinite
loop.

        Presuming the assignment requires pulling multiple selected fields
from individual records, where each record is of the same
format/spacing, AND that the field selection can not be preprogrammed...

Sample data file (use fixed width font to view):
-=-=-=-=-=-
Wulfraed       09Ranger  1915
Bask Euren     13Cleric  1511
Aethelwulf     07Mage    0908
Cwiculf        08Mage    1008
-=-=-=-=-=-

Sample format definition file:
-=-=-=-=-=-
Name    0-14
Level   15-16
Class   17-24
THAC0   25-26
Armor   27-28
-=-=-=-=-=-

Code to process (Python 2.5, with minimal error handling):
-=-=-=-=-=-

class Extractor(object):
    def __init__(self, formatFile):
        ff = open(formatFile, "r")
        self._format = {}
        self._length = 0
        for line in ff:
            form = line.split("\t") #file must be tab separated
            if len(form) != 2:
                print "Invalid file format definition: %s" % line
                continue
            name = form[0]
            columns = form[1].split("-")
            if len(columns) == 1:   #single column definition
                start = int(columns[0])
                end = start
            elif len(columns) == 2:
                start = int(columns[0])
                end = int(columns[1])
            else:
                print "Invalid column definition: %s" % form[1]
                continue
            self._format[name] = (start, end)
            self._length = max(self._length, end)
        ff.close()

    def __call__(self, line):
        data = {}
        if len(line) < self._length:
            print "Data line is too short for required format: ignored"
        else:
            for (name, (start, end)) in self._format.items():
                data[name] = line[start:end+1]
        return data

if __name__ == "__main__":
    FORMATFILE = "SampleFormat.tsv"
    DATAFILE = "SampleData.txt"

    characterExtractor = Extractor(FORMATFILE)

    df = open(DATAFILE, "r")
    for line in df:
        fields = characterExtractor(line)
        for (name, value) in fields.items():
            print "Field name: '%s'\t\tvalue: '%s'" % (name, value)
        print

    df.close()
-=-=-=-=-=-

Output from running above code:
-=-=-=-=-=-
Field name: 'Armor'             value: '15'
Field name: 'THAC0'             value: '19'
Field name: 'Level'             value: '09'
Field name: 'Class'             value: 'Ranger  '
Field name: 'Name'              value: 'Wulfraed       '

Field name: 'Armor'             value: '11'
Field name: 'THAC0'             value: '15'
Field name: 'Level'             value: '13'
Field name: 'Class'             value: 'Cleric  '
Field name: 'Name'              value: 'Bask Euren     '

Field name: 'Armor'             value: '08'
Field name: 'THAC0'             value: '09'
Field name: 'Level'             value: '07'
Field name: 'Class'             value: 'Mage    '
Field name: 'Name'              value: 'Aethelwulf     '

Field name: 'Armor'             value: '08'
Field name: 'THAC0'             value: '10'
Field name: 'Level'             value: '08'
Field name: 'Class'             value: 'Mage    '
Field name: 'Name'              value: 'Cwiculf        '
-=-=-=-=-=-

        Note that string fields have not been trimmed, also numeric fields
are still intextformat... The format definition file would need to be
expanded to include a "string", "integer", "float" (and "Boolean"?) code
in order for the extractor to do proper type conversions.

Clearly it's working. Altough, this code is beyond my python knowledge
(i don't get along with classes, maybe it's a good moment to learn
about them...) but i'll dig into it.

Thanks a lot! It really helps...

J
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,816
Latest member
nipsseyhussle

Latest Threads

Top