T
ts8807385
I was wondering if anyone has done this in Python. I wrote two
functions that do it (I think... see below), but I do not understand
how to interpret the results. I'm doing an experiment to implement ent
in Python. ent tests the randomness of files and chi squared is
probably the best test for this purposes when compared to other tests.
Many of the statistical tests are easy (like Arithmetic Mean, etc) and
I have no problems interpreting the results from those, but chi
squared has stumped me. Here are my two simple functions, run them if
you like to better understand the output:
import os
import os.path
def observed(f):
# argument f is a filepath/filename
#
# Return a list of observed characters in decimal ord(char).
# Decimal value of characters may be 0 through 255.
# [43, 54, 0, 255, 4, etc.]
chars = []
#print f
fd = open(f, 'rb')
bytes = fd.read(13312)
fd.close()
for byte in bytes:
chars.append(ord(byte))
#print chars
if len(chars) != 13312:
print "Wait... chars does not equal 13312 in observed!!!"
return None
else:
return chars
def chi(char_list):
# Expected frequency of characters. I arrived at this like so:
# expected = number of observations/number of possibilities
# 52 = 13312/256
expected = 52.0
print "observed\texpected\tx2"
# 0 - 255
for x in range(0,256):
observed = 0
for char in char_list:
if x == char:
observed +=1
# The three chi squared calculations
# one = observed - expected
# two = one squared
# x2 = two/expected
# x2 = (observed - expected) squared
# ----------------------------
# expected
one = observed - expected
two = one * one
x2 = two/expected
print observed, "\t", expected, "\t", x2
chi(observed("filepath"))
The output looks similar to this:
observed expected x2
62 52.0 1.92307692308
46 52.0 0.692307692308
60 52.0 1.23076923077
68 52.0 4.92307692308
I know this is a bit off-topic here, just hoping someone could help me
interpret the x2 variable. After that, I'll be OK. I need to sum up
things to get an overall x2 for the bytes I've read, but before doing
that, I wanted to post this note. Please feel free to comment on any
aspect of this. If I've got something entirely wrong, let me know.
BTW, I selected 13KB (13,312) as it seems to be efficient and a decent
size to test, the data could be any amount (up to and including the
whole file) above this.
Thanks,
Tiff
functions that do it (I think... see below), but I do not understand
how to interpret the results. I'm doing an experiment to implement ent
in Python. ent tests the randomness of files and chi squared is
probably the best test for this purposes when compared to other tests.
Many of the statistical tests are easy (like Arithmetic Mean, etc) and
I have no problems interpreting the results from those, but chi
squared has stumped me. Here are my two simple functions, run them if
you like to better understand the output:
import os
import os.path
def observed(f):
# argument f is a filepath/filename
#
# Return a list of observed characters in decimal ord(char).
# Decimal value of characters may be 0 through 255.
# [43, 54, 0, 255, 4, etc.]
chars = []
#print f
fd = open(f, 'rb')
bytes = fd.read(13312)
fd.close()
for byte in bytes:
chars.append(ord(byte))
#print chars
if len(chars) != 13312:
print "Wait... chars does not equal 13312 in observed!!!"
return None
else:
return chars
def chi(char_list):
# Expected frequency of characters. I arrived at this like so:
# expected = number of observations/number of possibilities
# 52 = 13312/256
expected = 52.0
print "observed\texpected\tx2"
# 0 - 255
for x in range(0,256):
observed = 0
for char in char_list:
if x == char:
observed +=1
# The three chi squared calculations
# one = observed - expected
# two = one squared
# x2 = two/expected
# x2 = (observed - expected) squared
# ----------------------------
# expected
one = observed - expected
two = one * one
x2 = two/expected
print observed, "\t", expected, "\t", x2
chi(observed("filepath"))
The output looks similar to this:
observed expected x2
62 52.0 1.92307692308
46 52.0 0.692307692308
60 52.0 1.23076923077
68 52.0 4.92307692308
I know this is a bit off-topic here, just hoping someone could help me
interpret the x2 variable. After that, I'll be OK. I need to sum up
things to get an overall x2 for the bytes I've read, but before doing
that, I wanted to post this note. Please feel free to comment on any
aspect of this. If I've got something entirely wrong, let me know.
BTW, I selected 13KB (13,312) as it seems to be efficient and a decent
size to test, the data could be any amount (up to and including the
whole file) above this.
Thanks,
Tiff