chi squared (X2) in Python

T

ts8807385

I was wondering if anyone has done this in Python. I wrote two
functions that do it (I think... see below), but I do not understand
how to interpret the results. I'm doing an experiment to implement ent
in Python. ent tests the randomness of files and chi squared is
probably the best test for this purposes when compared to other tests.
Many of the statistical tests are easy (like Arithmetic Mean, etc) and
I have no problems interpreting the results from those, but chi
squared has stumped me. Here are my two simple functions, run them if
you like to better understand the output:

import os
import os.path

def observed(f):

# argument f is a filepath/filename
#
# Return a list of observed characters in decimal ord(char).
# Decimal value of characters may be 0 through 255.
# [43, 54, 0, 255, 4, etc.]

chars = []
#print f

fd = open(f, 'rb')
bytes = fd.read(13312)
fd.close()

for byte in bytes:
chars.append(ord(byte))

#print chars

if len(chars) != 13312:
print "Wait... chars does not equal 13312 in observed!!!"
return None
else:
return chars

def chi(char_list):

# Expected frequency of characters. I arrived at this like so:
# expected = number of observations/number of possibilities
# 52 = 13312/256

expected = 52.0

print "observed\texpected\tx2"

# 0 - 255
for x in range(0,256):
observed = 0
for char in char_list:
if x == char:
observed +=1

# The three chi squared calculations
# one = observed - expected
# two = one squared
# x2 = two/expected

# x2 = (observed - expected) squared
# ----------------------------
# expected

one = observed - expected
two = one * one
x2 = two/expected

print observed, "\t", expected, "\t", x2


chi(observed("filepath"))

The output looks similar to this:

observed expected x2
62 52.0 1.92307692308
46 52.0 0.692307692308
60 52.0 1.23076923077
68 52.0 4.92307692308

I know this is a bit off-topic here, just hoping someone could help me
interpret the x2 variable. After that, I'll be OK. I need to sum up
things to get an overall x2 for the bytes I've read, but before doing
that, I wanted to post this note. Please feel free to comment on any
aspect of this. If I've got something entirely wrong, let me know.
BTW, I selected 13KB (13,312) as it seems to be efficient and a decent
size to test, the data could be any amount (up to and including the
whole file) above this.

Thanks,

Tiff
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,996
Messages
2,570,238
Members
46,826
Latest member
robinsontor

Latest Threads

Top