strange array size problem

C

chip9munk

Hello everybody!

One strange problem, please help!

I have the following 2D array: users_elements_matrix
numpy.shape(users_elements_matrix) is (100,43)

and array merged_binary_ratings
numpy.shape(merged_binary_ratings) is (100,)

Now,when I run:
numpy.linalg.lstsq(users_elements_matrix, merged_binary_ratings)
i get some ridiculous numbers for coeficients, all are the same and 1.38946385e+15.

What is really strange is that if I run
numpy.shape(users_elements_matrix[:,0:42])
i get ok numbers.

I tested several thing and have examined the matrix, everything is ok with the data.

how is it possible that one additional row (variable in linear regression)
has such a strange impact?!!?

I am loosing my mind here, please help!

Thanks!
 
C

chip9munk

one more thing.

the problem is not in the last column, if I use it in regression (only that column, or with a few others) I will get the results. But if I use all 43 columns python breaks!

whhhyyyy?!?!?!

thanks!
 
R

Robert Kern

Hello everybody!

One strange problem, please help!

I have the following 2D array: users_elements_matrix
numpy.shape(users_elements_matrix) is (100,43)

and array merged_binary_ratings
numpy.shape(merged_binary_ratings) is (100,)

Now,when I run:
numpy.linalg.lstsq(users_elements_matrix, merged_binary_ratings)
i get some ridiculous numbers for coeficients, all are the same and 1.38946385e+15.

What is really strange is that if I run
numpy.shape(users_elements_matrix[:,0:42])
i get ok numbers.

I tested several thing and have examined the matrix, everything is ok with the data.

how is it possible that one additional row (variable in linear regression)
has such a strange impact?!!?

I am loosing my mind here, please help!

The numpy-discussion mailing list is probably the best place to ask. I recommend
posting a complete working example (with data) that demonstrates the problem.
Use pastebin.com or a similar service if necessary.

http://www.scipy.org/scipylib/mailing-lists.html

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
O

Oscar Benjamin

one more thing.

the problem is not in the last column, if I use it in regression (only that column, or with a few others) I will get the results. But if I use all 43 columns python breaks!

Have you tried testing the rank with numpy.linalg.matrix_rank? I'm
guessing that the extra row makes the matrix singular (up to floating
point error).


Oscar
 
C

chip9munk

Interesting!
rank of the whole minus last row
numpy.linalg.matrix_rank(users_elements_matrix[:,0:42]) is 42

but also rank of whole is
numpy.linalg.matrix_rank(users_elements_matrix[:,0:43]) is 42

but what does that mean?!

could you explain briefly what now?

thank you!
 
O

Oscar Benjamin

Interesting!
rank of the whole minus last row
numpy.linalg.matrix_rank(users_elements_matrix[:,0:42]) is 42

but also rank of whole is
numpy.linalg.matrix_rank(users_elements_matrix[:,0:43]) is 42

but what does that mean?!

It means that the additional column is a linear combination of the
existing columns. This means that your system of equations can contain
a contradiction. Essentially you're trying to get the least squares
solution to something like:

3*x + y = 1
1*x + 2*y = 4
1*x + 2*y = 5 # Contradicts the equation above

Because of floating point error it isn't *exactly* a contradiction so
you get silly values instead of an error.
could you explain briefly what now?

http://en.wikipedia.org/wiki/Rank_(linear_algebra)


Oscar
 
C

chip9munk

It means that the additional column is a linear combination of the
existing columns. This means that your system of equations can contain
a contradiction. Essentially you're trying to get the least squares
solution to something like: 3*x + y = 1 1*x + 2*y = 4 1*x + 2*y = 5 #
Contradicts the equation above Because of floating point error it
isn't *exactly* a contradiction so you get silly values instead of an
error. http://en.wikipedia.org/wiki/Rank_(linear_algebra)

I get it, due to the type of data this can be possible and I am sure
that is the issue.
Now I have to figure out how to solve it, but at least you have
identified the problem for me.

Thank you very much for clear and prompt solution!

Best
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,996
Messages
2,570,238
Members
46,826
Latest member
robinsontor

Latest Threads

Top