-------- Original-Nachricht --------
Datum: Wed, 14 May 2008 18:52:08 +0900
Von: Calamitas <
[email protected]>
An: (e-mail address removed)
Betreff: Re: Gecoder syntax question
Gecoder to formulate the problem ?
Dear Peter and Andreas,
thank you for responding.
Gecode/R does not handle float values in constraints.
I somehow previously suspected that but I was afraid I might
have overlooked something. Thanks for the information.
First off, your optimization problem seems to make little sense to me
the way you specified it.
Well, as I asked a syntax question, I didn't want to give a lengthy description of the background of the problem I'm trying to solve. So the
motivation might not be obvious. Here it is.
x and y should be vectors counting occurrences of words in a sentence
or a paragraph from a text. A is a matrix made from an entire given text using a procedure that can be considered a variant of Latent Semantic Analysis,
http://en.wikipedia.org/wiki/Latent_semantic_analysis
and the goal is to find (a set of) synonyms/paraphrases/explanations (=y) of the meaning of a given sentence/paragraph (=x) - here I am not concerned with the order of the words - so I abusively treat a sentence/paragraph and its word count vector as the same thing. A synonym/paraphrase/explanation cannot use the exact same words as what it is supposed to explain, so I cannot just solve a matrix equation to retrieve x=y from Ax=Ay.
The matrix A will always be singular ("more words than concepts" - see below), but as it is a linear mapping between finite-dimensional
vector spaces, it is continuous and this:
If A is singular, the optimization problem is
unbound and thus there is no maximum for ||x-y|| for a fixed value of
||A(x-y)||.
cannot happen.
A is in fact a matrix that comes about using an SVD with some predefined
maximum amount of non-zero eigenvalues and associated eigenvectors, where the latter can be interpreted as different concepts, and Ax is then a vector expressing how much sentence x was about
each of the concepts talked about in the text x is taken from.
Maybe you should quantify what 'mapped closely' means,
like ||A(x-y)|| <= 1.
Of course, I'll introduce a quantification, but I am not sure yet whether
I should give some hard bound right away or search for the so-and-so-many
y vectors that come closest in some iterative form ...
But I don't think Gecode/R can handle floats, or at least not for the
unknowns. It seems to be a finite domain solver.
As Andreas said, constraints cannot be floats. This can be helped by
multiplying with some constant and cutting off.
The unknowns (entries of y) and the entries of x are integers anyway - as I wrote in the original post.
If you only care
about say 3 digits after the comma, you could scale all values up by
1000 and round to integers. My gut feeling is that this can easily
give results that are way off, and not only for ill-conditioned
problems.
I share that fear, too.
Mmh, however, well here, the multiplication and cutting off would take place after the SVD, and I don't want to use the Euclidean norm, because it introduces numerical instability by minimizing one or a few large
squared sums in exchange for maximizing many small squared sums in the
components.
But your problem seems numerical in nature. I'm inclined to solve it
by using the singular value decomposition to find the (or a) direction
in which A scales the least. Using vector d to denote this direction,
you can pick y = x + a * d, and that should give you the largest value
for ||x-y|| for the specific value of ||A(x-y)||. Both norms above can
be expressed as linear functions of a which is the only variable left,
and you can easily pick the a you want.
If A is singular, the smallest
singular value will be 0 and you can pick a as large as you want and
|| A(x-y) || will still be 0. Of course, I'm assuming here that "a
norm" means Euclidian norm.
As I said above, the x and y are discrete, and as a matter of fact, I wanted Gecoder to do the searching here - there
might only be 1 or 2 more or less occurrences of each word in a paraphrase,
but a good paraphrase might actually render a set of words by another
one having differences in more than one direction, like in these bits
"George W. Bush"
"the current president of the U.S."
"Mr. Clinton's successor"
all having the same meaning in a newspaper article, say.
So thank you again for your kind suggestions, I'll try some multiplication
and cutting-off.
Best regards,
Axel