I'm a bit confused. Evaluating the levenshtein distance in the
classical method requires a matrix that you have to get memory for
from somewhere. Even if you have 'int c_levenshtein( const string&
s1, const string& s2 )' or some levenshtein member function, you still
have to provide memory for the matrix; it's not innate to 's1' and
's2'. I agree that 'pm_workspace' is a kind of small string
optimization for malloc that is not directly related to the algorithm,
but you still need to get the memory from somewhere. Can you
pseudocode your own version of levenshtein using the std::string
framework, so I can better understand where you're getting the memory
for the matrix from, and classify what parts of the function are
"high" and "low" level?
There are no two levels "high" and "low", there is a potentially open-
ended hiearchy of levels. I am very oriented to the actual working code,
so for me the notions "lower" and "higher" just mean the order of loading
dynamic-link-libraries into the process space, assuming that each feature
is implemented in its own dynamic-link library.
And I said the technique just reminded me the short-string optimization
of C++, not that it would be applicable for this very function. Anyway,
here is the translation of the levenshtein function into C++:
\code
int c_levenshtein( const std::string& s1, const std:string& s2 )
{
// note: try-catch is probably unneeded here, it is just to replicate
// the original function way to report any errors
// by returning a non-informative -1 error code.
try {
/* If one of the strings is empty "", the edit distance is equal
to the length of the non-empty string. */
if (s1.empty() || s2.empty()) {
return s1.length() + s2.length();
}
int m = s1.length()+1;
int n = s2.length()+1;
std::vector<int> proximity_matrix(n*m);
gc_compute_levenshtein_matrix(
s1.c_str(), s2.c_str(), &m, &n, &proximity_matrix[0] );
return proximity_matrix[m*n-1];
} catch(...) {
return -1;
}}
\endcode
The intermediate array is using std::vector here instead of std::string,
and I have not heard any implementation of std::vector that is using any
kind of "small-string" optimization. So this C++ version probably
involves a dynamic memory allocation even in case of short input strings
and so probably runs slower than the C version in case of short input
strings.
Getting it faster would involve more work, and I would not be
convinced this is needed unless the profiler told me that. On the other
hand, if more work were needed, I could encapsulate it in a class and
just replace the name std::vector with my class name.