M
markspace
For some reason I was thinking about sub-string searches today. I
looked up Patricia tries (a kind of radix tree) to see if they would
help. While interesting, the radix tree seems to have a lot of overhead
for large numbers of entries.
The radix tree uses a bucket at each level to hold all children (and
there could be quite a lot of children). Each child if present requires
a pointer (an object in Java) to hold it. For the example given, this
could be as much as one object per character in each string, plus the
bucket to hold it and its siblings. If the number strings is very
large, this could really result in an explosion of memory usage.
<http://en.wikipedia.org/wiki/Radix_tree>
So is there a better way for large data sets?
For the example give, it appears to be a kind of incremental search.
Those, first we find the "r" which is common to all the words in the
example. Then if we're looking for, say, "romane" we'd find that the
"om" is common to all words that match. Then we find that "an" matches
both "romane" and "romanus". Finally we find the "e" which tells us
"romane" exist in the tree.
So I implemented a simple "incremental binary search". Given a string,
the search tells you if there's any elements that begin with the
parameter string. It also remembers the previous match, so that the
next search picks up from that location. "ro" could be matched next
along with "rom". If the search ever returns a non-match condition, you
know you've hit the maximum extent. There are no further matches,
regardless how many additional characters you append.
This seems useful for "short-circuiting" long string matches, allowing
you to pick out dictionary matches without inuring average n^2 time.
My question is: does this really buy me anything? Or did I just
duplicate some other well known algorithms that works just as well or
better?
/*
* Copyright 2012 Brenden Towey. All rights reserved.
*/
package com.techdarwinia.trie;
import java.io.FileReader;
import java.io.Reader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Scanner;
/**
*
* @author Brenden Towey
*/
public class IncrementalStringSearch
{
private int low;
private int high;
private String[] array;
public IncrementalStringSearch( String[] array )
{
this.array = array;
high = array.length - 1;
}
public boolean find( String findString )
{
if( low > high ) return false;
for( ;; ) {
int index = ( low + high ) / 2;
String foundSub = array[index];
if( array[index].length() > findString.length() )
foundSub = array[index].substring( 0, findString.length() );
else
foundSub = array[index];
if( foundSub.equals( findString ) )
return true;
if( foundSub.compareTo( findString ) > 0 ) {
high = index - 1;
if( high < low ) return false;
} else {
low = index + 1;
if( low > high ) return false;
}
}
}
public static void main( String[] args ) throws Exception
{
ArrayList<String> passwords = new ArrayList<>();
Reader ins = new FileReader( "test/com/techdarwinia/trie/bpw.txt" );
Scanner scanner = new Scanner( ins );
scanner.useDelimiter( ",?\\s+" );
while( scanner.hasNext() ) {
passwords.add( scanner.next() );
}
ins.close();
Collections.sort( passwords );
String test = "passxword ";
for( int i = 0; i < test.length(); i++ ) {
IncrementalStringSearch inc = new IncrementalStringSearch(
passwords.toArray( new String[0] ) );
for( int j = i + 1; j <= test.length(); j++ ) {
String string = test.substring( i, j );
if( !inc.find( string ) ){
if( j > i+1 && passwords.contains( test.substring( i,
j-1 ) ) ) {
System.out.println( "Found: "+test.substring( i, j-1 ) );
}
break;
}
}
}
}
}
looked up Patricia tries (a kind of radix tree) to see if they would
help. While interesting, the radix tree seems to have a lot of overhead
for large numbers of entries.
The radix tree uses a bucket at each level to hold all children (and
there could be quite a lot of children). Each child if present requires
a pointer (an object in Java) to hold it. For the example given, this
could be as much as one object per character in each string, plus the
bucket to hold it and its siblings. If the number strings is very
large, this could really result in an explosion of memory usage.
<http://en.wikipedia.org/wiki/Radix_tree>
So is there a better way for large data sets?
For the example give, it appears to be a kind of incremental search.
Those, first we find the "r" which is common to all the words in the
example. Then if we're looking for, say, "romane" we'd find that the
"om" is common to all words that match. Then we find that "an" matches
both "romane" and "romanus". Finally we find the "e" which tells us
"romane" exist in the tree.
So I implemented a simple "incremental binary search". Given a string,
the search tells you if there's any elements that begin with the
parameter string. It also remembers the previous match, so that the
next search picks up from that location. "ro" could be matched next
along with "rom". If the search ever returns a non-match condition, you
know you've hit the maximum extent. There are no further matches,
regardless how many additional characters you append.
This seems useful for "short-circuiting" long string matches, allowing
you to pick out dictionary matches without inuring average n^2 time.
My question is: does this really buy me anything? Or did I just
duplicate some other well known algorithms that works just as well or
better?
/*
* Copyright 2012 Brenden Towey. All rights reserved.
*/
package com.techdarwinia.trie;
import java.io.FileReader;
import java.io.Reader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Scanner;
/**
*
* @author Brenden Towey
*/
public class IncrementalStringSearch
{
private int low;
private int high;
private String[] array;
public IncrementalStringSearch( String[] array )
{
this.array = array;
high = array.length - 1;
}
public boolean find( String findString )
{
if( low > high ) return false;
for( ;; ) {
int index = ( low + high ) / 2;
String foundSub = array[index];
if( array[index].length() > findString.length() )
foundSub = array[index].substring( 0, findString.length() );
else
foundSub = array[index];
if( foundSub.equals( findString ) )
return true;
if( foundSub.compareTo( findString ) > 0 ) {
high = index - 1;
if( high < low ) return false;
} else {
low = index + 1;
if( low > high ) return false;
}
}
}
public static void main( String[] args ) throws Exception
{
ArrayList<String> passwords = new ArrayList<>();
Reader ins = new FileReader( "test/com/techdarwinia/trie/bpw.txt" );
Scanner scanner = new Scanner( ins );
scanner.useDelimiter( ",?\\s+" );
while( scanner.hasNext() ) {
passwords.add( scanner.next() );
}
ins.close();
Collections.sort( passwords );
String test = "passxword ";
for( int i = 0; i < test.length(); i++ ) {
IncrementalStringSearch inc = new IncrementalStringSearch(
passwords.toArray( new String[0] ) );
for( int j = i + 1; j <= test.length(); j++ ) {
String string = test.substring( i, j );
if( !inc.find( string ) ){
if( j > i+1 && passwords.contains( test.substring( i,
j-1 ) ) ) {
System.out.println( "Found: "+test.substring( i, j-1 ) );
}
break;
}
}
}
}
}