Algorithm considerations

I

Ike

I am trying to determine an efficient algorithm to discern if a preposition
occurs in a sentence. Prepositions are runs of given words which are
anywhere from one to three words long. I already have a list of these
propositions (for example, 'with' is in the list as a preposition, as well
as 'with regards to').

I want to find if a sentence contains one of these propositions, and if it
does, use the longest preposition (in other words, if 'with regards to'
appears in the sentence, I want that to be the preposition that is chosen
there, not 'with').

The only caveat is that the sentence itself is already parsed, word-by-word,
into an array of Strings. In other words "The cat is in the hat" is a
String[0]="The", String[1]="cat"...String[5]="hat".

As for how I store my list of prepositions themselves, I can store them in
any type of collection(s). What I am trying to discern is what is an
efficient means for finding what preopositions occur in this String[],
taking into account that if more than one preposition traverses the same
word, the longest preposition is selected.

Does anyone have any ideas on what might be an efficient means to do this?
Thanks, Ike
 
D

Daniel Pitts

I am trying to determine an efficient algorithm to discern if a preposition
occurs in a sentence. Prepositions are runs of given words which are
anywhere from one to three words long. I already have a list of these
propositions (for example, 'with' is in the list as a preposition, as well
as 'with regards to').

I want to find if a sentence contains one of these propositions, and if it
does, use the longest preposition (in other words, if 'with regards to'
appears in the sentence, I want that to be the preposition that is chosen
there, not 'with').

The only caveat is that the sentence itself is already parsed, word-by-word,
into an array of Strings. In other words "The cat is in the hat" is a
String[0]="The", String[1]="cat"...String[5]="hat".

As for how I store my list of prepositions themselves, I can store them in
any type of collection(s). What I am trying to discern is what is an
efficient means for finding what preopositions occur in this String[],
taking into account that if more than one preposition traverses the same
word, the longest preposition is selected.

Does anyone have any ideas on what might be an efficient means to do this?
Thanks, Ike

Considering your input size seems small enough, I would start with the
brute force approach.
In other words:

for every position in my word array
find the longest proposition at that location.

This is an O(n^2) algorithm, but since sentences are generally < 10
words, it should be good enough.

Only after you make your application do what you want should you worry
about efficiency. If you find it is running too slow, use a profiler
to find out why and where.

If it happens to be that logic above, I would suggest using a
deterministic-finate-state-automata (DFA) It will run in linear time,
but its more complex to implement.
Basically, you'd be implementing a special-case regexp with tokens
that are words instead of letters.
 
E

Eric Sosman

Daniel Pitts wrote On 02/26/07 13:37,:
[...]

This is an O(n^2) algorithm, but since sentences are generally < 10
words, it should be good enough.

At least fifteen words, more depending on how you count
the mathematical notations.
Only after you make your application do what you want should you worry
about efficiency.

Fifteen words.
If you find it is running too slow, use a profiler
to find out why and where.

Seventeen words.
If it happens to be that logic above, I would suggest using a
deterministic-finate-state-automata (DFA)

At least sixteen words.
It will run in linear time,
but its more complex to implement.

Twelve words.
Basically, you'd be implementing a special-case regexp with tokens
that are words instead of letters.

Fifteen or sixteen words.

Total: Ninety or more words in six sentences, for an
average length of at least fifteen words per sentence. An
O(n^2) algorithm will therefore run about 125% longer than
you might have anticipated. Do not, on pain of tedium, use
this method on nineteenth-century German philosophical
treatises! ;-)
 
D

Daniel Pitts

Daniel Pitts wrote On 02/26/07 13:37,:
This is an O(n^2) algorithm, but since sentences are generally < 10
words, it should be good enough.

At least fifteen words, more depending on how you count
the mathematical notations.
Only after you make your application do what you want should you worry
about efficiency.

Fifteen words.
If you find it is running too slow, use a profiler
to find out why and where.

Seventeen words.
If it happens to be that logic above, I would suggest using a
deterministic-finate-state-automata (DFA)

At least sixteen words.
It will run in linear time,
but its more complex to implement.

Twelve words.
Basically, you'd be implementing a special-case regexp with tokens
that are words instead of letters.

Fifteen or sixteen words.

Total: Ninety or more words in six sentences, for an
average length of at least fifteen words per sentence. An
O(n^2) algorithm will therefore run about 125% longer than
you might have anticipated. Do not, on pain of tedium, use
this method on nineteenth-century German philosophical
treatises! ;-)

My estimates were off, but my advice is sound.
 
L

Lew

Ike said:
(for example, 'with' is in the list as a preposition, as well
as 'with regards to')


I would not diagram "with regards [sic] to" as a preposition. I wouldn't even
diagram "with regard to" as a preposition. A prepositional phrase, perhaps.

There are two prepositions in that snippet, "with" and "to". The object of
"with" is "regard", and the object of "to" is some noun that will follow.

- Lew
 
I

Ike

I took Daniel's advice -- and it works fine. Considering there are, at the
outside, 150-160 preposition(al phrases of 1-3 words). Code enclosed in case
someone upon searching this archive has a similar problem:

/** typically you would set startpoint to 0, but I have amended the
* code here such that you can start at any arbitrary point in the sentence
*/
private void startwithprep(String [] ta, int startpoint){
StringBuffer sb=null;
sb=new StringBuffer();
String tester=null;
for(int i=0;i<3;i++){
if(ta.length>startpoint+i){
sb.append(ta[startpoint+i]);
tester=sb.toString().toLowerCase();
if(prepositions.contains(tester)){ //prepositions is an
ArrayList of Strings
//here's the match, do as you like
}
sb.append(" ");
}
}
}
 
M

Mark Jeffcoat

Considering your input size seems small enough, I would start with the
brute force approach.
In other words:

for every position in my word array
find the longest proposition at that location.

This is an O(n^2) algorithm, but since sentences are generally < 10
words, it should be good enough.

Since we have a list of all the possible prepositions, and
the longest preposition is of (small) constant length, this
is a O(n) algorithm.

You're not going to be able to beat that for asymptotic
efficiency. [Bad proof: someone might actually end a sentence
with a preposition].


As far as actual efficiency is concerned, I'd go with Daniel's
DFA idea.
 
M

Mark Jeffcoat

Ike said:
/** typically you would set startpoint to 0, but I have amended the
* code here such that you can start at any arbitrary point in the sentence
*/
private void startwithprep(String [] ta, int startpoint){
StringBuffer sb=null;
sb=new StringBuffer();
String tester=null;
for(int i=0;i<3;i++){
if(ta.length>startpoint+i){
sb.append(ta[startpoint+i]);
tester=sb.toString().toLowerCase();
if(prepositions.contains(tester)){ //prepositions is an
ArrayList of Strings
//here's the match, do as you like
}
sb.append(" ");
}
}
}

Cool. Can you make your preposition list available? Your
problem is likely solved, but I'd be interested looking at
the performance numbers for a couple of different approachs;
I'm curious about how much difference micro-optimizations
could really make in a case like this.


On a tangent, I'd always write this function as returning a
String, returning where you have "//here's the match" (and
probably null on no match.) I think that style gives you
better functional decomposition, and makes it much simpler
to test startWithPrep in isolation.

(If pressed further, I'd mumble something about Bertrand
Meyer's Command/Query separation, and hint that I'd maybe
read his 1300 page book on OOP, even though that is, in
fact, quite false.)
 
I

Ike

----- Original Message -----
From: "Mark Jeffcoat" <[email protected]>
Newsgroups: comp.lang.java.programmer
Sent: Monday, February 26, 2007 8:46 PM
Subject: Re: Algorithm considerations
Cool. Can you make your preposition list available? Your
problem is likely solved, but I'd be interested looking at
the performance numbers for a couple of different approachs;
I'm curious about how much difference micro-optimizations
could really make in a case like this.

I"VE ENCLOSED IT AT THE BOTTOM AS SQL COMMANDS IN CASE SOMEONE WANTS TO USE
IT IN MYSQL AS I AM DOING.
On a tangent, I'd always write this function as returning a
String, returning where you have "//here's the match" (and
probably null on no match.) I think that style gives you
better functional decomposition, and makes it much simpler
to test startWithPrep in isolation.

I"M ACTUALLY USING IT TO RETAG PARTS-OF-SPEECH, WITH BRILL POS TAGS. SO IN
MY INSTANTIATION OF IT, I AM PASSING THE FUNCTION ANOTHER ARRAY WHICH IS THE
TAGS CORRESPONDING TO THE WORDS, AND SWAPPING THE TAGS WITH 'IN' IN LIEU OF
WHAT THEY ARE (UNLESS THEY ARE ALREADY TAGGED 'CS').

HERE'S MY LIST. I AM NOT INCLUDING POSTPREPOSITIONS (LIKE 'ago') .
ADDITIONALLY, MY alsoCS FIELD MEANS THAT THIS WORD IS ALSO OFTEN USED AS A
SUBORDINATING CONJUNCTION:

# --------------------------------------------------------

#
# Table structure for table `prepositions`
#

CREATE TABLE `prepositions` (
`id` int(3) NOT NULL auto_increment,
`preposition` varchar(40) NOT NULL default '',
`alsoCS` int(3) NOT NULL default '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

#
# Dumping data for table `prepositions`
#

INSERT INTO `prepositions` VALUES (1, 'abaft', 0);
INSERT INTO `prepositions` VALUES (2, 'about', 0);
INSERT INTO `prepositions` VALUES (3, 'above', 0);
INSERT INTO `prepositions` VALUES (4, 'according to', 0);
INSERT INTO `prepositions` VALUES (5, 'across', 0);
INSERT INTO `prepositions` VALUES (6, 'after', 1);
INSERT INTO `prepositions` VALUES (7, 'against', 0);
INSERT INTO `prepositions` VALUES (8, 'ahead of', 0);
INSERT INTO `prepositions` VALUES (9, 'along', 0);
INSERT INTO `prepositions` VALUES (10, 'along with', 0);
INSERT INTO `prepositions` VALUES (11, 'alongside', 0);
INSERT INTO `prepositions` VALUES (12, 'amid', 0);
INSERT INTO `prepositions` VALUES (13, 'among', 0);
INSERT INTO `prepositions` VALUES (14, 'amongst', 0);
INSERT INTO `prepositions` VALUES (15, 'apart from', 0);
INSERT INTO `prepositions` VALUES (16, 'around', 0);
INSERT INTO `prepositions` VALUES (17, 'as', 1);
INSERT INTO `prepositions` VALUES (18, 'as far as', 0);
INSERT INTO `prepositions` VALUES (19, 'as well as', 0);
INSERT INTO `prepositions` VALUES (20, 'at', 0);
INSERT INTO `prepositions` VALUES (21, 'back of', 0);
INSERT INTO `prepositions` VALUES (22, 'before', 1);
INSERT INTO `prepositions` VALUES (23, 'behind', 0);
INSERT INTO `prepositions` VALUES (24, 'below', 0);
INSERT INTO `prepositions` VALUES (25, 'beneath', 0);
INSERT INTO `prepositions` VALUES (26, 'beside', 0);
INSERT INTO `prepositions` VALUES (27, 'between', 0);
INSERT INTO `prepositions` VALUES (28, 'beyond', 0);
INSERT INTO `prepositions` VALUES (29, 'but', 0);
INSERT INTO `prepositions` VALUES (30, 'by', 0);
INSERT INTO `prepositions` VALUES (31, 'concerning', 0);
INSERT INTO `prepositions` VALUES (32, 'contrary to', 0);
INSERT INTO `prepositions` VALUES (33, 'despite', 0);
INSERT INTO `prepositions` VALUES (34, 'down', 0);
INSERT INTO `prepositions` VALUES (35, 'during', 0);
INSERT INTO `prepositions` VALUES (36, 'except', 0);
INSERT INTO `prepositions` VALUES (37, 'excepting', 0);
INSERT INTO `prepositions` VALUES (38, 'for', 0);
INSERT INTO `prepositions` VALUES (39, 'from', 0);
INSERT INTO `prepositions` VALUES (40, 'in', 0);
INSERT INTO `prepositions` VALUES (41, 'in addition to', 0);
INSERT INTO `prepositions` VALUES (42, 'in back of', 0);
INSERT INTO `prepositions` VALUES (43, 'in front of', 0);
INSERT INTO `prepositions` VALUES (44, 'in lieu of', 0);
INSERT INTO `prepositions` VALUES (45, 'in place of', 0);
INSERT INTO `prepositions` VALUES (46, 'in regard to', 0);
INSERT INTO `prepositions` VALUES (47, 'in spite of', 0);
INSERT INTO `prepositions` VALUES (48, 'in view of', 0);
INSERT INTO `prepositions` VALUES (49, 'inside', 0);
INSERT INTO `prepositions` VALUES (50, 'instead of', 0);
INSERT INTO `prepositions` VALUES (51, 'into', 0);
INSERT INTO `prepositions` VALUES (52, 'like', 0);
INSERT INTO `prepositions` VALUES (53, 'near', 0);
INSERT INTO `prepositions` VALUES (54, 'of', 0);
INSERT INTO `prepositions` VALUES (55, 'off', 0);
INSERT INTO `prepositions` VALUES (56, 'on', 0);
INSERT INTO `prepositions` VALUES (57, 'on account of', 0);
INSERT INTO `prepositions` VALUES (58, 'out', 0);
INSERT INTO `prepositions` VALUES (59, 'out of', 0);
INSERT INTO `prepositions` VALUES (60, 'outside', 0);
INSERT INTO `prepositions` VALUES (61, 'over', 0);
INSERT INTO `prepositions` VALUES (62, 'past', 0);
INSERT INTO `prepositions` VALUES (63, 'rather than', 0);
INSERT INTO `prepositions` VALUES (64, 'regarding', 0);
INSERT INTO `prepositions` VALUES (65, 'round', 0);
INSERT INTO `prepositions` VALUES (66, 'since', 1);
INSERT INTO `prepositions` VALUES (67, 'through', 0);
INSERT INTO `prepositions` VALUES (68, 'throughout', 0);
INSERT INTO `prepositions` VALUES (69, 'till', 0);
INSERT INTO `prepositions` VALUES (70, 'to', 0);
INSERT INTO `prepositions` VALUES (71, 'together with', 0);
INSERT INTO `prepositions` VALUES (72, 'toward', 0);
INSERT INTO `prepositions` VALUES (73, 'towards', 0);
INSERT INTO `prepositions` VALUES (74, 'under', 0);
INSERT INTO `prepositions` VALUES (75, 'underneath', 0);
INSERT INTO `prepositions` VALUES (76, 'until', 1);
INSERT INTO `prepositions` VALUES (77, 'unto', 0);
INSERT INTO `prepositions` VALUES (78, 'up', 0);
INSERT INTO `prepositions` VALUES (79, 'up to', 0);
INSERT INTO `prepositions` VALUES (80, 'upon', 0);
INSERT INTO `prepositions` VALUES (81, 'versus', 0);
INSERT INTO `prepositions` VALUES (82, 'via', 0);
INSERT INTO `prepositions` VALUES (83, 'with', 0);
INSERT INTO `prepositions` VALUES (84, 'with regard to', 0);
INSERT INTO `prepositions` VALUES (85, 'within', 0);
INSERT INTO `prepositions` VALUES (86, 'without', 0);
INSERT INTO `prepositions` VALUES (87, 'worth', 0);
INSERT INTO `prepositions` VALUES (88, 'with regards to', 0);
INSERT INTO `prepositions` VALUES (89, 'aboard', 0);
INSERT INTO `prepositions` VALUES (90, 'absent', 0);
INSERT INTO `prepositions` VALUES (91, 'amidst', 0);
INSERT INTO `prepositions` VALUES (92, 'astride', 0);
INSERT INTO `prepositions` VALUES (93, 'atop', 0);
INSERT INTO `prepositions` VALUES (94, 'besides', 0);
INSERT INTO `prepositions` VALUES (95, 'following', 0);
INSERT INTO `prepositions` VALUES (96, 'notwithstanding', 0);
INSERT INTO `prepositions` VALUES (97, 'mid', 0);
INSERT INTO `prepositions` VALUES (98, 'minus', 0);
INSERT INTO `prepositions` VALUES (99, 'onto', 0);
INSERT INTO `prepositions` VALUES (100, 'opposite', 0);
INSERT INTO `prepositions` VALUES (101, 're', 0);
INSERT INTO `prepositions` VALUES (102, 'subsequent to', 0);
INSERT INTO `prepositions` VALUES (103, 'prior to', 0);
INSERT INTO `prepositions` VALUES (104, 'next to', 0);
INSERT INTO `prepositions` VALUES (105, 'near to', 0);
INSERT INTO `prepositions` VALUES (106, 'owing to', 0);
INSERT INTO `prepositions` VALUES (107, 'outside of', 0);
INSERT INTO `prepositions` VALUES (108, 'on to', 0);
INSERT INTO `prepositions` VALUES (109, 'in to', 0);
INSERT INTO `prepositions` VALUES (110, 'inside of', 0);
INSERT INTO `prepositions` VALUES (111, 'far from', 0);
INSERT INTO `prepositions` VALUES (112, 'as to', 0);
INSERT INTO `prepositions` VALUES (113, 'aside from', 0);
INSERT INTO `prepositions` VALUES (114, 'because of', 0);
INSERT INTO `prepositions` VALUES (115, 'close to', 0);
INSERT INTO `prepositions` VALUES (116, 'due to', 0);
INSERT INTO `prepositions` VALUES (117, 'by means of', 0);
INSERT INTO `prepositions` VALUES (118, 'in accordance with', 0);
INSERT INTO `prepositions` VALUES (119, 'on behalf of', 0);
INSERT INTO `prepositions` VALUES (120, 'on top of', 0);
INSERT INTO `prepositions` VALUES (121, 'in case of', 0);
INSERT INTO `prepositions` VALUES (122, 'betwixt', 0);
INSERT INTO `prepositions` VALUES (123, 'circa', 0);
INSERT INTO `prepositions` VALUES (124, 'anti', 0);
INSERT INTO `prepositions` VALUES (125, 'cum', 0);
INSERT INTO `prepositions` VALUES (126, 'per', 0);
INSERT INTO `prepositions` VALUES (127, 'qua', 0);
INSERT INTO `prepositions` VALUES (128, 'sans', 0);
INSERT INTO `prepositions` VALUES (129, 'vis-a-vis', 0);
INSERT INTO `prepositions` VALUES (130, 'vis a vis', 0);
 
O

Oliver Wong

[context is: picking out keywords -- "prepositions" -- out of
already-parsed sentences.]

Ike said:
#
# Table structure for table `prepositions`
#

CREATE TABLE `prepositions` (
`id` int(3) NOT NULL auto_increment,
`preposition` varchar(40) NOT NULL default '',
`alsoCS` int(3) NOT NULL default '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

#
# Dumping data for table `prepositions`
#

INSERT INTO `prepositions` VALUES (1, 'abaft', 0);
INSERT INTO `prepositions` VALUES (2, 'about', 0);
INSERT INTO `prepositions` VALUES (3, 'above', 0);
INSERT INTO `prepositions` VALUES (4, 'according to', 0);
[and so on...]

You might want to make your preposition column indexable, as you will
probably be searching through that column pretty frequently. And if the
values in the preposition column are unique, you might want to drop the
`id` column altogether and make your preposition column your primary key.

- Oliver
 
I

Ike

Oliver Wong said:
[context is: picking out keywords -- "prepositions" -- out of
already-parsed sentences.]

You might want to make your preposition column indexable, as you will
probably be searching through that column pretty frequently. And if the
values in the preposition column are unique, you might want to drop the
`id` column altogether and make your preposition column your primary key.

- Oliver


As you can see from the code I enclosed, I actually just read it directly
into an ArrayList and work with the ArrayList itself.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top