C++ code for parsing syllables?

M

me

I'm pulling my hair out trying to figure out code for
parsing and counting syllables in simple English
sentences.

Can someone throw the dog a bone on where to start?
 
M

Michael Angelo Ravera

I'm pulling my hair out trying to figure out code for
parsing and counting syllables in simple English
sentences.

Can someone throw the dog a bone on where to start?

This isn't really a C++ question, but a Computational Linguistics
question.

The first step is in recognizing vowel groups. Once you recognize
vowel groups, you can try to determine whether the group forms 1or
more sylables.
 
K

Kai-Uwe Bux

Andy said:
Daniel said:
<snip> [..]
This web site

http://www.wordcalc.com/

seems to do what you want. Except... "The rhythm of life" contains two
syllables. Half a syllable per word.

Good luck. This is a hard problem.

Maybe from a linguistic point of view, it is hard. But algorithmically, it
seems somewhat easy: English has about 1,000,000 words (with very inclusive
counting) and the number of syllables in each of them is known. So just do a
table look-up. This algorithm also has the advantage of being applicable to
any language (and it will be easier as English has a huge vocabulary).

It's a finite problem and in fact smaller than, say, the problem of finding
phone numbers based on name and address. The interesting part would be to
use frequency information about words to make the look-up fast; or to find a
good data structure to reduce memory consumption.

Of course, there is the issue of words being added to the language. However,
a rule based algorithm should not be expected to cope with the new words
either: its rules are just designed to deal with the known words.


Best

Kai-Uwe Bux
 
D

Daniel Pitts

Andy said:
Daniel said:
(e-mail address removed) wrote:

I'm pulling my hair out trying to figure out code for
parsing and counting syllables in simple English
sentences.

Can someone throw the dog a bone on where to start?

Google is your friend:
http://english.glendale.cc.ca.us/phonics.rules.html
<snip>
[..]
This web site

http://www.wordcalc.com/

seems to do what you want. Except... "The rhythm of life" contains two
syllables. Half a syllable per word.

Good luck. This is a hard problem.

Maybe from a linguistic point of view, it is hard. But algorithmically, it
seems somewhat easy: English has about 1,000,000 words (with very inclusive
counting) and the number of syllables in each of them is known. So just do a
table look-up. This algorithm also has the advantage of being applicable to
any language (and it will be easier as English has a huge vocabulary).

It's a finite problem and in fact smaller than, say, the problem of finding
phone numbers based on name and address. The interesting part would be to
use frequency information about words to make the look-up fast; or to find a
good data structure to reduce memory consumption.
How about a hash-map for both of those.

Actually, with only 1 million words, the entirety of the data structure
can easily fit in memory on even the cheapest of today's desktop/server
machines (mobile/embedded are a different story). Making look up
extremely fast.
 
M

me

Daniel T. said:
The exceptions remind me of a joke by Emo Phillips.

Most states do not end in the letter "a." The only ones that do are
Alabama, Georgia, Florida, Louisiana, Oklahoma, Arizona, California,
Nevada, Alaska, Montana, Nebraska, South Dakota, North Dakota,
Minnesota, Iowa, Indiana, Pennsylvania, North Carolina, South
Carolina, West Virginia, east Virginia, and Missouri.

That's funny!

I live in MissourA as well!!
 
J

James Kanze

Pay special attention to rule 1.
The rhythm can be foretold by looking at where the vowels are,
right? So "rhythm" has ... err... two syllables, because it's
split by the Y which counts as a vowel,

The y is the only possible vowel, so rhythm can't have more than
one syllable. Except that as I hear it (and according to
dictionaries), it has two: in this case, the m acts as a
syllable.
whereas "foretold" obviously has three syllables, centred
around the three vowels.
Or is that centered?

Rule 7 and the second point under 1 in the Basic Syllable Rules
do imply that silent e's don't count:). (Of course, they don't
give any hint as to how a program is to determine whether an e
is silent or not.)
This web site

seems to do what you want. Except... "The rhythm of life"
contains two syllables. Half a syllable per word.
Good luck. This is a hard problem.

To put it mildly. Compare "ccoper" with the beginning of
"cooperation".

And that's without internationalization: the rules will be
distinctly different in French or in German than in English.

For starters, you'll probably want to see
http://tug.org/docs/liang/. To my knowledge, no one has done
better since (and it works for all, or at least most languages,
with a simple replacement of machine generated tables).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,148
Messages
2,570,838
Members
47,385
Latest member
Joneswilliam01

Latest Threads

Top