Need Help with Programming Science Project

theguy · Jan 24, 2014

I have a science project that involves designing a program which can examine a bit of text with the author's name given, then figure out who the author is if another piece of example text without the name is given. I so far have three different authors in the program and have already put in the example text but for some reason, the program always leans toward one specific author, Suzanne Collins, no matter what insane number I try to put in or how much I tinker with the coding. I would post the code, but I don't know ifit's fine to put it here, as it contains pieces from books. I do believe that would go against copyright laws. If I can figure out a way to put it inwithout the bits from the stories, then I'll do so, but as of now, any help is appreciated. I understand I'm not exactly making it easy since I'm notputting up any code, but I'm kind of desperate for help here, as I can't seem to find anybody or anything else helpful in any way. Thank you.

Peter Otten · Jan 24, 2014

theguy said:
I have a science project that involves designing a program which can
examine a bit of text with the author's name given, then figure out who
the author is if another piece of example text without the name is given.
I so far have three different authors in the program and have already put
in the example text but for some reason, the program always leans toward
one specific author, Suzanne Collins, no matter what insane number I try
to put in or how much I tinker with the coding. I would post the code, but
I don't know if it's fine to put it here, as it contains pieces from
books. I do believe that would go against copyright laws. If I can figure
out a way to put it in without the bits from the stories, then I'll do so,
but as of now, any help is appreciated. I understand I'm not exactly mak
ing it easy since I'm not putting up any code, but I'm kind of desperate
for help here, as I can't seem to find anybody or anything else helpful
in any way. Thank you.

If I were to speculate what your program might look like:

text_samples = {
"Suzanne Collins": "... some text by collins ...",
"J. K. Rowling": "... some text by rowling ...",
#...
}

unknown = "... sample text by unknown author ..."

def calc_match(text1, text2):
import random
return random.random()

guessed_author = None
guessed_match = None

for author, text in text_samples.items():
match = calc_match(unknown, text)
print(author, match)
if guessed_author is None or match > guessed_match:
guessed_author = author
guessed_match = match

print("The author is", guessed_author)

The important part in this script are not the text samples or the loop to
determine the best match -- it's the algorithm used to determine how good
two texts match.
In the above example that algorithm is encapsulated in the calc_match()
function and it's really bad, it gives you random numbers between 0 and 1.

For us to help you it should be sufficient when you post the analog of this
function in your code together with a description in plain english of how it
is meant to calculate the similarity between two texts.

Alternatavely, instead of the copyrighted texts grab text samples from
project gutenberg with expired copyright.

Make sure that the resulting post is as short as possible -- long text
samples don't make the post clearer than short ones.

bob gailer · Jan 24, 2014

I have a science project that involves designing a program which can examine a bit of text with the author's name given, then figure out who the author is if another piece of example text without the name is given. I so far have three different authors in the program and have already put in the example text but for some reason, the program always leans toward one specific author, Suzanne Collins, no matter what insane number I try to put in or how much I tinker with the coding. I would post the code, but I don't know if it's fine to put it here, as it contains pieces from books. I do believe that would go against copyright laws.

AFAIK copyright laws apply to reproducing something for profit. I doubt
that posting it here will matter.

In any case do post your code; you could trim the fat out of the text if
you need to,

Chris Angelico · Jan 25, 2014

AFAIK copyright laws apply to reproducing something for profit. I doubt that
posting it here will matter.

Incorrect; posting not-for-profit can still be a violation of
copyright. But as Peter said, the text itself isn't critical. Post
with placeholder text, as he suggested, and we can look at the code.

ChrisA

Terry Reedy · Jan 25, 2014

Incorrect; posting not-for-profit can still be a violation of
copyright. But as Peter said, the text itself isn't critical. Post
with placeholder text, as he suggested, and we can look at the code.

In the US, short quotations are allowed for 'fair use'.

Roy Smith · Jan 25, 2014

Ben Finney said:
That's a common misconception that has never been true.

<URL:http://www.faqs.org/faqs/law/copyright/myths/part1/>

Copyright is a legal monopoly in a work, reserving a large set of
actions to the copyright holders. Without license from the copyright
holders, or an exemption under the law, you cannot legally perform those
actions.

[The rest of this post is based on my "I am not a lawyer" understanding
of the law. Also, this is based on US copyright law; things may be
different elsewhere, and I haven't the foggiest idea what law applies to
an international forum such as this]

On the other hand (where Ben Finney's post is the first hand), there is
the Fair Use Doctrine (FUD), which grants certain exemptions. The US
Copyright Office has a page (http://www.copyright.gov/fls/fl102.html)
about this.

As a real-life example, I believe I can safely invoke the FUD to quote
the leading paragraphs from today's New York Times and New York Post
articles about the same event and give their Fleish-Kincaid Reading Ease
and Grade Level scores, if I was comparing the writing style of the two
newspapers:

----------------------------------------------

NY Times:

The crime gripped the publicâ€™s imagination, for both its magnitude and
its moxie: In the predawn hours of Dec. 11, 1978, a group of masked
gunmen seized about $6 million in cash and jewels from a cargo building
at Kennedy International Airport.

Reading Ease Score: 56.6
Grade Level: 10.6

----------------------------------------------

NY Post:

On Dec. 11, 1978, armed mobsters stole $5 million in cash and nearly $1
million in jewels from a Lufthansa airlines vault at JFK Airport, in
what would be for decades the biggest-ever heist on US soil.

Reading Ease Score: 76.2
Grade Level: 7.3

----------------------------------------------

The scores above were computed by http://www.readability-score.com/

In my opinion, this meets all of the requirements of the FUD. I'm
quoting short passages, and using them to critique the writing styles of
the two papers.

In the OP's case, he's analyzing published works as input to a text
analysis algorithm. In my personal opinion, posting samples of those
texts, for the purpose of discussing how his algorithm works, would be
well within the bounds of Fair Use.

kvxdelta · Jan 25, 2014

Alright. I have the code here. Now, I just want to note that the code was not designed to work "quickly" or be very well-written. It was rushed, as I only had a few days to finish the work, and by the time I wrote the program, I hadn't worked with Python (which I never had TOO much experience with anyways) for a while. (About a year, maybe?) It was a bit foolish to take upthe project, but here's the code anyways:

#D.J. Machale - Pendragon
#Pendragon: Book Six - The Rivers of Zadaa
#Page 98
#The sample sentences for this author. I put each sentence into a seperate variable because I knew no other way to divide the sentence. I also removedspaces so they wouldn't be counted.
djmachale_1 = 'WheretonowIaskedLoor'
djmachale_2 = 'ToaplacewherewewillnotbedisturbedbyBatuorRokadorsheanswered'
djmachale_3 = 'WelefttheroomfollowingLoorthroughthetwistingtunnelthatIhadwalkedthroughseveraltimesbeforeonvisitingtoZadaa'
djmachale_4 = 'Shortlyweleftthesmallertunneltoenterthehugecavernthatonceheldanundergroundriver'
djmachale_5 = 'WhenSpaderandIwerefirstheretherewasafour-storywaterfallononesideoftheimmensecavernthatfedadeepragingriver'
djmachale_6 = 'Nowtherewasonlyadribbleofwaterthatfellfromarockymouthintoapathetictrickleofastreamatthebottomofthemostlydryriverbed'
djmachale_7 = 'WhathappenedhereAlderasked'
djmachale_8 = 'ThereisalottotellLooranswered'
djmachale_9 = 'Later'
djmachale_10 = 'Alderacceptedthat'
djmachale_11 = 'Hewasaneasyguy'
djmachale_12 = 'Loorledustotheopeningthatwasoncehiddenbehindthewaterfallbutwasnowinplainsight'
djmachale_13 = 'Weclimbedafewstonestairssteppedthroughtheportalandenteredaroomthatheldthewater-controldeviceIhavedescribedtoyoubefore'
djmachale_14 = 'Toremindyouguysthisthinglookedlikeoneofthosegiantpipe-organsthatyouseeinchurch'
djmachale_15 = 'Butthesepipesranhorizontallydisappearingintotherockwalloneithersideoftheroom'
djmachale_16 = 'Therewasaplatforminfrontofitthatheldanamazingarrayofswitchesandvalves'
djmachale_17 = 'WhenIfirstcameheretherewasaRokadorengineeronthatplatformfeverishlyworkingthecontrolslikeanexpert'
djmachale_18 = 'Ihadnoideawhatthedevicedidotherthanknowingithadsomethingtodowithcontrollingtheflowofwaterfromtherivers'
djmachale_19 = 'Theguyhadmapsanddiagramsthathereferredtowhilehequicklymadeadjustmentsandtoggledswitches'
djmachale_20 = 'Nowtheplatformwasempty'

#djmwords contains the amount of words in each sentence
#djmwords_total is the total word count between all the samples
djmwords = [6, 15, 22, 17, 26, 29, 5, 8, 1, 3, 5, 19, 25, 18, 16, 17, 20,25, 18, 5]
djmwords_total = sum(djmwords)
avgWORDS_per_SENTENCE_DJMACHALE = (djmwords_total/20)

#Each variable becomes the total number of letters in each sentence
djmachale_1 = len(djmachale_1)
djmachale_2 = len(djmachale_2)
djmachale_3 = len(djmachale_3)
djmachale_4 = len(djmachale_4)
djmachale_5 = len(djmachale_5)
djmachale_6 = len(djmachale_6)
djmachale_7 = len(djmachale_7)
djmachale_8 = len(djmachale_8)
djmachale_9 = len(djmachale_9)
djmachale_10 = len(djmachale_10)
djmachale_11 = len(djmachale_11)
djmachale_12 = len(djmachale_12)
djmachale_13 = len(djmachale_13)
djmachale_14 = len(djmachale_14)
djmachale_15 = len(djmachale_15)
djmachale_16 = len(djmachale_16)
djmachale_17 = len(djmachale_17)
djmachale_18 = len(djmachale_18)
djmachale_19 = len(djmachale_19)
djmachale_20 = len(djmachale_20)

#DJMACHALE_TOTAL is the total letter count between all the samples
DJ_Machale = [djmachale_1, djmachale_2, djmachale_3, djmachale_4, djmachale_5, djmachale_6, djmachale_7, djmachale_8, djmachale_9, djmachale_10, djmachale_11, djmachale_12, djmachale_13, djmachale_14, djmachale_15, djmachale_16, djmachale_17, djmachale_18, djmachale_19, djmachale_20]
DJMACHALE_TOTAL = (djmachale_1+djmachale_2+djmachale_3+djmachale_4+djmachale_5+djmachale_6+djmachale_7+djmachale_8+djmachale_9+djmachale_10+djmachale_11+djmachale_12+djmachale_13+djmachale_14+djmachale_15+djmachale_16+djmachale_17+djmachale_18+djmachale_19+djmachale_20)
avgLETTERS_per_SENTENCE_DJMACHALE = (DJMACHALE_TOTAL/20)

avgLETTERS_per_WORD_DJMACHALE = (DJMACHALE_TOTAL/djmwords_total)

#----------------------------------------------------------------------------------------------------------------------------------------------------------------------
#Suzanne Collins - The Hunger Games
#The Hunger Games
#Page 103
suzannecollins_1 = 'AsIstridetowardtheelevatorIflingmybowtoonesideandmyquivertotheother'
suzannecollins_2 = 'IbrushpastthegapingAvoxeswhoguardtheelevatorsandhitthenumbertwelvebuttonwithmyfist'
suzannecollins_3 = 'ThedoorsslidetogetherandIzipupward'
suzannecollins_4 = 'Iactuallymakeitbacktomyfloorbeforethetearsstartrunningdownmycheeks'
suzannecollins_5 = 'IcanheartheotherscallingmefromthesittingroombutIflydownthehallintomyroomboltthedoorandflingmyselfontomybed'
suzannecollins_6 = 'ThenIreallybegintosob'
suzannecollins_7 = 'NowIvedoneit'
suzannecollins_8 = 'NowIveruinedeverything'
suzannecollins_9 = 'IfIdevenstoodaghostofachanceitvanishedwhenIsentthatarrowflyingattheGamemakers'
suzannecollins_10 = 'Whatwilltheydotomenow'
suzannecollins_11 = 'Arrestme'
suzannecollins_12 = 'Executeme'
suzannecollins_13 = 'CutmytongueandturnintoanAvoxsoIcanwaitonthefutretributesofPanem'
suzannecollins_14 = 'WhatwasIthinkingshootingattheGamemakers'
suzannecollins_15 = 'OfcourseIwasntIwasshootingatthatapplebecauseIwassoangryatbeingignored'
suzannecollins_16 = 'Iwasnttryingtokilloneofthem'
suzannecollins_17 = 'IfIweretheydbedead'
suzannecollins_18 = 'Ohwhatdoesitmatter'
suzannecollins_19 = 'ItsnotlikeIwasgoingtowintheGamesanyway'
suzannecollins_20 = 'Whocareswhattheydotome'

suzcwords = [19, 19, 8, 16, 6, 4, 4, 20, 7, 2, 2, 19, 8, 18, 8, 6, 5, 11,7]
suzcwords_total = (19+19+8+16+6+4+4+20+7+2+2+19+8+18+8+6+5+11+7)
avgWORDS_per_SENTENCE_SUZANNECOLLINS = (suzcwords_total/20)

suzannecollins_1 = len(suzannecollins_1)
suzannecollins_2 = len(suzannecollins_2)
suzannecollins_3 = len(suzannecollins_3)
suzannecollins_4 = len(suzannecollins_4)
suzannecollins_5 = len(suzannecollins_5)
suzannecollins_6 = len(suzannecollins_6)
suzannecollins_7 = len(suzannecollins_7)
suzannecollins_8 = len(suzannecollins_8)
suzannecollins_9 = len(suzannecollins_9)
suzannecollins_10 = len(suzannecollins_10)
suzannecollins_11 = len(suzannecollins_11)
suzannecollins_12 = len(suzannecollins_12)
suzannecollins_13 = len(suzannecollins_13)
suzannecollins_14 = len(suzannecollins_14)
suzannecollins_15 = len(suzannecollins_15)
suzannecollins_16 = len(suzannecollins_16)
suzannecollins_17 = len(suzannecollins_17)
suzannecollins_18 = len(suzannecollins_18)
suzannecollins_19 = len(suzannecollins_19)
suzannecollins_20 = len(suzannecollins_20)

Suzanne_Collins = [suzannecollins_1, suzannecollins_2, suzannecollins_3, suzannecollins_4, suzannecollins_5, suzannecollins_6, suzannecollins_7, suzannecollins_8, suzannecollins_9, suzannecollins_10, suzannecollins_11, suzannecollins_12, suzannecollins_13, suzannecollins_14, suzannecollins_15, suzannecollins_16, suzannecollins_17, suzannecollins_18, suzannecollins_19, suzannecollins_20]
SUZANNECOLLINS_TOTAL = (suzannecollins_1+suzannecollins_2+suzannecollins_3+suzannecollins_4+suzannecollins_5+suzannecollins_6+suzannecollins_7+suzannecollins_8+suzannecollins_9+suzannecollins_10+suzannecollins_11+suzannecollins_12+suzannecollins_13+suzannecollins_14+suzannecollins_15+suzannecollins_16+suzannecollins_17+suzannecollins_18+suzannecollins_19+suzannecollins_20)
avgLETTERS_per_SENTENCE_SUZANNECOLLINS = (SUZANNECOLLINS_TOTAL/20)

avgLETTERS_per_WORD_SUZANNECOLLINS = (SUZANNECOLLINS_TOTAL/suzcwords_total)

#-----------------------------------------------------------------------------------------------------------------------------------------
#Richard Peck - The Last Safe Place on Earth
#The Last Safe Place on Earth
#Page 1-2

richardpeck_1 = 'HalloweensaweekandahalfawayHomecomingtheweekendafter'
richardpeck_2 = 'ItsthattimeofyearandcominghomeImthinkingWhatagreateveningtobegoingsomewherewithagirlmyarmdrapedoverhersoftshoulderthetwoofusscuffingthroughtheleaves'
richardpeck_3 = 'ImseeinggirlseverywhereIlooksomeofthemrealmostnot'
richardpeck_4 = 'Iseegirlsintheshapesthetreetrunksmakeandintheformationsoftheclouds'
richardpeck_5 = 'Iseealotofgirlsthisfall'
richardpeck_6 = 'Imnotobsessed'
richardpeck_7 = 'Imintenthgrade'
richardpeck_8 = 'SoIwascominghomeonfoot'
richardpeck_9 = 'Therewereacoupleofbooksinmybackpack'
richardpeck_10 = 'OnewasRayBradburysFahrenheit451whichweweresupposedtobereadingforMrsLenkysclass'
richardpeck_11 = 'Iplannedtobuckledownonschoolworkandreallyhitthebooksnextyearsenioryearatthelatest'
richardpeck_12 = 'MeanwhileIwastakingeverydayasitcametryingtogetatoeholdonhighschool'
richardpeck_13 = 'ButthefactisIdidntreallythinkhighschoolwashappeninguntilIfoundagirl'
richardpeck_14 = 'ItwasapostcardeveningalongTranquilyLanetheactualnameofourstreet'
richardpeck_15 = 'Thehazewaslikebonfiresmoke,thoughwecantburnleaveswithinthevillagelimits'
richardpeck_16 = 'Itwasared-and-goldworldwithpurpleeveningcomingon'
richardpeck_17 = 'OurhouseisthebigwhitebrickwiththegreenshutterslikeahouseonaChristmascard'
richardpeck_18 = 'Weusedtoliveinthewesternsuburbs'
richardpeck_19 = 'ButwhenDianaandIwereinsixthgradethejuniorhighouttherehadacoupleofknifefightsthatmadethenews'
richardpeck_20 = 'Thegangsweremovinginsowemovedout'

richwords = [11, 36, 12, 17, 8, 3, 4, 7, 9 , 17, 19, 17, 17, 14, 15, 12, 18, 9, 23, 9]
richwords_total = (11+36+12+17+8+3+4+7+9+17+19+17+17+14+15+12+18+9+23+9)
avgWORDS_per_SENTENCE_RICHARDPECK = (richwords_total/20)

richardpeck_1 = len(richardpeck_1)
richardpeck_2 = len(richardpeck_2)
richardpeck_3 = len(richardpeck_3)
richardpeck_4 = len(richardpeck_4)
richardpeck_5 = len(richardpeck_5)
richardpeck_6 = len(richardpeck_6)
richardpeck_7 = len(richardpeck_7)
richardpeck_8 = len(richardpeck_8)
richardpeck_9 = len(richardpeck_9)
richardpeck_10 = len(richardpeck_10)
richardpeck_11 = len(richardpeck_11)
richardpeck_12 = len(richardpeck_12)
richardpeck_13 = len(richardpeck_13)
richardpeck_14 = len(richardpeck_14)
richardpeck_15 = len(richardpeck_15)
richardpeck_16 = len(richardpeck_16)
richardpeck_17 = len(richardpeck_17)
richardpeck_18 = len(richardpeck_18)
richardpeck_19 = len(richardpeck_19)
richardpeck_20 = len(richardpeck_20)

Richard_Peck = [richardpeck_1, richardpeck_2, richardpeck_3, richardpeck_4, richardpeck_5, richardpeck_6, richardpeck_7, richardpeck_8, richardpeck_9, richardpeck_10, richardpeck_11, richardpeck_12, richardpeck_13, richardpeck_14, richardpeck_15, richardpeck_16, richardpeck_17, richardpeck_18, richardpeck_19, richardpeck_20]
RICHARDPECK_TOTAL = (richardpeck_1+richardpeck_2+richardpeck_3+richardpeck_4+richardpeck_5+richardpeck_6+richardpeck_7+richardpeck_8+richardpeck_9+richardpeck_10+richardpeck_11+richardpeck_12+richardpeck_13+richardpeck_14+richardpeck_15+richardpeck_16+richardpeck_17+richardpeck_18+richardpeck_19+richardpeck_20)
avgLETTERS_per_SENTENCE_RICHARDPECK = (RICHARDPECK_TOTAL/20)

avgLETTERS_per_WORD_RICHARDPECK = (RICHARDPECK_TOTAL/richwords_total)

#---------------------------------------------------------------------------------------------------------
#EXAMPLE SLOT
example1 = 'Wepulledthefilmfortheten-thirtynewstohearhowtheWarriorshaddoneagainsttheLakeVillaVikinsontheVikingshomefield'
example2 = 'WedlostbutitwascloseandC.E.andIwentbacktotheDracula'
example3 = 'Itwasgettinglatewhenthephonerang'
example4 = 'DeepinhispopcornworldDaddidntanswerit'
example5 = 'Ipickedupinthedenanditwasawoman'
example6 = 'IwavedatC.E.toturndownthesoundsbecausethewomanwascrying'
example7 = 'Whoisthis'
example8 = 'ItwasMrsCunningham'
example9 = 'Icantfindmydaughtershesaid'
example10 = 'IcantfindPace'
example11 = 'SheshereIsaid'
example12 = 'Shesupstairswithmysister'
example13 = 'AmomentofsilencethenandMrsCunninghamsvoiceshuddered'
example14 = 'IssheYoutellhertostayrightthereImcomingover'
example15 = 'SoweneverdidseehowtheDraculafilmended'
example16 = 'HeyPaceIsaidupthestairs'
example17 = 'Yourmomscomingover'
example18 = 'ThisbroughteverybodytothefronthallPacefirst'
example19 = 'DianawasbehindherandMominherrobeandMarnieinherpajamas'
example20 = 'BeforeDrandMrsCunninghamgothereDadwasinthefronthalltooinhisapron'

examplewords = [25, 15, 8, 9, 11, 14, 3, 4, 6, 4, 4, 5, 10, 7, 4, 9, 14, 17]
examplewords_total = sum(examplewords)
avgWORDS_per_SENTENCE_EXAMPLE = (examplewords_total/20)

example1 = len(example1)
example2 = len(example2)
example3 = len(example3)
example4 = len(example4)
example5 = len(example5)
example6 = len(example6)
example7 = len(example7)
example8 = len(example8)
example9 = len(example9)
example10 = len(example10)
example11 = len(example11)
example12 = len(example12)
example13 = len(example13)
example14 = len(example14)
example15 = len(example15)
example16 = len(example16)
example17 = len(example17)
example18 = len(example18)
example19 = len(example19)
example20 = len(example20)

example = [example1, example2, example3, example4, example5, example6, example7, example8, example9, example10, example11, example12, example13, example14, example15, example16, example17, example18, example19, example20]
EXAMPLE_TOTAL = (example1+example2+example3+example4+example5+example6+example7+example8+example9+example10+example11+example12+example13+example14+example15+example16+example17+example18+example19+example20)
avgLETTERS_per_SENTENCE_EXAMPLE = (EXAMPLE_TOTAL/20)

avgLETTERS_per_WORD_EXAMPLE = (EXAMPLE_TOTAL/examplewords_total)

#------------------------------------------------------------------------------------------------------------------------------
#Tests for similarities and prints (displays) the author whom the program believes to have written the example text

#I used a scoreboard system of sorts to determine which author was most similar to the example. Each time the program finds a match to one in each of the tests, it adds a point to that author here.
DJMachalePossibility = 0
SuzanneCollinsPossibility = 0
RichardPeckPossibility = 0

#Matches average letters/sentence in example with most likely author
#I attempted to find the closest value by subtracting the example's value from each of the authors. The author with the smallest distance from the example would be marked up one point.

avgLPS_DJ_EXAMPLE = (avgLETTERS_per_SENTENCE_DJMACHALE-avgLETTERS_per_SENTENCE_EXAMPLE)

avgLPS_SUZC_EXAMPLE = (avgLETTERS_per_SENTENCE_SUZANNECOLLINS-avgLETTERS_per_SENTENCE_EXAMPLE)

avgLPS_RICH_EXAMPLE = (avgLETTERS_per_SENTENCE_RICHARDPECK-avgLETTERS_per_SENTENCE_EXAMPLE)

LPS_Comparisons = [avgLPS_DJ_EXAMPLE, avgLPS_SUZC_EXAMPLE, avgLPS_RICH_EXAMPLE]
avgLPS_Match = min(LPS_Comparisons)

if avgLPS_Match == avgLPS_DJ_EXAMPLE:
DJMachalePossibility = (DJMachalePossibility+1)

if avgLPS_Match == avgLPS_SUZC_EXAMPLE:
SuzanneCollinsPossibility = (SuzanneCollinsPossibility+1)

if avgLPS_Match == avgLPS_RICH_EXAMPLE:
RichardPeckPossibility = (RichardPeckPossibility+1)

#Matches average words/sentence in example with most likely author

avgWPS_DJ_EXAMPLE = (avgWORDS_per_SENTENCE_DJMACHALE-avgWORDS_per_SENTENCE_EXAMPLE)

avgWPS_SUZC_EXAMPLE = (avgWORDS_per_SENTENCE_SUZANNECOLLINS-avgWORDS_per_SENTENCE_EXAMPLE)

avgWPS_RICH_EXAMPLE = (avgWORDS_per_SENTENCE_RICHARDPECK-avgWORDS_per_SENTENCE_EXAMPLE)

WPS_Comparisons = [avgWPS_DJ_EXAMPLE, avgWPS_SUZC_EXAMPLE, avgWPS_RICH_EXAMPLE]
avgWPS_Match = min(WPS_Comparisons)

if avgWPS_Match == avgWPS_DJ_EXAMPLE:
DJMachalePossibility = (DJMachalePossibility+1)

if avgWPS_Match == avgWPS_SUZC_EXAMPLE:
SuzanneCollinsPossibility = (SuzanneCollinsPossibility+1)

if avgWPS_Match == avgWPS_RICH_EXAMPLE:
RichardPeckPossibility = (RichardPeckPossibility+1)

#Matches average letters/word in example with most likely author

avgLPW_DJ_EXAMPLE = (avgLETTERS_per_WORD_DJMACHALE-avgLETTERS_per_WORD_EXAMPLE)

avgLPW_SUZC_EXAMPLE = (avgLETTERS_per_WORD_SUZANNECOLLINS-avgLETTERS_per_WORD_EXAMPLE)

avgLPW_RICH_EXAMPLE = (avgLETTERS_per_WORD_RICHARDPECK-avgLETTERS_per_WORD_EXAMPLE)

LPW_Comparisons = [avgLPW_DJ_EXAMPLE, avgLPW_SUZC_EXAMPLE, avgLPW_SUZC_EXAMPLE]
avgLPW_Match = min(LPW_Comparisons)

if avgLPW_Match == avgLPW_DJ_EXAMPLE:
DJMachalePossibility = (DJMachalePossibility+1)

if avgLPW_Match == avgLPW_SUZC_EXAMPLE:
SuzanneCollinsPossibility = (SuzanneCollinsPossibility+1)

if avgLPW_Match == avgLPW_RICH_EXAMPLE:
RichardPeckPossibility = (RichardPeckPossibility+1)

AUTHOR_SCOREBOARD = [DJMachalePossibility, SuzanneCollinsPossibility, RichardPeckPossibility]

#The author with the most points on them would be considered the program's guess.
Match = max(AUTHOR_SCOREBOARD)

print AUTHOR_SCOREBOARD

if Match == DJMachalePossibility:
print "The author should be D.J. Machale."

if Match == SuzanneCollinsPossibility:
print "The author should be Suzanne Collins."

if Match == RichardPeckPossibility:
print "The author should be Richard Peck."

Rustom Mody · Jan 25, 2014

Alright. I have the code here. Now, I just want to note that the code wasnot designed to work "quickly" or be very well-written. It was rushed, as I only had a few days to finish the work, and by the time I wrote the program, I hadn't worked with Python (which I never had TOO much experience withanyways) for a while. (About a year, maybe?) It was a bit foolish to take up the project, but here's the code anyways:

<snipped>

Ewwww!

If you (or anyone with basic python experience) rewrites that code, it willbecome
1/50th the size and all that you call 'code' will reside in data files.

That can mean one of json, xml, yml, ini, pickle, ini, csv etc

If you need further help in understanding/choosing, post back

theguy · Jan 25, 2014

<snipped>

Ewwww!

If you (or anyone with basic python experience) rewrites that code, it will become

1/50th the size and all that you call 'code' will reside in data files.

That can mean one of json, xml, yml, ini, pickle, ini, csv etc

If you need further help in understanding/choosing, post back

I know. I'm kind of ashamed of the code, but it does the job I need it to up to a certain point, where it for some reason continually gives me SuzanneCollins as the author. It always gives three points to her name in the AUTHOR_SCOREBOARD list. The code, though, is REALLY bad. I'm trying to simply get it to do the things needed for the program. If I could get it to actually calculate the "points" for AUTHOR_SCOREBOARD properly, then all my problems would be solved. Luckily, I'm not being graded on the elegance or conciseness of my code. Thank you for the constructive criticism, though I am really seeking help with my little problem involving that dang scoreboard. Thank you.

Dave Angel · Jan 25, 2014

Alright. I have the code here. Now, I just want to note that the code was not designed to work "quickly" or be very well-written. It was rushed, as I only had a few days to finish the work, and by the time I wrote the program, I hadn't worked with Python (which I never had TOO much experience with anyways) for a while. (About a year, maybe?) It was a bit foolish to take up the project, but here's the code anyways: ..........

LPW_Comparisons = [avgLPW_DJ_EXAMPLE, avgLPW_SUZC_EXAMPLE, avgLPW_SUZC_EXAMPLE]
avgLPW_Match = min(LPW_Comparisons)

if avgLPW_Match == avgLPW_DJ_EXAMPLE:
DJMachalePossibility = (DJMachalePossibility+1)

if avgLPW_Match == avgLPW_SUZC_EXAMPLE:
SuzanneCollinsPossibility = (SuzanneCollinsPossibility+1)

if avgLPW_Match == avgLPW_RICH_EXAMPLE:
RichardPeckPossibility = (RichardPeckPossibility+1)

AUTHOR_SCOREBOARD = [DJMachalePossibility, SuzanneCollinsPossibility, RichardPeckPossibility]

#The author with the most points on them would be considered the program's guess.
Match = max(AUTHOR_SCOREBOARD)

print AUTHOR_SCOREBOARD

if Match == DJMachalePossibility:
print "The author should be D.J. Machale."

if Match == SuzanneCollinsPossibility:
print "The author should be Suzanne Collins."

if Match == RichardPeckPossibility:
print "The author should be Richard Peck."

1. When you calculate averages, you should be using floating
point divide.
avg = float (a) / b

2. When you subtract two values, you need an abs, because
otherwise min () will hone in on the negative values.

3. Realize that having Match agree with more than one is not
that unlikely.

4. If you want to vary what you call strictness, you're really
going to need to learn about functions.

Gregory Ewing · Jan 25, 2014

theguy said:
I so far have
three different authors in the program and have already put in the example
text but for some reason, the program always leans toward one specific
author, Suzanne Collins, no matter what insane number I try to put in or how
much I tinker with the coding.

It's obvious what's happening here: all the other
authors have heavily borrowed from Suzanne Collins.
You've created a plagiarism detector!

Gregory Ewing · Jan 25, 2014

theguy said:
If I could get it to actually
calculate the "points" for AUTHOR_SCOREBOARD properly, then all my problems
would be solved.

Have you tried getting it to print out the values
it's getting for the scores, and comparing them
with what you calculate by hand?

Denis McMahon · Jan 25, 2014

I know. I'm kind of ashamed of the code, but it does the job I need it
to up to a certain point

OK, well first of all take a step back and look at the problem.

You have n exemplars, each from a known author.

You analyse each exemplar, and determine some statistics for it.

You then take your unknown sample, determine the same statistics for the
unknown sample.

Finally, you compare each exemplar's stats with the sample's stats to try
and find a best match.

So, perhaps you want a dictionary of { author: statistics }, and a
function to analyse a piece of text, which might call other functions to
get eg avg words / sentence, avg letters / sentence, avg word length, and
the sd in each, and the short word ratio (words <= 3 chars vs words >= 4
chars) and some other statistics.

Given the statistics for each exemplar, you might store these in your
dictionary as a tuple.

this isn't python, it's a description of an algorithm, it just looks a
bit pythonic:

# tuple of weightings applied to different stats
stat_weightings = ( 1.0, 1.3, 0.85, ...... )

def get_some_stat( t ):
# calculate some numerical statistic on a block of text
# return it

def analyse( f ):
text = read_file( f )
return ( get_some_stat( text ), ...... )

exemplars = {}

for exemplar_file in exemplar_files:
exemplar_data[author] = analyse( exemplar_file )

sample_data = analyse( sample_file )

scores = {}

tmp = 0
x = 0

# score for a piece of work is sum of ( diff of stat * weighting )
# for all the stats, lower score = closer match
for author in keys( exemplar_data ):
for i in len( exemplar_data[ author ] ):
tmp = tmp + sqrt( exemplar_data[ author ][ i ] -
sample_data[ i ] ) * stat_weightings( i )
scores[ author ] = tmp
if tmp > x:
x = tmp

names = []

for author in keys( scores ):
if scores[ author ] < x:
x = scores[ author ]
names = [ author ]
elif scores[ author ] == x:
names.append( [ author ] )

print "the best matching author(s) is/are: ", names

Then all you have to do is find enough ways to calculate stats, and the
magic coefficients to use in the stat_weightings

Dennis Lee Bieber · Jan 25, 2014

<snipped>

Ewwww!

I think my reaction was more guttural -- said:
If you (or anyone with basic python experience) rewrites that code, it will become
1/50th the size and all that you call 'code' will reside in data files.

That can mean one of json, xml, yml, ini, pickle, ini, csv etc

If you need further help in understanding/choosing, post back

Heck, at the very least turn all those xxxx_99 variables into single
lists.... The posted code looks like something from 1968 K&K BASIC.

Rustom Mody · Jan 25, 2014

Heck, at the very least turn all those xxxx_99 variables into single
lists.... The posted code looks like something from 1968 K&K BASIC.

Yes thats correct.

My suggestion of data-files is a second step.

A first step is just converting to using internal (python) data structures.
[And not 1968 BASIC scalars!]

alex23 · Jan 28, 2014

I have a science project that involves designing a program which can examine a bit of text with the author's name given, then figure out who the author is if another piece of example text without the name is given.

This sounds like exactly the sort of thing NLTK was made for. Here's an
example of using it for this requirement:

http://www.aicbt.com/authorship-attribution/

Need Help with Project	1	Dec 12, 2022
Looking to start a project but I need help	2	Mar 8, 2023
[C Language] Need help transferring Linux CodeBlocks Project to Windows CodeBlocks Project	1	Jun 19, 2023
Need help with this code	2	May 10, 2023
Need Help with Repository Program (Beginner)	1	Jul 7, 2023
I need help fixing my website	2	Oct 15, 2023
Help with Github???	2	Jan 6, 2024
I need some help with homework	1	Jan 16, 2022

Need Help with Programming Science Project

theguy

Peter Otten

bob gailer

Chris Angelico

Terry Reedy

Roy Smith

kvxdelta

Rustom Mody

theguy

Dave Angel

Gregory Ewing

Gregory Ewing

Denis McMahon

Dennis Lee Bieber

Rustom Mody

alex23

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads