Dear Group,
It is true you are upset. But I tried and not getting. I am trying bit more and including my efforts by night. If things come by it would be nice. Let us see.
Regards,
Subhabrata.
Dear Group,
The only progress I could do I could write one file "mydata.txt" and convert it into mallet format as "output.mallet", as can be seen in REPORT NO.2.
The exercises I made are copied and pasted as, REPORT NO.1, AND REPORT NO.2, as I could not attach them. Apology if you feel it looks like "junk". Theonly other change you may find I have changed path of Mallet as I was trying to follow the tutorial("
http://programminghistorian.org/lessons/topic-modeling-and-mallet") surfed out.
REPORT NO.1:
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.
C:\Users\subhabrata>cd\
C:\>cd mallet
C:\mallet>
C:\mallet>ant
Buildfile: C:\mallet\build.xml
init:
[copy] Copying 1 file to C:\mallet\class
compile:
[javac] C:\mallet\build.xml:60: warning: 'includeantruntime' was not set, de
faulting to build.sysclasspath=last; set to false for repeatable builds
BUILD SUCCESSFUL
Total time: 6 seconds
C:\mallet>ant jar
Buildfile: C:\mallet\build.xml
init:
[copy] Copying 1 file to C:\mallet\class
compile:
[javac] C:\mallet\build.xml:60: warning: 'includeantruntime' was not set, de
faulting to build.sysclasspath=last; set to false for repeatable builds
jar:
[jar] Building jar: C:\mallet\dist\mallet.jar
BUILD SUCCESSFUL
Total time: 1 second
C:\mallet>dir
Volume in drive C is Acer
Volume Serial Number is 7A35-B119
Directory of C:\mallet
05/17/2013 01:02 AM <DIR> .
05/17/2013 01:02 AM <DIR> ..
05/17/2013 01:02 AM <DIR> bin
09/02/2011 12:50 PM 2,897 build.xml
05/17/2013 01:02 AM <DIR> class
05/17/2013 01:02 AM <DIR> dist
05/17/2013 01:02 AM <DIR> lib
09/02/2011 12:50 PM 11,918 LICENSE
09/02/2011 12:50 PM 3,566 Makefile
09/02/2011 12:50 PM 2,519 pom.xml
05/17/2013 01:02 AM <DIR> sample-data
05/17/2013 01:02 AM <DIR> src
05/17/2013 01:02 AM <DIR> stoplists
09/02/2011 12:50 PM <DIR> test
4 File(s) 20,900 bytes
10 Dir(s) 415,472,402,432 bytes free
C:\mallet>cd bin
C:\mallet\bin>dir
Volume in drive C is Acer
Volume Serial Number is 7A35-B119
Directory of C:\mallet\bin
05/17/2013 01:02 AM <DIR> .
05/17/2013 01:02 AM <DIR> ..
09/02/2011 12:50 PM 635 classifier2info
09/02/2011 12:50 PM 632 csv2classify
09/02/2011 12:50 PM 631 csv2vectors
09/02/2011 12:50 PM 2,347 mallet
09/02/2011 12:50 PM 2,471 mallet.bat
09/02/2011 12:50 PM 1,771 mallethon
09/02/2011 12:50 PM 63 prepend-license.sh
09/02/2011 12:50 PM 636 svmlight2vectors
09/02/2011 12:50 PM 633 text2classify
09/02/2011 12:50 PM 632 text2vectors
09/02/2011 12:50 PM 636 vectors2classify
09/02/2011 12:50 PM 632 vectors2info
09/02/2011 12:50 PM 631 vectors2topics
09/02/2011 12:50 PM 635 vectors2vectors
14 File(s) 12,985 bytes
2 Dir(s) 415,471,616,000 bytes free
C:\mallet\bin>mallet import-dir --input pathway\to\the\directory\with\the\files
--output tutorial.mallet --keep-sequence --remove-stopwords
Labels =
pathway\to\the\directory\with\the\files
Exception in thread "main" java.lang.IllegalArgumentException: C:\mallet\bin\pat
hway\to\the\directory\with\the\files is not a directory.
at cc.mallet.pipe.iterator.FileIterator.<init>(FileIterator.java:108)
at cc.mallet.pipe.iterator.FileIterator.<init>(FileIterator.java:145)
at cc.mallet.classify.tui.Text2Vectors.main(Text2Vectors.java:312)
C:\mallet\bin>svmlight2vectors
'svmlight2vectors' is not recognized as an internal or external command,
operable program or batch file.
C:\mallet\bin>mallet
Mallet 2.0 commands:
import-dir load the contents of a directory into mallet instances (one
per file)
import-file load a single file into mallet instances (one per line)
import-svmlight load a single SVMLight format data file into mallet instance
s (one per line)
train-classifier train a classifier from Mallet data files
train-topics train a topic model from Mallet data files
infer-topics use a trained topic model to infer topics for new documents
estimate-topics estimate the probability of new documents given a trained mo
del
hlda train a topic model using Hierarchical LDA
prune remove features based on frequency or information gain
split divide data into testing, training, and validation portions
Include --help with any option for more information
C:\mallet\bin>mallet train-classifier --help
A tool for training, saving and printing diagnostics from a classifier on vector
s.
--help TRUE|FALSE
Print this command line option usage information. Give argument of TRUE for l
onger documentation
Default is false
--prefix-code 'JAVA CODE'
Java code you want run before any other interpreted code. Note that the text
is interpreted without modification, so unlike some other Java code options, you
need to include any necessary 'new's when creating objects.
Default is null
--config FILE
Read command option values from a file
Default is null
--report [train|test|validation]:[accuracy|f1:label|confusion|raw]
Default is test:accuracy test:confusion train:accuracy
--trainer ClassifierTrainer constructor
Java code for the constructor used to create a ClassifierTrainer. If no '(' a
ppears, then "new " will be prepended and "Trainer()" will be appended.You may u
se this option mutiple times to compare multiple classifiers.
Default is new NaiveBayesTrainer()
--output-classifier FILENAME
The filename in which to write the classifier after it has been trained.
Default is classifier.mallet
--input FILENAME
The filename from which to read the list of training instances. Use - for std
in.
Default is text.vectors
--training-file FILENAME
Read the training set instance list from this file. If this is specified,the
input file parameter is ignored
Default is text.vectors
--testing-file FILENAME
Read the test set instance list to this file. If this option is specified, the
training-file parameter must be specified and the input-file parameter isigno
red
Default is text.vectors
--validation-file FILENAME
Read the validation set instance list to this file.If this option is specified
, the training-file parameter must be specified and the input-file parameter is
ignored
Default is text.vectors
--training-portion DECIMAL
The fraction of the instances that should be used for training.
Default is 1.0
--validation-portion DECIMAL
The fraction of the instances that should be used for validation.
Default is 0.0
--unlabeled-portion DECIMAL
The fraction of the training instances that should have their labels hidden.
Note that these are taken out of the training-portion, not allocated separately.
Default is 0.0
--random-seed INTEGER
The random seed for randomly selecting a proportion of the instance list for t
raining
Default is 0
--num-trials INTEGER
The number of random train/test splits to perform
Default is 1
--classifier-evaluator CONSTRUCTOR
Java code for constructing a ClassifierEvaluating object
Default is null
--verbosity INTEGER
The level of messages to print: 0 is silent, 8 is most verbose. Levels 0-8 cor
respond to the java.logger predefined levels off, severe, warning, info, config,
fine, finer, finest, all. The default value is taken from the mallet logging.pr
operties file, which currently defaults to INFO level (3)
Default is -1
--noOverwriteProgressMessages true|false
Suppress writing-in-place on terminal for progess messages - repetitive messag
es of which only the latest is generally of interest
Default is false
--cross-validation INT
The number of folds for cross-validation (DEFAULT=0).
Default is 0
C:\mallet\bin>mallet train-classifier --training-file text.vectors
java.io.FileNotFoundException: text.vectors (The system cannot find the file spe
cified)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(Unknown Source)
at cc.mallet.types.InstanceList.load(InstanceList.java:787)
at cc.mallet.classify.tui.Vectors2Classify.main(Vectors2Classify.java:26
2)
Exception in thread "main" java.lang.IllegalArgumentException: Couldn't read Ins
tanceList from file text.vectors
at cc.mallet.types.InstanceList.load(InstanceList.java:794)
at cc.mallet.classify.tui.Vectors2Classify.main(Vectors2Classify.java:26
2)
C:\mallet\bin>
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
REPORT NO.2:
The number of iterations between printing a brief summary of the topics so far
..
Default is 50
--output-model-interval INTEGER
The number of iterations between writing the model (and its Gibbs sampling sta
te) to a binary file. You must also set the --output-model to use this option,
whose argument will be the prefix of the filenames.
Default is 0
--output-state-interval INTEGER
The number of iterations between writing the sampling state to a text file. Y
ou must also set the --output-state to use this option, whose argument willbe t
he prefix of the filenames.
Default is 0
--optimize-interval INTEGER
The number of iterations between reestimating dirichlet hyperparameters.
Default is 0
--optimize-burn-in INTEGER
The number of iterations to run before first estimating dirichlet hyperparamet
ers.
Default is 200
--use-symmetric-alpha true|false
Only optimize the concentration parameter of the prior over document-topic dis
tributions. This may reduce the number of very small, poorly estimated topics, b
ut may disperse common words over several topics.
Default is false
--use-ngrams true|false
Rather than using LDA, use Topical-N-Grams, which models phrases.
Default is false
--use-pam true|false
Rather than using LDA, use Pachinko Allocation Model, which models topical cor
relations.You cannot do this and also --use-ngrams.
Default is false
--alpha DECIMAL
Alpha parameter: smoothing over topic distribution.
Default is 50.0
--beta DECIMAL
Beta parameter: smoothing over unigram distribution.
Default is 0.01
--gamma DECIMAL
Gamma parameter: smoothing over bigram distribution
Default is 0.01
--delta DECIMAL
Delta parameter: smoothing over choice of unigram/bigram
Default is 0.03
--delta1 DECIMAL
Topic N-gram smoothing parameter
Default is 0.2
--delta2 DECIMAL
Topic N-gram smoothing parameter
Default is 1000.0
--pam-num-supertopics INTEGER
When using the Pachinko Allocation Model (PAM) set the number of supertopics.
Typically this is about half the number of subtopics, although more may help.
Default is 10
--pam-num-subtopics INTEGER
When using the Pachinko Allocation Model (PAM) set the number of subtopics.
Default is 20
Exception in thread "main" java.lang.IllegalArgumentException: Unrecognizedopti
on 2: --keep-sequence
at cc.mallet.util.CommandOption$List.process(CommandOption.java:344)
at cc.mallet.util.CommandOption.process(CommandOption.java:146)
at cc.mallet.topics.tui.Vectors2Topics.main(Vectors2Topics.java:200)
C:\mallet\bin>dir
Volume in drive C is Acer
Volume Serial Number is 7A35-B119
Directory of C:\mallet\bin
05/17/2013 01:20 AM <DIR> .
05/17/2013 01:20 AM <DIR> ..
09/02/2011 12:50 PM 635 classifier2info
09/02/2011 12:50 PM 632 csv2classify
09/02/2011 12:50 PM 631 csv2vectors
09/02/2011 12:50 PM 2,347 mallet
09/02/2011 12:50 PM 2,471 mallet.bat
09/02/2011 12:50 PM 1,771 mallethon
05/17/2013 01:19 AM 85 mydata.txt
05/17/2013 01:20 AM 7,379 output.mallet
09/02/2011 12:50 PM 63 prepend-license.sh
09/02/2011 12:50 PM 636 svmlight2vectors
09/02/2011 12:50 PM 633 text2classify
09/02/2011 12:50 PM 632 text2vectors
09/02/2011 12:50 PM 636 vectors2classify
09/02/2011 12:50 PM 632 vectors2info
09/02/2011 12:50 PM 631 vectors2topics
09/02/2011 12:50 PM 635 vectors2vectors
16 File(s) 20,449 bytes
2 Dir(s) 415,176,773,632 bytes free
C:\mallet\bin>text2vectors --input output.mallet --output x1.vectors
'text2vectors' is not recognized as an internal or external command,
operable program or batch file.
C:\mallet\bin>text2vectors
'text2vectors' is not recognized as an internal or external command,
operable program or batch file.
C:\mallet\bin>mallet
Mallet 2.0 commands:
import-dir load the contents of a directory into mallet instances (one
per file)
import-file load a single file into mallet instances (one per line)
import-svmlight load a single SVMLight format data file into mallet instance
s (one per line)
train-classifier train a classifier from Mallet data files
train-topics train a topic model from Mallet data files
infer-topics use a trained topic model to infer topics for new documents
estimate-topics estimate the probability of new documents given a trained mo
del
hlda train a topic model using Hierarchical LDA
prune remove features based on frequency or information gain
split divide data into testing, training, and validation portions
Include --help with any option for more information
C:\mallet\bin>mallet train-topics --input output.mallet --output x1.mallet
A tool for estimating, saving and printing diagnostics for topic models, such as
LDA.
--help TRUE|FALSE
Print this command line option usage information. Give argument of TRUE for l
onger documentation
Default is false
--prefix-code 'JAVA CODE'
Java code you want run before any other interpreted code. Note that the text
is interpreted without modification, so unlike some other Java code options, you
need to include any necessary 'new's when creating objects.
Default is null
--config FILE
Read command option values from a file
Default is null
--input FILENAME
The filename from which to read the list of training instances. Use - for std
in. The instances must be FeatureSequence or FeatureSequenceWithBigrams, not Fe
atureVector
Default is null
--language-inputs FILENAME [FILENAME ...]
Filenames for polylingual topic model. Each language should have its own file,
with the same number of instances in each file. If a document is missing in one
language, there should be an empty instance.
Default is (null)
--testing FILENAME
The filename from which to read the list of instances for empirical likelihood
calculation. Use - for stdin. The instances must be FeatureSequence or Featur
eSequenceWithBigrams, not FeatureVector
Default is null
--output-model FILENAME
The filename in which to write the binary topic model at the end of the iterat
ions. By default this is null, indicating that no file will be written.
Default is null
--input-model FILENAME
The filename from which to read the binary topic model to which the --input wi
ll be appended, allowing incremental training. By default this is null, indicat
ing that no file will be read.
Default is null
--inferencer-filename FILENAME
A topic inferencer applies a previously trained topic model to new documents.
By default this is null, indicating that no file will be written.
Default is null
--evaluator-filename FILENAME
A held-out likelihood evaluator for new documents. By default this is null, i
ndicating that no file will be written.
Default is null
--output-state FILENAME
The filename in which to write the Gibbs sampling state after at the end of th
e iterations. By default this is null, indicating that no file will be written.
Default is null
--output-topic-keys FILENAME
The filename in which to write the top words for each topic and any Dirichlet
parameters. By default this is null, indicating that no file will be written.
Default is null
--topic-word-weights-file FILENAME
The filename in which to write unnormalized weights for every topic and word t
ype. By default this is null, indicating that no file will be written.
Default is null
--word-topic-counts-file FILENAME
The filename in which to write a sparse representation of topic-word assignmen
ts. By default this is null, indicating that no file will be written.
Default is null
--xml-topic-report FILENAME
The filename in which to write the top words for each topic and any Dirichlet
parameters in XML format. By default this is null, indicating that no filewill
be written.
Default is null
--xml-topic-phrase-report FILENAME
The filename in which to write the top words and phrases for each topic and an
y Dirichlet parameters in XML format. By default this is null, indicating that
no file will be written.
Default is null
--output-doc-topics FILENAME
The filename in which to write the topic proportions per document, at theend
of the iterations. By default this is null, indicating that no file will be wri
tten.
Default is null
--doc-topics-threshold DECIMAL
When writing topic proportions per document with --output-doc-topics, do not p
rint topics with proportions less than this threshold value.
Default is 0.0
--doc-topics-max INTEGER
When writing topic proportions per document with --output-doc-topics, do not p
rint more than INTEGER number of topics. A negative value indicates that all to
pics should be printed.
Default is -1
--num-topics INTEGER
The number of topics to fit.
Default is 10
--num-threads INTEGER
The number of threads for parallel training.
Default is 1
--num-iterations INTEGER
The number of iterations of Gibbs sampling.
Default is 1000
--random-seed INTEGER
The random seed for the Gibbs sampler. Default is 0, which will use the clock
..
Default is 0
--num-top-words INTEGER
The number of most probable words to print for each topic after model estimati
on.
Default is 20
--show-topics-interval INTEGER
The number of iterations between printing a brief summary of the topics so far
..
Default is 50
--output-model-interval INTEGER
The number of iterations between writing the model (and its Gibbs sampling sta
te) to a binary file. You must also set the --output-model to use this option,
whose argument will be the prefix of the filenames.
Default is 0
--output-state-interval INTEGER
The number of iterations between writing the sampling state to a text file. Y
ou must also set the --output-state to use this option, whose argument willbe t
he prefix of the filenames.
Default is 0
--optimize-interval INTEGER
The number of iterations between reestimating dirichlet hyperparameters.
Default is 0
--optimize-burn-in INTEGER
The number of iterations to run before first estimating dirichlet hyperparamet
ers.
Default is 200
--use-symmetric-alpha true|false
Only optimize the concentration parameter of the prior over document-topic dis
tributions. This may reduce the number of very small, poorly estimated topics, b
ut may disperse common words over several topics.
Default is false
--use-ngrams true|false
Rather than using LDA, use Topical-N-Grams, which models phrases.
Default is false
--use-pam true|false
Rather than using LDA, use Pachinko Allocation Model, which models topical cor
relations.You cannot do this and also --use-ngrams.
Default is false
--alpha DECIMAL
Alpha parameter: smoothing over topic distribution.
Default is 50.0
--beta DECIMAL
Beta parameter: smoothing over unigram distribution.
Default is 0.01
--gamma DECIMAL
Gamma parameter: smoothing over bigram distribution
Default is 0.01
--delta DECIMAL
Delta parameter: smoothing over choice of unigram/bigram
Default is 0.03
--delta1 DECIMAL
Topic N-gram smoothing parameter
Default is 0.2
--delta2 DECIMAL
Topic N-gram smoothing parameter
Default is 1000.0
--pam-num-supertopics INTEGER
When using the Pachinko Allocation Model (PAM) set the number of supertopics.
Typically this is about half the number of subtopics, although more may help.
Default is 10
--pam-num-subtopics INTEGER
When using the Pachinko Allocation Model (PAM) set the number of subtopics.
Default is 20
Exception in thread "main" java.lang.IllegalArgumentException: Unrecognizedopti
on 2: --output
at cc.mallet.util.CommandOption$List.process(CommandOption.java:344)
at cc.mallet.util.CommandOption.process(CommandOption.java:146)
at cc.mallet.topics.tui.Vectors2Topics.main(Vectors2Topics.java:200)
C:\mallet\bin>mallet --input output.mallet --trainer NaiveBayes
Mallet 2.0 commands:
import-dir load the contents of a directory into mallet instances (one
per file)
import-file load a single file into mallet instances (one per line)
import-svmlight load a single SVMLight format data file into mallet instance
s (one per line)
train-classifier train a classifier from Mallet data files
train-topics train a topic model from Mallet data files
infer-topics use a trained topic model to infer topics for new documents
estimate-topics estimate the probability of new documents given a trained mo
del
hlda train a topic model using Hierarchical LDA
prune remove features based on frequency or information gain
split divide data into testing, training, and validation portions
Include --help with any option for more information
C:\mallet\bin>
Regards,
Subhabrata.