A
A. S. Bradbury
= Ariel release 0.0.1
== Install
gem install ariel (if it's not yet propagated either wait or grab the .gem
from my rubyforge page and install that).
== Announcement
This is the first public release of Ariel - A Ruby Information Extraction
Library. See my previous post, ruby-talk:20014
[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/200140] for more
background information. This release supports defining a tree document
structure and learning rules to extract each node of this tree. Handling of
list extraction and learning is not yet implemented, and is the next
immediate priority. See the examples directory included in this release and
below for discussion of the included examples. Rule learning is functional,
and appears to work well, but many refinements are possible. Look out for
more updates and a new release shortly.
== About Ariel
Ariel intends to assist in extracting information from semi-structured
documents including (but not in any way limited to) web pages. Although you
may use libraries such as Hpricot or Rubyful Soup, or even plain Regular
Expressions to achieve the same goal, Ariel approaches the problem very
differently. Ariel relies on the user labeling examples of the data they want
to extract, and then finds patterns across several such labeled examples in
order to produce a set of general rules for extracting this information from
any similar document. It uses the MIT license.
== Examples
This release includes two examples in the example directory (which should now
be in the directory to which rubygems installed ariel). The first is the
google_calculator directory (inspired by Justin Bailey's post to my Ariel
progress report). The structure is very simple, a calculation is extracted
from the page, and then the actual result is extracted from that calculation.
3 labeled examples are included. Ariel reads each of these, tokenizes them,
and extracts each label. 4 sets of rules are learnt:
1. Rules to locate the start of the calculation in the original document.
2. Rules to locate the end of the calculation in the original document
(applied from the end of the document).
3. Rules to locate the start of the result of the calculation from the
extracted calculation.
4. Rules to locate the end of the result of the calculation from the extracted
calculation (applied from the end of the calculation).
Take note of 3 and 4 - this is the advantage of treating a document as a tree
in this way. Deeply nested elements can be located by generating a series of
simple rules, rather than generating a rule with complexity that increases at
each level. Sets of rules are generated because it may not be possible to
generate a single rule that will catch all cases. A rule is found that
matches as many of the examples as possible (and fails on the rest), these
examples are then removed and a rule is found that will match as many of the
remaining examples and so on. When it comes to applying these learnt rules,
the rules are applied in order until there is a rule that matches.
To see this example for yourself just execute structure.rb in the
examples/google_calculator directory to create a locally writable
structure.yaml. Then do:
ariel -D -m learn -s structure.yaml -d /examplepath/labeled
You'll have to wait a while (see my note about performance below). At the end,
the learnt rules will be printed in YAML format, and structure.yaml will be
updated to include these rules. Apply these learnt rules to some unlabeled
documents by doing:
ariel -D -m extract -s structure.yaml -d /examplepath/unlabeled
You should see the results of a successful extraction printed to your
terminal, such as this one:
Results for unlabeled/2:
calculation: 3.5 U.S. dollars = 1.8486241 British pounds
result: 1.8486241 British pounds
The second example (raa) learns rules using just 2 labeled examples. This is
probably fewer than I'd recommend in most cases, but as it works... This
example consists of project entries in the Ruby Application Archive. The
structure of the page is very flat, so all rules are applied to the full
page. Rules are learnt and applied as shown above. The structure.yaml files
included in the examples directories already include rules generated by
Ariel, use these if you just want to see extraction working.
Note: The interface demonstrated by ariel above is not very flexible or
friendly, it's just to serve as a demonstration for the moment.
== Performance
Generating rules takes quite a long time. It is always going to be an
intensive operation, but there are some very simple and obvious improvements
in efficiency that can be made. For a start, the rule candidate refining
process currently re-applies the same rules over and over every time the
remaining rule candidates are ranked. This is where most time is spent, and
caching these should make a big difference. This will definitely be
implemented. Other performance enhancements are bound to be there, but my
focus at this time is to get something that works.
== Credits
Ariel is developed by Alex Bradbury as a Google Summer of Code project under
the mentoring of Austin Ziegler.
== Links
Watch my development through the subversion repository at
http://rubyforge.org/projects/ariel
I've also just started using the tracker at http://code.google.com/p/ariel/
== Install
gem install ariel (if it's not yet propagated either wait or grab the .gem
from my rubyforge page and install that).
== Announcement
This is the first public release of Ariel - A Ruby Information Extraction
Library. See my previous post, ruby-talk:20014
[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/200140] for more
background information. This release supports defining a tree document
structure and learning rules to extract each node of this tree. Handling of
list extraction and learning is not yet implemented, and is the next
immediate priority. See the examples directory included in this release and
below for discussion of the included examples. Rule learning is functional,
and appears to work well, but many refinements are possible. Look out for
more updates and a new release shortly.
== About Ariel
Ariel intends to assist in extracting information from semi-structured
documents including (but not in any way limited to) web pages. Although you
may use libraries such as Hpricot or Rubyful Soup, or even plain Regular
Expressions to achieve the same goal, Ariel approaches the problem very
differently. Ariel relies on the user labeling examples of the data they want
to extract, and then finds patterns across several such labeled examples in
order to produce a set of general rules for extracting this information from
any similar document. It uses the MIT license.
== Examples
This release includes two examples in the example directory (which should now
be in the directory to which rubygems installed ariel). The first is the
google_calculator directory (inspired by Justin Bailey's post to my Ariel
progress report). The structure is very simple, a calculation is extracted
from the page, and then the actual result is extracted from that calculation.
3 labeled examples are included. Ariel reads each of these, tokenizes them,
and extracts each label. 4 sets of rules are learnt:
1. Rules to locate the start of the calculation in the original document.
2. Rules to locate the end of the calculation in the original document
(applied from the end of the document).
3. Rules to locate the start of the result of the calculation from the
extracted calculation.
4. Rules to locate the end of the result of the calculation from the extracted
calculation (applied from the end of the calculation).
Take note of 3 and 4 - this is the advantage of treating a document as a tree
in this way. Deeply nested elements can be located by generating a series of
simple rules, rather than generating a rule with complexity that increases at
each level. Sets of rules are generated because it may not be possible to
generate a single rule that will catch all cases. A rule is found that
matches as many of the examples as possible (and fails on the rest), these
examples are then removed and a rule is found that will match as many of the
remaining examples and so on. When it comes to applying these learnt rules,
the rules are applied in order until there is a rule that matches.
To see this example for yourself just execute structure.rb in the
examples/google_calculator directory to create a locally writable
structure.yaml. Then do:
ariel -D -m learn -s structure.yaml -d /examplepath/labeled
You'll have to wait a while (see my note about performance below). At the end,
the learnt rules will be printed in YAML format, and structure.yaml will be
updated to include these rules. Apply these learnt rules to some unlabeled
documents by doing:
ariel -D -m extract -s structure.yaml -d /examplepath/unlabeled
You should see the results of a successful extraction printed to your
terminal, such as this one:
Results for unlabeled/2:
calculation: 3.5 U.S. dollars = 1.8486241 British pounds
result: 1.8486241 British pounds
The second example (raa) learns rules using just 2 labeled examples. This is
probably fewer than I'd recommend in most cases, but as it works... This
example consists of project entries in the Ruby Application Archive. The
structure of the page is very flat, so all rules are applied to the full
page. Rules are learnt and applied as shown above. The structure.yaml files
included in the examples directories already include rules generated by
Ariel, use these if you just want to see extraction working.
Note: The interface demonstrated by ariel above is not very flexible or
friendly, it's just to serve as a demonstration for the moment.
== Performance
Generating rules takes quite a long time. It is always going to be an
intensive operation, but there are some very simple and obvious improvements
in efficiency that can be made. For a start, the rule candidate refining
process currently re-applies the same rules over and over every time the
remaining rule candidates are ranked. This is where most time is spent, and
caching these should make a big difference. This will definitely be
implemented. Other performance enhancements are bound to be there, but my
focus at this time is to get something that works.
== Credits
Ariel is developed by Alex Bradbury as a Google Summer of Code project under
the mentoring of Austin Ziegler.
== Links
Watch my development through the subversion repository at
http://rubyforge.org/projects/ariel
I've also just started using the tracker at http://code.google.com/p/ariel/