I
irishhacker
What's the percentage of Perl users who use Perl for data munging
(cleaning up data , data transformation, etc) on a fairly common
occasion?
Perl is particularly good at regular expressions, which is useful for
some types of data munging.
There are mainly three choices for data munging. Which choice is best
depends on the type of data problem one has. There are many different
types of data munging, both in degree of difficulty, and flavor.
ALL-PURPOSE PROGRAMMING LANGUAGES
obvious example: Perl
SPECIALIZED PROGRAMMING LANGUAGES
obvious example: SAS datastep (but extremely expensive) , also SPSS
( to get data ready for analysis, same thing)
PSPP (GPL open source re-implementation of SPSS programming language,
@ http://directory.fsf.org/math/stats )
DAP (GPL open source re-implementation of SAS programming language, @
http://directory.fsf.org/math/stats )
vilno (GPL open source, another data transformation programming
language and engine, @ http://code.google.com/p/vilno )
GRAPHICAL USER INTERFACE
Kettle ( http://kettle.pentaho.org )
KETL, ( http://www.ketl.org ) and on and on.
Particularly popular with the "T" part of "ETL" .
ETL is always marketed as having a GUI front-end, no one ever mentions
using an ETL programming language.
If the complexity/quality of the data is not that bad, and hence the
required munging is not too complicated, then a GUI product is good.
But if Murphy's law strikes with the databases(if something can go
wrong it will), programming languages provide more flexibily for bad
situations.
(cleaning up data , data transformation, etc) on a fairly common
occasion?
Perl is particularly good at regular expressions, which is useful for
some types of data munging.
There are mainly three choices for data munging. Which choice is best
depends on the type of data problem one has. There are many different
types of data munging, both in degree of difficulty, and flavor.
ALL-PURPOSE PROGRAMMING LANGUAGES
obvious example: Perl
SPECIALIZED PROGRAMMING LANGUAGES
obvious example: SAS datastep (but extremely expensive) , also SPSS
( to get data ready for analysis, same thing)
PSPP (GPL open source re-implementation of SPSS programming language,
@ http://directory.fsf.org/math/stats )
DAP (GPL open source re-implementation of SAS programming language, @
http://directory.fsf.org/math/stats )
vilno (GPL open source, another data transformation programming
language and engine, @ http://code.google.com/p/vilno )
GRAPHICAL USER INTERFACE
Kettle ( http://kettle.pentaho.org )
KETL, ( http://www.ketl.org ) and on and on.
Particularly popular with the "T" part of "ETL" .
ETL is always marketed as having a GUI front-end, no one ever mentions
using an ETL programming language.
If the complexity/quality of the data is not that bad, and hence the
required munging is not too complicated, then a GUI product is good.
But if Murphy's law strikes with the databases(if something can go
wrong it will), programming languages provide more flexibily for bad
situations.