[ANN] Metadata 1.1

I

Ilmari Heikkinen

tarball: http://dark.fhtr.org/repos/metadata/metadata-1.1.tar.gz
gem: http://dark.fhtr.org/repos/metadata/metadata-1.1.gem
git: http://dark.fhtr.org/repos/metadata


Changes
-------
* more README documentation
- all output fields in appendix
- grouped tested formats
* more extensive testing
* fixed a bug with document text extraction
* took out empty Document.PageSizeNames

* use more fields from extract
(keywords, language, revision history among others)

* use more dcraw metadata, ignore failed exif for raws
* renamed Image.Frames to Image.FrameCount
* added Image.LayerCount for layered images
* use more fields from exif: colorspace, colormode
* fixed exif output to use numbers instead of strings where
appropriate (focal length, exposure time, ISO speed, Fnumber)

* optional md5sum and/or sha1sum in the metadata:
mdh [-m] [-s]
and
Metadata.sha1sum|md5sum = true|false


Thanks
------

Konrad Meyer for his patient testing and bug reports.
Darren Kirby for the heads-up on wmainfo's ASF-parsing capabilities
(along with being the author of wmainfo-rb and flacinfo-rb.)


Description
-----------

This package `Metadata' comes with a library called `metadata' and
a small program called `mdh'.

The library probes files for their metadata (e.g. jpeg dimensions
and camera make, mp3 artist, pdf text and word count) and returns the
metadata as a Hash. All strings in the metadata are converted to UTF-8.

The `mdh'-program can print out file metadata as YAML and package the
metadata with the file.

The metadata hash follows the shared file metadata spec naming, with some
additional fields, see list at the end of this file (Appendix A.)

For details on the MDH file format, see the end of this file (Appendix B.)


Usage
-----

# print out metadata for myfile.jpg
mdh myfile.jpg

# create myfile.jpg.mdh, which consists of an MDH metadata header + myfile.jpg
mdh -c myfile.jpg

# print out the metadata header from an MDH file
mdh -e -p myfile.jpg.mdh

# strip out the metadata header from an MDH file and save it to myfile.jpg
mdh -e myfile.jpg.mdh

# print out the list of options
mdh -h

irb> require 'metadata'
irb> Metadata.extract('myfile.jpg')
irb> Metadata.extract_text('myfile.pdf')
irb> Pathname.new("myfile.jpg").metadata


List of supported formats
-------------------------

Audio:
Whatever you manage to make mplayer play.
Plus special handlers for FLAC, m4a, ape, musepack, wavepack and wma.

Successfully tested with:
mp3, flac, ogg, wav, ra, m4a, wma

Should also work:
wv, mpc, ape


Video:
Whatever you manage to make mplayer play.

Successfully tested with:
wmv, mov, divx, xvid, flv, ogm, mpg, mkv


Images:
Should handle pretty much anything.
I.e. anything handled by ExifTool, ImageMagick, Imlib2 or dcraw.

Successfully tested with:
Web formats:
jpeg, png, gif, svg
Camera raws:
nef, dng, crw, pef, orf
Image editor state dumps:
psd, xcf
The rest:
tga, tif, bmp, xpm, ppm


Documents:
Successfully tested with:
Web formats:
html, txt
Print formats:
pdf, ps, ps.gz
OO formats:
sxi, odp
MS formats:
doc, ppt, xls

- I'm using unoconv to convert OO & MS docs to temp PDFs for the text &
dimensions extraction, so those bits of data are missing. MSOffice docs
are missing dimensions for the same reason. Here's a way to get them:
( first, get Thumbnailer: http://dark.fhtr.org/repos/thumbnailer/ )
$ thumbnailer -s 1 -k foo.odp /tmp/foo.jpg
$ mdh foo.odp
$ rm foo.odp-temp.pdf /tmp/foo.jpg


Others:
- BitTorrent .torrent files
- Archive contents
- Whatever `extract' outputs and I am handling


Requirements
------------

* Ruby 1.8

* Tons of metadata extraction programs and libs.
This package has many dependencies since there is no single universal
metadata header format that all files use. Blame resource forks, filename
extensions, bags of bytes and mimetypes.

List of gems:
flacinfo-rb
wmainfo-rb
MP4Info
id3lib-ruby
apetag

List of Debian packages:
dcraw
libimlib2-ruby
extract
libimage-exiftool-perl
poppler-utils
mplayer
html2text
imagemagick
unhtml
pstotext
antiword
catdoc
shared-mime-info

* You do want to install the latest versions of dcraw and
shared-mime-info to be able to handle camera raw images.
http://cybercom.net/~dcoffin/dcraw/
http://freedesktop.org/wiki/Software/shared-mime-info

* Python + chardet library
http://chardet.feedparser.org/


Install
-------

De-compress archive and enter its top directory.
Then type:

($ su)
# ruby setup.rb

These simple step installs this program under the default
location of Ruby libraries. You can also install files into
your favorite directory by supplying setup.rb some options.
Try "ruby setup.rb --help".


Appendix A: Metadata fields
--------------------------------------

This list contains the metadata fields output by Metadata and mdh.
The list follows the shared file metadata spec for the most part.
http://wiki.freedesktop.org/wiki/Specifications/shared-filemetadata-spec

field name | field type
----------------------------------------------------------------------
Archive.Contents array of pathnames

Audio.Band string
Audio.Composer string
Audio.Conductor string
Audio.Copyright string (copyright message)
Audio.Grouping string
Audio.Image binary string (embedded image data)
Audio.InterpretedBy string
Audio.Lyricist string
Audio.Publisher string
Audio.RemixedBy string
Audio.Subtitle string
Audio.Tempo integer
Audio.VariableBitrate boolean
Audio.Writer string
Audio.Publicationright string
Audio.File string
Audio.EAN/UPC string
Audio.ISBN string
Audio.Catalog string
Audio.LC string
Audio.Media string
Audio.Index string
Audio.Related string
Audio.ISRC string
Audio.Abstract string
Audio.Language string
Audio.Bibliography string
Audio.Introplay string
Audio.Dummy string
Audio.DebutAlbum string
Audio.RecordDate string
Audio.RecordLocation string
v-- ORIGINAL FIELDS USED --v
Audio.Title string
Audio.Artist string
Audio.Album string
Audio.AlbumArtist string
Audio.AlbumTrackCount integer
Audio.TrackNo integer
Audio.DiscNo integer
Audio.Performer string
Audio.Duration float
Audio.ReleaseDate datetime
Audio.Comment string
Audio.Genre string
Audio.Codec string
Audio.Samplerate integer
Audio.Bitrate float
Audio.Channels integer
Audio.Lyrics string

Doc.Album string
Doc.Artist string
Doc.Charset string
Doc.Description string
Doc.Genre string
Doc.Language string
Doc.ModifyDate date
Doc.PageSizeName string (A4, A5, letter, ...)
Doc.RevisionHistory array of strings
Doc.ParagraphCount integer
Doc.LineCount integer
Doc.CharacterCount integer
Doc.LastSavedBy string
Doc.Keywords array of strings
Doc.Template string
v-- ORIGINAL FIELDS USED --v
Doc.Title string
Doc.Subject string
Doc.Author string
Doc.PageCount integer
Doc.WordCount integer
Doc.Created datetime

File.Software string (software used to create the file)
File.MD5Sum string (md5sum of file's contents)
File.SHA1Sum string (sha1sum of file's contents)
v-- ORIGINAL FIELDS USED --v
File.Format string (mime type, inode/directory for dirs)
File.Size integer
File.Content string
File.Modified string

Image.DateCreated date
Image.DateTimeCreated date
Image.DateTimeOriginal date
Image.DimensionUnit string (px, mm, pt, ...)
Image.Editor string
Image.EXIF string (exiftool output)
Image.FrameCount integer
Image.LayerCount integer
Image.Modified date
Image.OriginatingProgram string
Image.ComponentCount integer
Image.ColorMode string (e.g. RGB)
Image.ColorSpace string (e.g. sRGB)
v-- ORIGINAL FIELDS USED --v
Image.Height float
Image.Width float
Image.Title string
Image.Date datetime
Image.Creator string
Image.Description string
Image.Software string
Image.CameraMake string
Image.CameraModel string
Image.ExposureProgram string
Image.ExposureTime float
Image.Fnumber float
Image.Flash boolean
Image.FocalLength float
Image.ISOSpeed float
Image.MeteringMode string
Image.WhiteBalance string
Image.Copyright string

Location.Latitude float
Location.Longitude float

Video.Album string
Video.Artist string
Video.Bitrate integer
Video.Codec string
Video.Comment string
Video.Duration float
Video.Framerate float (frames per second)
Video.Genre string
Video.ReleaseDate date
Video.Title string
Video.TrackNo integer
Video.Demuxer string

BitTorrent.Name string
BitTorrent.Files array of { 'path' => string,
'length' => integer,
'md5sum' => string }
BitTorrent.Length integer (size of single-file torrents)
BitTorrent.MD5Sum string (md5sum for single-file torrents)
BitTorrent.PieceCount integer
BitTorrent.PieceLength integer (length of a single piece
BitTorrent.Comment string
BitTorrent.Announce string (announce url)
BitTorrent.AnnounceList array of arrays of strings
BitTorrent.Nodes array of [hostname, port] -arrays



Appendix B: The MDH file format
-------------------------------

MDH files are built as follows:

bytes | content
---------------
3 | "MDH" - MDH file format identifier
1 | "\x01" - MDH file format version number
4 | Long, network byte order - the size of the metadata struct in bytes
var | YAML - The MDH metadata struct
var | The actual file contents

All string fields in the metadata are UTF-8.


License
 
K

Konrad Meyer

--nextPart7143535.uxaS0X2PQS
Content-Type: text/plain;
charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Quoth Ilmari Heikkinen:
tarball: http://dark.fhtr.org/repos/metadata/metadata-1.1.tar.gz
gem: http://dark.fhtr.org/repos/metadata/metadata-1.1.gem
git: http://dark.fhtr.org/repos/metadata
=20
=20
Changes
-------
* more README documentation
- all output fields in appendix
- grouped tested formats
* more extensive testing
* fixed a bug with document text extraction
* took out empty Document.PageSizeNames
=20
* use more fields from extract
(keywords, language, revision history among others)
=20
* use more dcraw metadata, ignore failed exif for raws
* renamed Image.Frames to Image.FrameCount
* added Image.LayerCount for layered images
* use more fields from exif: colorspace, colormode
* fixed exif output to use numbers instead of strings where
appropriate (focal length, exposure time, ISO speed, Fnumber)
=20
* optional md5sum and/or sha1sum in the metadata:
mdh [-m] [-s]
and
Metadata.sha1sum|md5sum =3D true|false
=20
=20
Thanks
------
=20
Konrad Meyer for his patient testing and bug reports.
Darren Kirby for the heads-up on wmainfo's ASF-parsing capabilities
(along with being the author of wmainfo-rb and flacinfo-rb.)
=20
=20
Description
-----------
=20
This package `Metadata' comes with a library called `metadata' and
a small program called `mdh'.
=20
The library probes files for their metadata (e.g. jpeg dimensions
and camera make, mp3 artist, pdf text and word count) and returns the
metadata as a Hash. All strings in the metadata are converted to UTF-8.
=20
The `mdh'-program can print out file metadata as YAML and package the
metadata with the file.
=20
The metadata hash follows the shared file metadata spec naming, with so= me
additional fields, see list at the end of this file (Appendix A.)
=20
For details on the MDH file format, see the end of this file (Appendix = B.)
=20
=20
Usage
-----
=20
# print out metadata for myfile.jpg
mdh myfile.jpg
=20
# create myfile.jpg.mdh, which consists of an MDH metadata header +=20 myfile.jpg
mdh -c myfile.jpg
=20
# print out the metadata header from an MDH file
mdh -e -p myfile.jpg.mdh
=20
# strip out the metadata header from an MDH file and save it to myfile.= jpg
mdh -e myfile.jpg.mdh
=20
# print out the list of options
mdh -h
=20
irb> require 'metadata'
irb> Metadata.extract('myfile.jpg')
irb> Metadata.extract_text('myfile.pdf')
irb> Pathname.new("myfile.jpg").metadata
=20
=20
List of supported formats
-------------------------
=20
Audio:
Whatever you manage to make mplayer play.
Plus special handlers for FLAC, m4a, ape, musepack, wavepack and wma.
=20
Successfully tested with:
mp3, flac, ogg, wav, ra, m4a, wma
=20
Should also work:
wv, mpc, ape
=20
=20
Video:
Whatever you manage to make mplayer play.
=20
Successfully tested with:
wmv, mov, divx, xvid, flv, ogm, mpg, mkv
=20
=20
Images:
Should handle pretty much anything.
I.e. anything handled by ExifTool, ImageMagick, Imlib2 or dcraw.
=20
Successfully tested with:
Web formats:
jpeg, png, gif, svg
Camera raws:
nef, dng, crw, pef, orf
Image editor state dumps:
psd, xcf
The rest:
tga, tif, bmp, xpm, ppm
=20
=20
Documents:
Successfully tested with:
Web formats:
html, txt
Print formats:
pdf, ps, ps.gz
OO formats:
sxi, odp
MS formats:
doc, ppt, xls
=20
- I'm using unoconv to convert OO & MS docs to temp PDFs for the text= &
dimensions extraction, so those bits of data are missing. MSOffice= =20
docs
are missing dimensions for the same reason. Here's a way to get the= m:
( first, get Thumbnailer: http://dark.fhtr.org/repos/thumbnailer/ )
$ thumbnailer -s 1 -k foo.odp /tmp/foo.jpg
$ mdh foo.odp
$ rm foo.odp-temp.pdf /tmp/foo.jpg
=20
=20
Others:
- BitTorrent .torrent files
- Archive contents
- Whatever `extract' outputs and I am handling
=20
=20
Requirements
------------
=20
* Ruby 1.8
=20
* Tons of metadata extraction programs and libs.
This package has many dependencies since there is no single universal
metadata header format that all files use. Blame resource forks,=20 filename
extensions, bags of bytes and mimetypes.
=20
List of gems:
flacinfo-rb
wmainfo-rb
MP4Info
id3lib-ruby
apetag
=20
List of Debian packages:
dcraw
libimlib2-ruby
extract
libimage-exiftool-perl
poppler-utils
mplayer
html2text
imagemagick
unhtml
pstotext
antiword
catdoc
shared-mime-info
=20
* You do want to install the latest versions of dcraw and
shared-mime-info to be able to handle camera raw images.
http://cybercom.net/~dcoffin/dcraw/
http://freedesktop.org/wiki/Software/shared-mime-info
=20
* Python + chardet library
http://chardet.feedparser.org/
=20
=20
Install
-------
=20
De-compress archive and enter its top directory.
Then type:
=20
($ su)
# ruby setup.rb
=20
These simple step installs this program under the default
location of Ruby libraries. You can also install files into
your favorite directory by supplying setup.rb some options.
Try "ruby setup.rb --help".
=20
=20
Appendix A: Metadata fields
--------------------------------------
=20
This list contains the metadata fields output by Metadata and mdh.
The list follows the shared file metadata spec for the most part.
http://wiki.freedesktop.org/wiki/Specifications/shared-filemetadata-spec
=20
field name | field type
----------------------------------------------------------------------
Archive.Contents array of pathnames
=20
Audio.Band string
Audio.Composer string
Audio.Conductor string
Audio.Copyright string (copyright message)
Audio.Grouping string
Audio.Image binary string (embedded image data)
Audio.InterpretedBy string
Audio.Lyricist string
Audio.Publisher string
Audio.RemixedBy string
Audio.Subtitle string
Audio.Tempo integer
Audio.VariableBitrate boolean
Audio.Writer string
Audio.Publicationright string
Audio.File string
Audio.EAN/UPC string
Audio.ISBN string
Audio.Catalog string
Audio.LC string
Audio.Media string
Audio.Index string
Audio.Related string
Audio.ISRC string
Audio.Abstract string
Audio.Language string
Audio.Bibliography string
Audio.Introplay string
Audio.Dummy string
Audio.DebutAlbum string
Audio.RecordDate string
Audio.RecordLocation string
v-- ORIGINAL FIELDS USED --v
Audio.Title string
Audio.Artist string
Audio.Album string
Audio.AlbumArtist string
Audio.AlbumTrackCount integer
Audio.TrackNo integer
Audio.DiscNo integer
Audio.Performer string
Audio.Duration float
Audio.ReleaseDate datetime
Audio.Comment string
Audio.Genre string
Audio.Codec string
Audio.Samplerate integer
Audio.Bitrate float
Audio.Channels integer
Audio.Lyrics string
=20
Doc.Album string
Doc.Artist string
Doc.Charset string
Doc.Description string
Doc.Genre string
Doc.Language string
Doc.ModifyDate date
Doc.PageSizeName string (A4, A5, letter, ...)
Doc.RevisionHistory array of strings
Doc.ParagraphCount integer
Doc.LineCount integer
Doc.CharacterCount integer
Doc.LastSavedBy string
Doc.Keywords array of strings
Doc.Template string
v-- ORIGINAL FIELDS USED --v
Doc.Title string
Doc.Subject string
Doc.Author string
Doc.PageCount integer
Doc.WordCount integer
Doc.Created datetime
=20
File.Software string (software used to create the file)
File.MD5Sum string (md5sum of file's contents)
File.SHA1Sum string (sha1sum of file's contents)
v-- ORIGINAL FIELDS USED --v
File.Format string (mime type, inode/directory for dirs)
File.Size integer
File.Content string
File.Modified string
=20
Image.DateCreated date
Image.DateTimeCreated date
Image.DateTimeOriginal date
Image.DimensionUnit string (px, mm, pt, ...)
Image.Editor string
Image.EXIF string (exiftool output)
Image.FrameCount integer
Image.LayerCount integer
Image.Modified date
Image.OriginatingProgram string
Image.ComponentCount integer
Image.ColorMode string (e.g. RGB)
Image.ColorSpace string (e.g. sRGB)
v-- ORIGINAL FIELDS USED --v
Image.Height float
Image.Width float
Image.Title string
Image.Date datetime
Image.Creator string
Image.Description string
Image.Software string
Image.CameraMake string
Image.CameraModel string
Image.ExposureProgram string
Image.ExposureTime float
Image.Fnumber float
Image.Flash boolean
Image.FocalLength float
Image.ISOSpeed float
Image.MeteringMode string
Image.WhiteBalance string
Image.Copyright string
=20
Location.Latitude float
Location.Longitude float
=20
Video.Album string
Video.Artist string
Video.Bitrate integer
Video.Codec string
Video.Comment string
Video.Duration float
Video.Framerate float (frames per second)
Video.Genre string
Video.ReleaseDate date
Video.Title string
Video.TrackNo integer
Video.Demuxer string
=20
BitTorrent.Name string
BitTorrent.Files array of { 'path' =3D> string,
'length' =3D> integer,
'md5sum' =3D> string }
BitTorrent.Length integer (size of single-file torrents)
BitTorrent.MD5Sum string (md5sum for single-file torrents)
BitTorrent.PieceCount integer
BitTorrent.PieceLength integer (length of a single piece
BitTorrent.Comment string
BitTorrent.Announce string (announce url)
BitTorrent.AnnounceList array of arrays of strings
BitTorrent.Nodes array of [hostname, port] -arrays
=20
=20
=20
Appendix B: The MDH file format
-------------------------------
=20
MDH files are built as follows:
=20
bytes | content
---------------
3 | "MDH" - MDH file format identifier
1 | "\x01" - MDH file format version number
4 | Long, network byte order - the size of the metadata struct in=20 bytes
var | YAML - The MDH metadata struct
var | The actual file contents
=20
All string fields in the metadata are UTF-8.
=20
=20
License
-------
=20
Ruby's
=20
=20


Is the gem working now? If so, very cool.

Thanks,
=2D-=20
Konrad Meyer <[email protected]> http://konrad.sobertillnoon.com/

--nextPart7143535.uxaS0X2PQS
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQBG+FvHCHB0oCiR2cwRAjhOAJ93kVxUmFVd1NtpnkS4gB8QY3+nOACgm0XT
xP29dZhySVGof87A3TYYJZs=
=suqV
-----END PGP SIGNATURE-----

--nextPart7143535.uxaS0X2PQS--
 
I

Ilmari Heikkinen

Quoth Ilmari Heikkinen:

Is the gem working now? If so, very cool.

It's working, but it's not on rubyforge. And I'm sort of queasy on
putting it there, due to the dephell of external programs.

Justification for dephell: those projects live or die based on
whether they handle everything in their specialty area. And I'm
too busy for NIH :-/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,736
Latest member
AdolphBig6

Latest Threads

Top