[a tool for anonymizing XSLT testcases]
Feel free to present your solution here
OK, here goes.
* * *
The method that I have used when anonymizing SGMLs, XMLs and DTDs (or
any textual content for that matter) is roughly the following:
* Create scrambling key
I do this by using a simple substitution cipher (alphabet soup), in
ROT-13 way, but instead of shifting I use normal alphabet scrambling.
This is not the most perfect way because the scrambling key can be
calculated using statistical methods and educated guesses, but suits
for most of the cases.
---8<---8<---
$alpha = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
$regex = join('', shuffle( split //, $alpha) );
---8<---8<---
* Define keywords
By keyword I mean the words that should not be scrambled so that the
end result makes sense. For example DTD keywords would be "ELEMENT",
"ATTLIST", "PCDATA" and so forth. For XSLT there would be keywords like
"xsl:stylesheet" and "preceding-sibling::".
---8<---8<---
@keywords = ("ELEMENT", "ATTLIST", "PCDATA", [...] );
---8<---8<---
* Pick up file and mangle all alphabets
Just read the file and do normal transformation on each line according
to the scrambling key.
---8<---8<---
eval "\$line =~ tr/$alpha/$regex/";
---8<---8<---
* Revert all keywords
Seek through all keywords and do reverse scrambling.
---8<---8<---
foreach $keyword (@keywords) {
$scramble = $keyword;
eval "\$scramble =~ tr/$alpha/$regex/";
$line =~ s/$scramble/$keyword/;
}
---8<---8<---
* Loop for all files using the same scrambling key
This makes sure that the scrambled DTD definitions and XML tags match
and they can be used together. Just provide files on one spell like:
---8<---8<---
$ ./anonymizer.pl *.dtd *.xml
---8<---8<---
* * *
The result looks something like this (part of DITA concept.mod and
lawnmower concept sample):
---8<---8<---
<!ELEMENT SCySRFz ((%zTzJR
, (%zTzJRMJzZ
?,
(%ZaCXzfRZS; | %MuZzXMSz
?,
(%FXCJCB
?, (%SCyuCfk
?, (%XRJMzRf-JTycZ
?,
(%SCySRFz-TyEC-zkFRZ
* ) >
<!ATTLIST SCySRFz
Tf ID #REQUIRED
SCyXRE CDATA #IMPLIED
%ZRJRSz-MzzZ;
%JCSMJTmMzTCy-MzzZ;
%MXSa-MzzZ;
CbzFbzSJMZZ
CDATA #IMPLIED
fCjMTyZ CDATA "&TySJbfRf-fCjMTyZ;" >
---8<---8<---
<?xml version="1.0" encoding="utf-8"?>
<!-- daTZ ETJR TZ FMXz CE zaR vwdH AFRy dCCJcTz FXCxRSz aCZzRf Cy
qCbXSRECXBR.yRz. qRR zaR MSSCjFMykTyB JTSRyZR.zKz ETJR ECX
MFFJTSMuJR JTSRyZRZ.-->
<!-- (W) WCFkXTBaz wUY WCXFCXMzTCy 2001, 2005. HJJ tTBazZ tRZRXDRf.
*-->
<!DOCTYPE SCySRFz PUBLIC "-//AHqwq//vdv vwdH WCySRFz//gi"
"../../fzf/SCySRFz.fzf">
<SCySRFz Tf="JMPyjCPRXSCySRFz" xml:lang="en-us">
<zTzJR>IMPyjCPRX</zTzJR>
<SCyuCfk><F>daR JMPyjCPRX TZ M jMSaTyR bZRf zC Sbz BXMZZ Ty zaR kMXf.
IMPyjCPRXZ SMy uR
RJRSzXTS, BMZ-FCPRXRf, CX jMybMJ.</F></SCyuCfk>
</SCySRFz>
---8<---8<---
* * *
One can of course argue that this is not the most efficient way of
doing the anonymizing, but it has worked for me so far. The biggest
drawbacks in this method are:
* URI scrambling
All filenames, folders and paths are scrambled in the process, so the
script should rename them at the same time. However, it could be
difficult to spot those filenames automatically and the URIs may also
have folder names. Adding filenames as keywords and/or manually
renaming the files and folders afterwards might be the way to go.
* listing keywords
Grepping specifications manually to the keyword list is a bit big task,
but it needs to be done only once.
* not true encryption
As mentioned, the scrambling key can be reverse engineered. It is
fairly easy to do statistical analysis to get the key. The effort gets
bigger if the script is extended so that instead of simple
substitution, one letter is replaced with varying amount of random
letters. One step harder could the have "lossy key" that deletes some
of the alphabets, but then the script has to make sure that the end
result makes sense (for example <i>). There are also other ways to
improve the key but they all require more focus one the implementation.
Anyway, IMO this method is a compromise between functionality, the
amount of time spend on script and secrecy, and it really depends on
the overall project and NDA if this method is suitable and sufficient.