Javascript Collection, Obfuscation, Crawling?

S

Steve H.

Hello all,
I am a visiting researcher at a laboratory this summer and my
current task is investigating javascript obfuscation techniques. I am
trying to get a relatively large sample of website containing
javascript code so I can analyze it and determine if it is:
1) obfuscated
2) malicious

I have a fairly decent inference what the result will be, but it would
be nice to have statistics on my side. Having said that, I believe it
will be necessary to have a very large sample size to perform my
analysis.

Now for my question, does anyone know if there are any ways to utilize
a web browser or other component to automatically find javascript
samples? Google has not yielded any results, and the code search
merely searches repositories; not exactly what I need.

Short of rolling my own crawler, can anyone offer any suggestions that
will aid me in my task?

Thanks!
 
E

Evertjan.

Steve H. wrote on 24 jul 2007 in comp.lang.javascript:
Hello all,
I am a visiting researcher at a laboratory this summer and my
current task is investigating javascript obfuscation techniques. I am
trying to get a relatively large sample of website containing
javascript code so I can analyze it and determine if it is:
1) obfuscated
2) malicious

What is the sense of determining if js is obfuscated?

You would first need a decent definition of obfuscation.

Do you really think, or does your employer, that the level of
"obfuscation" is a measure of probability of maliciousness?

The common understanding on ths NG is, methinks, that obfuscation only
deters the users that cannot even read plain js.
I have a fairly decent inference what the result will be, but it would
be nice to have statistics on my side. Having said that, I believe it
will be necessary to have a very large sample size to perform my
analysis.

You should do a random pilot and extrapolate, having determined the
randomness with other parameters. A professional statistician looking
over your shoulder is a must here. Do not throw salt into her eyes.
Now for my question, does anyone know if there are any ways to utilize
a web browser or other component to automatically find javascript
samples? Google has not yielded any results, and the code search
merely searches repositories; not exactly what I need.

Short of rolling my own crawler, can anyone offer any suggestions that
will aid me in my task?

Someone has to crawl, and if it's not Google, it must be you, meseems.

Builing one is not that difficult, just write a httpxml function.

I would use Google with some simple words to get a fast amount of URLs
and measure the amount of bytes between <script and /script> in the
received strings, and check for external .js files.
 
S

Steve H.

Steve H. wrote on 24 jul 2007 in comp.lang.javascript:

What is the sense of determining if js is obfuscated?

You would first need a decent definition of obfuscation.

Do you really think, or does your employer, that the level of
"obfuscation" is a measure of probability of maliciousness?

No, I do not think this, nor does my employer.
The common understanding on ths NG is, methinks, that obfuscation only
deters the users that cannot even read plain js.

I agree.
You should do a random pilot and extrapolate, having determined the
randomness with other parameters. A professional statistician looking
over your shoulder is a must here. Do not throw salt into her eyes.

This is a bit assuming, but thank you for the suggestion. Let's just
say that there are enough people in my vicinity to verify my results
and ensure that perform statistical tests properly. Having said that,
I am no stranger to the field.
Someone has to crawl, and if it's not Google, it must be you, meseems.

Builing one is not that difficult, just write a httpxml function.

I wasn't really concerned with difficulty, I was just wondering if
someone knew of a method to save me some time; I am currently juggling
multiple projects and this one is a little lower in priority than
others.
I would use Google with some simple words to get a fast amount of URLs
and measure the amount of bytes between <script and /script> in the
received strings, and check for external .js files.

I will probably write my own crawler in conjunction with the google
api.

Thank you again for your suggestions, but I found many of your
statements assuming and/or loaded. I wish you would have asked me
questions for clarification without introducing a bias into the way
you ask said questions; personally, I find that a bit insulting.
 
R

RobG

Hello all,
I am a visiting researcher at a laboratory this summer and my
current task is investigating javascript obfuscation techniques. I am
trying to get a relatively large sample of website containing
javascript code so I can analyze it and determine if it is:
1) obfuscated
2) malicious

I have a fairly decent inference what the result will be, but it would
be nice to have statistics on my side. Having said that, I believe it
will be necessary to have a very large sample size to perform my
analysis.

Now for my question, does anyone know if there are any ways to utilize
a web browser or other component to automatically find javascript
samples? Google has not yielded any results, and the code search
merely searches repositories; not exactly what I need.

Short of rolling my own crawler, can anyone offer any suggestions that
will aid me in my task?

Detecting obfuscated code should be fairly straight forward, look for
the patterns:

function <identifier>
var <identifier>

and compare the amount of white space to character data. If the
average length of identifiers is short (say 2 characters) and the
percentage of white space is very low (say less than 5%, testing will
tell), the code is likely obfuscated.

I don't know if you intend to infer any particular motive to
obfuscation, but when used to minimize identifier lengths and remove
all unnecessary white space (i.e. minification) it can seriously
reduce the size of scripts, providing the benefits of faster downloads
and lower data volume. The fact that obfuscated code is also (very)
difficult to read is seen as a bonus by some, though it should not be
the primary purpose for using it.

For example, Google's map scripts are (or were, I haven't checked
lately) obfuscated, yet within a very short time manually 'de-
obfuscated' versions appeared on the web, published by those who
wanted to share how it worked. I expect Google wasn't concerned about
that as they were likely after the minification benefits rather than
attempting to protect their copyright.

As for malicious code, I think you need to know exactly what you are
looking for, e.g. the recently publicised IE and Firefox protocol
handling flaw or the supposed iPhone vulnerability. I think
javascript might be used as a transport to say deliver an malicious
object (say applet, animation or image), but it is unlikely that the
script itself will be malicious.
 
E

Evertjan.

Steve H. wrote on 24 jul 2007 in comp.lang.javascript:
Thank you again for your suggestions, but I found many of your
statements assuming and/or loaded. I wish you would have asked me
questions for clarification without introducing a bias into the way
you ask said questions; personally, I find that a bit insulting.

You were on the asking side, providing not even enough info about your
own presumed qualities, so if you want only niceties, try a paid
helpdesk.

This is usenet, so get used to it, Steve.
This is a bit assuming, but thank you for the suggestion. Let's just
say that there are enough people in my vicinity to verify my results
and ensure that perform statistical tests properly. Having said that,
I am no stranger to the field.

Again, how could we know you are "no stranger to the field" of
statistics?

In the medical field, where I work, checking your own research statistics
is rightly felt to introduce hidden biases.
No, I do not think this, nor does my employer.

So why are you [plural] searching for obfuscation at all, if,
as I surmize, you are after malicious code on the web?

===

I think a properly, in the statistical sense, conducted pilot will give
you a reasonable idea about the computer time involved to find enough of
the code you are after. Perhaps the main enterprize would take 12 years,
or 2 hours of computer time, who is to say without a pilot? And even then
extrapolation, the standard goal of a pilot, remains dangerous as some
hidden timing effect could act exponentially or the pilot's url batch
could prove to be non representative on a larger scale.
 
T

Thomas 'PointedEars' Lahn

RobG said:
I don't know if you intend to infer any particular motive to
obfuscation, but when used to minimize identifier lengths and remove
all unnecessary white space (i.e. minification) it can seriously
reduce the size of scripts, providing the benefits of faster downloads
and lower data volume. [...]

I wouldn't be so sure about that. For example, omitting white space
characters tends to require delimiter characters that were otherwise not
needed.


PointedEars
 
M

Mark Szlazak

Detecting obfuscated code should be fairly straight forward, look for
the patterns:

function <identifier>
var <identifier>

and compare the amount of white space to character data. If the
average length of identifiers is short (say 2 characters) and the
percentage of white space is very low (say less than 5%, testing will
tell), the code is likely obfuscated.

I don't know if you intend to infer any particular motive to
obfuscation, but when used to minimize identifier lengths and remove
all unnecessary white space (i.e. minification) it can seriously
reduce the size of scripts, providing the benefits of faster downloads
and lower data volume. The fact that obfuscated code is also (very)
difficult to read is seen as a bonus by some, though it should not be
the primary purpose for using it.

For example, Google's map scripts are (or were, I haven't checked
lately) obfuscated, yet within a very short time manually 'de-
obfuscated' versions appeared on the web, published by those who
wanted to share how it worked. I expect Google wasn't concerned about
that as they were likely after the minification benefits rather than
attempting to protect their copyright.

As for malicious code, I think you need to know exactly what you are
looking for, e.g. the recently publicised IE and Firefox protocol
handling flaw or the supposed iPhone vulnerability. I think
javascript might be used as a transport to say deliver an malicious
object (say applet, animation or image), but it is unlikely that the
script itself will be malicious.

Also, another test for obfuscation maybe to check if there are any
comments in the script. Comments are usually removed from the source
in compressed/obfuscated code.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top