Extracting and replacing url within href tag

A

Adnan Siddiqi

Hi
Suppose I have following URLs comming from an HTML document

<a href="http://mydomain1.com">Domain1</a>
<a
href="http://subdomain.domain.com/myfile.anyext">http://subdomain.domain.com/myfile.anyext</a>


<a href="http://subdomain.domain2.com/myfile.anyext">Domain2</a>

Now,what I want to search URL pattern within Href only as well as check
if it contains a particular domain ,for instance "domain2.com", if yes
then it replace with following URL.

"http://redirectUrl.com/http://subdomain.domain2.com/myfile.anyext"

can anyone shed light upon this?

Thankyou

-adnan
 
V

VK

Adnan said:
Hi
Suppose I have following URLs comming from an HTML document

<a href="http://mydomain1.com">Domain1</a>
<a
href="http://subdomain.domain.com/myfile.anyext">http://subdomain.domain.com/myfile.anyext</a>


<a href="http://subdomain.domain2.com/myfile.anyext">Domain2</a>

Now,what I want to search URL pattern within Href only as well as check
if it contains a particular domain ,for instance "domain2.com", if yes
then it replace with following URL.

"http://redirectUrl.com/http://subdomain.domain2.com/myfile.anyext"

<script type="text/javascript">
function patchLinks() {
var len = document.links.length;
var lnk = null;
for (var i=0; i<len; i++) {
lnk = document.links;
if (lnk.href.indexOf('domain2.com') != -1) {
lnk.href = 'http://redirectUrl.com/' + lnk.href;
}
}
}

window.onload = patchLinks;
</script>
 
T

Thomas 'PointedEars' Lahn

VK said:
Adnan said:
Now,what I want to search URL pattern within Href only as well as check
if it contains a particular domain ,for instance "domain2.com", if yes
then it replace with following URL.

"http://redirectUrl.com/http://subdomain.domain2.com/myfile.anyext"

<script type="text/javascript">
function patchLinks() {
var len = document.links.length;
var lnk = null;
for (var i=0; i<len; i++) {
lnk = document.links;
if (lnk.href.indexOf('domain2.com') != -1) {
lnk.href = 'http://redirectUrl.com/' + lnk.href;
}
}
}

window.onload = patchLinks;
</script>


Please /think/ before you code.


PointedEars
 
T

Thomas 'PointedEars' Lahn

Adnan said:
Suppose I have following URLs comming from an HTML document

<a href="http://mydomain1.com">Domain1</a>
<a
href="http://subdomain.domain.com/myfile.anyext">http://subdomain.domain.com/myfile.anyext said:
<a href="http://subdomain.domain2.com/myfile.anyext">Domain2</a>

Now,what I want to search URL pattern within Href only as well as check
if it contains a particular domain ,for instance "domain2.com", if yes
then it replace with following URL.

"http://redirectUrl.com/http://subdomain.domain2.com/myfile.anyext"

This is not a valid URL/URI. See RFC3986 and below.
can anyone shed light upon this?

First of all, you want to do this server-side, not client-side.

However, the language used then may be an ECMAScript implementation
as well. The only difference to the solution presented here is that
you will need to determine what is a link differently, and that you
have to parse the source code instead (unless you can make use of an
existing markup parser implementation).

Second, use Regular Expressions.

....
<html>
<head>
...
<meta http-equiv="Content-Script-Type" content="text/javascript">
<script type="text/javascript">
var _global = this;

/**
* Patches links referring to specific domains so that their target
* URL is appended to another URL.
*
* @param sDomains: string
* Links with domains to be redirected, delimited by <tt>|</tt>.
* @param sRedirectBase: string
* Base URI (prefix) for the redirection.
*/
function patchLinks(sDomains, sRedirectBase)
{
/**
* Tries hard to escape a string according to the query component
* specification in RFC3986.
*
* @partof
* http://pointedears.de/scripts/string.js
* @param s: string
* @return type string
* <code>s</code> escaped, or unescaped if escaping through
* <code>encodeURIComponent()</code> or <code>escape()</code>
* is not possible.
*/
function esc(s)
{
/**
* @author
* (C) 2003-2006 Thomas Lahn &lt;[email protected]&gt;
* Distributed under the GNU GPL v2.
* @partof
* http://pointedears.de/scripts/types.js
* @argument s
* String to be determined a method type, i.e. "object" for
* IE DOM methods, "function" otherwise. The type must have
* been retrieved with the `typeof' operator.
*
* Note that in contrast to @link{#isMethod()}, this
* method may also return <code>true</code> if the value of
* the <code>typeof</code> operand is <code>null</code>; to be
* sure that the operand is a method reference, you have to
* && (AND)-combine the <code>isMethodType(...)</code>
* expression with the method reference identifier.
*
* Use this method instead of <code>isMethod()</code> if
* you want to avoid warnings in case the property to be
* tested is not defined, or errors in case the property
* cannot be read.
* @return
* <code>true</code> if <code>s</code> is a method type,
* <code>false</code> otherwise.
* @type boolean
* @see #isMethod()
*/
function isMethodType(s)
{
return /\s*(function|object)\s*/.test(s);
}

return (isMethodType(typeof encodeURIComponent)
&& encodeURIComponent
? encodeURIComponent(s)
: (isMethodType(typeof escape) && escape
? escape(s)
: s));
}

 for (var links = document.links, i = links && links.length; i--;)
{
  var
link = links,
rx = new RegExp(
"^(ht|f)tps?:\\/\\/([^.]+\\.)*("
+ sDomains.replace(/\./g, "\\.")
+ ")(\\/|$)");

if (rx.test(link.href))
{
    link.href = sRedirectBase + esc(link.href);
  }
 }
}
</script>
</head>

<body onload="patchLinks('domain2.com', 'http://redirectUrl.com/');">
...
</body>
</html>


PointedEars
 
M

Michael Winter

Adnan Siddiqi wrote:
[snip]

This is not a valid URL/URI. See RFC3986 and below.

To a point; the path doesn't contain hierarchical information. For that
reason, it's certainly a questionable URI - it would be more
conventional to include the embedded URI in the query string - but
nevertheless it does match the grammar expressed in RFC 3986.

Assuming your gripe is syntactic, rather than semantic (and I would
agree on the latter - no debate there), then I can only see two possible
causes: the colon and the empty segment.

Path segments may contain colons,

path-abempty = *( "/" segment )
segment = *pchar
pchar = unreserved / pct-encoded / sub-delims
/ ":" / "@"

as long as it isn't within the first path segment in a relative-path
reference:

relative-part = "//" authority path-abempty
/ path-absolute
/ path-noscheme
/ path-empty
path-noscheme = segment-nz-nc *( "/" segment )
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims
/ "@" )
; non-zero-length segment without any colon ":"

Neither the prose nor the grammar prohibits empty path segments within a
path (and both easily could if the intention was there). In fact, the
prose alludes to the possibility of empty path segments as it states
that a path cannot begin "//" when there is no authority component, but
it doesn't say that such a sequence can't appear elsewhere.

[snip]

Mike
 
T

Thomas 'PointedEars' Lahn

Michael said:
To a point; the path doesn't contain hierarchical information.

It does not have to:

| 3.3. Path
|
| The path component contains data, usually organized in hierarchical
| form [...]
|
| A path consists of a sequence of path segments separated by a slash
| ("/") character. A path is always defined for a URI, though the
| defined path may be empty (zero length). Use of the slash character
| to indicate hierarchy is only required when a URI will be used as the
| context for relative references.
For that reason, it's certainly a questionable URI -

There is a better reason for calling it questionable at best.
it would be more conventional to include the embedded URI in the query
string -

I did/do not care about conventions just because they exist. Unquestioned
conventions can lead to unfounded traditions, and unfounded traditions tend
to lead to a standstill in development. There is no need for a query part
here if the appended URI is properly escaped.
but nevertheless it does match the grammar expressed in RFC 3986.

To a point, yes.
Assuming your gripe is syntactic, rather than semantic (and I would
agree on the latter - no debate there), then I can only see two possible
causes: the colon and the empty segment.

Path segments may contain colons,

path-abempty = *( "/" segment )
segment = *pchar
pchar = unreserved / pct-encoded / sub-delims
/ ":" / "@"

These productions apply, but see below.
as long as it isn't within the first path segment in a relative-path
reference:

relative-part = "//" authority path-abempty
/ path-absolute
/ path-noscheme
/ path-empty
path-noscheme = segment-nz-nc *( "/" segment )
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims
/ "@" )
; non-zero-length segment without any colon ":"

These productions do not apply here. The URI retrieved from the `href'
property will be an absolute URI, even if the value of the corresponding
`href' attribute is a URI reference that could produce this
(URI-reference :: relative-ref :: relative-part [ "?" query ] [ "#"
fragment ]).
Neither the prose nor the grammar prohibits empty path segments within a
path (and both easily could if the intention was there). In fact, the
prose alludes to the possibility of empty path segments as it states
that a path cannot begin "//" when there is no authority component, but
it doesn't say that such a sequence can't appear elsewhere.

You are misunderstanding the RFC, and your logic is flawed. For an
/(ht|f)tps?:/ URI/URL (see subsection 1.1.3) must contain an authority
component because of the need for a host (a general URI does not need to,
as it may be a URN).

Let's start with the initial production:

| URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
| [...]
| hier-part = "//" authority path-abempty
| / path-absolute
| / path-rootless
| / path-empty

Subsequent productions are:

| authority = [ userinfo "@" ] host [ ":" port ]
| [...]
| path-abempty = *( "/" segment )
| path-absolute = "/" [ segment-nz *( "/" segment ) ]
| path-rootless = segment-nz *( "/" segment )
| path-empty = 0<pchar>

Obviously (for the reasons given above), of those for /(ht|f)tps?:/
URIs/URLs only the production

| hier-part = "//" authority path-abempty

applies. And I concur that path-abempty can produce both `:' and `//'.

But: the character sequence `scheme:' (here: `http:') is clearly defined.
As per

| 2.2. Reserved Characters

`:' is such a character:

| reserved = gen-delims / sub-delims
|
| gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"

Furthermore (in the same subsection),

| URI producing applications should percent-encode data octets that
| correspond to characters in the reserved set unless these characters
| are specifically allowed by the URI scheme to represent data in that
| component.

The reason for this recommendation (SHOULD) is that another `scheme:'
character sequence within the URI can render the URI ambiguous.

So the example given by the OP may be a syntactically Valid URI, but one
that is at least unwise to use. And considering that this was only an
example, and that the URI reference to be found in the `href' attribute
value (that is resolved to an absolute URI upon property access) can
contain _any_ character valid in an URI _iff it is not part of the path
component_, especially reserved ones, one can rightfully say that simple
concatenation of a base absolute (redirect) URI with the retrieved absolute
URI of the hyperlink (most certainly) will not result in a (valid) URI.


PointedEars
 
M

Michael Winter

Michael Winter wrote:
[snip]
[...] the path doesn't contain hierarchical information.

It does not have to:

Yes, in general, URIs do not need to have hierarchical paths, but HTTP
URIs are hierarchical. That said, the hierarchy can be arbitrary; it
certainly doesn't need to follow a directory structure, for example.
Yes, I know you know that, I'm just trying to eliminate an unnecessary
response. :)

[snip]
These productions do not apply here.

I never said they did. You seemed to miss the part where I wrote, "in a
relative-path reference" (a term defined in 4.2 Relative Reference). The
URI suggested by the OP isn't a relative-path reference, but an absolute
URI (4.3 - though a fragment might not be prohibited). The information
above was just a qualification to prevent misinterpretation of the
preceding statement. That is, a colon can appear in path segments, but
not /all/ path segments.

[snip]
You are misunderstanding the RFC, and your logic is flawed.

I disagree. I have simply stated facts. However, it's hard to refute
conclusively unless you identify what you think I have misunderstood, or
where exactly logic has apparently failed me.
For an /(ht|f)tps?:/ URI/URL (see subsection 1.1.3) must contain an
authority component because of the need for a host (a general URI
does not need to, as it may be a URN).

I already knew that. I think you misunderstood why I wrote what I did,
though that is perhaps my fault. I started by making comments specific
to HTTP URIs, but then shifted to a more generic treatment without
explicitly noting it.

Your problem with what I wrote would seem to revolve around my mention
of the authority component. It only occurred in relation to empty path
segments in general, and not specifically to the URI suggested by the OP
(so the specifics of HTTP URIs are irrelevant, in this instance).

To some, allowing empty path segments might seem to be an oversight, or
a simplification of the grammar. Given the number of revisions to the
URI syntax RFCs, the former is unlikely, but the latter isn't entirely
unreasonable. However, even if that were the case, the RFC needn't have
limited itself to stating:

If a URI does not contain an authority component, then the path
cannot begin with two slash characters ("//").
-- 3.3 Path

It could have just forbade empty segments, instead.

As I didn't know what the grounds were for your objection to the
proposed URI, I hoped to cover the two obvious syntactic possibilities.
On reflection, I should have just asked. :)

[snipped well-intentioned quotation of the grammar]
As per

| 2.2. Reserved Characters

`:' is such a character:

| reserved = gen-delims / sub-delims
|
| gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"

Furthermore (in the same subsection),

| URI producing applications should percent-encode data octets that
| correspond to characters in the reserved set unless these characters
| are specifically allowed by the URI scheme to represent data in that
| component.

The reason for this recommendation (SHOULD) is that another `scheme:'
character sequence within the URI can render the URI ambiguous.

I don't see why. A scheme can only occur at the start of a URI, and a
URI may only start in five ways. Three of those require an unambiguous
delimiter: authority (//), query (?), and fragment (#). The remaining
two is with the scheme itself, or a path. If the path begins with a
slash, that too is unambiguous. If it doesn't, then a colon in the first
segment will be confusing, but this can be resolved by adding a leading
dot segment (foo:bar -> ./foo:bar).

So, if a colon occurs before any slash, question mark, or hash
characters, it delimits the scheme. Anywhere else and it is part of some
(sub-)component.
So the example given by the OP may be a syntactically Valid URI, but one
that is at least unwise to use.

I agree.

[snip]

Mike
 
T

Thomas 'PointedEars' Lahn

Michael said:
I never said they did. You seemed to miss the part where I wrote, "in a
relative-path reference" (a term defined in 4.2 Relative Reference). The
URI suggested by the OP isn't a relative-path reference, but an absolute
URI (4.3 - though a fragment might not be prohibited). The information
above was just a qualification to prevent misinterpretation of the
preceding statement. That is, a colon can appear in path segments, but
not /all/ path segments.

Referring to irrelevant parts of the grammar does not strike me as being
reasonable or helpful. The matter is complicated enough already, more
noise will rather hinder its clarification.
For an /(ht|f)tps?:/ URI/URL (see subsection 1.1.3) must contain an
authority component because of the need for a host (a general URI
does not need to, as it may be a URN).

I already knew that. I think you misunderstood why I wrote what I did,
though that is perhaps my fault. I started by making comments specific
to HTTP URIs, but then shifted to a more generic treatment without
explicitly noting it.
ACK
As per

| 2.2. Reserved Characters

`:' is such a character:

| reserved = gen-delims / sub-delims
|
| gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"

Furthermore (in the same subsection),

| URI producing applications should percent-encode data octets that
| correspond to characters in the reserved set unless these characters
| are specifically allowed by the URI scheme to represent data in that
| component.

The reason for this recommendation (SHOULD) is that another `scheme:'
character sequence within the URI can render the URI ambiguous.

I don't see why.

| The purpose of reserved characters is to provide a set of delimiting
| characters that are distinguishable from other data within a URI.
A scheme can only occur at the start of a URI, and a
URI may only start in five ways. Three of those require an unambiguous
delimiter: authority (//), query (?), and fragment (#). The remaining
two is with the scheme itself, or a path. If the path begins with a
slash, that too is unambiguous.

| A subset of the reserved characters (gen-delims) is used as
| delimiters of the generic URI components described in Section 3. A
| component's ABNF syntax rule will not use the reserved or gen-delims
| rule names directly; instead, each syntax rule lists the characters
| allowed within that component (i.e., not delimiting it), and any of
| those characters that are also in the reserved set are "reserved"
| for use as subcomponent delimiters within the component.
If it doesn't, then a colon in the first segment will be confusing, but
this can be resolved by adding a leading dot segment
(foo:bar -> ./foo:bar).

That would, however, require it to be a URI reference instead of a URI.
So, if a colon occurs before any slash, question mark, or hash
characters, it delimits the scheme. Anywhere else and it is part of some
(sub-)component.

It is not that simple.

I hope you also agree with what you have snipped below because *that*
contained the key point of this paragraph.


PointedEars
 
A

Adnan Siddiqi

VK you rock!

Thanks a lot


-adnan
Adnan said:
Hi
Suppose I have following URLs comming from an HTML document

<a href="http://mydomain1.com">Domain1</a>
<a
href="http://subdomain.domain.com/myfile.anyext">http://subdomain.domain.com/myfile.anyext</a>


<a href="http://subdomain.domain2.com/myfile.anyext">Domain2</a>

Now,what I want to search URL pattern within Href only as well as check
if it contains a particular domain ,for instance "domain2.com", if yes
then it replace with following URL.

"http://redirectUrl.com/http://subdomain.domain2.com/myfile.anyext"

<script type="text/javascript">
function patchLinks() {
var len = document.links.length;
var lnk = null;
for (var i=0; i<len; i++) {
lnk = document.links;
if (lnk.href.indexOf('domain2.com') != -1) {
lnk.href = 'http://redirectUrl.com/' + lnk.href;
}
}
}

window.onload = patchLinks;
</script>
 
V

VK

Thomas said:
No (his solution is error-prone at best).

This solution uses DOM 0 - thus it works for all ever produced browsers
with JavaScript/JScript support starting with Netscape 2.

If you have yet more universal solution I'm anxious to see it.
 
T

Thomas 'PointedEars' Lahn

VK said:
This solution uses DOM 0

It uses features that are also available in DOM Level 0.
- thus it works for all ever produced browsers
with JavaScript/JScript support starting with Netscape 2.

Wrong. If you knew what you are talking about, you would also know that
"DOM Level 0" refers to features common to Netscape 3.0 and IE 3.0. But
even if we ignore that, your statement is still wrong.
If you have yet more universal solution I'm anxious to see it.

I have posted it already.


PointedEars
 
V

VK

Thomas said:
I have posted it already.

When you need to replace a bulb, are you turning around the lamp
yourself or are you using your hand only? With this code I'm not sure
anymore... :)
 
T

Thomas 'PointedEars' Lahn

VK said:
When you need to replace a bulb, are you turning around the lamp
yourself or are you using your hand only? With this code I'm not sure
anymore... :)

Obviously you have not understood my code, which is hardly surprising.
FWIW, the main issues that my approach covers and yours does not, are:

1. Only the domain of the link's URL should matter.
2. The resulting URL must be properly escaped.

How this is achieved is a different matter.


PointedEars
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,233
Members
46,820
Latest member
GilbertoA5

Latest Threads

Top