Extract domain name

Charles Calvert · Aug 20, 2010

All,

I have the same basic issue as discussed in this thread last year:
<http://groups.google.com/group/comp.lang.ruby/browse_thread/thread/77cfd3250a633c7e/0430f50818a01ddd>.

Justin Collins points out the greatest difficulty with the situation,
i.e. that when dealing with a country code TLD, one may well have a
different number of parts (e.g. example.co.uk) than when dealing with
a gTLD (example.com).

The only solution that has occurred to me is to have a list of known
TLDs and second level domains (e.g. co.uk) that are insufficiently
specific, requiring a subdomain for additional specificity. The
problem is that this requires maintenance as well as initial research.

Does anyone have any suggestions for an alternative method to solve
this problem? I'm currently using Addressable:URI
(http://addressable.rubyforge.org/api/classes/Addressable/URI.html) to
parse the URLs and extract the host names.

Mr zengr · Aug 20, 2010

Charles said:
All,

I have the same basic issue as discussed in this thread last year:
<http://groups.google.com/group/comp.lang.ruby/browse_thread/thread/77cfd3250a633c7e/0430f50818a01ddd>.

Justin Collins points out the greatest difficulty with the situation,
i.e. that when dealing with a country code TLD, one may well have a
different number of parts (e.g. example.co.uk) than when dealing with
a gTLD (example.com).

The only solution that has occurred to me is to have a list of known
TLDs and second level domains (e.g. co.uk) that are insufficiently
specific, requiring a subdomain for additional specificity. The
problem is that this requires maintenance as well as initial research.

Does anyone have any suggestions for an alternative method to solve
this problem? I'm currently using Addressable:URI
(http://addressable.rubyforge.org/api/classes/Addressable/URI.html) to
parse the URLs and extract the host names.

I think the best way will be actually match with a list of TLDs and
gTLDs.

Mozilla has a list of domains:
http://mxr.mozilla.org/mozilla/source/netwerk/dns/src/effective_tld_names.dat

A stackoverflow question on the same topic:
http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url

Their solution is regex.

Brian Candler · Aug 20, 2010

And remember that some things which look like domains have set
themselves up as registries - e.g. uk.com

Charles Calvert · Aug 23, 2010

[snip my question about extracting domain name (e.g. "example.com"
from "www.example.com").

I think the best way will be actually match with a list of TLDs and
gTLDs.

Mozilla has a list of domains:
http://mxr.mozilla.org/mozilla/source/netwerk/dns/src/effective_tld_names.dat

Wow. That's a big help. Thanks.

A stackoverflow question on the same topic:
http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url
Interesting.

Their solution is regex.

As the poster pointed out, matching everything leads to a huge regex,
which is likely to cause maintenance problems (though he indicated
that they started generating the regex from other data to address
that) and would make me concerned about resource allocation, though I
couldn't find anything in the core Ruby Doc about a max length for a
regex.

On the other hand it might be more performant than looping through a
bunch of substring matches or matching against database records. I
sense some testing in my future.

Thanks,

Michael Fellinger · Aug 23, 2010

[snip my question about extracting domain name (e.g. "example.com"
from "www.example.com").

I think the best way will be actually match with a list of TLDs and
gTLDs.

Mozilla has a list of domains:
http://mxr.mozilla.org/mozilla/source/netwerk/dns/src/effective_tld_names=

Click to expand...

dat

Wow. =C2=A0That's a big help. =C2=A0Thanks.

A stackoverflow question on the same topic:
http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url
Interesting.

Their solution is regex.

Click to expand...

As the poster pointed out, matching everything leads to a huge regex,
which is likely to cause maintenance problems (though he indicated
that they started generating the regex from other data to address
that) and would make me concerned about resource allocation, though I
couldn't find anything in the core Ruby Doc about a max length for a
regex.

On the other hand it might be more performant than looping through a
bunch of substring matches or matching against database records. =C2=A0I
sense some testing in my future.
http://github.com/pauldix/domainatrix

Thanks,
--
Charles Calvert
Moderator - alt.computer.consultants.moderated
Submission Address: (e-mail address removed)
Contact Address: (e-mail address removed)

--=20
Michael Fellinger
CTO, The Rubyists, LLC
I check email a couple times daily; to reach me sooner, use:
http://awayfind.com/manveru

Charles Calvert · Aug 23, 2010

[snip my question about extracting domain name (e.g. "example.com"
from "www.example.com").

I think the best way will be actually match with a list of TLDs and
gTLDs.

Click to expand...

Click to expand...

[snip]

http://github.com/pauldix/domainatrix

For a minute, I thought your reply was generated by a porn spam bot
until I saw github in the URL.

For those reading the thread, this is a gem that uses
http://publicsuffix.org/ to parse domain names and identify the suffix
(e.g. "com" or "co.uk"), domain, subdomains, etc. It extends
Addressable.URI.

That was very helpful. Thanks.

Correcting complex math	10	Dec 20, 2009
Ruby Weekly News 22nd - 28th August 2005	0	Aug 31, 2005
Comments on ObjectiveView issue 9 (no, I'm not a spambot)	3	Dec 12, 2006
Ruby Weekly News 27th June - 10th July 2005	0	Jul 12, 2005
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006

Extract domain name

Charles Calvert

Mr zengr

Brian Candler

Charles Calvert

Michael Fellinger

Charles Calvert

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads