Parsing for email addresses

galileo228 · Feb 15, 2010

Hey all,

I'm trying to write python code that will open a textfile and find the
email addresses inside it. I then want the code to take just the
characters to the left of the "@" symbol, and place them in a list.
(So if (e-mail address removed) was in the file, 'galileo228' would be
added to the list.)

Any suggestions would be much appeciated!

Matt

Jonathan Gardner · Feb 15, 2010

I'm trying to write python code that will open a textfile and find the
email addresses inside it. I then want the code to take just the
characters to the left of the "@" symbol, and place them in a list.
(So if (e-mail address removed) was in the file, 'galileo228' would be
added to the list.)

Any suggestions would be much appeciated!

You may want to use regexes for this. For every match, split on '@'
and take the first bit.

Note that the actual specification for email addresses is far more
than a single regex can handle. However, for almost every single case
out there nowadays, a regex will get what you need.

Tim Chase · Feb 16, 2010

Jonathan said:
You may want to use regexes for this. For every match, split on '@'
and take the first bit.

Note that the actual specification for email addresses is far more
than a single regex can handle. However, for almost every single case
out there nowadays, a regex will get what you need.

You can even capture the part as you find the regexps. As
Jonathan mentions, finding RFC-compliant email addresses can be a
hairy/intractable problem. But you can get a pretty close
approximation:

import re

r = re.compile(r'([-\w._+]+)@(?:[-\w]+\.)+(?:\w{2,5})', re.I)
# ^
# if you want to allow local domains like
# user@localhost
# then change the "+" marked with the "^"
# to a "*" and the "{2,5}" to "+" to unlimit
# the TLD. This will change the outcome
# of the last test "jim@com" to True

for test, expected in (
('(e-mail address removed)', True),
('(e-mail address removed)', True),
('@example.com', False),
('@sub.example.com', False),
('@com', False),
('jim@com', False),
):
m = r.match(test)
if bool(m) ^ expected:
print "Failed: %r should be %s" % (test, expected)

emails = set()
for line in file('test.txt'):
for match in r.finditer(line):
emails.add(match.group(1))
print "All the emails:",
print ', '.join(emails)

-tkc

galileo228 · Feb 16, 2010

Hey all, thanks as always for the quick responses.

I actually found a very simple way to do what I needed to do. In
short, I needed to take an email which had a large number of addresses
in the 'to' field, and place just the identifiers (everything to the
left of @domain.com), in a python list.

I simply highlighted all the addresses and placed them in a text file
called emails.txt. Then I had the following code which placed each
line in the file into the list 'names':

Code:

fileHandle = open('/Users/Matt/Documents/python/results.txt','r')
names = fileHandle.readlines()

Now, the 'names' list has values looking like this: ['(e-mail address removed)
\n', '(e-mail address removed)\n', etc]. So I ran the following code:

Code:

for x in names:
    st_list.append(x.replace('@domain.com\n',''))

And that did the trick! 'Names' now has ['aaa12', 'bbb34', etc].

Obviously this only worked because all of the domain names were the
same. If they were not then based on your comments and my own
research, I would've had to use regex and the split(), which looked
massively complicated to learn.

Thanks all.

Matt

Tim Chase · Feb 16, 2010

galileo228 said:
Code:

fileHandle = open('/Users/Matt/Documents/python/results.txt','r') names = fileHandle.readlines()

Now, the 'names' list has values looking like this: ['(e-mail address removed)
\n', '(e-mail address removed)\n', etc]. So I ran the following code:

Code:

for x in names: st_list.append(x.replace('@domain.com\n',''))

And that did the trick! 'Names' now has ['aaa12', 'bbb34', etc].

Obviously this only worked because all of the domain names were the
same. If they were not then based on your comments and my own
research, I would've had to use regex and the split(), which looked
massively complicated to learn.

The complexities stemmed from several factors that, with more
details, could have made the solutions less daunting:

(a) you mentioned "finding" the email addresses -- this makes
it sound like there's other junk in the file that has to be
sifted through to find "things that look like an email address".
If the sole content of the file is lines containing only email
addresses, then "find the email address" is a bit like [1]

(b) you omitted the detail that the domains are all the same.
Even if they're not the same, (a) reduces the problem to a much
easier task:

s = set()
for line in file('results.txt'):
s.add(line.rsplit('@', 1)[0].lower())
print s

If it was previously a CSV or tab-delimited file, Python offers
batteries-included processing to make it easy:

import csv
f = file('results.txt', 'rb')
r = csv.DictReader(f) # CSV
# r = csv.DictReader(f, delimiter='\t') # tab delim
s = set()
for row in r:
s.add(row['Email'].lower())
f.close()

or even

f = file(...)
r = csv.DictReader(...)
s = set(row['Email'].lower() for row in r)
f.close()

Hope this gives you more ideas to work with.

-tkc

[1]
http://jacksmix.files.wordpress.com/2007/05/findx.jpg

galileo228 · Feb 17, 2010

Tim -

Thanks for this. I actually did intend to have to sift through other
junk in the file, but then figured I could just cut and paste emails
directly from the 'to' field, thus making life easier.

Also, in this particular instance, the domain names were the same, and
thus I was able to figure out my solution, but I do need to know how
to handle the same situation when the domain names are different, so
your response was most helpful.

Apologies for leaving out some details.

Matt

galileo228 said:
galileo228 said:

Code:

fileHandle = open('/Users/Matt/Documents/python/results.txt','r') names = fileHandle.readlines()

Click to expand...

Now, the 'names' list has values looking like this: ['(e-mail address removed)
\n', '(e-mail address removed)\n', etc]. So I ran the following code:

Click to expand...

Code:

for x in names: st_list.append(x.replace('[email protected]\n',''))

Click to expand...

And that did the trick! 'Names' now has ['aaa12', 'bbb34', etc].

Click to expand...

Obviously this only worked because all of the domain names were the
same. If they were not then based on your comments and my own
research, I would've had to use regex and the split(), which looked
massively complicated to learn.

Click to expand...

The complexities stemmed from several factors that, with more
details, could have made the solutions less daunting:

(a) you mentioned "finding" the email addresses -- this makes
it sound like there's other junk in the file that has to be
sifted through to find "things that look like an email address".
If the sole content of the file is lines containing only email
addresses, then "find the email address" is a bit like [1]

(b) you omitted the detail that the domains are all the same.
Even if they're not the same, (a) reduces the problem to a much
easier task:

s = set()
for line in file('results.txt'):
s.add(line.rsplit('@', 1)[0].lower())
print s

If it was previously a CSV or tab-delimited file, Python offers
batteries-included processing to make it easy:

import csv
f = file('results.txt', 'rb')
r = csv.DictReader(f) # CSV
# r = csv.DictReader(f, delimiter='\t') # tab delim
s = set()
for row in r:
s.add(row['Email'].lower())
f.close()

or even

f = file(...)
r = csv.DictReader(...)
s = set(row['Email'].lower() for row in r)
f.close()

Hope this gives you more ideas to work with.

-tkc

[1]http://jacksmix.files.wordpress.com/2007/05/findx.jpg

JavaScript Challenge: Validating Email Addresses	1	Oct 6, 2023
Script to send email not working	1	Apr 10, 2023
HOWTO: Parsing email using Python part2	1	Jul 15, 2011
parsing email from stdin	0	Oct 8, 2013
Email banner	1	Nov 28, 2019
Dynamic block parsing + scrolling	0	May 30, 2024
Seeking co-founders for my company.	3	Sep 8, 2024
SendGrid email issue in responsive Gmail	1	Nov 4, 2021

Parsing for email addresses

galileo228

Jonathan Gardner

Tim Chase

galileo228

Tim Chase

galileo228

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads