One liner to remove duplicate records

N

Ninja Li

Hi,

I have a file with the following sample data delimited by "|" with
duplicate records:

20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|Ashley Cole|1.09|1.08|
20100430|20100429|Bill Thompson|0.76|0.78|
20100429|20100428|Time Apache|2.10|2.24|

The first three fields "date_1", "date_2" and "name" are unique
identifiers of a record.

Is there a simple way, like a one liner to remove the duplicates such
as with "John Smith"?

Thanks in advance.

Nick Li
 
S

sln

Hi,

I have a file with the following sample data delimited by "|" with
duplicate records:

20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|Ashley Cole|1.09|1.08|
20100430|20100429|Bill Thompson|0.76|0.78|
20100429|20100428|Time Apache|2.10|2.24|

The first three fields "date_1", "date_2" and "name" are unique
identifiers of a record.

Is there a simple way, like a one liner to remove the duplicates such
as with "John Smith"?

Thanks in advance.

Nick Li

I could think of a way, but it takes 2 lines, sorry.
-sln
 
J

John Bokma

Ninja Li said:
Hi,

I have a file with the following sample data delimited by "|" with
duplicate records:

20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|Ashley Cole|1.09|1.08|
20100430|20100429|Bill Thompson|0.76|0.78|
20100429|20100428|Time Apache|2.10|2.24|

The first three fields "date_1", "date_2" and "name" are unique
identifiers of a record.

Is there a simple way, like a one liner to remove the duplicates such
as with "John Smith"?

Yes.

But have you tried to write a multi-line Perl program first? Moving from
a working Perl program to a one-liner might be easier than starting
straight with the one-liner.

Also read up on what the various options of perl do.
 
S

sln

I could think of a way, but it takes 2 lines, sorry.

Wait, this might work.

c:\temp>perl -a -F"\|" -n -e "/^$/ and next or !exists $hash{$key = join '',@F[0
...2]} and ++$hash{$key} and print" file.txt
20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|Ashley Cole|1.09|1.08|
20100430|20100429|Bill Thompson|0.76|0.78|
20100429|20100428|Time Apache|2.10|2.24|

c:\temp>

-sln
 
D

Dr.Ruud

Ninja said:
I have a file with the following sample data delimited by "|" with
duplicate records:

20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|Ashley Cole|1.09|1.08|
20100430|20100429|Bill Thompson|0.76|0.78|
20100429|20100428|Time Apache|2.10|2.24|

The first three fields "date_1", "date_2" and "name" are unique
identifiers of a record.

Is there a simple way, like a one liner to remove the duplicates such
as with "John Smith"?

If the data is as strict as presented, you can use

sort -u <input

sort <input |uniq

or simply use the whole line as a hash key:

perl -wne'$_{$_}++ or print' <input

(the first underscore is not really necessary)
 
J

Jürgen Exner

Ninja Li said:
I have a file with the following sample data delimited by "|" with
duplicate records:

20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|Ashley Cole|1.09|1.08|
20100430|20100429|Bill Thompson|0.76|0.78|
20100429|20100428|Time Apache|2.10|2.24|

The first three fields "date_1", "date_2" and "name" are unique
identifiers of a record.

Is there a simple way, like a one liner to remove the duplicates such
as with "John Smith"?

Your data is sorted already, so a simple call to 'uniq' will do the job:
http://en.wikipedia.org/wiki/Uniq

jue
 
S

sln

Hi,

I have a file with the following sample data delimited by "|" with
duplicate records:

20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|John Smith|-0.07|-0.08|
20100430|20100429|Ashley Cole|1.09|1.08|
20100430|20100429|Bill Thompson|0.76|0.78|
20100429|20100428|Time Apache|2.10|2.24|

The first three fields "date_1", "date_2" and "name" are unique
identifiers of a record.

Is there a simple way, like a one liner to remove the duplicates such
as with "John Smith"?

Thanks in advance.

Nick Li

Another way:

perl -anF"\|" -e "tr/|// > 1 and ++$seen{qq<@F[0..2]>} > 1 and next or print" file.txt

-sln
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,983
Messages
2,570,187
Members
46,747
Latest member
jojoBizaroo

Latest Threads

Top