Removing lines containing same first string boundaries?

Tuxedo · Mar 17, 2014

I have a plain text file with each line in the format:

Start of line followed immediately by a string of character(s), a
whitespace, another string, a newline.

-------- file.txt -------

SOMESTRING XXX
SOMESTRING ZZZ
SOMEOTHERSTRING YYYZZ23
DIFFERENTSTRING HELLO

-----------

I would like to output each line that contains a string of a first
character sequence but not repeat any line(s) with the same string as a
first character sequence that appear further down the file. The output of
running a perl procedure against the above file would then be:

SOMESTRING XXX
SOMEOTHERSTRING YYYZZ23
DIFFERENTSTRING HELLO

In other words, no repetition should occur of any first word boundary on
each line in case the sequence happens to reappear on other line(s) as a
first character boundary before each line's first whitespace.

Alternatively, if given a parameter such as '^SOMESTRING' the output
against the file would be narrowed down to:

SOMESTRING XXX

The second character string boundary (XXX) after the whitespace is
arbitrary and should not affect the result but can be included in the
output even if it happens to match SOMESTRING. So the output becomes the
first occurence of ^SOMESTRING plus the remaining characters on the same
line up until newline.

Or if for example '^SOME' is passed as a parameter, the result would be:

SOMESTRING XXX
SOMEOTHERSTRING YYYZZ23

In which ways can this be done efficiently in Perl?

Many thanks for any ideas.

Tuxedo

Tuxedo · Mar 17, 2014

Jürgen Exner wrote:

[...]

What have you tried so far? Where are you stuck? What doesn't work as
expected?

Nothing so far, as I didn't try anything yet.

I would simply use split() to isolate the leading word and then use a
hash to track which words have already been printed.
[...]

perldoc -f grep

Thanks for the above pointers. I will look into these.

Tuxedo

John Black · Mar 17, 2014

I have a plain text file with each line in the format:

Start of line followed immediately by a string of character(s), a
whitespace, another string, a newline.

-------- file.txt -------

SOMESTRING XXX
SOMESTRING ZZZ
SOMEOTHERSTRING YYYZZ23
DIFFERENTSTRING HELLO

-----------

I would like to output each line that contains a string of a first
character sequence but not repeat any line(s) with the same string as a
first character sequence that appear further down the file. The output of
running a perl procedure against the above file would then be:

SOMESTRING XXX
SOMEOTHERSTRING YYYZZ23
DIFFERENTSTRING HELLO

In other words, no repetition should occur of any first word boundary on
each line in case the sequence happens to reappear on other line(s) as a
first character boundary before each line's first whitespace.

Alternatively, if given a parameter such as '^SOMESTRING' the output
against the file would be narrowed down to:

SOMESTRING XXX

The second character string boundary (XXX) after the whitespace is
arbitrary and should not affect the result but can be included in the
output even if it happens to match SOMESTRING. So the output becomes the
first occurence of ^SOMESTRING plus the remaining characters on the same
line up until newline.

Or if for example '^SOME' is passed as a parameter, the result would be:

SOMESTRING XXX
SOMEOTHERSTRING YYYZZ23

In which ways can this be done efficiently in Perl?

Many thanks for any ideas.

Tuxedo

Just use a regex to match each line to your input parameter.

if ($line =~ /($param\w*)\s+.*/) { # If the first string matches input parameter
$string1 = $1; # Extract string1 from the current line

if (grep {$string1 ne $_} @matched_array { # If string1 has not been seen before
print $line; # Print the line
push (@matched_array, $string1); # Put the string in the matched array
}
}

I have not tested this for errors but that is what I would start with. Not shown is the loop
to grab each line of the file into $line.

John Black

Rainer Weikusat · Mar 17, 2014

Tuxedo said:
I have a plain text file with each line in the format:

Start of line followed immediately by a string of character(s), a
whitespace, another string, a newline.

-------- file.txt -------

SOMESTRING XXX
SOMESTRING ZZZ
SOMEOTHERSTRING YYYZZ23
DIFFERENTSTRING HELLO

-----------

I would like to output each line that contains a string of a first
character sequence but not repeat any line(s) with the same string as a
first character sequence that appear further down the file.
[...]

Alternatively, if given a parameter such as '^SOMESTRING' the output
against the file would be narrowed down to:

SOMESTRING XXX

The second character string boundary (XXX) after the whitespace is
arbitrary and should not affect the result

[...]

-----------
my (%seen, $filter, $tag);

$filter = $ARGV[0] // '.';

while (<STDIN>) {
($tag) = /^(\S+)/ or next;

next if $seen{$tag} || !/$filter/o;

$seen{$tag} = 1;
print;
}

Rainer Weikusat · Mar 17, 2014

John Black said:
I have a plain text file with each line in the format:

Start of line followed immediately by a string of character(s), a
whitespace, another string, a newline.

-------- file.txt -------

SOMESTRING XXX
SOMESTRING ZZZ
SOMEOTHERSTRING YYYZZ23
DIFFERENTSTRING HELLO

-----------

I would like to output each line that contains a string of a first
character sequence but not repeat any line(s) with the same string as a
first character sequence

Click to expand...

[...]

Alternatively, if given a parameter such as '^SOMESTRING' the output
against the file would be narrowed down to:

SOMESTRING XXX

Click to expand...

[...]

Just use a regex to match each line to your input parameter.

if ($line =~ /($param\w*)\s+.*/) { # If the first string matches input parameter
$string1 = $1; # Extract string1 from the current line

if (grep {$string1 ne $_} @matched_array { # If string1 has not been seen before
print $line; # Print the line
push (@matched_array, $string1); # Put the string in the matched array
}
}

This is atrociously inefficient/ unscalable as the running time of the
algorithm is proportional to the square of the number of lines which are
printed.

Charlton Wilbur · Mar 17, 2014

HL> Strong smell of homework assignment here ...

Indeed. Highly detailed and precise specification combined with "I
haven't tried anything yet...."

Tuxedo, go and try. Come back when you have a specific, concrete
question that isn't "Solve my problem for me."

Charlton

Rainer Weikusat · Mar 17, 2014

Tuxedo said:
Rainer Weikusat wrote:

[...]

-----------
my (%seen, $filter, $tag);

$filter = $ARGV[0] // '.';

while (<STDIN>) {
($tag) = /^(\S+)/ or next;

next if $seen{$tag} || !/$filter/o;

$seen{$tag} = 1;
print;
}

Click to expand...

Many thanks for the above procedure.

How exactly should it be run against a file and keyword parameter?

It expects data to process on stdin and a keyword, if any, as first
argument, eg, assuming that the script text is in a file named a.pl and
the test text you posted in a file named text,

[rw@sable]/tmp#perl a.pl SOME <text
SOMESTRING XXX
SOMEOTHERSTRING YYYZZ23

NB: while (<>) doesn't localize $_ implicitly. If this is supposed to be
part of a larger program, a

local $_;

might need to be added in a suitable scope to avoid overwriting someone
else's $_.

John Black · Mar 17, 2014

Why are you using an O(n^2) algorithm when the same can be achieved with
an O(n) algorithm using a hash, which on top of everything else is even
simpler to write?

jue

Are you saying that a lookup for a key in a hash array does not have to search the hash for a
key match in a similar way to how grep is searching an array?

John Black

Randy Westlund · Mar 17, 2014

Are you saying that a lookup for a key in a hash array does not have to search the hash for a
key match in a similar way to how grep is searching an array?

John Black

Yes. The idea of a hash is that it can transform (aka hash) the key
(an O(1) operation) into something like an array index, which allows
an index-based lookup (another O(1) operation).

The downside of using a hash is memory overhead in setting up the
hash table.

Rainer Weikusat · Mar 17, 2014

John Black said:
Are you saying that a lookup for a key in a hash array does not have
to search the hash for a key match in a similar way to how grep is
searching an array?

Of course not. A 'hash' is not an array. It's an associative array
implemented as hash table with separate chaining. This means it is an
array of pointers to linked lists containing the actual entries. When
searching for a key, the key is transformed to a number by using it as
input for the so-called 'hash function' and 'truncating' the resulting
number down to the current size of the list pointer array via
mod. Assuming that H(key) is the hash function, an array index is
calculated as

ndx = H(key) % @hash_array

Nowadays, using hash arrays whose sizes are powers of two is common
because the modulo-operation can then be performed with a binary
and. The list $hash_array[ndx] points to is then searched for the actual
key. In ideal conditions (when the table is large enough), all list
sizes will be 1 which means the lookup requires only a single string
comparison. Even if the list has a number of entries (because of
so-called 'hash collisions'), it will still be much shorter than a list
containing all key/value-pairs in the hash if a 'reasonable' hash
function is used (it is supposed to produce a 'uniform' distribution of
values).

Tuxedo · Mar 17, 2014

Rainer Weikusat wrote:

[...]

It expects data to process on stdin and a keyword, if any, as first
argument, eg, assuming that the script text is in a file named a.pl and
the test text you posted in a file named text,

[rw@sable]/tmp#perl a.pl SOME <text
SOMESTRING XXX
SOMEOTHERSTRING YYYZZ23

NB: while (<>) doesn't localize $_ implicitly. If this is supposed to be
part of a larger program, a

local $_;

might need to be added in a suitable scope to avoid overwriting someone
else's $_.

Many thanks for the detailed how-to! I will test.

Tuxedo

Rainer Weikusat · Mar 17, 2014

[...]

-----------
my (%seen, $filter, $tag);

$filter = $ARGV[0] // '.';

while (<STDIN>) {
($tag) = /^(\S+)/ or next;

next if $seen{$tag} || !/$filter/o;

$seen{$tag} = 1;
print;
}

There are a couple of more possible gotchas with this, in particular,

- the /o flag means the regex will be compiled exactly once per program
run. This might become an issue in conditions where the value of
$filter can change, ie, if thas was put into a subroutine.

$filter = qr/$filter/

could be used to pre-compile the current $filter in such cases.

- the filter match is performed on the whole input line, not only on the
tag, meaning, it will match in the 2nd and subsequent fields if not
anchored to the beginning of the line.

$tag !~ /$filter/

could be used instead.

Rainer Weikusat · Mar 18, 2014

Jürgen Exner said:
Yes. That's why it's called a hash in the first place. Accessing a hash
element is typically O(1).

This is a little too simplistic. The the amount of work necessarily to
locate a random key which is stored in the hash is proportional to the
average length of each used hash chain divided by 2 and the time to
determine that some key is not in the hash is proportional to the
average length of a hash chain. Assuming a 'good' hash function is being
used (which also covers the case of 'degenerated' key sets better known
as 'algorithmic complexity attack), the average length of a hash chain
should be the number of keys in the hash divided by the number of entries
in the array holding the chain pointers. That still proportional to the
number of entries in the hash, aka O(n), except that it is possible to
make the table sufficiently large to ensure that the average length of a
hash chain will be small.

As an example:

[rw@sable]~#perl -MDevel:

eek -e '%a = 'aaaa' .. 'zzzz'; Dump(\%a)'
SV = RV(0x817b824) at 0x817b818
REFCNT = 1
FLAGS = (TEMP,ROK)
RV = 0x8195d28
SV = PVHV(0x8180984) at 0x8195d28
REFCNT = 2
FLAGS = (SHAREKEYS)
ARRAY = 0xb6807008 (0:110767, 1:94524, 2:40904, 3:12427, 4:2864, 5:545, 6:99, 7:12, 8:2)
hash quality = 98.6%
KEYS = 228488
FILL = 151377
MAX = 262143
RITER = -1
EITER = 0x0
Elt "utei" HASH = 0x8ccc0001
SV = PV(0xaffa700) at 0xb70b0bf8
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0xb000720 "utej"\0
CUR = 4
LEN = 8
Elt "vqmw" HASH = 0xe2100002
SV = PV(0xb07d430) at 0xb70cf8c8
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0xb0843d0 "vqmx"\0
CUR = 4
LEN = 8
Elt "yehi" HASH = 0x8d680003
SV = PV(0xb22c828) at 0xb230360
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0xb233fc0 "yehj"\0
CUR = 4
LEN = 8

There are 228,488 keys in the hash and the chain pointer table has a
size of 262,143 which means the average length of a chain should be
about 0.87. However, only 151,377 table entries are in use so
'average chain length' for the 'locate key which exists' case is about
1.5 and about 42% of the available table slots are unused (the 'ARRAY ='
line shows the actual distribution of chain lengths).

Assuming the key set is small, there's also the problem that calculating
the hash value is usually more expensive than comparing two keys and
locating an item on a linked-list is more expensive than locating one in
an array.

The general idea behind 'using a hash' is that determining
exactly when this become better than doing linear searches in an array
isn't worth the effort for a small number of entries and that hash
lookups will very likely perform better than most linear searches as the
key set gets larger (at the expense of wasting a possibly 'large' amount
of memory).

Rainer Weikusat · Mar 18, 2014

Martijn Lievaart said:
And Perl has a very "reasonable" implementation, much effort has been put
into that (if only to defeat DOS-attacks).

This is similar to the rotating hash, but it actually mixes the
internal state. It takes 9n+9 instructions and produces a full
4-byte result. Preliminary analysis suggests there are no
funnels.

This hash was not in the original Dr. Dobb's article. I
implemented it to fill a set of requirements posed by Colin
Plumb. Colin ended up using an even simpler (and weaker) hash
that was sufficient for his purpose.

http://burtleburtle.net/bob/hash/doobs.html

[SCNR]

Kaz Kylheku · Mar 18, 2014

I have a plain text file with each line in the format:

Start of line followed immediately by a string of character(s), a
whitespace, another string, a newline.

-------- file.txt -------

SOMESTRING XXX
SOMESTRING ZZZ
SOMEOTHERSTRING YYYZZ23
DIFFERENTSTRING HELLO

This can be implemented using a very simple, clear on-liner in awk, right from
your shell prompt.

The lines marked <- are my tty input; the others are awk output:

$ awk '{ if (! ($1 in seen)) { print $0 ; seen[$1] } }'
A B <-
A B
A C <-
A D <-
D 1 <-
D 1
D 2 <-
D 3 <-
A Z <-

Rainer Weikusat · Mar 18, 2014

Kaz Kylheku said:
I have a plain text file with each line in the format:

Start of line followed immediately by a string of character(s), a
whitespace, another string, a newline.

-------- file.txt -------

SOMESTRING XXX
SOMESTRING ZZZ
SOMEOTHERSTRING YYYZZ23
DIFFERENTSTRING HELLO

Click to expand...

This can be implemented using a very simple, clear on-liner in awk, right from
your shell prompt.

The lines marked <- are my tty input; the others are awk output:

$ awk '{ if (! ($1 in seen)) { print $0 ; seen[$1] } }'

perl -ane 'print, $seen{$F[0]} = 1 unless $seen{$F[0]}'

Tim McDaniel · Mar 18, 2014

Kaz Kylheku said:
Kaz Kylheku said:

I have a plain text file with each line in the format:

Start of line followed immediately by a string of character(s), a
whitespace, another string, a newline.

-------- file.txt -------

SOMESTRING XXX
SOMESTRING ZZZ
SOMEOTHERSTRING YYYZZ23
DIFFERENTSTRING HELLO

Click to expand...

This can be implemented using a very simple, clear on-liner in awk, right from
your shell prompt.

The lines marked <- are my tty input; the others are awk output:

$ awk '{ if (! ($1 in seen)) { print $0 ; seen[$1] } }'

Click to expand...

perl -ane 'print, $seen{$F[0]} = 1 unless $seen{$F[0]}'

perl -ane 'print if !$seen{$F[0]}++'

perl -ane '$seen{$F[0]}++ || print'

Rainer Weikusat · Mar 18, 2014

Kaz Kylheku said:
Kaz Kylheku said:

I have a plain text file with each line in the format:

Start of line followed immediately by a string of character(s), a
whitespace, another string, a newline.

-------- file.txt -------

SOMESTRING XXX
SOMESTRING ZZZ
SOMEOTHERSTRING YYYZZ23
DIFFERENTSTRING HELLO

This can be implemented using a very simple, clear on-liner in awk, right from
your shell prompt.

The lines marked <- are my tty input; the others are awk output:

$ awk '{ if (! ($1 in seen)) { print $0 ; seen[$1] } }'

Click to expand...

perl -ane 'print, $seen{$F[0]} = 1 unless $seen{$F[0]}'

Click to expand...

perl -ane 'print if !$seen{$F[0]}++'

perl -ane '$seen{$F[0]}++ || print'

Well, there are quite a few more more-or-less bizarre variants, eg,

perl -ape '$_=$seen{$F[0]}++?"":$_'

or

perl -ape '$seen{$F[0]}++&&undef$_'

Kaz Kylheku · Mar 18, 2014

Kaz Kylheku said:
Kaz Kylheku said:

I have a plain text file with each line in the format:

Start of line followed immediately by a string of character(s), a
whitespace, another string, a newline.

-------- file.txt -------

SOMESTRING XXX
SOMESTRING ZZZ
SOMEOTHERSTRING YYYZZ23
DIFFERENTSTRING HELLO

This can be implemented using a very simple, clear on-liner in awk, right from
your shell prompt.

The lines marked <- are my tty input; the others are awk output:

$ awk '{ if (! ($1 in seen)) { print $0 ; seen[$1] } }'

Click to expand...

perl -ane 'print, $seen{$F[0]} = 1 unless $seen{$F[0]}'

Click to expand...

perl -ane 'print if !$seen{$F[0]}++'

perl -ane '$seen{$F[0]}++ || print'

gawk '{if(!seen[$1]++)print}'

(GNU Awk has bignums, which takes care of the overflow.)

Rainer Weikusat · Mar 18, 2014

Rainer Weikusat said:
I have a plain text file with each line in the format:

Start of line followed immediately by a string of character(s), a
whitespace, another string, a newline.

-------- file.txt -------

SOMESTRING XXX
SOMESTRING ZZZ
SOMEOTHERSTRING YYYZZ23
DIFFERENTSTRING HELLO

This can be implemented using a very simple, clear on-liner in awk, right from
your shell prompt.

The lines marked <- are my tty input; the others are awk output:

$ awk '{ if (! ($1 in seen)) { print $0 ; seen[$1] } }'

perl -ane 'print, $seen{$F[0]} = 1 unless $seen{$F[0]}'

Click to expand...

perl -ane 'print if !$seen{$F[0]}++'

perl -ane '$seen{$F[0]}++ || print'

Click to expand...

Well, there are quite a few more more-or-less bizarre variants, eg,

perl -ape '$_=$seen{$F[0]}++?"":$_'

or

perl -ape '$seen{$F[0]}++&&undef$_'

Coming to think of that, no self-respecting quibbler would ever use a
hash named %seen. And this auto-split thing is also much too
straight-forward. So what about

perl -pe '$£{(/(\S+)/)[0]}++&&undef$_'

?

Php combine identical lines in text file	4	Oct 11, 2023
Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023
ifstream >> string with UTF-8?	6	Sep 9, 2009
Problem Splitting Text String	2	Dec 29, 2022
Turning lines of a file into array?	10	May 4, 2013
Python point location of intersect between two lines	0	Feb 28, 2018
Clickable link conversion regex?	0	Nov 30, 2012
Can't solve problems! please Help	0	Sep 26, 2022

Removing lines containing same first string boundaries?

Tuxedo

Tuxedo

John Black

Rainer Weikusat

Rainer Weikusat

Charlton Wilbur

Rainer Weikusat

John Black

Randy Westlund

Rainer Weikusat

Tuxedo

Rainer Weikusat

Rainer Weikusat

Rainer Weikusat

Kaz Kylheku

Rainer Weikusat

Tim McDaniel

Rainer Weikusat

Kaz Kylheku

Rainer Weikusat

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads