Best way to replace a set of strings in large files?

Ryan Chan · Dec 10, 2009

Hello,

Consider the case:

You have 200 lines of mapping to replace, in a csv format, e.g.

apple,orange
boy,girl
....

You have a 500MB file, you want to replace all 200 lines of mapping,
what would be the most efficient way to do it?

Thanks.

cvhLE · Dec 11, 2009

Hello,

Consider the case:

You have 200 lines of mapping to replace, in a csv format, e.g.

apple,orange
boy,girl
...

You have a 500MB file, you want to replace all 200 lines of mapping,
what would be the most efficient way to do it?

Thanks.

If you want to replace the whole line or know the column where you
need to replace it and the line has clear separators you may be be a
lot faster if you do it using awk:

cat csv|awk -F"," "$2~/apple/ {$2="orange"; print $1,$2} " ...

otherwise I don't see a reason not to use the most obvious way:
starting from line 1 and running until the end ... especially if dont
know *where* the 200 lines are ...

#! /usr/bin/perl -w
%replace=('apple'=>'orange','boy'=>'girl');
$r="(".join ("|", keys %replace ).")";$r=qr($r);
while (<>) {
s/$r/$replace{$1}/g;
print;
}

[08:07:43] cvh@lenny:~$ echo "a boy named sue sings a song for apple
jack" | perl repl.pl
a girl named sue sings a song for orange jack
[08:07:45] cvh@lenny:~$ echo "a boy named sue sings a song for apple
jack" > test.txt
[08:07:59] cvh@lenny:~$ perl repl.pl test.txt
a girl named sue sings a song for orange jack
[08:08:11] cvh@lenny:~$ perl repl.pl test.txt >test_replace.txt
[08:08:24] cvh@lenny:~$ cat test_replace.txt
a girl named sue sings a song for orange jack
[08:08:40] cvh@lenny:~$

sln · Dec 11, 2009

If you want to replace the whole line or know the column where you
need to replace it and the line has clear separators you may be be a
lot faster if you do it using awk:

cat csv|awk -F"," "$2~/apple/ {$2="orange"; print $1,$2} " ...

otherwise I don't see a reason not to use the most obvious way:
starting from line 1 and running until the end ... especially if dont
know *where* the 200 lines are ...

#! /usr/bin/perl -w
%replace=('apple'=>'orange','boy'=>'girl');
$r="(".join ("|", keys %replace ).")";$r=qr($r);
while (<>) {
s/$r/$replace{$1}/g;
print;
}

I would asume this would take a long
time to do this process.

At a minimum, it would take

500,000,000
x
200
-----------------
100,000,000,000

100 billion character comparisons
if nothing ever matched.
Still not matching word, but the first character
matched before backtracking

100,000,000,000
x
2
----------------
200,000,000,000

brings the total up to 200 billion character
comparisons.

Since this is all a conservative estimate
I would average (conservatively) 4 comparison
characters per map per byte in the file and say

500,000,000
x
800
-----------------
400,000,000,000

400 billion comparisons.
Add to that the menutia of backtracking, loading
buffers, writing to disk, and the underpining layers
Perl has to do to execute C code, and I would go out
for coffee or take a nap.

-sln

Find and count strings of text from multiple files	17	Dec 16, 2021
best way to make a few changes in a large data file	18	Jan 8, 2013
Best way to replace hash keys	5	Jun 3, 2011
Search a Large files backwards	7	Mar 2, 2010
What is the most pythonic way to build up large strings?	13	Feb 8, 2014
matching strings in a large set of strings	13	Apr 29, 2010
Replace in large text file ?	7	Jun 5, 2010
I'm tempted to quit out of frustration	1	Aug 13, 2023

Best way to replace a set of strings in large files?

Ryan Chan

cvhLE

sln

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads