C
Chris Chris
Hi there,
I am trying to find a solution for this problem:
I am generating a large amount of data from different data feeds (CSV).
I have a short Ruby program that parses all of the CSV files (input),
manipulates the data, and then writes the data to a new CSV file,
combined_data.csv
My problem is that the data feeds I get are formatted differently.
Example, let's use iPods:
1.csv
iPod Video 20 GB | black, white
Sony BlueRay Player | green, blue
2.csv
IPOD video 20Gigs | $250 | apple
SONY Blue Ray Disc Player | $300 | sony
3.csv
I pod Video 20GB | apple.com/ipod
Sony BRD Player | sony.com/blueray
For my combined_data.csv, I might want to combine the data from the 3
feeds like so:
combined_data.csv
iPod Video 20 GB | black, white | $300 | sony | apple.com/ipod
Thus, were I have 3 relevant rows of data, I want to combine them into
one. Of course, the data isn't in the same order in the different CSV
files.
The only "common element" is the name (product name in this case), but
as you will see above, there are usually some differences. So I have to
use the name to combine the data.
lowercase and removing whitespaces won't work because words can be
different.
I basically have no idea how this can be accomplished (especially since
there are 100s of products), apart from manually adding a common ID,
e.g.
if n == "IPOD video 20Gigs"
id = "ipodv20"
end
Second thing I thought of, maybe there's a way in Ruby to compare
strings and get back a percentage of how similar the strings are?
If you have any ideas, I'd appreciate your input.
Cheers, Chris
I am trying to find a solution for this problem:
I am generating a large amount of data from different data feeds (CSV).
I have a short Ruby program that parses all of the CSV files (input),
manipulates the data, and then writes the data to a new CSV file,
combined_data.csv
My problem is that the data feeds I get are formatted differently.
Example, let's use iPods:
1.csv
iPod Video 20 GB | black, white
Sony BlueRay Player | green, blue
2.csv
IPOD video 20Gigs | $250 | apple
SONY Blue Ray Disc Player | $300 | sony
3.csv
I pod Video 20GB | apple.com/ipod
Sony BRD Player | sony.com/blueray
For my combined_data.csv, I might want to combine the data from the 3
feeds like so:
combined_data.csv
iPod Video 20 GB | black, white | $300 | sony | apple.com/ipod
Thus, were I have 3 relevant rows of data, I want to combine them into
one. Of course, the data isn't in the same order in the different CSV
files.
The only "common element" is the name (product name in this case), but
as you will see above, there are usually some differences. So I have to
use the name to combine the data.
lowercase and removing whitespaces won't work because words can be
different.
I basically have no idea how this can be accomplished (especially since
there are 100s of products), apart from manually adding a common ID,
e.g.
if n == "IPOD video 20Gigs"
id = "ipodv20"
end
Second thing I thought of, maybe there's a way in Ruby to compare
strings and get back a percentage of how similar the strings are?
If you have any ideas, I'd appreciate your input.
Cheers, Chris