On 01-Nov-12 5:41 AM, (e-mail address removed) wrote:
Hi,
Can you please show the way to quickly search such big Xml file, in a Visual C++ project?
http://dl.dropbox.com/u/40211031/List.zip
Did you generate these 1,000,002 lines of XML data, or is this from the
real world?
In case someone does not like downloading 57 megs of zipped file, or
expanding it into 722 megs of rather pointless example lines: here is an
abbreviated version:
<?xml version="1.0" encoding="UTF-16"?>
<Appdata>
<Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04" Attr5="05"
Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"
Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"
Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>
<Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14" Attr5="15"
Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"
Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"
Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>
... (999,998 similar lines omitted) ...
<Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"
Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"
Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"
Attr12="99999812" Attr13="99999813" Attr14="99999814" Attr15="99999815"
Attr16="99999816" Attr17="99999817" Attr18="99999818"
Attr19="99999819">Node_Number999998</Data>
<Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"
Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"
Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"
Attr12="99999912" Attr13="99999913" Attr14="99999914" Attr15="99999915"
Attr16="99999916" Attr17="99999917" Attr18="99999918"
Attr19="99999919">Node_Number999999</Data>
</Appdata>
I'm assuming you *generated* this file by way of example. If not, well,
it's so extremely structured that you could throw it away and use a
simple algorithm to generate the "data" for any line immediately. (And
then it would not be "data", it would be a calculation.)
Anyway, XML is a poor choice for this particular set of data. Write a
program to convert it into a binary format, where each "line" uses 10
integers and one string of a fixed length of 20 bytes. That takes up no
more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small
enough to be loaded into the RAM of today's computers.
Search "quickly" depends on what you want to search for. If, for
example, you may need to grab a single digit out of any attribute or
content (say, a '9' that can occur in the middle of 'Attr2="4593252"'),
you are better off storing everything as string. You could also sort the
list on one or more of the Attr fields, and, if you prefer lookup speed
over memory usage, you could even sort on *all* of the attribute fields
plus the data field, and save pointers to the 'actual' data.
[Jw]