G
George Durzi
Consider this excerpt from some HTML. (This is a copy from View->Source,
except for the comment)
<TABLE WIDTH=100% CELLPADDING=0 CELLSPACING=0 border=0>
<?xml version="1.0" encoding="UTF-16"?>
<!-- need to extract whatever is here -->
</TABLE>
I need to extract all the HTML that would be in the <!-- need to extract
whatever is here --> section. So I did the following.
1. Retrieve the HTML into a string variable
Interesting observation: when I look at the contents of the string, every
double quote has been escaped, so they all show as \" instead of "
2. Remove carriage returns and newlines from the string
ResultHtml = ResultHtml.Replace("\r", string.Empty);
ResultHtml = ResultHtml.Replace("\n", string.Empty);
3. Use a Regex to try and find a match
string sFind = "<TABLE WIDTH=100% CELLPADDING=0 CELLSPACING=0 border=0><?xml
version=\"1.0\" encoding=\"UTF-16\"?>" + ((.|\n)*?) + "</TABLE>";
Regex rx = new Regex(sFind,
RegexOptions.IgnoreCase|RegexOptions.IgnorePatternWhitespace);
Match m1 = rx.Match(ResultHtml);
if (m1.Success)
// do something
I never get a match ... I tried this with some simpler HTML and the regex
works fine to retrieve what was between two table tags
I also tried stripping all double quotes from ResultHtml, and them trying:
string sFind = "<TABLE WIDTH=100% CELLPADDING=0 CELLSPACING=0 border=0><?xml
version=1.0 encoding=UTF-16?>" + ((.|\n)*?) + "</TABLE>";
Still no match..
The string in my HTML which I'm trying to match exists exactly as in sFind.
Any idea?
except for the comment)
<TABLE WIDTH=100% CELLPADDING=0 CELLSPACING=0 border=0>
<?xml version="1.0" encoding="UTF-16"?>
<!-- need to extract whatever is here -->
</TABLE>
I need to extract all the HTML that would be in the <!-- need to extract
whatever is here --> section. So I did the following.
1. Retrieve the HTML into a string variable
Interesting observation: when I look at the contents of the string, every
double quote has been escaped, so they all show as \" instead of "
2. Remove carriage returns and newlines from the string
ResultHtml = ResultHtml.Replace("\r", string.Empty);
ResultHtml = ResultHtml.Replace("\n", string.Empty);
3. Use a Regex to try and find a match
string sFind = "<TABLE WIDTH=100% CELLPADDING=0 CELLSPACING=0 border=0><?xml
version=\"1.0\" encoding=\"UTF-16\"?>" + ((.|\n)*?) + "</TABLE>";
Regex rx = new Regex(sFind,
RegexOptions.IgnoreCase|RegexOptions.IgnorePatternWhitespace);
Match m1 = rx.Match(ResultHtml);
if (m1.Success)
// do something
I never get a match ... I tried this with some simpler HTML and the regex
works fine to retrieve what was between two table tags
I also tried stripping all double quotes from ResultHtml, and them trying:
string sFind = "<TABLE WIDTH=100% CELLPADDING=0 CELLSPACING=0 border=0><?xml
version=1.0 encoding=UTF-16?>" + ((.|\n)*?) + "</TABLE>";
Still no match..
The string in my HTML which I'm trying to match exists exactly as in sFind.
Any idea?