MS Word to XHTML

C

Caversham

Is there any macro / other tool - free or commercial - that can split
long Word docs into multiple XHTML pages?

Any comments on the quality/effectiveness of suitable products also
welcomed.
 
R

Roy Schestowitz

__/ [Caversham] on Sunday 11 September 2005 06:02 \__
Is there any macro / other tool - free or commercial - that can split
long Word docs into multiple XHTML pages?

Any comments on the quality/effectiveness of suitable products also
welcomed.

I would advice you to do the following:

* Download Open Office 2 beta (openoffice.org)

* Install it on your Windows machine

* Open the Word document in Open Office

* Save or export as HTML

* Fragment the output as requires, probably by hand (WYSIWYG programs like
Word have no notion of structure or semantics)

* Run HTMLTidy on the resulting HTML (find it in sourceforge.org)

* Modify output to fit XHTML standards

* Use search & replace for the task above

* Lastly, make sure your code validates (W3C validator)

Good luck,

Roy
 
A

Alan J. Flavell

On Sun, 11 Sep 2005, Roy Schestowitz wrote (seen on alt.html):

[...]
* Fragment the output as requires, probably by hand (WYSIWYG programs
like Word have no notion of structure or semantics)

This isn't by any means aimed at you personally, but your posting
triggered a response from me, and it looks as if knowledge is proceeding
backwards.

Proper use of MS Word uses Styles, oriented towards the structure of the
document. (If I had my way, I'd rip the direct styling buttons out of the
main menu of Word, and hide them away in an Advanced Users menu). Such
properly-made Word documents are reasonably capable of being converted
well to structural HTML, and a stylesheet suitable for web use can then be
applied (it usually won't be the same "style sheet" (= style template) as
would be suitable for a printed Word document, of course!).

I had some experience, around 1997-8, with the (payware) rtftohtml program
- subsequently renamed and marketed under the company name Logictran - it
had this pretty-much sorted out. I must admit I haven't got experience of
it since the change of name, but I can say that the principles of the
original program seemed to what I was looking for, unlike most of the
other pseudo-WYSIWYG garbage from other places (that offended all sense of
what is suitable for the WWW).

With that rtftohtml program, decently structured Word could be turned into
decently structured HTML, and split on chapter or section headings quite
automatically, with HTML indexes and table of contents generated
automatically. OK, there were some rough edges, but at least the
principles showed up just fine. I find it sad that some 7 years later we
seem to have fallen back to the stone age of direct styling and
pseudo-WYSIWYG in most of the Word conversions that I have seen.

[Note - there are other programs called rtftohtml or rtf2html - it may be
that some of them do a similar job, I can't speak for or against them,
I'm just commenting as a reasonably satistfied user of version 4 of this
particular program from around 1998 onwards.]
 
T

Toby Inkster

Roy said:
* Run HTMLTidy on the resulting HTML (find it in sourceforge.org)
* Modify output to fit XHTML standards
* Use search & replace for the task above

Tidy can do all of this -- use the "-asxhtml" option.
 
S

SpaceGirl

Alan said:
On Sun, 11 Sep 2005, Roy Schestowitz wrote (seen on alt.html):

[...]
* Fragment the output as requires, probably by hand (WYSIWYG programs
like Word have no notion of structure or semantics)


This isn't by any means aimed at you personally, but your posting
triggered a response from me, and it looks as if knowledge is proceeding
backwards.

Proper use of MS Word uses Styles, oriented towards the structure of the
document. (If I had my way, I'd rip the direct styling buttons out of the
main menu of Word, and hide them away in an Advanced Users menu). Such
properly-made Word documents are reasonably capable of being converted
well to structural HTML, and a stylesheet suitable for web use can then be
applied (it usually won't be the same "style sheet" (= style template) as
would be suitable for a printed Word document, of course!).

I had some experience, around 1997-8, with the (payware) rtftohtml program
- subsequently renamed and marketed under the company name Logictran - it
had this pretty-much sorted out. I must admit I haven't got experience of
it since the change of name, but I can say that the principles of the
original program seemed to what I was looking for, unlike most of the
other pseudo-WYSIWYG garbage from other places (that offended all sense of
what is suitable for the WWW).

With that rtftohtml program, decently structured Word could be turned into
decently structured HTML, and split on chapter or section headings quite
automatically, with HTML indexes and table of contents generated
automatically. OK, there were some rough edges, but at least the
principles showed up just fine. I find it sad that some 7 years later we
seem to have fallen back to the stone age of direct styling and
pseudo-WYSIWYG in most of the Word conversions that I have seen.

[Note - there are other programs called rtftohtml or rtf2html - it may be
that some of them do a similar job, I can't speak for or against them,
I'm just commenting as a reasonably satistfied user of version 4 of this
particular program from around 1998 onwards.]

Word XP and upwards stores its documents in XML format doesn't it? You
could probably write your own XSLT to turn in into HTML fairly easily.

--


x theSpaceGirl (miranda)

# lead designer @ http://www.dhnewmedia.com #
# remove NO SPAM to email, or use form on website #
# this post (c) Miranda Thomas 2005
# explicitly no permission given to Forum4Designers
# to duplicate this post.
 
A

Alan J. Flavell

Alan J. Flavell wrote:

[comprehensive quote of my posting, without apparently having anything
relevant to say about it.]
Word XP and upwards stores its documents in XML format doesn't it?

So what? XML is only a format for defining markup. If the markup
doesn't do anything meaningful (specifically - if it only creates a
visual result on a printed page, without having any significant
structure) then it's not going to turn into effective HTML: it'd just
be the usual garbage in / garbage out that we're accustomed to with
Word conversions to soi-disant "web" format.
You could probably write your own XSLT to turn in into HTML fairly
easily.

There seems to be some kind of conceptual disconnect here. Most Word
documents (in my experience) simply don't contain the necessary
structure for useful conversion to HTML: they've been created as a
purely visual construction for printing onto paper. It's irrelevant
what underlying technology you use (RTF, XML, SGML, whatever) - the
problem is that the source material simply does not represent the
needed structures, *because the document authors do not put it there*.

You might as well try to convert cheese into fresh cream: both are
fine milk products, it's true, but instead of trying to convert the
one into the other, you'd do better to produce them both starting from
fresh milk. And the kind of "fresh milk" that's needed here is
logically structured text markup. Not visual formatting. Until the
authors of Word documents can grasp that, the prospects for conversion
of Word to web formats are poor, IMHO.
 
T

Toby Inkster

Alan said:
You might as well try to convert cheese into fresh cream: both are
fine milk products, it's true, but instead of trying to convert the
one into the other, you'd do better to produce them both starting from
fresh milk.

That is a very nice analogy -- I must try to remember it.
 
R

Roy Schestowitz

__/ [Toby Inkster] on Sunday 11 September 2005 10:02 \__
Tidy can do all of this -- use the "-asxhtml" option.

I didn't know about the existence of this option. Perhaps I am using an
(very) old version of tidy. I wasn't impressed the last time I used it,
which was over a year ago. I must also have thought about complex cases
when I suggested the steps above. Placements of images, for example, might
pose some difficulties, especially if they float.

Oo_Org will be a decent tools for steering away from non-standard attributes
and hard-coded fonts. The last thing the World Wide Web needs is more code
that is 'made up', which non-MS browsers like Firefox must accept and adapt
to. Sad, yet inevitable.

It sometimes upsets me that kids at school are taught to compose using
WYSIWYG paradigms. It only encourages information to be uniterpretable.
Like Zeldman once said, people used to toss bottles out the car's window
until they realised the impact of carelessness and laziness (misquotation,
but something to that effect anyway).

Roy
 
R

Roy Schestowitz

__/ [Alan J. Flavell] on Sunday 11 September 2005 11:19 \__
Alan J. Flavell wrote:

[comprehensive quote of my posting, without apparently having anything
relevant to say about it.]
Word XP and upwards stores its documents in XML format doesn't it?

So what? XML is only a format for defining markup. If the markup
doesn't do anything meaningful (specifically - if it only creates a
visual result on a printed page, without having any significant
structure) then it's not going to turn into effective HTML: it'd just
be the usual garbage in / garbage out that we're accustomed to with
Word conversions to soi-disant "web" format.
You could probably write your own XSLT to turn in into HTML fairly
easily.

There seems to be some kind of conceptual disconnect here. Most Word
documents (in my experience) simply don't contain the necessary
structure for useful conversion to HTML: they've been created as a
purely visual construction for printing onto paper. It's irrelevant
what underlying technology you use (RTF, XML, SGML, whatever) - the
problem is that the source material simply does not represent the
needed structures, *because the document authors do not put it there*.

You might as well try to convert cheese into fresh cream: both are
fine milk products, it's true, but instead of trying to convert the
one into the other, you'd do better to produce them both starting from
fresh milk. And the kind of "fresh milk" that's needed here is
logically structured text markup. Not visual formatting. Until the
authors of Word documents can grasp that, the prospects for conversion
of Word to web formats are poor, IMHO.

I fully agree with you on that point. Any attempt at rephrasing the same
ideas would result in depletion. To suggest ways forward, I suggest that
the OP, who clearly wants to publish material on the Web, learns LaTeX.
Shall the idea of editing raw text become daunting, I suggest LyX < lyx.org
[LyX: Front-end to LaTeX]. 5 minutes with LyX would help anyone realise
the difference and convey the idea, e.g. varying outputs, styles,
imposition of structure, etc.

Only a few days ago, somebody in the LyX mailing lists mentioned his
upcoming presentation on "Word: What you See Is What a Mess". The
presentation I deliver on Wednesday is well-formed XHTML <
http://schestowitz.com/Weblog/archives/2005/09/11/public-speaking/ > and is
motored by Eric Meyer's S5.

Roy
 
S

Stefan Ram

Caversham said:
Is there any macro / other tool - free or commercial - that can split
long Word docs into multiple XHTML pages?

I have a macro "Wrocco" that extracts XML from a documented
including paragraph and character styles and document
properties, but not everything (no formatting or tables).

The VBA source code and some links to other resources can
be found in the project page:

http://www.purl.org/stefan_ram/pub/wrocco_en

If you would use any tool to create XML from Word (including
XHTML), you could then use XSLT to split this into multiple
pages, I assume.
 
J

Joris Gillis

Hi,

Tempore 12:19:53 said:
So what? XML is only a format for defining markup. If the markup
doesn't do anything meaningful (specifically - if it only creates a
visual result on a printed page, without having any significant
structure) then it's not going to turn into effective HTML: it'd just
be the usual garbage in / garbage out that we're accustomed to with
Word conversions to soi-disant "web" format.


There seems to be some kind of conceptual disconnect here. Most Word
documents (in my experience) simply don't contain the necessary
structure for useful conversion to HTML: they've been created as a
purely visual construction for printing onto paper. It's irrelevant
what underlying technology you use (RTF, XML, SGML, whatever) - the
problem is that the source material simply does not represent the
needed structures, *because the document authors do not put it there*.

You might as well try to convert cheese into fresh cream: both are
fine milk products, it's true, but instead of trying to convert the
one into the other, you'd do better to produce them both starting from
fresh milk. And the kind of "fresh milk" that's needed here is
logically structured text markup. Not visual formatting. Until the
authors of Word documents can grasp that, the prospects for conversion
of Word to web formats are poor, IMHO.

I warmheartedly applaud your brilliant analysis. You stated your point very clearly.

It's depressing to see what a tiny percentage of people realize (or bother with) the importance of structural markup.

The future does not look bright. I have seen so called 'IT-classes' where they make innocent people believe they are IT-experts when they can change the background color of characters typed in Word...

regards,
 
S

SpaceGirl

Roy said:
__/ [Alan J. Flavell] on Sunday 11 September 2005 11:19 \__

Alan J. Flavell wrote:

[comprehensive quote of my posting, without apparently having anything
relevant to say about it.]

Word XP and upwards stores its documents in XML format doesn't it?

So what? XML is only a format for defining markup. If the markup
doesn't do anything meaningful (specifically - if it only creates a
visual result on a printed page, without having any significant
structure) then it's not going to turn into effective HTML: it'd just
be the usual garbage in / garbage out that we're accustomed to with
Word conversions to soi-disant "web" format.

Word documents, being style based, are easy to convert. Use XSLT to
strip out all the crap so that all you end up with is basic HTML - <p>'s
and <h>'s. I wasn't suggested that anything more complicated that that
should be attempted - but I HAVE seen it done pretty successfully with
Word 2003 files. In the case of that client (although I wasn't part of
the team who wrote those tools), their customers would submit Word
documents and the XSLT would convert them into both HTML and PDFs, and
the reproduction was almost perfect (styling and colours anyway).

That wasn't what I saw, but like I said I wasn't on that team. As far as
I could tell they wrote a simple parser.

Strange, as I've never had a problem. Generally I have to do it in a
sort of round-robin of programs; First save your Word documents as PDF,
then save the PDF as a web page. It works just fine.

<snip stuff I cant be bothered to read, seeing as everyone else is being
so fucking rude>


--


x theSpaceGirl (miranda)

# lead designer @ http://www.dhnewmedia.com #
# remove NO SPAM to email, or use form on website #
# this post (c) Miranda Thomas 2005
# explicitly no permission given to Forum4Designers
# to duplicate this post.
 
R

Roy Schestowitz

__/ [SpaceGirl] on Sunday 11 September 2005 20:46 \__
Roy said:
__/ [Alan J. Flavell] on Sunday 11 September 2005 11:19 \__

On Sun, 11 Sep 2005, SpaceGirl wrote:


Alan J. Flavell wrote:

[comprehensive quote of my posting, without apparently having anything
relevant to say about it.]


Word XP and upwards stores its documents in XML format doesn't it?

So what? XML is only a format for defining markup. If the markup
doesn't do anything meaningful (specifically - if it only creates a
visual result on a printed page, without having any significant
structure) then it's not going to turn into effective HTML: it'd just
be the usual garbage in / garbage out that we're accustomed to with
Word conversions to soi-disant "web" format.

Word documents, being style based, are easy to convert. Use XSLT to
strip out all the crap so that all you end up with is basic HTML - <p>'s
and <h>'s. I wasn't suggested that anything more complicated that that
should be attempted - but I HAVE seen it done pretty successfully with
Word 2003 files. In the case of that client (although I wasn't part of
the team who wrote those tools), their customers would submit Word
documents and the XSLT would convert them into both HTML and PDFs, and
the reproduction was almost perfect (styling and colours anyway).

That wasn't what I saw, but like I said I wasn't on that team. As far as
I could tell they wrote a simple parser.


I believe that's possible, but it depends on the standard that the author
sticks to. Word does not /force/ the author to add structural information.
Hence, hacks are allowed which leave bits hanging aloof.

Strange, as I've never had a problem. Generally I have to do it in a
sort of round-robin of programs; First save your Word documents as PDF,
then save the PDF as a web page. It works just fine.


I have had bad experiences converting PDF's to HTML. I even wrote about this
very <http://schestowitz.com/Weblog/archives/2005/05/24/pdf-to-html/>
particular conversion because I found it frustrating. PDF involves
embedment of objects to fit the media, e.g. A4 paper, so it is bound to
lose what is necessary for a good conversion.

<snip stuff I cant be bothered to read, seeing as everyone else is being
so fucking rude>


Are you referring to me? Did I say anything rude? Please clarify if
possible.

Roy
 
A

Alan J. Flavell

To suggest ways forward, I suggest that
the OP, who clearly wants to publish material on the Web, learns LaTeX.

Well, this drifts somewhat off the topic of some of the crossposted
groups, but our physicists are accustomed to writing their
publications in some form of latex, and I can say that when I was
handling the web-ifying of their publications, several years back, I
was (for the most part) getting good results from a program called
latex2html, and most problems were attributable to identifiable
causes, none of which were usually a major hindrance. (Back then we
had to make do with the deplorable HMTL version called HTML/3.2, but,
aside from that, the principles seemed right).
Shall the idea of editing raw text become daunting, I suggest LyX
< lyx.org > [LyX: Front-end to LaTeX]. 5 minutes with LyX would help
anyone realise the difference and convey the idea, e.g. varying
outputs, styles, imposition of structure, etc.

Only a few days ago, somebody in the LyX mailing lists mentioned his
upcoming presentation on "Word: What you See Is What a Mess".

googled!

It's really the principles which count here: but in practical terms,
I'm sure you're right in aiming at a format which promotes >doing the
right thing< by default - as opposed to one which has prominent
direct-formatting buttons on its user interface, and logical markup as
an apparently advanced topic which, I'm afraid, too many of authors
seem to disdain learning.

all the best
 
R

Roy Schestowitz

[Groups distribution reduced]

__/ [Alan J. Flavell] on Monday 12 September 2005 17:33 \__
Well, this drifts somewhat off the topic of some of the crossposted
groups, but our physicists are accustomed to writing their
publications in some form of latex, and I can say that when I was
handling the web-ifying of their publications, several years back, I
was (for the most part) getting good results from a program called
latex2html, and most problems were attributable to identifiable
causes, none of which were usually a major hindrance. (Back then we
had to make do with the deplorable HMTL version called HTML/3.2, but,
aside from that, the principles seemed right).


I use latex2html almost religiously. I estimate that about 1000 pages in my
site are in one form or another a product of latex2html, which has always
produced better output than lyx2html, for example. I discussed latex2html
in depth a couple of days ago and I continue to promote it.

Shall the idea of editing raw text become daunting, I suggest LyX
< lyx.org > [LyX: Front-end to LaTeX]. 5 minutes with LyX would help
anyone realise the difference and convey the idea, e.g. varying
outputs, styles, imposition of structure, etc.

Only a few days ago, somebody in the LyX mailing lists mentioned his
upcoming presentation on "Word: What you See Is What a Mess".

googled!

It's really the principles which count here: but in practical terms,
I'm sure you're right in aiming at a format which promotes >doing the
right thing< by default - as opposed to one which has prominent
direct-formatting buttons on its user interface, and logical markup as
an apparently advanced topic which, I'm afraid, too many of authors
seem to disdain learning.

all the best


Only last night I was in a similar position involving my supervisor who
heads the Computer Science Department [I believe it is sensible to make
this public given the nature of the discussion]. For a Windows-centric
person like himself, who uses Office almost exclusively, it was difficult
to satisfy a Linux-dominated department. Conversion of a Word document to
HTML, also to be embedded in E-mail (I must bite my tongue) was never a
good idea. The final outcome is a PDF attachment with hyperlinks. My
arguments about standards, structure-based composition and the like seem to
have led to this result, which I suspect many will be satisfied with.

Best Wishes,

Roy
 
P

Peter Flynn

Toby said:
That is a very nice analogy -- I must try to remember it.

The others in common use are

Turning hamburgers back into cows
Turning scrambled eggs back into chickens

///Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,000
Messages
2,570,252
Members
46,848
Latest member
CristineKo

Latest Threads

Top