Converting from Microsoft Word to HTML

(covering Word 5, 6, Office95, Office97, and a non-Microsoft product that actually works; also micro write-ups for Quark and Frame users)

for the Web Tools Review by Philip Greenspun


May 2005 Update: I'm not actively maintaining this page anymore but I did just recently try a product called "Word to Web" (WordToWeb 2.5) to see if it would product less complex and hard-to-edit-manually HTML than Word 2003 does on its own. The answer is "no". In fact, it did a much worse job than Microsoft Word by itself.

Word 5.0 (the bad old days)

There are two tried-and-true ways to get documents from Microsoft Word 5.0 onto the Web:

Word 6.0 (Microsoft Realizes Internet Exists)

Even though I'm generally opposed to making Bill Gates even richer, I decided that I could spare $135 for the academic version of Microsoft Office with Word 6.0. This would, I decided, fix the Text with Layout but and let me try out Microsoft Internet Assistant.

The Academic version of Office comes only on floppy disk; 32 of them. It took me about two hours and 120 megabytes of disk space to install the complete package.

I started Word 6.0.1, the new fixed zippy version of Word that was supposed to address the program's sluggishness on the Macintosh. PowerMac-native Word 6 ran substantially slower than Word 5 ran in emulation on my PowerMac.

"OK, so it is a major pig," I thought to myself, "and proves once again that C programs beyond a certain complexity are neither fast nor small. Still, it will be nice to be able to Save As Text with Layout without crashing the machine."

I tried saving a simple 10-page paper with no figures or equations or anything special (beyond two footnotes) as Text with Layout. An error box appeared. No output file was produced. My machine crashed a few seconds later.

"Oh well, so two years and all the C programmers Bill Gates could bring in from India weren't enough to squash this bug. At least I can play with Internet Assistant."

It turns out that Internet Assistant only runs on the PC version of Word.

Did I feel like I'd been cheated out of $135? Hell no! Aficionados of viewgraph design claim that Aldus Persuasion is way better, but PowerPoint produced a nice stack of colored viewgraphs for a conference talk. Oh, this PowerMac native C program is real zippy. On a 66 Mhz PowerMac, it was barely able to keep up with my typing in certain modes.

Office 95

It took me about a year but I finally got Windows NT to run on a PC. Then I was able to enjoy total Microsoft quality, both operating systems and applications. I installed Office for Win95 and then the very latest Internet Assistant. Now I could just type "Save As HTML" from any Word document. It was incredibly convenient and I bowed down to worship Bill Gates.

As soon as I had rolled up my prayer rug, though, I noticed that the HTML output from Word/Internet Assistant was garbage. For example, it would start to wrap a headline in an H2 tag but then forget to close it, so huge blocks of text were rendered as a headline. There were hundreds of extraneous PRE, BR, and P tags. Worse than useless.

"At least I can go back to my old way of doing things," I mumbled to myself, "I'll just Save As RTF and then use rtftohtml." No such luck. The latest version of Word, at least with Internet Assistant installed, puts some crud in RTF files that rtftohtml doesn't understand.

Something that Works

What finally worked was a beautiful commercial product called HTML Transit from InfoAccess. This works on whole groups of Word docs, producing index pages and local tables of contents in a highly configurable manner. It is a beautiful program that can almost make up for a day of using Microsoft software. HTML Transit understands Word tables and turns them into HTML tables. It isn't too smart about "smart" quotes, though, and it turns them into "’" which is not part of the legal HTML command set. It happens to be rendered by Netscape on Mac/PC as a fancy single quote but on Unix boxes it means nothing so the user sees "dont" when you wrote "don't". perl -i.bak -pe "s/’/'/g" should fix it up. Less easily fixed are equations. I tried translating my pathetic master's thesis which has a lot of equations in the text. These all ended up mangled. Despite these nits, HTML Transit is by far the best tool available for translating Microsoft Word into HTML. It also understands Interleaf, Frame, Word Perfect, RTF and a bunch of other formats (not Pagemaker though). Unfortunately, it only runs on Windows.

Office 97

Of course, you'd expect a company like Microsoft, with $8 billion in the bank, to eventually get it right... A friend of mine contributing a review to my on-line photography magazine wrote the original in Office97. He saved it as HTML and the results were remarkable. Where you'd have expected to find an <H3>, instead you got
<B><FONT FACE="Arial" SIZE=4><P>
The document was filled with special 8-bit ASCII characters that aren't part of the legal HTML character set, e.g., "smart quotes" and long dashes. There was an extra "<P>&nbsp;</P>" in between all of the paragraphs.

In short, a terrible unusable non-standard mess reflecting a complete ignorance of the original point of HTML (that the browser does the formatting).

What About Quark?

Quark does not have native support for writing HTML versions of its documents. There are a couple of plug-ins that you can buy that allegedly output HTML but I haven't been able to get the whole collection to work.

What About Frame?

Adobe's Frame 5.5 has the ability out of the box to output HTML (if you have an older version, you'll have to download a plug-in from the Adobe web site and/or upgrade to 5.5). It takes about 5 minutes to install Frame on a WinNT machine, read in a document, and write out an HTML file. I will be able to tell you more after I get off my butt and finish the on-line edition of Database Backed Web Sites (Macmillan used Frame to produce the final film).

... Well, I've tried Frame 5.5 now. It includes an impressive array of controls for converting documents to HTML and is way better than the plug-in that Adobe distributed with 5.0. I think that the philosophy and overall power are similar to what you get with InfoAccess. One truly impressive thing that Frame will do is simultaneously output a cascading style sheet so that the final HTML stays reasonably clean yet readers with modern browsers can get the benefits of design choices you've made. (Note: I wasn't able to finish the on-line edition of Database Backed Web Sites because Macmillan didn't supply me with something that would load cleanly into Frame and display. But whatever I saw in Frame was ultimately viewable in the final HTML document.)

The bottom line on Frame is that it remains an extremely powerful way to manage a set of documents that you intend to make available simultaneously in print, PDF, and HTML. But in order to get the most out of Frame, you'll have to invest a few days thinking about styles and what they should mean. Frame is a good enough piece of software that there are actually rewards to taking an intelligent and formal approach to your problem. But if you want to be stupid, you can think of Frame as a version of Microsoft Word with most of the bugs taken out.


philg@mit.edu

Reader's Comments

Word97 save as HTML... not only does it produce bletcherous HTML, but it forgets to put a TITLE in half the time. Kill it. Kill it now!

Maybe Word98 will work as good as that rtftohtml Perl script from 93?

-- Christian Mogensen, April 22, 1998

The company I work for offers a pretty good Word2HTML converter to the public free of charge. It is not perfect, but it is pretty good. If you go to http://www.micromodeling.com/actsol/actsol48.htm you can download it for free. Nothing to fill out, no-one will call you, no hassles whatever. This is just something we use and give away. I hope that this is useful to your readers.

-- Noah Clements, December 8, 1998
For an amusing account of the non-standard HTML code produced by the "Save As...HTML" feature, and a method to correct some of the more egregious mistakes, check out: http://www.fourmilab.ch/webtools/demoronizer

-- Andreas Yankopolus, December 31, 1998
mswordview is a Unix application which converts Word 8 documents to HTML. You can set netscape up to use mswordview to display .doc attachments.

http://www.csn.ul.ie/~caolan/docs/MSWordView.html

-- Andrew Morton, January 4, 1999

Star Office, which is free from Star Division, is an office suite (word processor, etc.). I have been successful using it to read MS Word files and to generate HTML versions of those files. I have not tried this with tables, however.

Found at http://www.stardivision.com, it installs easily on Solaris, Linux, etc. I think Windows NT/95 as well.



-- Patrick Logan, February 14, 1999

I finally gave up fighting with the staff over using Word, but I was able to convince them to save their work for the Web in RTF format (nearly a "standard"). We use RTF2HTML which results in doing a minimum of hand-coding. Although you don't get the source, you do get the scripts (not Perl) which are fully customizable. It doesn't have the overhead of running a CGI script and is available for most common platforms.

As always, YMMV (your mileage may vary).

Craig

-- Craig Burgess, April 29, 1999

The correct URL for the demoroniser is:

http://www.fourmilab.ch/webtools/demoroniser/

-- Frances Prevas, September 23, 1999

Every time I see a virii alert about MSWord macros I just chuckle to myself. Why in the world anyone would allow themself to be abused by that monstrosity (of MSWord) is beyond me.
For the best html-developer/editor ever made for the Windoze enviroment set your browser to
NoteTabPro where for $19.95 you get an editor that I believe rivals EMACS and has a great feature known as clipbooks which allow for customized development for anything an interactive developer would want (even does BINARIES!)
Available FREE Clipbooks:
HTML, CSS, Java/Script, SGML, and on-and-on...if the clipbook hasn't been developed, the open-enviroment of NoteTabPro allows you to write your own clipbooks.
I have no financial interest in NoteTabPro but simply believe it is the best thing to come along since Gates invented the internet!
Check it out, trash MSWord, regain your HD space, your memory, and your upgrade-itis.

-- Mark Comerford, October 6, 1999

Give Word2000 a try

Sorry for the crass commercialism, but I noticed that you had not updated the information on this page to reflect the latest version of Microsoft Office. Unlike the previous two (I must admit, half-hearted) attempts at saving documents in HTML format, the (teaming millions of) Office developers went all out to make HTML a full-fledged document format and not just a poor cousin to the old binary formats.

 

If you just want really simple HTML, you might be better using FrontPage or Visual Notepad, but if you have existing documents in Word, Excel or PowerPoint, I think youll find surprisingly good fidelity to the look of the original document and a much better use of HTML tags to represent that look.

 

By the way, the HTML for this comment was created by Word2000.

 

 



-- Mike Koss, October 28, 1999
Everytime I create an HTML file using the built-in HTML "converters" in MS Office97, I need to open the .htm file (another annoyance courtesy of Microsoft) in NoteTab Light, and strip out all of the extra junk in the file (e.g. STYLE="vnd.ms-excel.numberformat:$#,##0") for every special number/date format produced by Excel or Access; not to mention all of the Font Face tags. By manually stripping the extra labels, I am able to get rid of several thousand bytes of extra ASCII garbage that does not add anything to my HTML. It's an extra step, but since I know that most people that look at my files are using dial-up from home connections, it is the least that I can do.

-- John Fracisco, October 29, 1999
Regarding the "Try Office 2000" comment above: take a look at the HTML source this thing produced.

First, an embedded stylesheet about 30 lines long, completely specifying margins, fonts, etc. for each paragraph class. Each style used several non-standard attributes and values. Then, for each paragraph, it added a <span style=''> wrapper to override the styles in the class. Finally, it added some scripting, apparently for the hell of it.

So the end answer is no, the output of Office 2000 is no better than any previous MS effort, and is probably worse. (The explanation is that what's happing is that they're using HTML as an actual complete file format -- a replacement for .doc. They're using stylesheets as the equivalent to templates. Because there are lots of things in MS formatting that HTML doesn't support, they have to use lots of non-standard extensions. As usual, MS has totally missed the point: HTML is *not* a formatting language, it's a semantic markup language.)

-- Steve Greenland, January 24, 2000

For mass find a replace features, I enjoy the Allaire products HomeSite and ColdFusion Studio. They let you do find and replace features on whole folders. Makes changing MS HTML into legible/legal HTML a little easier. I've noticed MS HTML does sneaky stuff like incorrect nesting that works with IE, but breaks NetScape (subtle browser war tactic???). The Allaire products also have a "code sweeper" function which you can set to do things like "strip the font tag" or "strip ending P tags". You can customize the usage of all tags with code sweeper. Also there is a validate function that will point out the nesting errors. It's handy and so far my favorite editor. It has a WYSIWYG thing too, but it's not that great and I never use it. The 4.0 version of these programs great on Windows98/NT but I would be careful with CF Studio 4.5. I had strange memory problems with it.

-- Phillip Harrington, January 25, 2000
I have recently started converting books for the Web, and luckily these books were produced in pagemaker 6.5. I had never used the HTML export from pm, because generally I start a project in dreamweaver. The HTML export from pagemaker essentially takes a pm style, and you map that style to H1, P, etc. There are some funky problems with font colors, but a search and replace in dreamweaver is fairly quick. {shameless plug} you can see how it finally turned out at my Georgia Coast book {/shameless plug} If you have any specific questions about how to use pagemaker you can mail me directly.

-- John Lenz, April 25, 2000
I would just like to add to the Word 2000 "thread". I am maintaining a site where the principal content producer uses Word. I get emailed the docs, then have to convert them to HTML. At first, I was just cutting and pasting into GoLive 5, and manually editing for lists and breaks etc. I was getting tired of this, and thought I would try the save as HTML option. I was completely amazed at the amount of rubbish in the resulting file, xml this, namespace that. I've gone back to cut'n'paste!

-- Mark Horrocks, November 23, 2000

MSWordView is now wvWare.

Having tried Office 2000 and Office 2001 (Mac) converstion to HTML and seen the awful results, the choice of a un*x (including MacOSX) converter is really cool.



-- Bob Kerstetter, January 11, 2001

Hi, regarding Microsoft 2000 and it's inability to understand what the words "tidy" code mean. I used MS Word 2000 when our site needed to create an intranet and it was a total nightmare. It kept creating all these files and folders on the server that took up space and loading time. It created a file folder for every single page of html which was just plain ignorant if you ask me. I persisted with this until my manager let me buy FrontPage 2000 which is still a little bit messy but better value. I feel that Word is okay for quick, simple pages that aren't going to need much maintenance. Frontpage is very well integrated with all the other MS packages but does tend to spit out some garbage in the form of FPDB includes etc and doesn't tend to like files created in other Web design applications, having it's own tilted view on the world. If you want a dynamic database driven website, then FrontPage is great for the novice who hasn't got time to learn ASP, Javasript etc. It's a good training ground to pick all that up. Thanks

-- Hazera Bibi, January 18, 2001

Try downloading this utility from the microsoft web site. It's saved me hours of reinputting line breaks after I've copied and pasted Word text into Dreamweaver. It's a real life-saver for webmasters.

Yes - it really does work!!

Word 2000 'crud' HTML filter: http://office.microsoft.com/downloads/2000/Msohtmf2.aspx

ruth arnold
www.spacehoppa.com



-- Ruth Arnold, May 30, 2001
I can't believe no one mentioned the fact that Dreamweaver (4.0 at least) has a function that will import Word generated html and clean it up for you. It does a fabulous job and allows you to pick an choose how severe you want the clean up.

-- kim simms, June 21, 2001

Another useful tool is Dave Raggett's Tidy program. It can be found at http://www.w3.org/People/Raggett/tidy/. It will clean up your HTML, and has numerous options so you can customize how it formats (or cleans) the HTML. It's been ported to most OSs, and the source code is available if you want to modify how it works. Since it's a command-line program, it can be hooked into any decent editor -- that is, ones that allow you to run programs and capture the output.



-- David Wall, August 15, 2001
Well I was searching the web on convertion projects from MS-gernerated HTML files to Pure HTML tagged files... And I landed up in this page and found a lot of useful info.

I have developed a Java/JSP/Javascript/HTML based web-enabled application to do the job of converting .txt files to .htm files and it gives the end user a choice of operations paragraph by paragraph and the processed paragraphs are then written by the JSP with tags to the .htm file. I found the speed of conversion to be about 45 to 50 files an hour! for this you have to save every MS-HTML file as .txt and then give it to my program as input. I recently converted about 1200 content files for a German website.

anyone interested in offloading projects or additional info? please do write to me at emmanuel_chris@rediffmail.com

-- Benjamin Christopher, November 5, 2001

Yes, Dreamweaver as a nice utility for cleaning the HTML code generated by Microsoft word, it allows also to choose how strong must be the cleaning, and it works satisfactorily for the Word 97 HTML code. Unfortunately, even with the strongest cleaning, it is not able to get rid of the <span style = ...> definition which Word 2000 put at the beginning of every sentence. If you import a Word 2000 HTML file, you will not be able to change the font of the document if not editing the HTML source, line by line... I'll give a try to the Office 2000 HTML filter 2.0 (by Microsoft), hoping it works!

-- Luca Bonci, February 21, 2002
I have found that eWebEditPro from Ektron (www.ektron.com) does a pretty good job of cleaning up Word 2000. It will produce xhtml output. It's not perfect, though - maybe about the same as Dreamweaver 4 but I haven't tested the difference. I'd like to find something that strips off all the font styles and leaves layout structure in place.

-- Andy Harrison, May 15, 2002
After struggling with the crud you get out of Word, even if you copy and paste into an HTML-friendly editor, I came up with this approach using Ant 1.5's very nice ReplaceRegExp task (sorry about the formatting loss here, but I'm too tired to reformat this nicely for text right now):

<target name="strip-test"> <replaceregexp flags="g" match="&lt;/FONT&gt;" replace=""> <fileset dir="${publish.dir}"/> </replaceregexp> <replaceregexp flags="g" match="&lt;FONT(.*)&gt;" replace=""> <fileset dir="${publish.dir}"/> </replaceregexp> <replaceregexp flags="g" match="&lt;P class=(.*)&gt;" replace="&lt;P&gt;"> <fileset dir="${publish.dir}"/> </replaceregexp> </target>

This effectively strips out the offensive font, style and non-standard <o:p> tag. There may be a way to optimize this by combining the replace expressions into a set of nested expressions, and it could easily be extended to strip out other junk. It's nicely speedy and easy to add to an Ant script for processing directories recursively leveraging Ant's <fileset> tag.

Ant is really wonderful if you're not familiar with it. Hope this is helpful to someone out there trying to clean their MS Word junk.

-- Daniel Seltzer, July 23, 2002

Lot of wonderfull information here guys, thanks.

Another utility to try is:

http://www.textism.com/resources/cleanwordhtml/

Using the *Office 2000 HTML Filter 2.0* from Microsoft and the page above does a good job of cleaning out Microsoft's, um, inaccuracies.

Now, if I can only teach my users not to use all caps . . .

-- Grey Gremlin, July 25, 2002

I have a vaguely db-backed personal web site on which I had to append some MSWord documents. The easiest solution I found is to open the Word file in OpenOffice Writer, (http://www.openoffice.org) save as HTML then edit the file in Emacs.

OpenOffice does generate a load of crap in the html file, but it's nowhere as bad as Word 2000. Most of my work is done by a rather ugly Emacs Macro (which should really be a Lisp procedure, but that will have to wait until I actually learn Lisp) to replace-regexp a couple of tags, namely :

- delete SPAN ("</?SPAN[^>]*>" -> "")

- remove attributes from p and h tags ("<H1[^>]*>" -> "<h1>")

The macro also add calls to my header and footer scripts and edits the header to use external CSS stylesheets.

Overall it works rather well and I can get .doc files up really fast. I still have to correct a few things by hand but with something more involved than my macro (by a better programmer) I think that wouldn't even be necessary.

By the way OpenOffice (free version of StarOffice 6) is quite good. It does mostly everything MsOffice does and it's free. And the equation editor is *way* better.

-- Serge Boucher, December 30, 2002

I am a farmer, not a computer expert, but here we are in 2003 and it seems even farms need web pages. I can accept that. I traded some beets for some web development work, and even got a crash course on using dreamweaver on a Mac. Wow, could it be, a computer that actually works! Looks complicated though. Next I needed to change something and thought it would be a simple matter to open an html document in MS word (the latest and greatest, in a university PC lab) make my changes and save as HTML. Word changed everything everything around so the images wouldn't display correctly on a Mac, and now I am wasting my time wading through code I don't understand. Sorry Mr. Gates, It looks like you still don't get it.

-- mac burgess, February 6, 2003
I'm sorry if someone has said this but there were a lot of responses and someone may have missed it. UK legislation to be introduced in about a years time will be very harsh on company websites that do not offer adequate accessability options - i.e allowing users to change font or font size and bg colour - to help people with learnig or reading dificulties. When word creates html it seems to put so many tags in that a lot of these facilities will not work. This is perhaps a consideration if you are using word for a vaguely commercial site as the UK gov have claimed they will agressively enforce this law.

-- Nathan Mcilree, July 1, 2003
We have found that the HTML exported from a MS-Word document by OpenOffice 1.1 is much cleaner than the Microsoft version. In particular it uses relative font sizes rather than the idiotic point sized fonts, so the user's screen prefrences are honoured. The output file is also significantly smaller than the equivalent Word export. (or indeed the original Word file.)

-- Andrew Macpherson, March 18, 2004
Also there is a useful plain-vanilla utility called antiword (which does a nice job of just grabbing the text), useful for creating indexes and the like (I personally use it for Plone, an open source CMS)

The other utility around is wvWare, but this has a lot of cascaded dependencies so is difficult to compile (build from scratch, as there is not off-the-shelf version) on some systems.

Good luck, and I wish you well extracting your intellectual property from M$ proprietary format !

-- stu hannay, August 9, 2004

Hi All, If you are in education and wish to convert your MS Word material into "clean" "section 508" compliant html, then look no further than using a rather useful plugin for word. "CourseGenie" from http://www.coursegenie.com is an excellent tool and is growing in its use by academics who do not want to be webmasters. A 1 minute demo of courseGenie is at : http://www.coursegenie.com/demos. Its been tipped as the next generation tool. All the Best, Michael Bailey.

-- Mike Bailey, October 12, 2004
There is a CMS called PHPWebSite which is open source. In a settings file, you can specify what html tags to strip from input (for example, when pasting from word into a textarea for creating content). I disallow the (P) tag.

I have customised this functionality to do the following: before stripping html tags, replace the (/P) and (/p) tags with (BR /).

Then there is a posting on www.php.net for the strip_tags function, in which a comment talks about a function which can strip attributes from specific tags. (The function allows you to specify an array of attributes not to strip - but that part doesn't seem to work - it will, however, strip all attributes).

Before stripping the tags, I then strip all the attributes for all the allowed tags except for anchor tags (A) and any tags to do with tables (table), (tr), (td), (th), etc.

The result is that only the tags I allow are kept, all paragraphs are converted to (BR) tags (and because Word seems to insert an empty paragraph between each 'real' paragraph, this works out ok), formatting for tables is retained, things like bold and italic text are retained, but all the garbage is thrown out.

Actually, on top of that we use a wysiwyg editor in the CMS, so if there is anything else that needs fixing (eg centering text) - it's pretty quick.

The flipside is that you don't have (P) tags.

If anyone wants info on the exact code we're using, youcantryreachingme AT NOSPAM hotmail DOT com .

-- Chris Notdisclosing, April 22, 2005

I don't want to turn this site into a single-topic forum, but the trial Malkue product doesn't seem to work very well in IE. It passes W3C but there's an unmatched <title /> tag in the code which seems to cause IE to go blank. When removed, the translated text appears in a single unbroken line...

All day spent trying to find a Word to HTML converter which both passes W3C and is also legible *sigh* (tried YAWC too, it crashes my Word97)

-- Gaylord Lussac, March 26, 2006

For a few years now I've been developing http://Docvert.org which lets you convert Microsoft Word into standards compliant (x)HTML. It lets you control every tag and attribute, or to just use several of the inbuilt templates for conversion to clean HTML. It's free and open source software, and you can install it for shared use by a whole office of people. It's even got a Microsoft Office toolbar that'll let you convert documents.

-- Matthew Holloway, May 2, 2008
  1. Use Gmail to Convert Word Docs to HTML

    If you have a MS Word doc that you want to convert to HTML, the last thing youýd ever use is the ýSave as Web Pageýý command in Word. Instead, you can send the attachment to your Gmail account and use the ýView as HTMLý link. Once the page is displayed in your browser, go to ýView Sourceý and copy the code. Most of it is very clean and quite useable.

    FROM: oreillynet.com 2006: use_gmail_to_convert_word_docs

    I tried GMail on one document, it said: The attachment cannot be viewed as HTML. Download the attachment to view it in its original format. :)

  2. I also tried
    HTMLTIDY -raw -clean -omit infile.html > outfile.html
    It decreased the file size from 690 kb to 580 kb, but did not clean useless tags.

  3. I did not try but my friend suggested three services doc2html that do not convert images:

    http://gdsland.com/excel2html/home.php -- excel to html
    http://www.zamzar.com/
    http://media-convert.com/konvertieren/


-- Evgenii Philippov, October 23, 2008
Add a comment

Related Links

Add a link