April 25, 2007
Converting Word documents to HTML

Eeek! Who needs all this extra code?
While I write my own copy for this blog, and some of my other sites, much of what I post on the Web is written by others. This material comes to me in a variety of formats, from Open Office to .pdf files, but most of it is in Microsoft Word, and all of it needs to be converted to HTML. There are a variety of ways to do this, but I'll just review three—two common approaches and my preferred method.
Three common approaches to converting Word documents
- Open the file in Word and save as HTML. This is not recommended. When you do this, Microsoft Word adds all sorts of extra coding, much of which is not what you originally intended. I tried that with this entry just to see how weird it would be and it turned out to be 450 lines long (as opposed to 82 with my HTML).
- Copy and paste to Dreamweaver. Reliability varies. Copy the text from your word file, open an existing HTML file in Dreamweaver, save it with a new name, select the text you wish to replace, switch to Design view, and paste in your content. Switch to Code view to see how it worked. If your Word document was perfectly formatted this may turn out fine. If the author did a lot of editing you may find that mysterious characters, extra spaces, or the wrong codes (for items such as headers) appear. If there are only a few you can delete them. If there are a lot, you may want to start over using the next method.
- Use Word's Find and Replace feature to substitute HTML for Word formatting. This is what I usually do. The following instructions will show you how.
Clean up any odd or special characters
- Open your Word file
- Find and replace & with &
- Look for any other special characters such as trademarks, umlauts, em dashes or percent signs and replace with HTML character or text as appropriate. Charts to look up characters are available at http://www.webstandards.org/learn/reference/charts/entities/.
- Find and replace ’ with ' and ” with " to remove curly apostrophes and curly quotes (if appropriate). Curly apostrophes and quotes are typographically correct and can be replaced by special characters, but straight quotes work more consistently in some situations, such as HTML e-mail.
Add coding for bold and italic
Put <strong> immediately before each bold entity and </strong> after and <em> before each italic entity and </em> after. I usually color these red so that I can easily see if I've closed any tags that I opened.
Add HTML paragraph formatting
- Find and replace paragraph marks (^p) with </p>^p<p>.
- Move the extra <p> from the end of the last paragraph to the beginning of the first paragraph.
- If necessary replace </p> <p> with blank space.
- Replace manual line breaks (^l) with
<br /> ^l. - Manually change p> to h3>, h5> or the appropriate code for heads and subheads.
- Replace p> with li> for any bulleted text. Add <ul> before and </ul> after the bulleted sections.
Save file then open an existing HTML file (from your site) in Dreamweaver
- In code mode, save the Dreamweaver file with a new name (thus creating a new file).
- Copy and paste the coded text from your Word file to replace the main text in your HTML file.
Add links
- In Dreamweaver, select the text you would like to link, copy the url to which it will link, then paste this into the link box in the properties panel. In the case of e-mail links you need to add mailto: to the beginning of the address (instead of http://).
- When a sentence ends with a link, check to make sure that it is followed by a period. The period should come immediately after the </a> without any space preceding it.
Now give your code a quick review; if it looks clean, post it to the Web. View the page in your Web browser then validate it using the W3C Markup Validation Service—to find any errors you may have missed. If everything checks out, you're done!
Posted by: Heidi Cool April 25, 2007 04:52 PM | Category: Content , Heidi's Entries , How-to , Tips and Tricks
Trackbacks
Trackback URL for this entry is: http://blog.case.edu/webdev/mt-tb.cgi/13789Post a comment
Posted by: hac4 (Heidi Cool) April 25, 2007 04:52 PM | Comments (5) | Trackback
http://blog.case.edu/webdev/2007/04/25/word.html
Office of Marketing and Communications
http://www.case.edu/univrel/marcomm/
http://blog.case.edu/webdev/
216.368.4440




Stumble It!
Comments
Great information. I had one more suggestion that seems to work pretty well for me (and it is fast). I upload the word document, or email it to myself, to my Google docs area. From there, I can view the document in html. View source and you have a pretty clean version of html to copy and paste.
Bill
Charleston web site design
I just drop my Word Doc in a Notepad, kill the formating, then I drop it in to Dreamweaver and go from there.
George,
dropping a Word Doc into Notepad doesn't get rid of the curly quotes.
Good Morning,
When I try to copy and paste HTML codes into a Web document the code will not convert to a button etc. I also get the following error message:
FTP Folder Error
An error occurred copying a file to the FTP server. Make sure you have permission to put files on this server.
Details
The process cannot access the file because it is being used by another process.
What can I do to remedy this problem?
Peace,
Carl, www.psychezpublishing.com
Great recommendations. I use a similar method to George with great success.