Preparing documents for the web

This note explains how to get existing documents ready for the web or to put into a database. An existing document which is already in a 'application-specific format', like a word document, spreadsheet, power-point presentation, etc. can be attached to a content page in our CMS (Content Management System) without any modification, but then a user would have to download the document and have the application (eg. Word, Excel, Powerpoint) on their system in order to read it. In some cases, this is probably OK. For example, if it's a large excel spreadsheet there is probably no better way than to just attach the document and hope the recipient has Excel.

However, for most content it's better to have the content immediately viewable on the web, without requiring a download or application on the viewer's PC. In this case, the existing documents we have must be converted to 'plain-text' because the formatting that is embedded in the application-specific document will not be readable on the web. There are 3 distinct cases to deal with: unformatted text, formatted text, and hard-copy.

Unformatted text is the easiest. By 'unformatted' I mean that there are no headers, bold, italics, underlines, footnotes, etc. To put this text on the web, just click 'create content' in our CMS, and cut and paste the text from the application into the content page. Click 'submit' and it's done.

Formatted text is slightly more difficult, because the formatting can not be retained between the original and the web copy - the system of formatting is completely different. You can cut the content from the application (eg. Word) and paste into a content box on the CMS, but the formatting will all be lost. After the text is in the CMS, it can be marked up to add back the formatting (the subject of another note). We will be adding a rich text function that will make this process easier, but it will still involve tediously adding back the formatting.

Hard-copy (a printed page) is the most difficult, since it must first be scanned and OCR software (Optical Character Recognition) used to translate the scan into text. This process is far from perfect, and there will be a large amount of editing to clean it up. The best way to perform the OCR is to just scan and translate to plain-text; don't try to have the OCR discern the formatting as well. Once the plain-text is clean, then copy the text into the CMS and format it there.

Having documents in the CMS has huge advantages in terms of wide viewership around the world across many types of platforms (eg. PC, mac, palmtops, etc.) and all the other advantages of the CMS (searching, comments, tagging, etc.), so the best strategy is to stay with plain text authoring and then marking up the text once it's in the CMS. In general, try to avoid formatting any documents in a word processor, as this will make the documents more difficult to move to the web.