Preparing documents for the web
This note explains how to get existing documents ready for the web or to put into a database. An existing document which is already in a 'application-specific format', like a word document, spreadsheet, power-point presentation, etc. can be attached to a content page in our CMS (Content Management System) without any modification, but then a user would have to download the document and have the application (eg. Word, Excel, Powerpoint) on their system in order to read it. In some cases, this is probably OK. For example, if it's a large excel spreadsheet there is probably no better way than to just attach the document and hope the recipient has Excel.
However, for most content it's better to have the content immediately viewable on the web, without requiring a download or application on the viewer's PC. In this case, the existing documents we have must be converted to 'plain-text' because the formatting that is embedded in the application-specific document will not be readable on the web. There are 3 distinct cases to deal with: unformatted text, formatted text, and hard-copy.
Unformatted text is the easiest. By 'unformatted' I mean that there are no headers, bold, italics, underlines, footnotes, etc. To put this text on the web, just click 'create content' in our CMS, and cut and paste the text from the application into the content page. Click 'submit' and it's done.
Formatted text is slightly more difficult, because the formatting can not be retained between the original and the web copy - the system of formatting is completely different. You can cut the content from the application (eg. Word) and paste into a content box on the CMS, but the formatting will all be lost. After the text is in the CMS, it can be marked up to add back the formatting (the subject of another note). We will be adding a rich text function that will make this process easier, but it will still involve tediously adding back the formatting.
Hard-copy (a printed page) is the most difficult, since it must first be scanned and OCR software (Optical Character Recognition) used to translate the scan into text. This process is far from perfect, and there will be a large amount of editing to clean it up. The best way to perform the OCR is to just scan and translate to plain-text; don't try to have the OCR discern the formatting as well. Once the plain-text is clean, then copy the text into the CMS and format it there.
Having documents in the CMS has huge advantages in terms of wide viewership around the world across many types of platforms (eg. PC, mac, palmtops, etc.) and all the other advantages of the CMS (searching, comments, tagging, etc.), so the best strategy is to stay with plain text authoring and then marking up the text once it's in the CMS. In general, try to avoid formatting any documents in a word processor, as this will make the documents more difficult to move to the web.
- Printer-friendly version
- Login to post comments




Comments
Oral history transcription
Joan and I spoke about this today - regarding transcriptions - those that are electronic and those that are only paper( hard) copy. Regarding paper copy only transcriptions - if they are scanned ( rather than re-typed into the content) - is there a "save as" file that allows one to name the document as well as set the format [ plain text]?
Additionally we spoke about opening up the transcription floppy disks, editing the electronic version, saving the edits back on the disk to be downloaded onto the web site -
I understand that we if simply downloaded them on the website, then they could be edited on the website from anywhere - but there is an apprehension about processes - so we will begin with a small number of electronic transcriptions, experiment and go from there.
Saving docs as plain text
Most word processors allow you to 'save as' and choose a type. For plain-text, you'd use a flle type of 'text' in Open Office, and "Plain-text" in Microsoft Word. But if you're putting the content in our CMS, it's easiest just to cut it from the word processor and paste it into the body box in the CMS.
As for putting content on the CMS that isn't ready, there are (at least) two approaches that help keep it internal until it is ready. First, don't check the 'publish' box in the publishing options at the bottom of the create content page. Then, that page is not visible by anyone who doesn't have permissions for 'content administration'.
Second, we could install a workflow module which explicitly moves a piece of content through stages for approval, notice, etc. This is more rigid, but might be just the ticket as the system grows.
About those floppies... I would write-protect those floppies before you stick them in a PC, and copy them straight-away onto a hard disk into an archival storage folder and burn a CD with those files to preseve them. Then, copy the files that you want to use into another folder and work from those. Floppies are really vulnerable to corruption, and writing stuff back to them runs the risk of horking (a technical term) the entire floppy.
Plain text
While in a hurry at work, I screwed this up again - sent an .odt which no one could open at home - I do need to paste a reminder on the computer about saving as a plain text doc- can this be done automatically/