This is the homepage of the LHC Repository System. From here, you can access information about the various collections held by LHC.
Accessions - These are the umbrella records of donations (and occasionally purchases) held by the LHC. The specific objects in the collection are given object numbers, and the first 7 digits of those object numbers are the accession number. The accession records don't have any information about the objects themselves.
These object records are found in the 5 principle collections:
Archives - These are collections of papers for an individual or institution.
Audio - This is primarly the oral history audio collection, although it also includes other audio material such as speeches, music, etc.
Library - This is the collection of books, pamphlets, brochures, magazines, etc.
Artifacts - These are the records describing the hundreds of artifacts that LHC has in it's collection, such as bottles, buttons, pins, shovels, knobs, etc.
Images - The largest collection at LHC, the images repository consists of photographs, maps, drawings, posters, etc.
To search the entire repository database, use this search window:
Respository Search
You can enter a full ObjectID (oid) or any part of one. For example, you could enter 2008 to just see the objects accessioned in 2008, or 2008.006 to find all objects included in the accession 2008.006.
The following sections of this online book give details about the LHC Repository and its development:
Now that the conversion of the PastPerfect repository into Drupal is complete, we will develop the supporting structures and utilities to attach the ancillary files used by the repository. We'll use the term 'blob' as a generic description of these attachments. (If you're really interested in that name, you can find out more here: http://www.cvalde.net/misc/blob_true_history.htm) In our case, blobs are the files of images, oral histories, finding aids, etc. which are attached to the repository records as ancillary files and then displayed in appropriate ways such as an image browser, audio player, etc.
The basic steps to accommodating the blobs is as follows:
This development effort will be ongoing through the summer of 2011.
The following issues are on the short list of things to do for the repository system:
This document describes the conversion of the LHC's current repository, which uses PastPerfect (PP) on a single PC, into a web-based repository built on Drupal, which is the content Management System (CMS) upon which the LHC website and CRM system is built.
Note: As of February 2011, this phase of the repository development is complete. These pages are left for archival purposes only.
This section is a workspace for the collaboration and coordination of the conversion project, which is a major IT effort that will dramatically impact the core business processes at LHC. Every part of the archival and collections workflow will be affected, so it's important to have a focal point for tracking the design and changes to the system. Each step will be documented here so that everyone involved knows where to go to find out what's happening and why, and can help with their input.
The overview of the project can be found in the LHC Technology Plan, which has a page about each of the 5 main areas of the repository (images, archives, audio, books, and objects).
This document is much more detailed than the LHC Technology Plan, and much more frequently changing. It's a working document and will evolve continuously as the project evolves over the next couple years.
Here is the current migration state of each of the databases: (Note: these database listings, and other spreadsheets have been removed from these 'Development Project' pages since the content is now available in Drupal. The links are left just for reference, but don't work.)
Archives DB
Audio DB
Books DB
Images DB
Objects DB
The following sections describe the migration project in detail:
Here's an outline of the steps to build the new Drupal repository and migrate the pp (PastPerfect) data to it:
1. Import pp data into spreadsheets and do a 'phase I' review and manipulation of data. The import has been done (for all but the 'Objects' data set), and is described below. The phase I review is where we decide what fields to keep, which to merge, and do some wholesale changing and cleaning up of the data. More on this later...
2. Load the data into mysql tables for phase II review and processing. Once the data is in a proper databases (mysql) it becomes possible to work with it in a much more flexible way, and to put it on the web so that we can easily access subsets of the records and fields from web pages. This will enable other staff at LHC to access the data while we're working on the migration project. (They can also use pp for viewing, just not changing) Also, we can easily concatenate fields, make substitutions, split data into tags, etc.
3. Create new data structures in Drupal. New content types need to be created for each category of data (accessions, images, books, archives, oral-histories, and objects). The fields for each have to be precisely defined. This first involves a detail mapping of the fields which we create from step 1. Also, we develop the taxonomy system and menu structure that will let us get to the records.
4. Import data into Drupal. Once the new content types are created, we import the data from each of the mysql tables into Drupal. At this point, the data is now part of the Drupal system and can be accessed, edited, listed, searched, etc.
5. Create the file system and linkages to connect the Drupal records to the underlying file objects (scans, oral-history files, archive finding aid pdfs, etc.)
6. Create views, printouts, access control, etc. This will be an ongoing process as we refine the system.
7. Develop workflow processes and train the staff.
It's very important that we don't go backwards in this progression, as each step is very time-consuming. Having moved the data out of pp and into spreadsheets, we can no longer import changes to the pp system. Once we import the spreadsheet data into mysql we can't make changes to the spreadsheets.
I've imported the pp data to spreadsheets in order to do the 'phase I' review of the data. Below are links to the spreadsheets. Download these into a new folder on your machine so you have a local copy to play with.
The accessions file lists all of the accessions. This data will be split into two spreadsheets: contact info, which will go into CiviCRM, and accession records, which will go into Drupal as a new content type "Accessions". We need to decide what goes where.
The archives file contains everything from the 'Archives' set in pp, except the oral histories, which have been moved into their own file. There will be 2 new content types in Drupal, one for archives and one for oral histories. If we are likely to collect other audio besides oral histories (eg. speeches, recordings of events, factory sounds, etc.) then maybe we should broaden this new content type to 'audio'.
The images file contains everything from the 'Photos' set in pp. I've changed the name to images to reflect the fact that images other than photos belong in this new content type, like maps, drawings, etc.
Last, the books file contains everything from the 'Library' set in pp. I've changed the name to books because I thought 'library' was too general. In a sense, everything is part of the library. If you think 'books' is too specific (because it includes pamphlets) then perhaps we should change it back to library, or something else.
Here are the spreadsheets, they're all version 1 (v1):
http://lawrencehistory.org/files/misc/db/accessions-v1.xls
http://lawrencehistory.org/files/misc/db/archives-v1.xls
http://lawrencehistory.org/files/misc/db/oral-histories-v1.xls
http://lawrencehistory.org/files/misc/db/images-v1.xls
http://lawrencehistory.org/files/misc/db/books-v1.xls
Start by just looking at the data - don't change anything. I've done some initial processing by deleting many fields (columns) which were not being used. Also, if every record (row) contained the same data for a field I deleted that field (eg. all records in the oral history collection had the same name for 'collection' so I deleted the field. We can still add it back to the final database, but it's simpler to work with the databases if we remove fields that don't contain distinguishing data.). If a field had entries for only a few records I moved that data into another field and deleted the field (column).
The spreadsheet representation of the data has the advantage of being able to see all of the data and navigating easily around it, scrolling up and down to see how a particular field is being used, and left and right to see how a particular record uses fields.
Next, we'll move the data into mysql and make it available on the web. We'll also nail down the final field mapping and I'll create the new Drupal content types. (Steps 2 & 3). This schedule could be thrown if we hit snags, so set expectations that there may well be delays.
One of the most critical parts of the review at this stage is to determine what fields we're going to use, and how we're going to blend fields.
This is a BIG project, but I think we'll have a killer system when it's done. The entire repository will be accessible from anywhere in the world, and we'll have the ability to allow Lawrencians to explore their history from their web browser. It should be a relatively easy next step to add their photos, artifacts and oral histories to our repository.
Archives-v6.txt is the complete dump of the archive database imported to mysql. (from archives-v5.xls). Once it's fully loaded into the browser you can use the browser's find facility to find records. Check the overall content and use of fields to make sure they look right.
Fielduse-archives.txt is a listing of the use of the other fields which have repeating data values, along with the number of occurrences of that value in the field. We need to look at this with an eye toward reducing the number of fields, concatenating them into other fields (like description) where possible.
Classfields-archives.txt is a listing of the use of the various fields that are used for classification. With the images db we have the benefit of a primary taxonomy field (otherno), but we have no such field with the archives db so we need to 'grab' something from one of these fields to use for that taxonomy... or create another field for that purpose. This is the biggest issue with the archives db and will take the most time to sort out.
Audio-v7.txt is the complete dump of the oral-histories-v5.xls spreadsheet. (1.4M). This is a much simpler db, so it should be quick to check for completeness, use of fields, etc. Note that, due to Amita's machine-like precision in recording the OH status in the notes (F, L, C, R, T) I was easily able to extract this and put the data into new fields of the corresponding name so that we can operate on that field (filter, sort). In most cases, I've been able to remove traces of the indication in the note field, but some of them remain (which is why you'll see an occasional letter (T, L, etc.) in the note field. )
Audio-fielduse.txt is the listing of the fields with recurring values in fields, along with the count of each occurrence of that value. Decide what fields can be consolidated.
Audio-title.txt is a list of the objectid and titles for the db. Notice that some titles have an added phrase after the person's name (I've separated it with a semi-colon). We should probably move this into another field so that the title field is left with only the person's name.
Audio-dates.txt gives a list of the dates from the date field in the spreadsheet (renamed to date_old) and a new date field which extracts a single date. I've done this so that the interviews can be sorted and filtered by date, which requires the field to be in a true date format. The way that the date field was being used made this impossible because of inconsistencies in the format (sometimes there were two dates, sometimes there were additional notes, and every date was in a non-sortable format). In order to make this change, we need to move the notes and additional dates into the description field. This file is sorted by date so that you can see the problem records (there aren't many) at the top.
Audio-bydate.txt is a file which gives a listing of all interviews by date. This isn't needed for checking anything, but is a useful way to see current state of the db.
We need to decide on the name of this db. I've called it 'audio' rather than 'oral-histories' so that it's a little broader and can accomodate additional resources we might add in the future which aren't just oral-histories, such as speeches, music, discussions, factory sounds!, etc. The db and Drupal presentation of this content type will be the same for all these, so I think it makes sense to choose a broader name for the db than just oral-histories.
I've imported images-v11.ods into mysql and created a variety of outputs which will let us further process the database in preparation for importing it into Drupal.
Here are the 4 files:
Images-v15.txtis a first-pass at the mysql database created from the images-v11.ods file (it's a 6MB file, so it will take awhile to load)
FieldUse.txt an analysis of the various fields used, showing how many records have the instance of the field shown.
ClassFields.txt the same kind of analysis, but on the fields used for classification
Tags.txt a list of those classification fields boiled down to a single word, to give you an idea of how many less 'tags' there could be.
The next step is to look carefully at these files in with the following in mind:
One major area to work on is the classification scheme. This is a BIG problem, but the starting point is to look at the various classification fields and how they're presently used. This is what the ClassFields.txt file shows. I think we're converging on the idea that there would be a relatively simple hierarchical taxonomy (eg. Mills, Churches, Schools, etc.) plus a 'tags' field with all possible tags that would be used to describe a record. The last file, Tags.txt, shows the result of taking all of the classification fields (CLASSES, PEOPLE, PHTGRAPHER, PLACE, STERMS, and SUBJECTS) and extracting a list of single words to use as tags. If we adopt this approach, then we need to pick a field which best represents the simple hierarchical taxonomy (perhaps SUBJECTS?) and clean that up so that it reflects the taxonomy choices we want. This is probably the most time-consuming step left before we import these records into Drupal.
Library-v6.txt is the complete dump of the library database imported to mysql. (from books-v5.xls). It's big (5M) so will take awhile to download. Once it's fully loaded into the browser you can use the browser's find facility to find records. Check the overall content and use of fields to make sure they look right.
Library-fielduse.txt is a listing of the use of the other fields which have repeating data values, along with the number of occurrences of that value in the field. As with the images db we need to look at this with an eye toward reducing the number of fields, concatenating them into other fields (like description) where possible.
library-classfields.txt is a listing of the use of the various fields that are used for classification. With the images db we have the benefit of a primary taxonomy field (otherno), but we have no such field with the library db so we need to 'grab' something from one of these fields to use for that taxonomy... or create another field for that purpose. This is the biggest issue with the library db and will take the most time to sort out.
objects-v5.txt is the complete dump of the object database imported to mysql. (from objects-v5.xls). Once it's fully loaded into the browser you can use the browser's find facility to find records. Check the overall content and use of fields to make sure they look right.
Fielduse-objects.txt is a listing of the use of the other fields which have repeating data values, along with the number of occurrences of that value in the field. We need to look at this with an eye toward reducing the number of fields, concatenating them into other fields (like description) where possible.
Classfields-objects.txt is a listing of the use of the various fields that are used for classification. With the images db we have the benefit of a primary taxonomy field (otherno), but we have no such field with the objects db so we need to 'grab' something from one of these fields to use for that taxonomy... or create another field for that purpose. This is the biggest issue with the objects db and will take the most time to sort out.
The past-perfect dataset has been reduced for each of the 5 databases as follows:
* Fields that aren't used have been deleted
* Fields that aren't useful for sorting or fast filtering have been merged
* Fields that have similar data have been combined
The Dublin Core metadata set (http://dublincore.org/documents/usageguide/elements.shtml) has been used as the basis for organizing the fields.
The STERMS, SUBJECTS, AND CLASSES fields have been combined into a single TAGS field, and processed into single words with redundancies removed. For example, if the SUBJECTS field previously had
"Essex Company, Arlington Trust Bank Company, Essex Bank."
The TAGS field now reads:
"Essex, Company, Arlington, Trust, Bank".
In this way, that record will be retrieved if the user searches on any of the previous words, but it dramatically simplifies the organization of tag data. It also simplifies the entry of new tag data.
The PEOPLE field has been processed in the same way.
Important: For the TAGS field (see list of all values in tags field), any tag which wasn't used at least 4 times was deleted. The thinking behind this is that the power of the tag is to provide grouping for similar objects. If there aren't at least 4 of them, then it doesn't make much sense. It's kind of like having a divider in a notebook with only one page in it. By doing this, the number of tags was reduced from 4,000 to about 1,500, which should improve the performance of the indexing. For most of those objects which had only a single tag that word will be used in the full-text description and will be searchable, but that isn't as meaningful as having the word chosen as a tag.
For the PEOPLE field (see list of all values in people field) I did not do this reduction, and left peoples names on the field even if there was only one. The thinking here was that a person's name might have value in the faceted search. For example, once someone chooses a tag of, say, "Essex and Company" then all of the names which also have tags of "Essex, Company" will be selectable. Please give some careful thought to this distinction (between the handling of PEOPLE and TAGS and let me know if you agree or not.
It would be easily possible to eliminate more values from either of these two lists.
Here are the resulting databases:
Archives DB
Audio DB
Books DB
Images DB
Objects DB
These database files don't include the following administrative fields:
CONDEXAM (person)
CONDDATE (date)
CONDITION (many entries – short assessment: good, fair, excellent, etc.)
DISPVALUE appears to be redundant to condition (good, excellent, etc.) few differences
CONDNOTES entries - arch: 24, books 5, images: 765, obj 50 – useful
CATBY (person)
CATDATE (date)
UPDATEDBY (person) This will be kept by Drupal
UPDATED (date)
STATUS entries: mostly "OK", a few "missing"
STATUSBY (person)
STATUSDATE (date)
INVNBY (person)
INVNDATE (date)
MAINTDATE (date)
These fields are all pretty clean... the question is which of them we need to migrate to Drupal?
Fields to be determined - These are additional fields that need some decisions about whether they are kept or merged.
Last, look at the Columns spreadsheet which maps the past-perfect columns (fields) into the Dublin Core set (dc fields) and administrative fields. This spreadsheet has been the guiding document behind the database migration, and might be difficult to make sense of, but you'll get the idea. When I'm done I'll try to clean it up.
At this point, the databases are almost ready for migration into Drupal. The next step is to build the new content types in Drupal, but first we need to make sure that we know exactly which fields we'll be using. This is the time to review each of the DBs carefully and make sure the field selections, combinations, deletions, etc. are right, and that the choice and processing of classification and tag fields are optimal. Changes at this stage are quite easy, but will be very difficult once the migration into Drupal has taken place.
The web page listings of the repository data have now all been updated to reflect the final schema:
Archives DB
Audio DB
Books DB
Images DB
Objects DB
The following changes have been made:
* Field Names have been defined to be used for the Drupal content types which correspond to the dublin-core (dc) specification. So, for example, I've changed 'author' to 'creator'. We are free to use whatever label for a field we'd like on the Drupal screens, so we could use 'Author' for that label, but it makes sense to standardize on consistent dc-compatible field names for the underlying database. These 'drupal' field names are shown in the third column in the table below. The second column is the corresponding dc name. An 'a' after the drupal field name means that this field will be created for all 5 of the databases so that they each provide for the base dc field set. Each of the 5 databases then show the actual pp field that is being used for that field. If the field name is capitalized, then it's used just as it came off the pp db. If it's lower case, it's a field that I created from merging other fields or otherwise processing the data (eg. date formatting).
* Sequence The fields have been sequenced into what we think is a sensible sequence. (dc doesn't standardize this). The first column is an arbitrary sequence number which is only used for this spreadsheet. This sequence will be implemented in Drupal (unless there are changes).
The last step before creating the Drupal content types and importing the data into them is to do a careful final check on the 5 web listings of the databases. Specifically, check the choice of fields names, the field sequence, and most importantly, the data itself. This is the last opportunity to easily clean-up data, merge fields, or do other data processing. The above web listings represent ALL of the data that would be imported, and the exact fields that it will be imported into.
If you want the following table in spreadsheet form, it's attached at the end of this post.
| seq | dc field | drupal field | archives | audio | books | images | objects | |
| 10 | Identifier | id_object | a | OBJECTID | OBJECTID | OBJECTID | OBJECTID | OBJECTID |
| 11 | Identifier | id_accession | a | ACCESSNO | ACCESSNO | ACCESSNO | ACCESSNO | ACCESSNO |
| 12 | Identifier | id_old | OLDNO | OLDNO | OLDNO | OLDNO | ||
| 13 | Identifier | id_neg | NEGNO | |||||
| 14 | Identifier | id_alt | altid | altid | ||||
| 19 | type | type | a | tblname | tblname | tblname | tblname | tblname |
| 20 | title | title | a | TITLE | TITLE | TITLE | TITLE | TITLE |
| 21 | title | title2 | TITLEX | |||||
| 22 | title | title_generic | a | OBJNAME | OBJNAME | OBJNAME | OBJNAME | OBJNAME |
| 30 | description | description | a | description | description | description | description | description |
| 32 | description | description_notes | NOTES | |||||
| 50 | subject | subject_tags | a | tags | tags | tags | tags | tags |
| 55 | subject | subject_people | a | subj_people | subj_people | subj_people | subj_people | subj_people |
| 60 | subject | subject_collection | a | COLLECTION | COLLECTION | COLLECTION | COLLECTION | |
| 70 | creator | creator | a | AUTHOR | AUTHOR | PHTGRAPHER | ARTIST | |
| 75 | creator | creator2 | AUTHORX | |||||
| 80 | contributor | contributor | a | PUBLISHER | STUDIO | |||
| 90 | date | date | a | DATEACCUM | PUBDATE | |||
| 100 | coverage | coverage_date | a | date | date | date | date | |
| 110 | coverage | coverage_event | a | EVENT | EVENT | EVENT | ||
| 120 | coverage | coverage_place | a | PLACE | MADE | |||
| 150 | format | format | a | EXTENT | EXTENT | PHYSDESC | dimensions | dimensions |
| 160 | format | format_level | LEVEL | |||||
| 162 | format | format_physical | PHYSCHAR | |||||
| 165 | format | format_system | SYSTEM | |||||
| 200 | provenance | provenance | a | CUSTODIAL | PROVENANCE | PROVENANCE | OWNED | |
| 210 | provenance | provenance_credit | CREDIT | CREDIT | CREDIT | CREDIT | ||
| 250 | language | language | a | LANGUAGE | LANGUAGE | LANGUAGE | ||
| 300 | rights | rights | a | ACCESS | ACCESS | |||
| 700 | relations | relations | a | RELATED | ORALHISTNO | ORALHISTNO | ||
| 800 | admin | stored | a | PERMLOC | PERMLOC | PERMLOC | PERMLOC | PERMLOC |
| 810 | admin | stored_neg | NEGLOC | |||||
| 820 | admin | image_catalog | OTHERNO | |||||
| 825 | admin | original_copy | ORIGCOPY | |||||
| 830 | admin | finding_aid | FINDAID | |||||
| 850 | admin | chk_release | r | |||||
| 850 | admin | chk_transcript | t | |||||
| 850 | admin | chk_folder | f | |||||
| 850 | admin | chk_line_index | l | |||||
| 850 | admin | chk_copied | c | |||||
| 915 | admin | condition | a | CONDITION | CONDITION | CONDITION | CONDITION | |
| 927 | admin | display_value | DISPVALUE | DISPVALUE | DISPVALUE | |||
| 925 | admin | condition_notes | CONDNOTES | CONDNOTES | CONDNOTES | CONDNOTES | ||
| 930 | admin | cat_by | a | CATBY | CATBY | CATBY | CATBY | CATBY |
| 935 | admin | cat_date | a | CATDATE | CATDATE | CATDATE | CATDATE | CATDATE |
| 950 | admin | status | a | STATUS | STATUS |
| Attachment | Size |
|---|---|
| Fields (column) list - columns-3.xls | 29.5 KB |