MYSQL File Analysis for Image Database

I've imported images-v11.ods into mysql and created a variety of outputs which will let us further process the database in preparation for importing it into Drupal.

Here are the 4 files:

Images-v15.txtis a first-pass at the mysql database created from the images-v11.ods file (it's a 6MB file, so it will take awhile to load)

FieldUse.txt an analysis of the various fields used, showing how many records have the instance of the field shown.

ClassFields.txt the same kind of analysis, but on the fields used for classification

Tags.txt a list of those classification fields boiled down to a single word, to give you an idea of how many less 'tags' there could be.

The next step is to look carefully at these files in with the following in mind:

  • Can fields be deleted? - It looks like a number of fields don't add any value. Specifically, the following fields look like they could be deleted:
    • CONDEXAM - Is this being used?
    • STATUS - Is this being used?
    • DATASET - Is this being used?
    • GPARENT - almost every record has the same value
    • MEDIUM - less than 30 records have anything but blank
    • OBJNAME - all but 100 say Print, Photographic
    • ORALHISTNO - only a few dozen records have a value
    • STATUS - Is this being used?
    • STATUS BY - Is this being used?
    • TEMPBY - Is this being used?
    • UPDATEDBY - Is this being used?
  • Can fields be combined to reduce the number. It seems like several fields could be added to the end of the description field thereby reducing the number of fields (which is generally a good thing to do). Keep a field if we'll need to sort on it, filter on it, or want to show it clearly in reports and screens. If it's just data that needs to be retained but we'll never sort or filter on it, then it could be added to the description field. Other fields might also be able to be combined, like EARLYDATE-LATEDATE.
  • Can data be cleaned up to reduce the number of variations. So, for example, NEGLOC has several variations of the same thing. These should be combined into one 'spelling'... what spelling should that be?
  • How can we combine permloc and temploc so that we have only one location field?, or do we need two?

One major area to work on is the classification scheme. This is a BIG problem, but the starting point is to look at the various classification fields and how they're presently used. This is what the ClassFields.txt file shows. I think we're converging on the idea that there would be a relatively simple hierarchical taxonomy (eg. Mills, Churches, Schools, etc.) plus a 'tags' field with all possible tags that would be used to describe a record. The last file, Tags.txt, shows the result of taking all of the classification fields (CLASSES, PEOPLE, PHTGRAPHER, PLACE, STERMS, and SUBJECTS) and extracting a list of single words to use as tags. If we adopt this approach, then we need to pick a field which best represents the simple hierarchical taxonomy (perhaps SUBJECTS?) and clean that up so that it reflects the taxonomy choices we want. This is probably the most time-consuming step left before we import these records into Drupal.