Digital Humanities Summer Institute #DHSI18 Day 3 [Morning] Making Choices About Your Data

Making Choices About Your Data
Digital Humanities Summer Institute #DHSI18 Day 3 (Morning)
Paige Morgan and Yvonne Lam

Standardized rights statements: http://rightsstatements.org/en/
  • Controlled vocabularies
  • Working with Openrefine
  • Free work time
  • Lunch
  • Reading: Against Cleaning
  • Free work time
  • Tomorrow: Meeting with FemDH
Controlled vocabulary: a set of carefully chosen words and phrases used to help structure and define information so that it can be easily returned in a search, or parsed by analysis programs. May be the basis for taxonomies and ontologies; can be hierarchical or restricted in various ways.

Ex) Pizza vocabulary. 
Crust (deep dish; crispy)
Sauce (marinara, alfredo, olive oil)
Cheese (mozzarella, Provolone, parmesan)
Veggies (mushrooms, green peppers, onions, tomatoes, olives)
Meat

We can say every pizza must have a crust, must have one or more sauces, etc.
Can add another layer and say there are 'veggie pizzas' and 'meat pizzas' and what is within those categories with rules that say veggies cannot contain any ingredients from the meat. Principle is that people can only have what we put into the vocab, and the vocab can only describe what we have. 

Have we left anything out? Yes. Goat cheese, pineapple (need to add a fruit category), bleu cheese, pears. Can update controlled vocabulary but works better if you have your data dictionary and your documentation and thought carefully about whether there is going to be any confusion.

In the scholarship in your field, does everyone agree on those terms and categories? Describe the choices you made, which ones are controversial, and why you made the choice you did--data dictionary. 

Ways of thinking about controlled vocabs
  • they can only have what we put into the controlled vocab; vocab can only describe what we have
  • Where is the material in your controlled vocab coming from?
    • Are there groups of terms in your area? Do people in your field fully agree on those terms?
    • External vocabs?
    • Your discipline?
  • How much compromise is appropriate?
Example of a range of vocabularies/taxonomies in action: dbpedia.org - Wikipedia's back end. URLs in Wikipedia but replace wiki with db. (See John Lennon - see the subject list as how a person is parsed. [dbc is part of a particular ontology] - get sense of the different categories this particular vocab uses to describe people). Your vocabulary: you might use base vocabulary so you might add items (give originators credit, but can say you think vocab is incomplete or racist or misogynist, ask what's there, what's not there).

How much compromise is appropriate? Could use dbpedia's categories to structure info and make consistent with dbpedia, would make it possible for your info to go into Wiki/dbpedia. But may have to decide whether other vocabulary already existing means enough of the same thing, or will there be confusion.

One place to find people's taxonomies: Linked Open Vocabularies (getting into linked open data) http://lov.okfn.org/dataset/lov/

Example: Getty Program vocabulary vocab.getty.edu/ontology - 

VIAF - Virtual International Authority File (from authoritative libraries). If wanted to update to URIs instead of just names, OpenRefine will allow that.

Are controlled vocabularies and taxonomies a way to retain more complexity in your data?

Taxonomies vs Ontologies. Used in different contexts and with different tech. Linked open data is more likely to hear about ontologies. Socsci work taxonomies are more common. Distinctions not all that different than between codebook and data dictionary. In context of linked open data vs philosophy, in our context ontology is the rules (if it is a pizza, you must have a crust; if this item has meat on it it cannot be classified as vegetarian; etc.)

OpenRefine
Import dataset
   Edit cells -> Common Transforms -> [both trim edge and internal whitespace] (we wont be generating URIs so we're not going to do this for everything right now)

Edit cells -> Transform
Use GREL code to find and replace for special characters like
value.replace(a,'')
will replace a with nothing (those are two single quotes with nothing between). 
Will show you a preview

Faceting is just identifying the different options within a particular column and laying them out. 

Facet -> text facet (for text column)
Notice that items facets have two blanks, one is 'gift' which is not a proper category. To edit, go to cell and edit

Can multiple facet
Might need to check outliers

A facet item then text filter might allow me to grab open text responses - can standardize the leadership institutes, can mine for answers in end qual questions.

Transforming date to ISO standard:
Edit cells -> transform
value.toString('yyyy-MM-dd')

Useful Openrefine resources
Google refine expression language
Faceting and filtering
Date functions
Column editing
Five steps you can take to save time with OpenRefine (for some datasets)

Separate Columns
Pulldown Edit Column -> Split into several. Asks what sace you want to use is. Expects a comma, delete that and type in a space.



Comments

Popular posts from this blog

First Impressions & Customer Service Failures

Email Lists: A Dose of Common Sense

On the Great Myth of the Librarian Grays