Digital Humanities Summer Institute #DHSI18 Day 2 [Morning] Making Choices About Your Data
Making Choices About Your Data
Digital Humanities Summer Institute #DHSI18 Day 2 Morning
Paige Morgan and Yvonne Lam
Clean data vs tidy data
Cleaner data is grouped in fewest 'boxes' possible, categories. makes data more interoperable and legible to their agencies. Think 'race/ethnicity' - either few checkboxes/labels, or open where folks can write in anything at all (where running analysis would be difficult). Ambiguity and complexity. Ambiguity is - how does having more or less ambiguity in your data/project affect where the work goes?
Limited categories is legible and understandable to others. If you are studying something that manifests differently among categories, you'd need the 'messier' more detailed data.
machine parsable <----------> non machine parsable---------->
less accurate <---------------> more accurate representation of complexity--------------->
Book recommendation: Sorting Things Out - death causes and diseases data. Dataset originated for people working on merchant ships. Incentive for doctors to lie about whether died onboard/in port, whether they reported in sick, what they had, etc.
Difference between source-based (where people can enter whatever, likely to have more complexity) and method-based (limited, represent source as best we can, but also want to do analysis and answer these RQs, so going to make some decisions). Another example: normalizing spelling (text analysis! If you know words aren't spelled in Modern English and there is one or more misspelling in the language of the time). If you erase that and make sure everyone spells 'clean' the same way vs 'clene' or 'cleen,' you will get the examples. Women/womyn. Are you hiding some nuance of your data that is important hat makes a certain group of people visible or invisible, and how do you handle that? Example: accents were normalized out of the menu dataset, idea that language is prescriptivized is classist and ableist.
machine parsable <----------> non machine parsable---------->
less accurate <---------------> more accurate representation of complexity--------------->
ex) Thomas Padilla - Comic Book Artists of Color database, chose not to normalize data on race at all, gave people full freedom to include race as preferred to describe it, including no normalizing of the spelling.
Raw data versus cooked (instead of messy vs clean). Reading recommendation: Lisa Goettelmen's edited collection on this, she states there's no such thing as raw data because you're getting someone's interpretation. (In quant fields "cooked" has bad connotation.) Consider: How stable is your research question and tool? As your knowledge of the tool and your RQ changes...you're making decisions along the way that have impact.
Platform Choice: Things to Bear in Mind
See New Yorker article: Maura Winckle's (sp?) article on the word "tool." Tools and platform choice you may think of as instrumental but it's not neutral. Sometimes you make utilitarian choice, and what are the choices that the tool is making for you, what did toolmaker intend?
Choosing tool or platform
- You might use more than one tool over course of your project
- No noe tool likely to fulfill all your needs
- Some platforms do one unique thing really well
- What input/format does tool require?
- Sustainability questions
- Can you download/export your material from this tool once you put it in?
- Who made the tool? Who are their audiences? What is their revenue stream (how long is it likely to last?)
- Collaboration questions
- Is it easy to share in-progress material with others (if you need to?)
- Accessibility questions
- How does this tool work for people using assistive technology?
- How does this tool work for people who are in locations with low bandwidth/internet access?
Platforms (see slide)
- Mapping/GIS platforms
- Google My Maps
- Plot points on a map, create descriptions, draw line on amap, basic styling, include images and videos. No import or export, and all done by hand.
- Google Fusion Tables
- Palladio
- Tableau/Tableau Public
- StoryMap JS
- Knight Foundation's competitor to ArcGIS Storymaps
- text
- AntConc
- text mining program widely used in corpus linguistics
- many tutorials available
- can extract data to spreadsheets on Windows PCs, pairs well with Tableau to visualize
- Voyant / Voyant Server
- produces word clouds and visualizations
- good gateway tool? Less powerful than AntConc
- Works well with languages other than English
- Relational database
- MySQL -
- command line tool
- AirTable -
- starting point, good free and good premium versions
OMEKA - content management sytem. Preloaded with metadata standards in use. Ooh, ask for Paige's MOU in re: working with Omeka.
Jekyll - Wax - generate static sites.
Reclaim domain hosting is recommended by a fellow attendee
Wordpress is another CMS
---------
Is your content/data/material:
- Text and images that you want to show to folks? Omeka, Scalar, Timeline JS
- text and images that go on a map? Google MyMaps, StoryMap JS
- Text that you want to analyze for patterns? AntCOnc, Voyant
- Stuff that you want to do in various permutations? Google Fusion Tables
- Information that you want to make interactive/filterable? Tableau, Google Fusion Tables, AirTable
- Stuff that you want to organize by tagging it? Scalar, Omeka
Project Time
Comments