Digital Humanities Summer Institute #DHSI18 Day 1 [Morning] Making Choices About Your Data
Digital Humanities Summer Institute #DHSI18 Day 1 Morning
Paige Morgan and Yvonne Lam [ #wrangledata ]
Goals
Spreadsheet of data and metadata you can take to librarian or developer
Vocabulary for this week
Common vocab as we discuss what we're going to do with our data.
Unstructured vs structured data
ex: donald Trump's doctor's letter certifying his good health
ex: American Revolution war-era deserter clothing
Text mining and text analysis where look for patterns, count words, unstructured or less-structured. Look at how often certain pronouns happen (he, she, etc.) - using programs to look at data, count. Data is minimally structured. All 7 Harry Potter novels: separate files for each novel so can compare volumes. or break down into single chapters to trace whatever to see if it peaks in certain points of the book or breaks down. There is no "better if more structured" - depends on your RQs.
Structured data is just arranged in specific ways so computers can do more things with it. Structured data can look like a spreadsheet - a good way of thinking and talking about structured data. Decisions on which fields to include (see North American Deserted Soldiers by White at UMiami - see vocabulary used to describe from columns X to G, are structuring clothing). More you structure your data the more it's possible to ask a computer to count or analyze. Spreadsheet is not the only way to create structured data.
Trump's doctor's letter certifying good health - can pull out any manner of data here. Another way of structuring data. TEL can be used to encode more than just text. TEI is another way of structuring data by putting things into boxes, view, speaker, paragraphs of text. Could box adjectives from letter. Could dataset letters from all president's physicians over time and structure that way (if such exist to build a dataset from).
Four types of data models
Leery of thinking of materials as data because sounds impersonal and clinical, and data is assumed to be finding commonalities for big data, where humanities projects attempt to show things that are unique much of the time. But data is just *a* representation, not a or The truth. Data models are just ways that if you are going to structure data, it will be more or less structured. See reading
Many tools available; these 4 cover most of the bases. If your data doesn't fit into one of these models, let's talk about it.
Social network analysis: Tabular? Ex - Who people are sending letters to? Tracking poems being copied down in commonplace books as network analysis. Tables include which commonplace book this poet was in, where are the books where they appear? What are they dated?
Vocabulary: tool vs platform. Sometimes used interchageably. But tool is for *doing* things, platform is stability. Platform to host, tool to analyze it? For example, Tableau--allows you to create data visualizations you can publicize so people can interact with your data visualization. You're putting your data into it - is it a platform? Or a tool because people can use to change? Distinction might be helpful when thinking about licensing, who has access, where does it live, how do we get people access? But distinction tends to be loose when working with stuff. Implication of platform: may consist of many tools, may have a hosted component but also a tool component. Think about locally hosted on laptop (tool) vs platform (hosted). Ex: nVivo - lends itself to metamarkup - meant to allow you to generate stats about your content depending on what you're looking for. Is not interested in allowing you to export material TO others. But metamarkup languages in the context of DH are explicitly to allow people to do stuff with stuff. nVivo is a tool you use for analysis, not intended for you to process content in a way that you will share with others so they can ask other questions with it.
Method oriented structured data
You have a very specific question you want to ask.
ID info in the sources -> Identify precisely the RQs and purpose of the database -> Design database to accommodate only the info needed to answer questions -> Convert info to data -> perform analysis -> generate research outputs.
If you're trying to track a particular thing, might structure your data based purely on answering one question.
See image from Institute for Historical Research's free online course "Designing databases for historical research."
Source oriented structured data
ID info in sources -> design database to include ALL of the information [...]
ex) Trump's doctor's letter we could consider adjectives, body parts, etc. But no one said cover all bases! You might need to cover everything. Digital Yoknapatawpha (Faulkner scholarship)--one text "That Evening Sun" - see what data is tracked, trace what is occurring in this text. To Faulkner scholars this is meaningful.
Can quickly become not meaningful - lots of affect in what's included, how recorded, etc. What is doable/feasible? We are not interested in encoding everything in Faulkner's stories, but others may be. Need to think through reasons why you want to do something and take into account the labor and materials you have available. Can you land a $4m grant from NEH to build source-oriented database and adequately planned budget so plan is actually feasible? You want to think about usability, think in advance who is your audience, what kinds of questions will they want to ask? Go to conferences, talk about it, pay attention to what folks are interested in. If you try to be comprehensive, you will get stuck in a goal that explodes your project milestones. If you create complicated system, you need to teach people to use it--becomes a question about labor. If you're fortunate enough to get student labor, are you teaching them usable skills, or specialized ting that the value students can use as a credential gets a little weird.
Think of yourself as customer for your own data - not you now, but you 6 months from now. Be nice to your future self. Help frame your thinking. Think about staging the project - what is the next thing I could do to make the case that I should have an intern or grad student or half of a database. What is the next showable milestone I could reasonably do from what I am now. Do I even have the resources to so this now? What concretely will this do? Small interventions and small starts can be just as effective as massive million dollar events.
Lunch
Paige Morgan and Yvonne Lam [ #wrangledata ]
Goals
Spreadsheet of data and metadata you can take to librarian or developer
- Clearer idea of what research questions you can ask of your data
- Better sense of what tools would be a good fit for your data; or what you would need to do to your data to make it work better with certain tools
- Start of specific plans about work that you want to do ON your data
So much depends on what you're going to prioritize because you are not going to learn all the things at once. Encouragement ot think carefully and realistically and generously with selves about setting goals of what we're going to learn. Goals, milestones ,
FemTechNet MEALS Framework
Idea is to poke a little bit at assumptions we have about technologies work and good ways of using them, what's an acceptable thing to apply technology to (mostly discussing digital tech). Not only is there this idea of how tech gets used and what is it, bit who is technology for? The fem DH class that troubled those assumptions was evocative for the instructors. when you work with a digital technology, idea with little behind it sounds good, but to implement you find it's far more complicated and it takes a lot more work than is planned. Overwork used to be valued-- people in DH are expected to learn stuff, make good choices, keep producing strong powerful outputs that get grants, w/o real understanding of labor and material conditions involved. Pressure to make decisions for other people.
- Technology is MATERIAL, though it is often presented as transcendent
- Technology involves EMBODIMENT, though it is often presented as disembodied
- Tech solicits AFFECT, though is often presented as highly rational
- Tech requires LABOR, though it is often presented as labor-saving
- Tech is SITUATED in particular contexts, though often presented as universal
- Tech promotes particular VALUES, though often presented as value-neutral
- Tech assumes MASTERY OF TACIT KNOWLEDGE PRACTICES, although often presented as transparent.
[Get slides later from tinyURL]
Introductions
Vocabulary for this week
Common vocab as we discuss what we're going to do with our data.
Unstructured vs structured data
ex: donald Trump's doctor's letter certifying his good health
ex: American Revolution war-era deserter clothing
Text mining and text analysis where look for patterns, count words, unstructured or less-structured. Look at how often certain pronouns happen (he, she, etc.) - using programs to look at data, count. Data is minimally structured. All 7 Harry Potter novels: separate files for each novel so can compare volumes. or break down into single chapters to trace whatever to see if it peaks in certain points of the book or breaks down. There is no "better if more structured" - depends on your RQs.
Structured data is just arranged in specific ways so computers can do more things with it. Structured data can look like a spreadsheet - a good way of thinking and talking about structured data. Decisions on which fields to include (see North American Deserted Soldiers by White at UMiami - see vocabulary used to describe from columns X to G, are structuring clothing). More you structure your data the more it's possible to ask a computer to count or analyze. Spreadsheet is not the only way to create structured data.
Trump's doctor's letter certifying good health - can pull out any manner of data here. Another way of structuring data. TEL can be used to encode more than just text. TEI is another way of structuring data by putting things into boxes, view, speaker, paragraphs of text. Could box adjectives from letter. Could dataset letters from all president's physicians over time and structure that way (if such exist to build a dataset from).
Four types of data models
Leery of thinking of materials as data because sounds impersonal and clinical, and data is assumed to be finding commonalities for big data, where humanities projects attempt to show things that are unique much of the time. But data is just *a* representation, not a or The truth. Data models are just ways that if you are going to structure data, it will be more or less structured. See reading
- tabular
- each data item is structured as a line of field values. Fields are the same for all items, a header line can indicate name. But note the database above, if too many holes in database cells, creates errors. Think about as you're looking at spreadsheets and tables, Are most of my cells full or empty--homogenous (easy to it into a table, all cells are full, to some degree library books are a good example of homogenous data - most have a title, author, date, subject--don't have to worry about library books being like soldier uniforms) or heterogeneous (for each item, lot of different type of data that may be very unique; ex - a database of prices in literature and fiction, thinking each would be thing and price in shillings, which worked until discover it's common for day laborers to be paid 6 pence a day plus potatoes or beer or whatever--column is no longer mathematical for the column you thought were goin got be numbers. Also some items would be apples at 6 shillings a pound, others finest almonds from wherever, or syphilis medication--could be useful for others but makes data very heterogeneous). Depending on what you want to do ... What are the RQs?
- We don't want to suggest hat good data has all the heterogeneity taken out of it; it will lead to disappearing people and populations. Be clear and up front about what your data does not effectively represent, what it allows us to answer but what its weaknesses are.
- Read: Roads to Power - about railroads in early 19th c Britain, she talks about holes in the data, and what she extrapolates form the data that isn't there.
- relational
- Data are structured as tables, each with own set of attributes, records in column can relate to others by referencing key column. (MySQL as example)
- meta-markup
- TEI is best example for DH people, it's a particular flavor of XML. Top level boxes, then views, then speaker, speaking in paragraphs, data might get more complex (see coded Shakespeare, down to spaces between words). Heirarchical, tree-like appearance. Textual data that you want to add more granularity to. Ex - track emotional beats in Dickens or Bronte or whomever
- Opinion: lot of humanities data is pretty complex, and some of the most complex data and questions people want to ask sometimes do not lend themselves to tabular or relational data and maybe not to meta markup.
- rdf - Resource Description Framework (graph data) or non-relational data
- More complex data model structure that *can* work with humanities data. Graph structure can take any shape - blobs and arrows between blobs. each fact about a data item is expressed as a triple, connects a subject to an object through precise relationship. Leads to graph-structured data that can take any shape. Movie release data vs television show airing date... issues because of differences. It might not matter depending on questions you want to ask. Do you need separate for release vs airing dates? But you need to represent your dates the same way!! Or if inviting people to write queries to work with your data, those queries might be structured that they catch one type of date and not another.
- non-relational database because not everything is directly related to everything else.
- Large companies like big data so tools for graph databases are only starting to become common enough that more people are using them. Relational models are much older.
- Tool called Dydra - can take your non-relational data which is structured in triples, load and can pre-bake queries others can run without needing to code. (Ex - run query on term, retrieve DOIs for JSTOR articles holding that term within whatever parameter.) Doesn't require you to be a full stack developer. Some tools require a full developer to get up and running, Dydra is a free tool that doesn't. Also means users don't need to be able to code in order to query your data.
Many tools available; these 4 cover most of the bases. If your data doesn't fit into one of these models, let's talk about it.
Social network analysis: Tabular? Ex - Who people are sending letters to? Tracking poems being copied down in commonplace books as network analysis. Tables include which commonplace book this poet was in, where are the books where they appear? What are they dated?
Vocabulary: tool vs platform. Sometimes used interchageably. But tool is for *doing* things, platform is stability. Platform to host, tool to analyze it? For example, Tableau--allows you to create data visualizations you can publicize so people can interact with your data visualization. You're putting your data into it - is it a platform? Or a tool because people can use to change? Distinction might be helpful when thinking about licensing, who has access, where does it live, how do we get people access? But distinction tends to be loose when working with stuff. Implication of platform: may consist of many tools, may have a hosted component but also a tool component. Think about locally hosted on laptop (tool) vs platform (hosted). Ex: nVivo - lends itself to metamarkup - meant to allow you to generate stats about your content depending on what you're looking for. Is not interested in allowing you to export material TO others. But metamarkup languages in the context of DH are explicitly to allow people to do stuff with stuff. nVivo is a tool you use for analysis, not intended for you to process content in a way that you will share with others so they can ask other questions with it.
Method oriented structured data
You have a very specific question you want to ask.
ID info in the sources -> Identify precisely the RQs and purpose of the database -> Design database to accommodate only the info needed to answer questions -> Convert info to data -> perform analysis -> generate research outputs.
If you're trying to track a particular thing, might structure your data based purely on answering one question.
See image from Institute for Historical Research's free online course "Designing databases for historical research."
Source oriented structured data
ID info in sources -> design database to include ALL of the information [...]
ex) Trump's doctor's letter we could consider adjectives, body parts, etc. But no one said cover all bases! You might need to cover everything. Digital Yoknapatawpha (Faulkner scholarship)--one text "That Evening Sun" - see what data is tracked, trace what is occurring in this text. To Faulkner scholars this is meaningful.
Can quickly become not meaningful - lots of affect in what's included, how recorded, etc. What is doable/feasible? We are not interested in encoding everything in Faulkner's stories, but others may be. Need to think through reasons why you want to do something and take into account the labor and materials you have available. Can you land a $4m grant from NEH to build source-oriented database and adequately planned budget so plan is actually feasible? You want to think about usability, think in advance who is your audience, what kinds of questions will they want to ask? Go to conferences, talk about it, pay attention to what folks are interested in. If you try to be comprehensive, you will get stuck in a goal that explodes your project milestones. If you create complicated system, you need to teach people to use it--becomes a question about labor. If you're fortunate enough to get student labor, are you teaching them usable skills, or specialized ting that the value students can use as a credential gets a little weird.
Think of yourself as customer for your own data - not you now, but you 6 months from now. Be nice to your future self. Help frame your thinking. Think about staging the project - what is the next thing I could do to make the case that I should have an intern or grad student or half of a database. What is the next showable milestone I could reasonably do from what I am now. Do I even have the resources to so this now? What concretely will this do? Small interventions and small starts can be just as effective as massive million dollar events.
Lunch
Comments