UC DLFX 2018 Combined Session G Notes
UC DLFX 2018 Combined Session G
Digital Conversion in the Modern Research Ecosystem
Stefan Elnabli (UCSD) - Media Curation Librarian and Supervisor, Digital Reformatting Operations
Interesting place to be halftime in collections and half in digital collections.
Digital conversion/digitizing isn't just a means of preservation or of access, but foundational to entire creational process of library collections which includes its use in modern research. Files we produce and disseminate should be considered tools.
in service of digital scholarship, and knowledge production, digital conversion is integral to the preservation, access, and future of our digital collections. What is the basis for knowledge production? how do we get from data to knowledge. How do we use our digital collections to answer different kinds of research questions? What is the value of digital collections and how can we measure it?
Token OAIS Model - refers to open archival information system. Producer submits to a repository--basis for allowing you to test the assertion that these things are being preserved. When those responsible for digital conversion, assume role of producer,, contributing to submission information packages. SIPs abide by requirements. We need to be aware of the end users, and need to be specific in defining designated community so you can add value ot the data they need. How? Permeate the whole process.
Stuff comes in, people migrate bits. Bits migrated or created need to comply with best practices and need to comply with end users needs. Need to be format experts, know care and handling, know digitization and migration, know how files generated are sustainable, know what metadata is important and how things can be accessed so data disseminated maximizes potential for scholarship and knowledge production.
Should care because fruit of our labor isn't merely access or preservation but in knowledge produced from data contributed to repository and our work, which starts at point of digital conversion of our collections. We can benefit by extending digital purview. Often the case that the needs of researchers and faculty feed back into user stories that feed back into development and digital conversion practices. How can conversion process enable ADA compliance when digitizing AV--ho can we envision closed caption metadata being used by communities. If researcher needs ot convey data visually, how can we enable that?
DIKW pyramid - Knowledge production and wisdom. Structurally, triangle. Functionally implies there is a transformation between all of these.
Data vs Info
Data: factual info, has no value in isolation bc needs a context
info processing data to make it meaningful by contextualizing it.
in Libraries and archives, we are stewards of data and info, and add value by making accessible and presenting git in dynamic ways for researchers to develop knowledge and wisdom.
Knowledge - synthesis of info to convey accumulated learning
Wisdom - added value of experience, morality, etc.
Our work contributes to wisdom of society and how things develop and change over time. Sees digitization as foundation of this pyramid, but permeates DIKW.
Trends in digital production?
Digital scholarship. Data visualization, Geospatial and temporal mapping, text mining.
Digital humanities division of NEH: 3d modeling for textile collections (Ball State); Image analysis for archival discovery (UNebraska-Lincoln). Digital conversion was basis.Extraction of machine readable data for analysis.
Demands of digital humanities research influences digitization process. Capabilities of conversion to produce data types influences scholarly research capabilities.
To be part of this process, need foundational infrastructures.
-platforms for online exhibitions
- applications for use and re use of content
- interoperability frameworks
- open access to data and information
Data conversion is integral
In service of knowledge production, enables preservation of data in ways that can be used in digital scholarship.
Knowing what is needed requires interaction with many--librarians, IT, researchers, etc.
Building the Discography of American Historical Recordings into a 78rpm Mass Digitization Project
David Seubert (UCSB), Curator of Performing Arts Collection, project Director and PI for this project
Build a database of historical sound recordings.
History of the ADP/DAHR: moved to UCSB in 2003
Current scope is large - document 5 American record labels 1892-1941, and growing. Authoritative data from original sources. Prismatic view of data. Been online since about 2007 as a database. Great, but silent. genesis of DAHR from Mass Digitization started with LoC. Gratis license from Sony in NYC - stream anything for free prior to 1945 (10,000 recordings). Source of data for a lot of different things. Can repurpose for MARC and for digital projects like speaking discography. Master recording registry like Hathi is for books. Collaborative and multi institutional. Systematic digitization and not random. Building discography from best collections, the deeply curated ones. Packard Humanities institute support. [Handout] Multiple streams of data inputs, then content inputs form library collections, then metadata management systems, then to digitization lab, then various online platforms. released DAHR MARC records.
An ecosystem of providing access. No separate units to do this. Primary source inputs like company recording ledgers and create database from this data. Turned into authoritative metadata. Work in FileManager databases. From beginning thought like librarians. Not a lot of back end work to be done. Not compliant with RDA but easy to make compliant. Second data input streams, published discographies (UCSB buys perpetual access- buying perpetual digital rights). Partner in india that takes PDFs of books, they have parallel database (SQL) on their servers, lots of folks keying into parallel database that mirrors exactly, can pull from that and upload, minor editing. DECA 85,000 master recordings.
UCSB collections - want to be able to pull any record off of shelf and put into digital workflow without one off difficult labor intensive process. Lightweight data entry system (same database), cataloger goes through and enters data off label (no consultation of primary sources, make a note of that). Master registry database (MySQL) originally created as web based database because working with LoC (would put in FileMaker because synchronizing is a headache). A technician making selection of analog to digitize - can see which ones haven't been digitized and do the selection, then on cart to go down to digital lab.
This isn't automated yet: master management system. We do it by series. Spreadsheet.
Revamping workflow: prior cost was $21/side excluding expensive metadata. Multiple turntables for multiple ingest. Automated everything possible--writing Python. Workflow management tool and QC module in Filemaker staff can use for quality control, new cost is $7/side.
Audio isa time based media and has to be done in real time. Student cleans the discs. Run 2 machines at once. Workflow management tabs, all of stuff is automated, will scan a barcode which initiates all the processes. More than 2 turntables in hard. Set up one while other is ingesting, entering technical metadata into Filemaker database. With those spinning - we capture audio 48 out of 60 minutes. Image capture - script driven ingest process: operator puts dic under camera, camera snaps photo, Python move file to crop and desk and send to data repository.
QC module is a window into where the data is stored. Look at label image, listen to audio file, and check this. The thing that kills projects is cleaning up the 1% error, so get away from that time intensive thing. Everything that strikes an error goes back to beginning to be redone, no high touch fixing.
Unbuilt yet, but publishing engine. You have to be able to publish content in real time, not batch it for a checkpoint. Priciple, click where you want ot publish it to, hit Go, then Python packages it up to submit to DAHR or YouTube or internet Archive, or repository.
Need to finish the build out of publishing engine.
Complete negotiations with record labels - not copyrighted but not public domain. Sound recordings not eligible for copyright prior to 1972 (actual sound recording; written sheet music is covered).
Need to select a digital platform (dependent on above)
[Sample on Youtube]
Q: Can others use this data?
A: No, were beginning to provide data for those doing big data and digital humanities. but no XML output option.
Fed Doc Big Data Big Items
Lynne Grigsby (UCB)
Most Cus are repositoriew.
You get it, you keep it, you let people see it. UC got permission to change "you keep it" part.
Goals: shared print collection at UC, FedDocs collection at HathiTrust, clear shelf space
In the beginning:
records from NRLF and SRLF
took what they said was FedDocs
Goal Single copy at RLF, others sent to Google for destructive scanning or offered
List of monos, lists of serials
All things we'd doe before
But...Mess.
Regroup
Determined
- would go through OcLC to determine if it was a FedDoc (means we get ALL records from campuses)
- merge only on OCLC number (knowing the issues of imperfection, possible dupes)
- hired developer with experience with data but not MARC - experience in munging lots of data
- realized needed more info about records (locations ot exclude, different campuses enter info differently in the records)
- wrote detailed specs, redesigned workflow
Big Items--Large Foldouts
USGS - for ex, professional papers
Soil Surveys
Maps critical part of documents
Check out, suppress, disbind (tough for archivists!), scan, QC, send to HathiTrust, get public domain records from HT, convert to electronic, load to OskiCat (library catalog).
Scanned maps with wide scanner (canon document feed, and wide tech large 48" large format
Just USGS.
Submitted to Hathi -
628 volumes
244, 138 pages
5.4 TB
*Everything still has one print copy at an RLF designated as shared print. If available in shared print, compare in your catalog for what you want to do with it: either keep it, send for scanning, or pull from shelves.
HathiTrust has a FedDoc advisory group - goal is make truly accessible to public. Gap filling - registry of all FedDocs that exists (10 institutions on that advisory group) - to ID most important titles and make sure all the info is there to say something is complete. Appendix with proposed schedule for UC project schedule for UC libraries on different campuses.
Digital Conversion in the Modern Research Ecosystem
Stefan Elnabli (UCSD) - Media Curation Librarian and Supervisor, Digital Reformatting Operations
Interesting place to be halftime in collections and half in digital collections.
Digital conversion/digitizing isn't just a means of preservation or of access, but foundational to entire creational process of library collections which includes its use in modern research. Files we produce and disseminate should be considered tools.
in service of digital scholarship, and knowledge production, digital conversion is integral to the preservation, access, and future of our digital collections. What is the basis for knowledge production? how do we get from data to knowledge. How do we use our digital collections to answer different kinds of research questions? What is the value of digital collections and how can we measure it?
Token OAIS Model - refers to open archival information system. Producer submits to a repository--basis for allowing you to test the assertion that these things are being preserved. When those responsible for digital conversion, assume role of producer,, contributing to submission information packages. SIPs abide by requirements. We need to be aware of the end users, and need to be specific in defining designated community so you can add value ot the data they need. How? Permeate the whole process.
Stuff comes in, people migrate bits. Bits migrated or created need to comply with best practices and need to comply with end users needs. Need to be format experts, know care and handling, know digitization and migration, know how files generated are sustainable, know what metadata is important and how things can be accessed so data disseminated maximizes potential for scholarship and knowledge production.
Should care because fruit of our labor isn't merely access or preservation but in knowledge produced from data contributed to repository and our work, which starts at point of digital conversion of our collections. We can benefit by extending digital purview. Often the case that the needs of researchers and faculty feed back into user stories that feed back into development and digital conversion practices. How can conversion process enable ADA compliance when digitizing AV--ho can we envision closed caption metadata being used by communities. If researcher needs ot convey data visually, how can we enable that?
DIKW pyramid - Knowledge production and wisdom. Structurally, triangle. Functionally implies there is a transformation between all of these.
Data vs Info
Data: factual info, has no value in isolation bc needs a context
info processing data to make it meaningful by contextualizing it.
in Libraries and archives, we are stewards of data and info, and add value by making accessible and presenting git in dynamic ways for researchers to develop knowledge and wisdom.
Knowledge - synthesis of info to convey accumulated learning
Wisdom - added value of experience, morality, etc.
Our work contributes to wisdom of society and how things develop and change over time. Sees digitization as foundation of this pyramid, but permeates DIKW.
Trends in digital production?
Digital scholarship. Data visualization, Geospatial and temporal mapping, text mining.
Digital humanities division of NEH: 3d modeling for textile collections (Ball State); Image analysis for archival discovery (UNebraska-Lincoln). Digital conversion was basis.Extraction of machine readable data for analysis.
Demands of digital humanities research influences digitization process. Capabilities of conversion to produce data types influences scholarly research capabilities.
To be part of this process, need foundational infrastructures.
-platforms for online exhibitions
- applications for use and re use of content
- interoperability frameworks
- open access to data and information
Data conversion is integral
In service of knowledge production, enables preservation of data in ways that can be used in digital scholarship.
Knowing what is needed requires interaction with many--librarians, IT, researchers, etc.
Building the Discography of American Historical Recordings into a 78rpm Mass Digitization Project
David Seubert (UCSB), Curator of Performing Arts Collection, project Director and PI for this project
Build a database of historical sound recordings.
History of the ADP/DAHR: moved to UCSB in 2003
Current scope is large - document 5 American record labels 1892-1941, and growing. Authoritative data from original sources. Prismatic view of data. Been online since about 2007 as a database. Great, but silent. genesis of DAHR from Mass Digitization started with LoC. Gratis license from Sony in NYC - stream anything for free prior to 1945 (10,000 recordings). Source of data for a lot of different things. Can repurpose for MARC and for digital projects like speaking discography. Master recording registry like Hathi is for books. Collaborative and multi institutional. Systematic digitization and not random. Building discography from best collections, the deeply curated ones. Packard Humanities institute support. [Handout] Multiple streams of data inputs, then content inputs form library collections, then metadata management systems, then to digitization lab, then various online platforms. released DAHR MARC records.
An ecosystem of providing access. No separate units to do this. Primary source inputs like company recording ledgers and create database from this data. Turned into authoritative metadata. Work in FileManager databases. From beginning thought like librarians. Not a lot of back end work to be done. Not compliant with RDA but easy to make compliant. Second data input streams, published discographies (UCSB buys perpetual access- buying perpetual digital rights). Partner in india that takes PDFs of books, they have parallel database (SQL) on their servers, lots of folks keying into parallel database that mirrors exactly, can pull from that and upload, minor editing. DECA 85,000 master recordings.
UCSB collections - want to be able to pull any record off of shelf and put into digital workflow without one off difficult labor intensive process. Lightweight data entry system (same database), cataloger goes through and enters data off label (no consultation of primary sources, make a note of that). Master registry database (MySQL) originally created as web based database because working with LoC (would put in FileMaker because synchronizing is a headache). A technician making selection of analog to digitize - can see which ones haven't been digitized and do the selection, then on cart to go down to digital lab.
This isn't automated yet: master management system. We do it by series. Spreadsheet.
Revamping workflow: prior cost was $21/side excluding expensive metadata. Multiple turntables for multiple ingest. Automated everything possible--writing Python. Workflow management tool and QC module in Filemaker staff can use for quality control, new cost is $7/side.
Audio isa time based media and has to be done in real time. Student cleans the discs. Run 2 machines at once. Workflow management tabs, all of stuff is automated, will scan a barcode which initiates all the processes. More than 2 turntables in hard. Set up one while other is ingesting, entering technical metadata into Filemaker database. With those spinning - we capture audio 48 out of 60 minutes. Image capture - script driven ingest process: operator puts dic under camera, camera snaps photo, Python move file to crop and desk and send to data repository.
QC module is a window into where the data is stored. Look at label image, listen to audio file, and check this. The thing that kills projects is cleaning up the 1% error, so get away from that time intensive thing. Everything that strikes an error goes back to beginning to be redone, no high touch fixing.
Unbuilt yet, but publishing engine. You have to be able to publish content in real time, not batch it for a checkpoint. Priciple, click where you want ot publish it to, hit Go, then Python packages it up to submit to DAHR or YouTube or internet Archive, or repository.
Need to finish the build out of publishing engine.
Complete negotiations with record labels - not copyrighted but not public domain. Sound recordings not eligible for copyright prior to 1972 (actual sound recording; written sheet music is covered).
Need to select a digital platform (dependent on above)
[Sample on Youtube]
Q: Can others use this data?
A: No, were beginning to provide data for those doing big data and digital humanities. but no XML output option.
Fed Doc Big Data Big Items
Lynne Grigsby (UCB)
Most Cus are repositoriew.
You get it, you keep it, you let people see it. UC got permission to change "you keep it" part.
Goals: shared print collection at UC, FedDocs collection at HathiTrust, clear shelf space
In the beginning:
records from NRLF and SRLF
took what they said was FedDocs
Goal Single copy at RLF, others sent to Google for destructive scanning or offered
List of monos, lists of serials
All things we'd doe before
But...Mess.
Regroup
Determined
- would go through OcLC to determine if it was a FedDoc (means we get ALL records from campuses)
- merge only on OCLC number (knowing the issues of imperfection, possible dupes)
- hired developer with experience with data but not MARC - experience in munging lots of data
- realized needed more info about records (locations ot exclude, different campuses enter info differently in the records)
- wrote detailed specs, redesigned workflow
Big Items--Large Foldouts
USGS - for ex, professional papers
Soil Surveys
Maps critical part of documents
Check out, suppress, disbind (tough for archivists!), scan, QC, send to HathiTrust, get public domain records from HT, convert to electronic, load to OskiCat (library catalog).
Scanned maps with wide scanner (canon document feed, and wide tech large 48" large format
Just USGS.
Submitted to Hathi -
628 volumes
244, 138 pages
5.4 TB
*Everything still has one print copy at an RLF designated as shared print. If available in shared print, compare in your catalog for what you want to do with it: either keep it, send for scanning, or pull from shelves.
HathiTrust has a FedDoc advisory group - goal is make truly accessible to public. Gap filling - registry of all FedDocs that exists (10 institutions on that advisory group) - to ID most important titles and make sure all the info is there to say something is complete. Appendix with proposed schedule for UC project schedule for UC libraries on different campuses.
Comments