Morning Workshop: Regular Expressions (Digital Humanities Summer Institute #DHSI18)

DHSI Morning Workshop: Regular Expressions by John Simpson
Description: Regular Expressions are a powerful tool for searching text to find patterns of characters. They are often used to extract postal codes, phone numbers, and emails from large sets of documents and when combined with a little bit of scripting they can turn tedious and error prone work done “by hand” into fast, effective, and automatic searching. In this workshop you will learn the basic syntax for regular expressions and deploy them to extract useful information in cases where doing it “by hand” would be tedious.

Point browser to https://regex101.com/ and to gutenberg.org/ebooks/13

Text version of The Hunting of the Snark.
Most of the workshop should be discussion dialog.
cwrc.ca/rsc-src
Regex good for matching patterns of characters

A PDF document in background is a lot of XML, lot fo stuff is not helpful, lots of XML vomit of individual lines, but can use to zoom in on a particular piece of text. Website, find text, tolls for that. But once you get the paragraph, unless there's TEI markup of every word, stuck. Blob of raw text, want to extract data.

cwrc.ca/rsc-src - list of all the libraries and archives in Canada, where they are, when they opened, when they closed, etc. But to get this information. To get all that info, a phone book for libraries and archives. PDF, copyrighted and protected/locked. Pair of grad students doing by hand. Get XML form of PDF, then extract addresses. Hacking between XML and regex, able to extract all addresses.

Happens in background, unspoken, gets the data that powers the pretty things.

Get text of a thing (Snark) - get plain text UTF8. Copy and paste into regex101.  Use this tool bc abstracts the way the syntax of teh language is. In command line can use, in C, R, Python, etc, can use regex. Today this cuts that away. Also, need to build the regex in those tools, and don't know anyone who writes it in Python to see if it works. In this tool will explain the regex, will search, quickie help guide.

3 Principles

  1. Effective vs efficient - efficient is terse beautiful tight piece of code that only the ultimate guru could read in an instant and it just works. Then our code that spans many lines, gets job done. Just get in and get out. Today: effective. 
  2. Know your data. If you don't know your data, won't know if you found what you want or not.
  3. Start greedy and then get conservative. Much easier to ask for too much and then say this is the stuff I don't want and cut out, than to be missing what you need because overly constrained. 
  4. Specificity is speed. Balance with #1. If you are scraping data in massive quantities and running complicated regular expressions - something that takes 5 seconds vs .5 seconds can add up over hundreds of thousands of documents. 

Hunt snarks. "snark" only has one match, not right. Diff between lower and uppercase snark - you just typed string of characters, it's a regular expression. CTRL+F is same sort - exact char matching. power comes when can put patterns in there when we want S or s. Snark with capital gets 35 matches with 306 steps. Steps: engine is doing character matching. A regx is just a command to tell you to match characters, goes character by character. When it finds a lowercase s, it sees if next letter is n, then will check to see if next is an a - requires more steps.

The space matters. Look for Snarks, out of 35, 11 have a space after them. Space at beginning, wherever. Demands precision. Secret silent deadly space at the end! Want all cases where snark is at end of line, will miss the ones where line ends with spaces they used to format. SUper picky, cases and spaces matter. Settings->whitespace to show you spaces

A tab is a text character represented typically as \t sometimes called a token. "\" is read as an escape character and says "next character is not its usual self" so it reads \t as a tab. Sometimes will show up as an ->, dots, or bigger space. We can search them.

Only one lowercase snark, in "snarking"

To find all snarks uppercase or lowercase. Depends on environment you deploy in, Python or R. Right side of regex bar is flag - turn on case insensitivity. Where is case in sensitivity not ok? Proper name nouns are also diff nouns, languages where it makes it a different word, name is also adjective (Brown), when the capital carries information. Does title count as part of the text? Acronyms. How many possible capitalization variants for Snark? Our group said 5!=120. (Oops, S can be s or S. It can have an n or N... 2x2x2x2x2. 2^5 = 32. Our answer is for if letters can be out of order.) 32. That's a lot and it's just a 5 letter word. This is why it's important to start greedy because it's easy to forget what others may have done, esp with large volumes of text.

Turn off case insensitivity. Can search for each kind then add it up in your programming language. Pipe used as "or" equivalent used inclusively. Snark|snark up to 59,000 steps. Can search for my string or other string. Can search for all other variants snark|snark|SNARK (53,000 steps, takes awhile). Can also search either/or: square brackets: [Ss]nark (faster, fewer steps). (S|s)nark - 89,000 steps. Gives you full match but also substring in the reader pane. Round parentheses group things (like math - applies to these things, applies first). Square brackets are used for a set - a single object. Treated by engine differently.

[] denotes set of characters that I will choose one out of to complete whatever come after it, can have multiple instances of this. Can feed specific characters, but can allso pass a range: [a-z]nark or [A-Z]. If [A-z] will run through ALL options capital and not (can't state [a-Z] because out of order in ASCII character alphabet). Can do [A-z0-9] because it reads it A-z and 0-9

[A-z][A-z]ark will find "ark" with any two leading letters.

How would you find and capture every entire word ending in "ing" in the poem
[A-z] needs to be infinitized, what about ending in punctuatoin, space, enter?
Token for infinitizing is \w - Help says \w matches any word character (equal to [a-zA-Z0-9_])
\wing grabs single letter before ing, doesn't infinitize the capture before "ing".
Could \w\w\w\w\w\w but ick. Nice way to say one or more or zero or more: \w+
\w+
matches any word character (equal to [a-zA-Z0-9_])
Now we can search for character strings, then sets of letters with square brackets with order, saw how to use the or pipe, now we can use some tokens - "all the word characters" [A-z], now the quantifiers. two you'll use a lot - the plus, the asterisk.

\w+ing -- reading this: is any word character /one or more times /followed by an i followed by an n followed by a g .
[\w-]+ing -- reading this as "the set of all/any word characters or a hyphen" "repeated one or more times" "followed by lowercase i followed by lowercase n followed by lowercase g"

need to say "this is the end of the word" - can use space at the end, but also period, etc
Word boundary token is \b
\b[\w-]+ing\b
leading /b reduces number of steps to say where word stops


If only want to find only words that have 6 or less letters total that end in -ing. Terser syntax: all 6 letter words that end in "ing".
\b\w{3}ing\b -- word boundary, any word character for teh three spaces before a lowercase i followed by lower n followed by lower g.
Can also pass arrange - all from 6,7,8 letter words : \b\w{3, 5}ing\b or anything bigger than 3 would be {3, } the space! Can't do { , 5).
\b(\w{5}|\w{3})ing\b leaves out the 4 before (sing-ing, land-ing, etc)

Aside problems: thing-um-a-jig (don't want) and bathing-machine and lace-making (want)
Most regex you can do, it's the niggling things that require more.

Advanced territory - how do you deal with the hyphens. Keep lace-making in full (capturing "making") but ditch "thing-um-a-jig"

Can Export as json, csv, or plain text. Might need to change mass execution time from the default in 10 seconds.

\b\w+ing\b

How do we capture all of the -ing words appropriately thing-um-a-jig (don't want) and bathing-machine and lace-making (want). What characteristics do they have that you want to grab onto, what features does it have that would allow us to keep or toss

Capture all previous to ING as long as continuous string for something like lace-making
Get rid of all after ING as long as continuous string

To capture lace-making --\b[\w-]+ing\b[^-]

Word boundary, any character from the set that includes any word character and set includes hyphen, any amount of times one or more, followed by ing, followed by word boundary, followed by a set that does not include hyphen. Immune to adding more hyphens in middle.

We need a way to say "not" - not these things - carrot does that - "nothing in this set can be here: [^-] (nothing in this set can be a hyphen). Note in syntax [^ is ONE character/thing when seen by the engine. : [^ means "A NEGATING SET"

Fancier way using negative lookaheads and lookbehinds - we won't do those today.

Last challenge today is to write a regular expression to capture all and only every word in this poem (not he punctuation)

Extract the words (what is a word)
WAIT -
Our solution \b[\w-]+ing\b[^-] captures space after words

Read \b[\w-]*ing(?!-|\w)\b -- word boundary followed by 1 character from this set where this set is any word character or hyphen as repeated ending in ING. then...(?! is a negative lookahead --when I've made all these matches and I look ahead and I dont like it, drop it, dont return the match - it can't be followed by anything but a word boundary) (where -|\w is where any character following a hyphen in same string)

\b[\w-]*ing(?!|\w)\b - cant have any word characters after the "ing"

Now - rip out all the words from the text
Starting with \b[\w-]+ing\b[^-]

Our solution: \b[\w-]+\b[^_|\W]

Problems: possessives are gone and treats it's as two words, underscores of _was_ in teh text

Walkthrough solution: (note - had to grab smart quotes from text because we dont have them

\b(\w+[-’']?)+\b this solution doesnt get rid of underscore of _was_

? = zero or one of. A quantifier without the parenths
Lookahead -as long as there are no double hyphens, treat it as a word and make a match

--------------------

Character we haven't seen today: the dot - . Understood as any character whatsoever (unless seen as \.) Then .+ means any character whatsoever over and over and over which will eat the entire document. SO: .nark

------------------
Regex101 - code generator on right hand side, export on left.












Comments

Popular posts from this blog

Access 2018 Conference: Morning Sessions Day 1 #AccessYHM

Access Conference 2018 Day 1 Afternoon Sessions #AccessYHM

UC DLFx 2018: DeMystifying Data Curation