Projects

Text Analysis Pedagogy Institute Tweet

In partnership with:

Funding for this 24-month project provided by:
The National Endowment for the Humanities

Funding received: September, 2020

We are now planning TAP Institute 2023. For updates, join our mailing list.

TAP Institute Participants- A public directory of teachers/researchers from the TAP Institute that have chosen to publicly share their information for the purposes of networking. (Add your information!)
2021-2022 TAP Institute Whitepaper- A paper describing our findings for the project
Even more DH Text Analysis Teaching/Learning Materials

Open Educational Resources

All resources are licensed

Beginner Courses

Python Basics 1-5

Nathan Kelber
Nathan Kelber in blue collared shirt and gray sweater

Course Materials

This course is appropriate for complete beginners who have never programmed or done text analysis before.
If you've never programmed before, this course is a great introduction. Taught from a humanist perspective, this course will help you start writing your first code and unlock the potential of text analysis.

Introduction to R Programming

Jacalyn Huband
Jacalyn Huband in red shirt

Course Materials

This course is appropriate for complete beginners who have never programmed or done text analysis before.
This course is a gentle introduction to R programming. With an emphasis on text analysis, this course will help you begin your adventures in programming.

A Gentle Introduction to Optical Character Recognition with PyTesseract

Hannah Jacobs
Hannah Jacobs in blue sweater with pink and green scarf

Course Materials

Python Basics required
This course will introduce the concept of “Optical Character Recognition” (OCR), various tools available for performing OCR, and important considerations for successfully OCRing digitized text. Using Tesseract in Python, we’ll walk through the entire process using a variety of examples to show the range of challenges scholars can face when performing OCR. By the end of the course, participants should be able to use the course’s Jupyter Notebooks to perform OCR on their own; they should be able to identify possible technical challenges presented by specific texts and propose potential solutions; and they should be able to assess the degree of accuracy they have achieved in performing OCR.

A Practical Guide to Text Data Curation

Xanda Schofield
Xanda

Course Materials

Python Basics required
No matter how exciting your research question is or how fancy your models are, all text analysis projects depend on having text data that is tidy enough to analyze. This course surveys some practices of text data curation to filter out irrelevant text, refine a corpus vocabulary, and identify text artifacts in real world text collections. We will explore how to approach these tasks using Python libraries such as NLTK and spaCy, as well as explore how some text models, like LDA topic models, can actually serve as a tool for diagnosing recurring corpus issues.

Web Scraping and Text Analysis in Bilingual Social Media

Rubria Rocha De Luna
Rubria

Course Materials

Requires Facebook account. No prior programming experience required.
This course is designed for attendees to learn how to web scrape social media posts, as well as download the information in csv format, clean it, and do basic analysis such as word frequency. To achieve this, we will rely on exercises with posts in Spanish, English or Spanglish, taken from Facebook pages belonging to organizations of migrants returned to Mexico. We will use some tools like Facepager, Notepad, Word, and RStudio.

Intermediate Courses

Python Intermediate 1-4

Nathan Kelber and Zhuo Chen
Nathan Kelber in blue collared shirt and gray sweater

Course Materials

Python Basics required
An introduction to intermediate Python skills including comprehensions, working with .txt, .csv., and .json files, navigating filepaths with pathLib, and object-oriented programming (OOP).

Data Analysis with Pandas

Melanie Walsh
Melanie Walsh headshot in green shirt

Course Materials

Python Basics required
This workshop will introduce students to a popular Python package known as Pandas, a tool for data analysis and manipulation that is widely-used among data scientists. Participants will learn how to work with CSV files and JSON files, how to filter and aggregate data, how to make bar charts and time series plots, how to merge datasets with common values, and more. All case studies and examples will feature data relevant to the humanities, such as (potentially) library circulation data, screenplay data, and social media data.

Visualizing Humanities Data

Zoe LeBlanc
Zoe LeBlanc in black shirt with purple scarf

Course Materials

Python Basics required
This course will introduce participants to some of the foundations and horizons of visualizing humanities data. To help us generate datasets we will lightly explore some text analysis methods, and then focus on some of the possibilities and pitfalls of visualizing data derived from these methods. In particular, this course will introduce participants to the principles of the grammar of graphics and exploratory data analysis through using the Python library Altair and Jupyter Notebooks. The goal of this course is to help participants learn how to incorporate visualizing humanities data into their research workflows, for both sharing aggregated information and making arguments.

Text Analysis in Ancient/Medieval Languages

William Mattingly
William Mattingly in a white shirt in sunglasses

Course Materials

Python Basics required
This workshop will introduce students to natural language processing (NLP) and text analysis in ancient and medieval languages. We will use Latin as a case study. Day 1 will focus on the basics of NLP and spaCy, one of the leading NLP libraries for Python. Day 2 will address the textual problems of working with ancient/medieval languages, including how to handle highly-inflected languages; lemmatization without a lemmatizer; and accounting for textual, geographical, and temporal variances of the language. Day 3 will address a single text analysis problem: named entity recognition (NER) in Latin. On this final day, we will develop a workflow for solving this problem. Students will leave this workshop with a strong understanding of NLP and NER. They will also have an understanding of how to solve text analysis problems in highly-inflected or dead languages. Students will be provided with the resources for further learning. Finally, students will leave the workshop with a working NER model that they can use and improve in the future.

Working with Twitter Data

Melanie Walsh
Melanie Walsh headshot in green shirt

Course Materials

Python Basics and command line experience recommended.
This course will prepare students to collect, analyze, and visualize Twitter data. Students will learn how to work with the Twitter API and with the Python library twarc, one of the most popular tools for Twitter data. We will also introduce basic text analysis methods that are appropriate for short documents like tweets. Participants who are eligible for the Academic Research Track of the Twitter API will have the opportunity to work with the entire historical archive of tweets (2006-2022).

Introduction to Natural Language Processing with spaCy

William Mattingly
William Mattingly in a white shirt in sunglasses

Course Materials

Python Basics required
This course will introduce the key concepts of natural language processing (NLP) and an NLP Python library, spaCy. SpaCy allows users to cultivate robust pipelines for text analysis. In Day 1 we will learn about NLP concepts and how to install and use the spaCy library generally. On Day 2, we will learn how to use spaCy to identify linguistic features within a document. On Day 3, we will learn about how to apply those features to solve real-world problems for information extraction.

Multilingual Newspaper Data and Visualizations

Sylvia Fernández Quintanilla
Sylvia

Course Materials

No prior programming experience required.
This course is designed for attendees to learn close reading text analysis with bilingual (Spanish and English) newspapers hosted in various digital repositories; create bilingual datasets and clean the data; select images from the newspapers and edit them; adapt these datasets for visualizations (mapping, timelines and networking) approaching it through time, space, cultural and historical contexts. We will use tools like Excel, Open Refine, Carto, Timeline JS, and GraphCommons.

Introduction to Pandas (William Mattingly)

William Mattingly
William Mattingly in a white shirt in sunglasses

Course Materials

Python Basics required.
This course introduces students to working with tabular data in Python through the Pandas library. On Day 1, you will learn how to install and import Pandas; you will also learn about some of its basic features, such as the DataFrame. Day 2 will focus on finding, organizing, and sorting data. Day 3 will focus on advanced searching methods, such as filtering, querying, grouping, and GroupBy. A few additional lessons will be provided on plotting data in Pandas.

Advanced Courses

Intro to Machine Learning

William Mattingly
William Mattingly in a white shirt in sunglasses

Course Materials

Python Basics required. Introduction to NLP with spaCy is recommended.
This workshop will introduce students to machine learning (ML), from its early beginnings to its modern applications; students will also be introduced to a branch of ML known as deep learning. We will specifically address how ML can be used to solve text-based problems. Day 1 will focus on the basics of ML, the key concepts and terms that practitioners must know. Day 2 will be dedicated to a common ML problem: text classification. Day 3 will focus on an adjacent problem: topic modeling. On both days, students will be provided a worfklow for solving these problems. Students will leave this workshop with a firm understanding of ML conceptually and a basic understanding of how to engage in ML via Python. Finally, students will be provided with the resources for further learning.

Intro to Machine Learning

Grant Glass
Grant Glass in button down shirt and suit jacket in front of UNC well

Course Materials

Python Basics required. Knowledge of Pandas recommended.
This course will introduce you to many techniques available to analyze textual data with different Machine Learning techinques in Python. You will be introduced to the theory and method of Machine Learning and given some practical skills on how to write and execute machine learning code in Python. Some basic experience with Python will be required for participation in the class coding projects, but feel free to join us if you want to have a better understanding of what Machine Learning techniques can do for humanists. Generally speaking, this class will help you think about humanities problems through the lens of Machine Learning.

Named Entity Recognition

Zoe LeBlanc
Zoe LeBlanc in black shirt with purple scarf

Course Materials

Python Basics required
This course will introduce participants to one of the core areas of natural language processing - named entity recognition. While annotating datasets with set standards is one of the oldest areas of DH research (particularly with the Text Encoding Initiative), this course will focus on some of the newer approaches for identifying and annotating objects of interest in any given text. The course will focus on using the Python library Spacy with both it's built-in functionality, and also learning how to expand upon it for more specific uses. While this course is taught in English, participants are encouraged to bring sources in multiple languages. Ultimately, participants will learn both how to leverage NER in their research and how to tailor NER to their specific textual sources.

Machine Learning for Humanists

Grant Glass
Grant Glass in button down shirt and suit jacket in front of UNC well

Course Materials

Python Basics required. Knowledge of Pandas recommended.
This course will introduce students to the variety of machine learning (ML) algorithms available for textual analysis. Throughout the three days of the course, we will address how ML can be used to solve text-based problems. Day 1 will focus on the basics of ML and students will use supervised learning to work through a research question. Day 2 will be dedicated to a common ML technique: Topic Modeling. Day 3 will focus on more advanced techniques such as using language models to classify text. Everyday students will be provided a workflow for using these techniques on their own research questions.

Introduction to Multilingual Named Entity Recognition

William Mattingly
William Mattingly in a white shirt in sunglasses

Course Materials

Python Basics required. Introduction to NLP with spaCy is recommended.
This course will introduce students to named entity recognition with emphasis placed on multilingual documents. In Day 1, we will address some of the common issues one faces in handling multilingual documents, such as inconsistent text encoding and text standardization, and some of the current state-of-the-art transformer-based language models. We will also meet some of the key features of spaCy’s NER pipelines. On Day 2, we will jump into rules-based NER with spaCy. On Day 3, we will explore machine learning (ML) based NER in spaCy. Here, we will learn the essentials of creating good datasets for training NER models.

How to do Things with Topic Models

Rafael Alvarado
Rafael Alvarado in blue button down shirt

Course Materials

Python Basics and Python Intermediate recommended
This workshop will introduce students to the concept of topic models and how they have been used to advance humanistic research. Topics to be covered include topic models as a general task in text analytics, creating topic models from scratch using Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF), visualizing their results, evaluating their performance, and interpreting their results. In addition, students will be exposed to examples of how topic models have been used in humanistic and social science research. Work will be conducted using Python 3 and Jupyter Notebooks.

Even More DH Text Analysis Teaching/Learning Materials

PythonHumanities.com by William Mattingly
Programming Historian by various authors
The Carpentries by various authors
Digital Humanities Research Institutes by various authors
Computational Humanities Research
YaleDHLab Lab Workshops
Jupyter notebooks for digital humanities curated by Quinn Dombrowski
Data Sitter's Club by various authors
HathiTrust Digital Library Collections and Tools
Documenting the Now

Books on Python, Text Analysis, and DH

Automate the Boring Stuff with Python: Practical Programming for Total Beginners (2019) by Al Sweigart
Python Crash Course: A Handson, project-based introduction to programming (2019) by Eric Matthes
Machine Learning with Python Cookbook (2018) by Chris Albon
Natural Langauge Processing in Action (2019)by Hobson Lane, Cole Howard, and Hannes Max Hapke
Humanities Data Analysis: Case Studies with Python by Folgert Karsdorp, Mike Kestemont, and Allen Riddell
Technical Textbooks List by Scott B. Weingart
Introduction to Named Entity Recognition by William Mattingly

Books on Data Ethics

Algorithms of Oppression (2018) by Safiya Noble
Race After Technology (2019) by Ruha Benjamin
Data Feminism (2020) by Catherine D'Ignazio and Lauren F. Klein

Instructional Video

Course Examples

Humanities Analytics by Matt Lavin
Introduction to Cultural Analytics and Python by Melanie Walsh
CodeLab by Shane Lin, Zoe LeBlanc, and Brandon Walsh
Computational and Inferential Thinking: The Foundations of Data Science by Ani Adhikari, John DeNero, David Wagner