helps you use the command line to work through common challenges that come up when working with digital primary sources.
. . . click on a button and it will show you a basic command, broken down to show what each piece does.
. . . for more descriptive breakdowns, explore with explainshell.
. . . before executing commands, review and install dependencies.
. . . consider contributing.
Sometimes you want to take text data out of searchable pdfs, so you can begin to use text analysis methods and tools.
This will convert all text searchable pdfs in a folder to text files.
for file in *.pdf; do pdftotext "$file" "${file%.*}.txt"; done
Sometimes you want to generate text data from images of pages, so you can begin to use text analysis methods and tools.
This will convert all tiff images of pages in a folder to text files.
for i in *.tiff; do tesseract $i yourfoldername_$i; done;
Sometimes you want to convert PDF images of pages to TIFF images of pages in order to perform optical character recognition (OCR) using Tesseract. Tesseract can in turn be used to generate plain text data from the TIFF images of pages.
This will convert all PDF images of pages in a folder to TIFF images of pages.
mogrify -type Grayscale -compress lzw -density 300 -format tiff -depth 8 "*.pdf"
Sometimes you want to remove HTML markup from webpages you save, so you can begin to use text analysis methods and tools.
This will remove markup from HTML files and convert them to txt files.
textutil -convert txt *.html
If you take notes in markdown, sometimes you want to publish them as html.
This will convert a md file to an html file.
pandoc foobar.md -f markdown -t html -s -o foobar.html
If you want to generate a simple slideshow from a text file that uses markdown syntax.
This will convert a txt file to an html slideshow.
pandoc -s --webtex -i -t slidy input_filename -o slideshow_name.html
If you have a bunch of PDFs of images, studying them as images computationally is easier if you change them to an image format.
This will split multipage PDFs and convert them to individual PNG files.
find ./ -name "*.pdf" -exec mogrify -format png {} \;
If you have a segment of a video that you want to convert to an animated gif for the web, and possibly even presentation fun.
This will split and convert a segment from an mp4 video file into an animated gif.
ffmpeg -ss 0 -t 13 -i inputfile.mp4 outputfile.gif
Sometimes you want to extract images from pdfs and store them as separate files.
This will extract all the images from pdfs and save them as seperate .png files. The page in which each image was found will be encoded in the title.
for f in *.pdf; do pdfimages $f ${f%.*} -p -png; done
Contributed by silvia gutiƩrrez.
If you have a large video and want to resize it to make it more accessible.
This will resize a video by dimension.
ffmpeg -i inputfile.mp4 -vf scale=640:480 outputfile.mp4
If you have a video file and want to extract the audio.
This will extract audio from a video file and create an mp3.
ffmpeg -i inputfile.mp4 -b:a 192K -vn outputfile.mp3
Sometimes you want to search across files and combine results as a basis for data analysis.
This will find a string of letters or numbers in multiple files and save the results in a single file.
egrep '185[0-9]' *.txt > foobar.txt
Sometimes you want to search across sub-directories and preserve the contexts in which something occurs.
This will find a string of letters or numbers in multiple files, save 50 characters either side of the match, and output the results to the shell.
egrep -r -o '.{0,50}\b185[0-9]\b.{0,50}' ./
Sometimes you want to normalize a text, so you can use text analysis methods and tools with greater precision.
This will delete all punctuation from a txt file.
tr -d [:punct:] < foobar-in.txt > foobar-out.txt
Sometimes you want to normalize the case of a text, so you can use text analysis methods and tools with greater precision.
This will delete all punctuation from a txt file.
tr [:upper:] [:lower:] < foobar-in.txt > foobar-out.txt
Sometimes you end up working with csv files that were produced using Microsoft Excel. Excel often introduces irregularities into the data structure. If for example, you wanted to import a csv produced using Excel into sqlite, it would fail given that Excel uses \r to indicate new lines in the data rather than the \n that sqlite expects.
This will replace the all \r line breaks with \n line breaks.
cat original.csv | tr -s '\r' '\n' > normalized.csv
Sometimes you have text files that have a particular string of text that you want removed or replaced, for example the start of an URL, but the text file is too big to be opened within your favorite, graphical user interface editor.
From a list of URLs following the pattern http://viaf.org/viaf/24597135, this will substitute http://viaf.org/viaf/ with ID:. To remove something instead of replacing it, the second argument can be omitted.
sed 's/http:\/\/viaf.org\/viaf\//ID:/' < input.txt > output.txt
sed 's/http:\/\/viaf.org\/viaf\///' < input.txt > output.txt
Sometimes you want a list of files and to give those files numeric IDs. This can be a useful first step in building a database.
This will find all the files in the directory 'folder' (and all subdiresctories), assign those files unique IDs, and save the output as a spreadsheet.
i=0; find folder -type f | while read image; do echo "$i,$image"; i=$((i+1)); done > foobar.csv
Sometimes you want to download many documents or images that are linked from a webpage.
This will use a list of item URLs to automate download of items.
wget -w 10 --limit-rate=20k -i item_urllist.txt
Sometimes you want to traverse multiple levels of a website to download certain file formats like images.
This will save all the .jpg files within three links from 'http://www.foo.bar'.
wget -w 1 -r -l 3 -A jpeg,jpg --limit-rate=100k http://www.foo.bar
Sometimes you want to save work on the command line, so you can remember what steps you took with your data.
This will save command line history to a txt file.
history > history.txt
Sometimes you can't remember a shell scripts but know you've used it more than once in the recent past.
This will create an output of the 20 most common shell scripts in your shell history.
history | cut -c 8- | sort | uniq -c | sort -rn | head -${1:-20}
Sometimes you need to edit many file names, so that you can make them more consistent.
This will remove 'foobar' or any variation in that space from every file.
for file in *.txt; do mv "$file" "${file/foobar/}"; done
CC-BY Thomas Padilla and James Baker, adapted from ffmprovisr and Script Ahoy