the sourcecaster

helps you use the command line to work through common challenges that come up when working with digital primary sources.

. . . click on a button and it will show you a basic command, broken down to show what each piece does.

. . . for more descriptive breakdowns, explore with explainshell.

. . . before executing commands, review and install dependencies.

. . . consider contributing.

casting

pdf to txt

Sometimes you want to take text data out of searchable pdfs, so you can begin to use text analysis methods and tools.

This will convert all text searchable pdfs in a folder to text files.

for file in *.pdf; do pdftotext "$file" "${file%.*}.txt"; done

for file in: for every file in the folder
*.pdf: that ends in the extension pdf
do pdftotext: use the pdftotext program to extract OCR from each pdf
"$file" "${file%.*}.txt": copy the filename from each pdf to the new text files without the .pdf extension of the original filename (%.*)

tiff to txt

Sometimes you want to generate text data from images of pages, so you can begin to use text analysis methods and tools.

This will convert all tiff images of pages in a folder to text files.

for i in *.tiff; do tesseract $i yourfoldername_$i; done;

for i in: for every file in the folder
*.tiff: that ends in the extension tiff
do tesseract: use the tesseract program to OCR each tiff image
$i yourfoldername_$i; done;: prepend foldername to each text file produced by tesseract

pdf to tiff

Sometimes you want to convert PDF images of pages to TIFF images of pages in order to perform optical character recognition (OCR) using Tesseract. Tesseract can in turn be used to generate plain text data from the TIFF images of pages.

This will convert all PDF images of pages in a folder to TIFF images of pages.

mogrify -type Grayscale -compress lzw -density 300 -format tiff -depth 8 "*.pdf"

mogrify: convert all files in folder
-type Grayscale: specify Grayscale image type
-compress lzw: use LZW compression
-density 300: specify pixel density of 300
-format tiff: generate tiff files
-depth 8: at 8 bit depth
"*.pdf": from all pdf files

html to txt

Sometimes you want to remove HTML markup from webpages you save, so you can begin to use text analysis methods and tools.

This will remove markup from HTML files and convert them to txt files.

textutil -convert txt *.html

textutil: start textutil and
-convert: use the convert function
txt *.html: to conver all files in the current directory that end in .html to txt files

markdown to html

If you take notes in markdown, sometimes you want to publish them as html.

This will convert a md file to an html file.

pandoc foobar.md -f markdown -t html -s -o foobar.html

pandoc: Use pandoc to convert
foobar.md: the foobar.md file
-f markdown: by specifying the input file format as markdown
-t html: with desired output as html
-s -o foobar.html: to output a standalone foobar.html file

txt to slideshow

If you want to generate a simple slideshow from a text file that uses markdown syntax.

This will convert a txt file to an html slideshow.

pandoc -s --webtex -i -t slidy input_filename -o slideshow_name.html

pandoc: Use pandoc to
-s --webtext: create a standalone document that can render TeX formulas
-i: support incremental display of slideshow sections
-t slidy: specify output format slidy
input_filename: specify input filename
-o slideshow_name.html: output slideshow file

pdf to png

If you have a bunch of PDFs of images, studying them as images computationally is easier if you change them to an image format.

This will split multipage PDFs and convert them to individual PNG files.

find ./ -name "*.pdf" -exec mogrify -format png {} \;

find ./ -name "*.pdf": find all files that end in the pdf extension
-exec mogrify -format png {} \;: and use mogrify to convert them to png files

mp4 to animated gif

If you have a segment of a video that you want to convert to an animated gif for the web, and possibly even presentation fun.

This will split and convert a segment from an mp4 video file into an animated gif.

ffmpeg -ss 0 -t 13 -i inputfile.mp4 outputfile.gif

ffmpeg: start ffmpeg and
-ss 0: specify the start of the animated gif at second 0 of the file
-t 13: specify a 13 second duration from the designated start
-i inputfile.mp4 outputfile.gif: designate input mp4 file and name the and output gif file

extracting images from pdf files

Sometimes you want to extract images from pdfs and store them as separate files.

This will extract all the images from pdfs and save them as seperate .png files. The page in which each image was found will be encoded in the title.

for f in *.pdf; do pdfimages $f ${f%.*} -p -png; done

for f in: for every file (f) in the folder
*.pdf: with a .pdf extension
do pdfimages: use the pdfimages, which is an open-source command-line utility for extracting images
$f ${f%.*}: to every file ($f) and name the image the same way as the file but without the pdf extension (${f%.*})
-p -png; done: also add the page where the image was found (-p) and save the file as a png (-png)

Contributed by silvia gutiérrez.

big mp4 to smaller mp4

If you have a large video and want to resize it to make it more accessible.

This will resize a video by dimension.

ffmpeg -i inputfile.mp4 -vf scale=640:480 outputfile.mp4

ffmpeg: start ffmpeg and
-i inputfile.mp4: designate input mp4 file
-vf scale=640:480: specify dimensions of the resized file
outputfile.mp4: name resized file

mp4 to mp3

If you have a video file and want to extract the audio.

This will extract audio from a video file and create an mp3.

ffmpeg -i inputfile.mp4 -b:a 192K -vn outputfile.mp3

ffmpeg: start ffmpeg and
-i inputfile.mp4: designate input mp4 file
-b:a 192K: set audio bitrate at 192K
-vn: disable video recording

wrangling

search and combine

Sometimes you want to search across files and combine results as a basis for data analysis.

This will find a string of letters or numbers in multiple files and save the results in a single file.

egrep '185[0-9]' *.txt > foobar.txt

egrep '185[0-9]': look for any date between 1850 and 1859
*.txt: in every file in this directory that ends in the extension .txt
> foobar.txt: and save all lines that match to foobar.txt

contextual search

Sometimes you want to search across sub-directories and preserve the contexts in which something occurs.

This will find a string of letters or numbers in multiple files, save 50 characters either side of the match, and output the results to the shell.

egrep -r -o '.{0,50}\b185[0-9]\b.{0,50}' ./

egrep -r -o: look recursively in this directory and all sub-directories and return only
'.{0,50}\b185[0-9]\b.{0,50}' ./: the dates 1850-1859 and characters fifty spaces either side

remove punctuation

Sometimes you want to normalize a text, so you can use text analysis methods and tools with greater precision.

This will delete all punctuation from a txt file.

tr -d [:punct:] < foobar-in.txt > foobar-out.txt

tr -d [:punct:]: delete all punctuation
< foobar-in.txt: from foobar-in.txt
> foobar-out.txt: and save as foobar-out.txt

normalize case

Sometimes you want to normalize the case of a text, so you can use text analysis methods and tools with greater precision.

This will delete all punctuation from a txt file.

tr [:upper:] [:lower:] < foobar-in.txt > foobar-out.txt

tr [:upper:] [:lower:]: transform upper case letters to lower case letters
< foobar-in.txt: in foobar-in.txt
> foobar-out.txt: and save as foobar-out.txt

normalize line breaks, excel

Sometimes you end up working with csv files that were produced using Microsoft Excel. Excel often introduces irregularities into the data structure. If for example, you wanted to import a csv produced using Excel into sqlite, it would fail given that Excel uses \r to indicate new lines in the data rather than the \n that sqlite expects.

This will replace the all \r line breaks with \n line breaks.

cat original.csv | tr -s '\r' '\n' > normalized.csv

cat original.csv: Input the original csv
tr -s '\r' '\n': replace all \r line breaks with \n line breaks
> normalized.csv: save normalized data into new file

find & replace characters

Sometimes you have text files that have a particular string of text that you want removed or replaced, for example the start of an URL, but the text file is too big to be opened within your favorite, graphical user interface editor.

From a list of URLs following the pattern http://viaf.org/viaf/24597135, this will substitute http://viaf.org/viaf/ with ID:. To remove something instead of replacing it, the second argument can be omitted.

sed 's/http:\/\/viaf.org\/viaf\//ID:/' < input.txt > output.txt
sed 's/http:\/\/viaf.org\/viaf\///' < input.txt > output.txt

sed 's/http:\/\/viaf.org\/viaf\//ID:/': Pattern 's/whatwelookfor/whatwereplaceitby/', while escaping the slashes
< input.txt: the input
> output.txt: save the modified output in a file called output.txt

list files and give them IDs

Sometimes you want a list of files and to give those files numeric IDs. This can be a useful first step in building a database.

This will find all the files in the directory 'folder' (and all subdiresctories), assign those files unique IDs, and save the output as a spreadsheet.

i=0; find folder -type f | while read image; do echo "$i,$image"; i=$((i+1)); done > foobar.csv

i=0; find folder -type f |: Start a count at one, find the directory 'folder' and all subdirectories, and hold that in memory
while read image; do echo "$i,$image"; i=$((i+1)); done: look at the first file in the directory 'folder' and all subdirectories and assign to it the ID 1. Repeat for each subsequent file, increasing the ID by 1 each time
> foobar.csv: save the ouput in a file called foobar.csv

getting

scrape web based items, using list

Sometimes you want to download many documents or images that are linked from a webpage.

This will use a list of item URLs to automate download of items.

wget -w 10 --limit-rate=20k -i item_urllist.txt

wget: Start wget and specify
-w 10: request of single files, at 10 second intervals from target server
--limit-rate=20k: limit the download rate to 20k
-i item_urllist.txt: input and use a plain text file that contains URLs for each item you want to download

scrape web based items, by format

Sometimes you want to traverse multiple levels of a website to download certain file formats like images.

This will save all the .jpg files within three links from 'http://www.foo.bar'.

wget -w 1 -r -l 3 -A jpeg,jpg --limit-rate=100k http://www.foo.bar

wget: Start wget and specify
-w 10: request of files, at 10 second intervals from target server
-r -l 3: retrieve matching items recurisvely within three links from 'http://www.foo.bar'
-A jpeg, jpg: only download jpeg and jpg files
--limit-rate=100k: limit the download rate to 100k per second

managing

save command line history

Sometimes you want to save work on the command line, so you can remember what steps you took with your data.

This will save command line history to a txt file.

history > history.txt

history: take command line history
> history.txt: and save it to a history.txt file

find common scripts

Sometimes you can't remember a shell scripts but know you've used it more than once in the recent past.

This will create an output of the 20 most common shell scripts in your shell history.

history | cut -c 8- | sort | uniq -c | sort -rn | head -${1:-20}

history: take command line history
cut -c 8-: remove characters you don't need
sort: sort what is left so that all similar lines are adjacent
uniq -c: count unique lines
sort -rn: reverse sort the output numerically
head -${1:-20}: print the top 20 lines

edit many filenames

Sometimes you need to edit many file names, so that you can make them more consistent.

This will remove 'foobar' or any variation in that space from every file.

for file in *.txt; do mv "$file" "${file/foobar/}"; done

for file in: for every file in the folder
*.txt: that ends in the extension .txt
do mv: use the rename function
"$file" "${file/foobar/}"; done: to remove 'foobar' or any variation in that space from every file