Let’s explore the text of Vasari’s Lives of The Artists.
The full text of an English translation can be found on Project Gutenberg.
We can use a shell script to download the files to a local folder:
#!/bin/bash DESTDIR="vasari" BOOKS="25326 25759 26860 28420 28421 28422 31845 31938 32362 33203" mkdir -p "${DESTDIR}" pushd "${DESTDIR}" for ebook in ${BOOKS} do wget "http://www.gutenberg.org/ebooks/${ebook}.txt" done popd
And then we can use R’s “tm” library to load the files:
## For "dissimilar" in tm ## install.packages("proxy") ## For "plot" on dtm in tm ## source("http://bioconductor.org/biocLite.R") ## biocLite("Rgraphviz") ## install.packages("tm") library(tm) dir<-"vasari" prefix<-"pg" extention<-".txt" ## These are the Project Gutenberg book numbers for Lives Of The Artists ## Skip volume 10, this doesn't follow the same format: , 33203 books<-c(25326, 25759, 26860, 28420, 28421, 28422, 31845, 31938, 32362) ## Make a file path for a book bookPath<-function(id){ paste(dir, "/", prefix, id, extention, sep="") } ## Load the file loadFile<-function(filename){ readChar(filename, file.info(filename)$size) } ## Load the files loadFiles<-function(filenames){ sapply(filenames, loadFile, USE.NAMES=FALSE) } ## Load the texts texts<-loadFiles(sapply(books, bookPath))
We can then extract the entry for each artist, clean up the data, create a corpus, and then clean up the corpus:
## Extract entries on each artist extractArtists<-function(text){ ## Split each file into blocks between "LIFE OF .*\n" artists<-unlist(strsplit(text, "\nLIFE OF")) ## Discard first block, that is introduction ## Last block will be discarded by the article cleaning function artists[2:length(artists)] } ## Clean footnotes, etc. from article text cleanArticle<-function(text){ ## Truncate at \nFOOTNOTES: text<-unlist(strsplit (text, split="\nFOOTNOTES:"))[1] ## Remove [Text in square brackets] text<-gsub("\\[[^]]+\\]", "", text) ## Remove punctuation text<-gsub("[[:punct:]]", "", text) ## Lowercase words text<-tolower(text) text } ## Get the first line of a string firstLine
We can then create a term/document matrix (and remove infrequently used terms) to explore the corpus:
## Term/document matrix dtm<-DocumentTermMatrix(artists.clean) ## Remove infrequent terms to save memory dtm<-removeSparseTerms(dtm, 0.4)
We can find frequently used terms:
## Frequent terms in the matrix findFreqTerms(dtm, 60)
[1] "lorenzo" "painted" "pietro" "life" "andrea" [6] "francesco" "giovanni" "beautiful" "executed" "antonio" [11] "domenico" "duke" "marble" "jacopo" "church" [16] "hand" "little" "able" "afterwards" "age" [21] "art" "beauty" "caused" "chapel" "christ" [26] "city" "day" "death" "del" "design" [31] "excellent" "figure" "figures" "finished" "florence" [36] "friend" "head" "held" "house" "judgment" [41] "left" "likewise" "manner" "master" "messer" [46] "painter" "painting" "palace" "pictures" "placed" [51] "pope" "reason" "rome" "seen" "sent" [56] "set" "time" "various" "whereupon"
We can see which words are strongly associated:
findAssocs(dtm, "painting", 0.8)
painting painter painters hand little painted pictures 1.00 0.90 0.90 0.89 0.89 0.89 0.89 beautiful figures grace lady save beauty executed 0.87 0.87 0.86 0.86 0.86 0.85 0.85 manner portrait time worthy excellent hands likewise 0.85 0.85 0.85 0.85 0.84 0.84 0.84 particular seen sent truth able art craftsmen 0.84 0.84 0.84 0.84 0.83 0.83 0.83 friend house left lived living return age 0.83 0.83 0.83 0.83 0.83 0.83 0.82 besides christ day finally mention received study 0.82 0.82 0.82 0.82 0.82 0.82 0.82 chapel city diligence excellence head honour master 0.81 0.81 0.81 0.81 0.81 0.81 0.81 nature rome true held wherefore 0.81 0.81 0.81 0.80 0.80
And we can plot those associations:
## Plot associations between terms plot(dtm, findFreqTerms(dtm, 120), attrs=list(graph=list(), node=list(shape="rectangle",fontsize="72", fixedsize="false")))
Which looks like this: