Art Data Analysis: Art & Language

art&lang_index1-01.jpgArt & Language are a conceptual art group founded in the late 1960s in England. Much of their early work didn’t look like art. It was essays, mathematical notation, transcripts of conversations, all different kinds of written materials. Faced with the opportunity to exhibit in a gallery setting to an artworld audience, A&L needed a way of realistically presenting their work in a way that a viewer who hadn’t been part of the original conversations might have a chance of being able to navigate the results.

161.jpgA&L’s solution was to assemble copies of all the texts in filing cabinets and produce an index to them. Texts were given “markers” (tags) and indexes of the relationships between each text’s tags were produced in print or on microfilm. Mainframe computer time was used to create the index for Index 04, although reports differ on which computer was used and whether the index was in fact random or not.
162.jpgThis is an obvious forerunner to Google or It is also a use of what would now be regarded as search technology to produce a genuinely artistic solution to a genuine artistic problem.

Art Data Analysis: Emily Vanderpoel

(Via Ptak Science Blog, which has more information)
cu31924008632964_0322.jpg“Color problems : a practical manual for the lay student of color” (1902) – Emily Noyes Vanderpoel – Download from

Vanderpoel’s colour proportion analyses look like colour quantized and re-ordered low-resolution image pixels. They are a useful historical precedent and visual model for computational analysis of images.

Art Data Analysis: Roger de Piles

(Via Ptak Science Books)

“To his last published work: Cours de peinture par principes avec un balance de peintres (1708) de Piles appended a list of fifty-six major painters in his own
time, with whose work he had acquainted himself as a connoisseur during
his travels.

To each painter in the list he gave marks from 0 to 18 for
composition, drawing, color and expression. This gave an overview of
aesthetic appreciation hingeing on the balance between color and design.
The highest marks went to Raffaello Sanzio and Rubens,
with a slight bias on color for Rubens, a slight bias on drawing for
Raphaël. Painters who scored very badly in anything but color were Giovanni Bellini, Giorgione and remarkably Michelangelo Caravaggio
with 16 on color and 0 (zero) on expression. Painters who fell far
behind Rubens and Raphaël but whose balance between color and design was
perfect were Lucas van Leyden, Sebastian Bourdon, Albrecht Dürer.”

The Wikipedia page reproduces this list as a table from which the values can be easily extracted. Like this:

Andrea del Sarto,12,16,9,8
Federico Barocci,14,15,6,10
Jacopo Bassano,6,8,17,0
Giovanni Bellini,4,6,14,O
Sebastian Bourdon,10,8,8,4
Charles Le Brun,16,16,8,16
I Carracci,15,17,13,13
Cavalier D'Arpino,10,10,6,2
Daniele da Volterra,12,15,5,8
Abraham van Diepenbeeck,11,10,14,6
Il Domenichino,15,17,9,17
Albrecht Dürer,8,10,10,8
Giovanni da Udine,10,8,16,3
Giulio Romano,15,16,4,14
Guido Reni,NA,13,9,12
Jacob Jordaens,10,8,16,6
Lucas Jordaens,13,12,9,6
Giovanni Lanfranco,14,13,10,5
Leonardo da Vinci,15,16,4,14
Lucas van Leyden,8,6,6,4
Otho Venius,13,14,10,10
Palma il Vecchio,5,6,16,0
Palma il Giovane,12,9,14,6
Il Parmigianino,10,15,6,6
Gianfrancesco Penni,O,15,8,0
Perin del Vaga,15,16,7,6
Sebastiano del Piombo,8,13,16,7
Francesco Salviati,13,15,8,8
Eustache Le Sueur,15,15,4,15
Pietro Testa,11,15,0,6
Van Dyck,15,10,17,13
Taddeo Zuccari,13,14,10,9
Federico Zuccari,10,10,8,8

Or you can get a scan of the original text from Google Books here. The table of scores starts on page 408 of the PDF version.

This kind of mock-objective scoring of artworks using whatever system is fun but even at the time was open to ridicule. By Hogarth, for example.

What might be more interesting would be if lots of people scored artworks or artists similarly.

Exploring Art Data 13

Let’s go back and explore one image from the Haystacks series further. We’ll be able to apply these same techniques to the whole series (and to large imagesets) later.

We’ll use the thumbnail of the first image in the series:


haystack-screenshot.pngWe’ve already got a box plot of its brightness and a plot of its palette. Now we will create a brightness histogram, a colour histogram, and measure its brightness, mean of brightness SD, and entropy (visual complexity).

Brightness histogram:

## Plot brightness histogram
breaks=0:255/255, xlim=c(0, 1), xlab="Value", col=grey(0), border=NULL)

brightness-histogram.pngColour Histogram (based on the work of Dr. Sai Chaitanya Gaddam):

## install.packages("plyr") # for ddply
## Colour histogram
colourImageHist<-function(bitmap, binResolution){
scaledRGBs<-floor(rgbs * binResolution)
counts<-ddply(scaledRGBs, .(red, green, blue), nrow)
attr(counts, "binResolution")<-binResolution
colourHist<-colourImageHist(artworksThumbnails[[1]], binResolution)
## Plot colour histogram
values<-apply(colourHist, 1,
function(row){sum(row[c(1, 2, 3)])})
binResolution<-attr(colourHist, "binResolution")
colours<-apply(colourHist, 1,
function(row){rgb(row[1] / binResolution,
row[2] / binResolution,
row[3] / binResolution)})
barplot(countsOrdered, col=coloursOrdered, border=NA)

colour-histogram.pngColour histogram cloud (based on the work of Dr. Sai Chaitanya Gaddam):

## Plot colour cloud ## sigmoid<-function(x){ 1 / (1 + exp(-x)) } colourCloudX<-function(r,g,b){ sigmoid((g - r + 1) / 2) } colourCloudY<-function(r,g,b){ eps<-length(colours) (b + eps)/(r + g + b + 3 * eps) } randomPointInRadius<-function(x, y, radius){ q<-runif(1) * (pi * 2) r<-radius * sqrt(runif(1)) h<-r * cos(q) v<-r * sin(q) c(x + h, y + v) } randomizedBinPoints<-function(xs, ys, radius){ coords<-array(dim=c(length(xs), 2)) for(index in 1:length(xs)){ point<-randomPointInRadius(xs[index], ys[index], radius) coords[index, 1]<-point[1] coords[index, 2]<-point[2] } coords } colourCloud<-function(hist, title="", radius=0.005, dotSize=1){ binResolution<-attr(hist, "binResolution") bins<-hist[c(1,2,3)] / binResolution ## Rep each item by the count that applies to its bin, this allows us to ## process the repeated item lists directly rather than nest loops & counters xs<-rep(apply(bins, 1, function(col){colourCloudX(col[1], col[2], col[3])}), hist$count) ys<-rep(apply(bins, 1, function(col){colourCloudY(col[1], col[2], col[3])}), hist$count) hexes<-rep(apply(bins, 1, function(c){rgb(c[1], c[2], c[3])}), hist$count) coords<-randomizedBinPoints(xs, ys, radius) plot(coords[,1], coords[,2], bg=hexes, main=title, cex=dotSize, col=NULL, pch=21, xlab="", ylab="", xaxt="n", yaxt="n") } colourCloud(colourHist)

colour-cloud.pngImage brightness mean (not very interesting for a single image):


[1] 0.6130143

Mean of image brightness standard deviation:


[1] 0.1069661

Image brightness entropy:

## Calculate image entropy imageEntropy<-function(histogram){ nonzeroCounts<-histogram$counts[histogram$counts > 0] probs<-nonzeroCounts / sum(nonzeroCounts) -sum(probs * log2(probs)) } imageEntropy(hist(imageData(grayscaleArtworksThumbnails[[1]]), breaks=0:255/255, plot=FALSE))

[1] 7.112374

Image colour entropy:

breaks=0:255/255, plot=FALSE))

[1] 7.572906
Exploring Art Data 12

Back to Vasari’s Lives.

We can compare Vasari’s description of Giovanni Cimabue to Wikipedia’s article on the artist.

The results show a surprising degree of similarity:

## install.packages("RCurl")
## Strip wiki code
## Remove {{stuff}}
text<-gsub("\\{\\{[^}]+\\}\\}", "", text)
## Remove [[stuff]]
text<-gsub("\\[\\[[^]]+\\]\\]", "", text)
## Remove [stuff]
text<-gsub("\\[[^]]+\\]", "", text)
## Remove 
text<-gsub("<[^>]+>", "", text)
## Remove punctuation
#text<-gsub("[[:punct:]]", "", text)
## Lowercase words
## Get the text of a page from Wikipedia
curlEscape(subject), "&action=raw", sep=""),
.opts=list(useragent="Mozilla/5.0 (X11; U; Linux i686; fr; rv: Gecko/20090913 Firefox/3.5.3"))
cimabue.corpus<-Corpus(VectorSource(c(artists[1], cimabuePage)),
dissimilarity(cimabueDtm, method="cosine")

They seem reassuringly similar (similarity is 1.0 – dissimilarity):

1 2 0.1079431
Exploring Art Data 11

Let’s look at a more contemporary source than Vasari, Cultural Bloggers Interviewed.

We can download the PDF with a shell script:

wget ""

And then load in the data and process it in R using tm again (with a slight modification to the function that cleans up the text):

library(tm) blogfile<-"./cultural_blogger.pdf" bloggers.names<-c("Claire Welsby", "Michelle Kasprzak", "Alek Tarkowski", "Marco Mancuso", "Anne Helmond", "Robert Misik", "Marta Peirano & José Luis de Vicente", "Alessandro Ludovico", "Régine Debatty") bloggers<-data.frame(name=bloggers.names, from=c(6, 11, 15, 19, 23, 27, 31, 35, 41), to=c(10, 14, 18, 22, 26, 30, 34, 40, 44)) ## Clean footnotes, etc. from article text cleanArticle<-function(text){ ## Remove urls. Would miss final url in a document ;-) text<-lapply(text, function(line){gsub("http://.+\\s", "", line, perl=TRUE)}) ## Remove punctuation text<-lapply(text, function(line){gsub("[[:punct:]]", "", line)}) ## Lowercase words text<-lapply(text, tolower) text } ## Load the blogger texts bloggers.texts<-apply(bloggers, 1, function(blogger){ reader<-readPDF(PdftotextOptions=paste("-layout", "-f", blogger[2], "-l", blogger[3])) reader(elem=list(uri=blogfile), language="en", id=blogger[1])}) ## Clean up the blogger texts bloggers.texts<-lapply(bloggers.texts, cleanArticle) ## Make a corpus of the bloggers texts bloggers.corpus<-Corpus(VectorSource(bloggers.texts), readerControl=list(language="english", reader=readPlain)) ## Remove whitespace within terms bloggers.clean<-tm_map(bloggers.corpus, stripWhitespace) ## Remove stopwords bloggers.clean<-tm_map(bloggers.clean, removeWords, stopwords("english")) ## Stem words ## No, this looks weird in the results ##bloggers.clean<-tm_map(bloggers.clean, stemDocument) ## Term/document matrix dtm<-DocumentTermMatrix(bloggers.clean) ## Remove infrequent terms to save memory dtm<-removeSparseTerms(dtm, 0.4)

Then we can find the most common terms:

## Frequent terms in the matrix findFreqTerms(dtm, 4)

 [1] "art"          "artists"      "arts"         "audience"     "based"
[6] "bit"          "blog"         "blogging"     "blogs"        "community"
[11] "contemporary" "content"      "difficult"    "example"      "experience"
[16] "feel"         "include"      "involved"     "issues"       "mainly"
[21] "media"        "people"       "platform"     "post"         "probably"
[26] "project"      "public"       "regarding"    "scene"        "technology"
[31] "thats"        "time"         "via"          "website"      "world"
[36] "course"       "definitely"   "describe"     "dont"         "facebook"
[41] "focus"        "interview"    "job"          "money"        "personal"
[46] "research"     "started"      "cultural"     "culture"      "digital"
[51] "music"        "write"        "writing"      "active"       "consider"
[56] "critical"     "english"      "following"    "hand"         "information"
[61] "network"      "popular"      "tools"        "actually"     "especially"
[66] "etc"          "hard"         "led"          "live"         "lot"
[71] "question"     "ive"          "online"       "read"         "video"
[76] "book"         "changed"      "european"     "model"        "moment"
[81] "specific"     "start"        "times"        "economic"     "readers"    

Look at associations:

## Frequently associated terms findAssocs(dtm, "blogging", 0.2)

 blogging      dont      read   usually      chat     video      blog    follow
1.00      0.74      0.61      0.57      0.56      0.55      0.49      0.45
research     blogs      hard       via      life       etc      live      role
0.42      0.38      0.38      0.38      0.37      0.35      0.35      0.33
scene  cultural       job  question      able interview     money       ive
0.31      0.30      0.30      0.30      0.27      0.27      0.24      0.23
led    course
0.21      0.20 

Find similar bloggers:

## Dissimilarity dis<-dissimilarity(dtm, method="cosine") ## The most similar bloggers for each blogger, in order of similarity similarityMin<-0.25 mostSimilarBloggers<-apply(dis, 1, function(row){ sorted<-sort(row) ordered<-order(row) ## 0.0 == same blogger ordered[sorted > 0.0 & sorted < similarityMin] }) for(doc in 1:length(mostSimilarBloggers)){ mostSimilar<-unlist(mostSimilarBloggers[doc]) if(length(mostSimilar) > 0){ count<-min(length(mostSimilar), 5) similar<-paste(bloggers.names[mostSimilar[1:count]], collapse=", ") }else{ similar<-"None" } cat(bloggers.names[[doc]], ": ", similar, "\n\n") }

Claire Welsby :  None
Michelle Kasprzak :  Régine Debatty, Anne Helmond
Alek Tarkowski :  Anne Helmond, Régine Debatty
Marco Mancuso :  None
Anne Helmond :  Alek Tarkowski, Michelle Kasprzak, Régine Debatty
Robert Misik :  None
Marta Peirano & José Luis de Vicente :  None
Alessandro Ludovico :  None
Régine Debatty :  Michelle Kasprzak, Alek Tarkowski, Anne Helmond 

Cluster bloggers:

## Clusters of similar bloggers clusterCount<-3 clusters<-kmeans(dtm, clusterCount) clusters.bloggers<-lapply(1:clusterCount, function(cluster){ bloggers.names[clusters$cluster == cluster]}) for(cluster in 1:clusterCount){ cat("Cluster", cluster, ":", paste(unlist(clusters.bloggers[cluster]), collapse=", "), "\n\n") }

Cluster 1 : Michelle Kasprzak, Alek Tarkowski, Anne Helmond, Régine Debatty
Cluster 2 : Claire Welsby, Marco Mancuso
Cluster 3 : Robert Misik, Marta Peirano & José Luis de Vicente, Alessandro Ludovico 

And plot associations between terms used in the text:

## Plot associations between terms plot(dtm, findFreqTerms(dtm, 6), attrs=list(graph=list(), node=list(shape="rectangle", fontsize="120", fixedsize="false")))


Exploring Art Data 10

Let’s make a word clouds for all the artists:

## install.packages('snippets',,'')
## Create a word cloud for all artists
freqAll<-termFreq(PlainTextDocument(paste(artists, collapse=" "),
control=list(removePunctuation=TRUE, removeNumbers=TRUE,
stopwords=TRUE, minDocFreq=100))
cloud(freqAll, col =, fit=TRUE))

vasari-wordcloud-all.pngAnd here’s a tag cloud for just one artist (Giovanni Cimabue) to compare it with:

## Create a word cloud for one artist
freq<-termFreq(PlainTextDocument(artists[1], language="en"),
control=list(removePunctuation=TRUE, removeNumbers=TRUE,
stopwords=TRUE, minDocFreq=3))
cloud(freq, col =, fit=TRUE))


Art Data Analysis: A Very Data Christmas

I thought it would be fun to explore
the lyrics of Christmas carols, and see how the word usage in these
songs compares with today’s lexicon. To do so I needed two things:
first, Christmas carol texts; and second, a way to compare the usage of
words in those songs to that of today.

A simple Google search for Christmas carol lyrics yielded this site, which I downloaded into a single text file. Then, I used the R tm package to create a clean word corpus from this text, stripping out English stopwords, punctuation and case. This left me with 755 words to explore...

Art Data Analysis: Software Studies

TimeDIff_SUM_ALL_color_reduced.jpgLev Manovich’s Software Studies initiative at UCSD is applying big data quantitative methods to mass media in a technique called Cultural Analytics. I particularly like their studies of US Presidential campaign ads (image above) and of manga images.

If art is the superstructure of kitsch or if an artist is an aesthetic summator then this is paradigmatic art, using the techniques of the age to depict the visual environment as renaissance artists used trade maths.

Exploring Art Data 9

Now let’s see which artists are described most similarly by Vasari:

## Dissimilarity
dis<-dissimilarity(dtm, method="cosine")
## The most similar artists for each artist, in order of similarity
mostSimilarArtists<-apply(dis, 1,
## 0.0 == same artist
ordered[sorted > 0.0 & sorted < similarityMin]
for(doc in 1:length(mostSimilarArtists)){
if(length(mostSimilar) > 0){
count<-min(length(mostSimilar), 5)
similar<-paste(artists.names[mostSimilar[1:count]], collapse=", ")
cat(artists.names[[doc]], ": ", similar, "\n\n")

Which gives us (truncated…):

Giovanni Cimabue :  Giotto, Masaccio, Agnolo Gaddi
Arnolfo Di Lapo :  Niccola And Giovanni Of Pisa
Niccola And Giovanni Of Pisa :  Arnolfo Di Lapo, Agostino And Agnolo Of Siena
Andrea Tafi :  None
Gaddo Gaddi :  None
Margaritone :  None
Giotto :  Buonamico Buffalmacco, Pietro Perugino, Giovanni Cimabue, Raffaello Da Urbino, Taddeo Gaddi

And let’s cluster the artists together using k-means clustering:

## Clusters of similar artists
clusters<-kmeans(dtm, clusterCount)
artists.names[clusters$cluster == cluster]})
for(cluster in 1:clusterCount){
cat("Cluster", cluster, ":",
paste(unlist(clusters.artists[cluster]), collapse=", "),

Which gives us:

Cluster 1 : Filippo Brunelleschi, Torrigiano, Antonio Da San Gallo The Younger, Niccolò Called Tribolo, Simone Mosca, Fra Giovanni Agnolo Montorsoli
Cluster 2 : Domenico Puligo, Francesco Mazzuoli, Niccolò Soggi, Giovanni Antonio Bazzi Called Il Sodoma
Cluster 3 : Giotto, Lorenzo Di Bicci, Ercole Ferrarese, Baldassarre Peruzzi, Madonna Properzia De Rossi, Girolamo Da Treviso, Il Rosso, Franciabigio, Giulio Romano, Fra Sebastiano Viniziano Del Piombo, Domenico Beccafumi Of Siena, Cristofano Gherardi Called Doceno Of Borgo San Sepolcro, Michele San Michele, Giovanni Da Udine, Battista Franco, Daniello Ricciarelli
Cluster 4 : Baccio Bandinelli
Cluster 5 : Andrea Tafi, Gaddo Gaddi, Margaritone, Stefano Painter Of Florence And Of Ugolino Sanese, Pietro Laurati, Ambrogio Lorenzetti, Pietro Cavallini, Tommaso Called Giottino, Giovanni Dal Ponte, Agnolo Gaddi, Berna, Duccio, Antonio Viniziano, Jacopo Di Casentino, Gherardo Starnina, Lippo, Don Lorenzo Monaco, Taddeo Bartoli, Niccolò Aretino, Dello, Nanni Dantonio Di Banco, Masolino Da Panicale, Giuliano Da Maiano, Piero Della Francesca, Leon Batista Alberti, Lazzaro Vasari, Antonello Da Messina, Alesso Baldovinetti, Vellano Da Padova, Benozzo Gozzoli15, Galasso Ferrarese17, Desiderio Da Settignano, Mino Da Fiesole, Lorenzo Costa, Benedetto Da Maiano, Jacopo Called Lindaco, Giorgione Da Castelfranco, Antonio Da Correggio, Mariotto Albertinelli, Raffaellino Del Garbo, Simone Called Il Cronaca, Marco Calavrese, Francesco Granacci Il Granaccio, Giuliano Bugiardini
Cluster 6 : Baccio D Agnolo, Perino Del Vaga, Bastiano Da San Gallo Called Aristotile, Francesco Salviati, Taddeo Zucchero
Cluster 7 : Pietro Perugino, Raffaello Da Urbino, Andrea Del Sarto, Jacopo Da Pontormo
Cluster 8 : Arnolfo Di Lapo, Niccola And Giovanni Of Pisa, Agostino And Agnolo Of Siena, Andrea Pisano, Jacopo Della Quercia, Luca Della Robbia, Lorenzo Ghiberti, Donato, Michelozzo Michelozzi, Andrea Verrocchio, Bramante Da Urbino, Andrea Dal Monte Sansovino, Benedetto Da Rovezzano, Pierino Piero Da Vinci, Giovan Francesco Rustici
Cluster 9 : Michelagnolo Buonarroti
Cluster 10 : Giovanni Cimabue, Buonamico Buffalmacco, Simone Sanese, Taddeo Gaddi, Andrea Di Cione Orcagna, Spinello Aretino, Paolo Uccello, Parri Spinelli, Masaccio, Fra Filippo Lippi, Cosimo Rosselli, Sandro Botticelli, Andrea Mantegna, Filippo Lippi Called Filippino, Bernardino Pinturicchio, Francesco Francia, Luca Signorelli Of Cortona, Leonardo Da Vinci10, Piero Di Cosimo, Fra Bartolommeo Di San Marco, Guglielmo Da Marcilla, Lorenzo Di Credi, Giovanni Antonio Sogliani, Giovanni Antonio Lappoli

Michaelangelo is clearly singular. 😉