February 28, 2009

OED, again

A little more on the OED. The idea of creating a publicly-accessible version has obviously been floating around for a few years. As well it might: not only would an open OED be fantastically useful, but there's a certain justice in bringing it back to the community. As Kragen Sitaker writes, the original OED

is one of the earliest instances of what are now called "pro-am" or "commons-based peer production" projects. From 1857 to 1928, thousands of readers collected examples of uses of words their dictionaries didn't define; they mailed these examples on slips of paper to a small number of editors, who undertook to collate them into a dictionary.

Kragen's attempt to liberate the OED was the most effective: not only did he get one set of the OED scanned, he also cooked up some code making it possible to look up individual words. Alas, his system is now offline - such is the fate of one-man projects. Rufus Pollock's attempt to revive it, within the framework of the Open Knowledge Foundation, seems not to have got anywhere.

More ambitious are the Distributed Proofreaders, a group who take OCR'ed books, edit and correct them by hand, and pass them on to rProject Gutenberg. They've been contemplating the idea of tacking the OED for some time now. But it's a pretty daunting project - both in scale, and in the complexity of the typography - and every attempt seems to peter out.

Which is all a bit of a disappointment. I'm not quite foolhardy enough to lauch myself into digitising the OED just yet, but there must be at least some prospect to make those scans slightly more user-friendly.

The Oxford English Dictionary, free

[update: Here is a very rough interface, which will be improved whenever I next have some free time]

Using the OED online costs £200/year, which is silly. Fortunately the first edition is out of copyright, and available at the Internet Archive. Unfortunately, it's a bit tricky to find the right volume in a format that doesn't expect you to download 200MB to look up a word. Djvu seems the best option; you need to install a browser plugin first, but then you can look at individual pages quite easily. Here are links to each volume:

A-B, C, D-E, F-G (pdf only) , H-K, L, M-N, O-P (flip-book only), Q-R, S-SH, SI-SU (flip-book only), SV-TH, TI-U, V-Z (flip-book only)

Other formats are at these links (yes, there are two separate scans, one from the University of Toronto and another from Kragen Sitaker):

  • Volume 1, A-B: Sitaker
  • Volume 2, C: Sitaker
  • Volume 3, D-E: Toronto (partial), Complete?
  • Volume 4, F-G: Sitaker, Toronto (no djvu for either)
  • Volume 5, H-K: Sitaker
  • Volume 6A, L:A Sitaker
  • Volume 6B, M-NB (Sitaker)
  • Volume 7, O-P: Toronto (flip-book only), Unlabelled (flipbook/pdf only)
  • Volume 8A, Q-R: A - Sitaker
  • Volume 8B, S-SHB - Sitaker
  • Volume 9A, SH-SU: Sitaker.
  • Volume 9B, SV-TH: Sitaker, Toronto
  • Volume 10A, TI-U: Sitaker, Toronto
  • Volume 10B, V-Z: Toronto, Sitaker

November 8, 2006

History of printing

This post is brought to you by the awestruck feeling of finding yet another underexplored bit of world history....

We all know Gutenberg wasn't the first person to experiment with movable type; it had been tried in China before. What I hadn't realised was just how international the world was first time round. One of the first examples of movable type comes from the Tangut Empire. They were printing in a language unrelated to Chinese, written in a script inspired by Chinese characters - but with a set of 6000+ totally different logograms. And some of the first texts that they tried to print like this were buddhist text translated from Sanskrit (possibly via Tibetan).

So: this culture created a writing system inspired by the Chinese, a religion from India, and out of them developed movable type 400-odd years before Gutenberg. Impressive, no?

But, there's a flaw. Movable type makes a lot less sense with 6000 characters than it does with an alphabet of 30-something. So for the most part, they just printed by carving wood-blocks, one per page. So when they created a Tangut version of the Tripitaka, the Buddhist scriptural canon, they used 130,000 blocks. Most of them are now in London or St. Petersburg, having been raided by people like Aurel Stein. Here are some papers on Tangut history and language.

[The picture is a fragment from a written Tangur text of the Platform Sutra, taken from the British Library]

November 7, 2006

Glassy Essence

John Updike reviews Salinger's Franny and Zooey. Reading about the Glass family, like reading about the Bagthorpes or watching The Royal Tenenbaums, is a guilty pleasure tinged with recognition and wish-fulfilment. Updike says much the same:

Of Zooey, we are assured he has a "somewhat preposterous ability to quote, instantaneously and, usually, verbatim, almost anything he had ever read, or even listened to, with genuine interest." The purpose of such sentences is surely not to particularize imaginary people but to instill in the reader a mood of blind worship, tinged with envy.

Many of the stories are online here

October 28, 2006

This medieval bestiary feel very much like the etymologies in Sanskrit works like Yaska's Nirukta. Both of them shift between what we'd now think of as etymology (i.e. finding plausible historical roots for words), and a more alien sense that the word, through etymology, somehow captures the entire nature of the thing described. I suppose in the West this goes back to the "Platonism without Plato" that drives medieval scholasticism, and there is something pretty similar in India.

The he-goat is a wanton and frisky animal, always longing for sex; as a result of its lustfulness its eyes look sideways - from which it has has derived its name. For, according to Suetonius, hirci are the corners of the eyes. Its nature is so very heated that its blood alone will dissolve a diamond, against which the properties of neither fire nor iron can prevail.

Also, like all these books, it is a very pretty thing.

October 3, 2006

Conference reloaded

How can you develop a service without sharing a language with your users?

Holed up in Budapest, my head too messed up to do any proper work (eep! the doom she is a-coming!), I've been listening to danah Boyd's keynote at the blogtalk conference that's just winding up in Vienna.

She touches on the fact that the creators of Orkut don't have the faintest idea what their Portugese or Hindi-speaking users are doing. I'd always vaguely assumed that there would be a fair few Portugese-speakers within the Orkut development team, for instance. But obviously not.

It'd be a nice little project for a journalist or an anthropologist, to work out how much the developers of these sites know about their users.

September 19, 2006

A place for concordances

IntraText Digital Library: a project to create a digital library with concordances, statistics, and other useful things,based around some XML structure, worked out in immense detail, so that you can do whatever you like to the texts.

Would be so much better, though, if you could use all the tools on an entire corpus, rather than a single text. Or at least, it'd save on the pile of ad hoc scripts I end up using to work out what's happening with chunks of sanskrit.

February 23, 2005

Some etymologies

Dubious, but hilarious. The town name ‘Baldock’ is apparently a corruption of ‘Baghdad’, and the result of exotically-minded Knights Templar settling there in the 12th century. Then ‘catamite’, a word which seems to be cropping up everywhere in my set texts, is derived from Ganymede. Ganymede? Catamite? There’s an intermediate latin stage of Catamus, but that doesn’t really explain it.

February 21, 2005

Scholarly jabs

Ryan writes about the lost art of academic jibes. He’s a little pessimistic - take this recent gem of a footnote from Richard Drayton’s recent book Nature’s Government

‘”I owe a particdular debt to Mr. Desmond for rescuing me from writing the more parochial history of Kew which would have followed from the publication of my doctoral dissertation. He read and commented on it in 1994, but failed to cite it since, he later advised me, he had ‘put it aside’ before writing. I take encouragement from the fact that Desmond was able so often to agree with the patterns and periods I had described for Kew’s history”‘

February 14, 2005

Linguists and journalists

Today’s earth-shattering revelation is the similarity between journalists and comparative linguists. The both bug an assortment of subject experts, then string together garbled misinterpretations of the responses, and publish to acclaim from the ignorant.

In a less huffy mood, I might have some good things to say about interdisciplinarity and spreading information.