Folksonomies: a new style of metadata

1. What is a Folksonomy?

'Folksonomy' (a contraction of 'folk taxonomy') is a term coined by Thomas Vander Wal to describe a form of informal metadata. Metadata (data about data) has a long history, online and offline, in information science, but it is generally associated with formal categorisation schemes.

Vader Wal noticed a somewhat different approach being used by a range of new websites, such as Furl, Flickr, Del.icio.us, Books we Like and 43 things.

The approach to metadata used by these sites diverged from the norm in three ways. In each case, it involved:

2. How new is Folksonomy?

To Vander Wal, as to many other commentators, this has appeared as a new development. He is right insofar as the near-simultaneous adoption of similar systems by several major startups is notable and fascinating.

Yet, consciously or not, these sites are standing on the shoulders of giants, and little of what they do is entirely new. In the following paragraphs I will attempt to identify precursors, or comparisons, for the features of the folksonomies'

Vander Wal himself has made the provocative statement that "The original folksonomy is the web". His explanation is as follows:

Increasingly in user testing I run across people who do not use the site navigation to find information on a site, nor do they use in-site search. I regularly see people go out to their favorite search engine and put in the term for what they believe they are looking for. The search engines often (and increasingly so) return the information they were looking for using the user's term. The term may not be what the site uses in its controlled vocabulary and may be a term that is rarely used on the site. How did the search find it? Outside links using the term pointing to the content. The original folksonomy is the web.

Looking back still further, we can find echoes of folksonomy in the nineteenth-century writings of Charles Amni Cutter. In 1876 he described a 'syndetic catalog" - one based (like a folksonomy) on informal cross-references rather than on an established hierarchy.

3. The weaknesses of Folksonomies

Amidst the excitement of new sites, it is easy to become blind to the problems of folksonomies. Many of these have not yet become apparent in practice, simply because folksonomy-based systems have not yet reached the level of popularity at which they will attrack malicious attacks and encounter serious scaling problems.

Spam is perhaps the greatest danger. In the words of one slashdot poster, "all metadata matures to spam". Discussions of how folksonomies could deal with spam have found several possible responses.

One is a mechanism similar to the 'ignore' feature of Del.icio.us,which currently allows users to filter out tags from particular users. This could be expanded to globally filter out users whom many people have chosen to ignore.

Spammers might try to make their links show up in many inboxes by tagging them with large numbers of inappropriate keywords. This could be combatted by banning people who use too many tags, or by allowing users to ignore items which are also assigned 'adult' tags.

One suggestion is to let genuine users tag spam

Let's say message A is tagged "a b c d e" and I think it don't belong in tag d, I can unmark it and it will be tagged "a b c d* e" where the d* tag is a 'tainted' belonging... Taintedness gets a score: misclassification hits and correction hits, each user can vote if they think the link is misclassified or not. We can the apply statistical methods and below a certain threshold the link is deleted from that tag... This would work very much as statistical spam filters work, through a kind of 'karma voting' system. Registered users can give one vote per link if they so desire... A spammer trying to promote a link would have to vote from many accounts (possible but a lot of work) for every link he wants misclassified (and as long as enough people vote against it, it will be detagged, because, why did it raise suspicion in the first place?)

Making registration harder could deter some spammers.

An email from Anselm Hook suggests a trust-network filter (working on the assumption that if a friend of a friend links to a feed, they are probably not spammers),

Which of these solutions will work remains to be seen - but spam will without doubt become one of the biggest problems for folksonomies.

Pornography and other inappropriate content. Folksonomies seem perfectly adapted to deal with this - just tag the inappropriate content as such, and enable users to filter out what they consider inappropriate. Users could be enabled to automatically ignore anything tagged with one of a list of markers for content they did not wish to see.

Synonyms If I call it one thing and you call it another, how will I ever find items you've tagged? In a formal classification system we will both be shoehorned into using the same terminology - but not so in a folksonomy. One answer to this is to give each user an idea of what tags others are using - by autocompletion of tag entry, or by displaying a 'cloud' of commonly-used tags for a particular type of item.

Automatic metadata n all the existing folksonomies, keywords are consciously chosen by users. In some ways this is a step back from systems where metadata is automatically generated from the nature of the data or from the behaviour of users. Examples of collecting metadata from users include Alexa (gathering information from a browser toolbar) Amazon (shopping patterns), and Google's use of descriptions taken from links to a page. non-internet exampels would include citation databases. Examples of automatically generating metadata from the data itself include search engines indexing the contents or modification date of a page.

4. Synthesis - comibining the best of Folksonomies with the rest of the world

Folksonomies, then, have many disadvantages to balance out their advantages. How could we get the best of both worlds, combining the informal manual classification offered by folksonomies with formal and automatic classification by other means?

There is scope for folksonomy-based systems to integrate the use of metadata. Flikr does store, but does not yet make effective use of, EXIF (Exchangeable Information File) data. This is the information automatically recorded by most digital cameras, detailing camera model, date, shutter speed, and other details. Any system (including a folksonomy) using this data will be able to guess more useful metadata, such as whether a photo was taken inside or outside, at night or during the day.

It might also be possible to combine folksonomies with expert-run classification systems, as Louis Rosenfeld suggests. As an example, librarians could comb del.icio.us for information on what labels seemed most natural to user.

The most beautiful metaphor, though, is that given by Peter Merholz. He talks about 'desire lines': the mud tracks that appear across parks as walkers take short-cuts. A cunning authority will allow these desire lines to form and then pave them over, ensuring that the paving is on routes where it will be used. Just so with folksonomies - they provide a rough popular classification, which can then be used as the basis for something more permanent, formal and useful. That approach might show us the true power of the folksonomy.