Tagging and Ontologies
Jun. 30th, 2006 08:56 amWhen I encountered the online tagging of data (first on del.icio.us, then on LJ) it really brought brought it home to me that most ontologies do not fit into heirarchies or simple groups. When it comes to most of the things we deal with in day to day life, what we actually have are interlocking sets of observations and categorisations that are best described using a series of independent labels.
However, I think the _next_ stage is going to be the really tricky (and interesting) bit - some of the labels can themselves be part of heirarchies, or synonyms for other labels. I have no wish, for instance, to tag something as being a "short film" and a "film", when short films _are_ films. But current tagging technology doesn't allow me to set up that link. Nor does it allow me to say that things tagged with "humor" by Americans are equivalent to things tagged as "humour" by brits. What we currently have is a way to produce a mish-mash of data, with some interesting trends in it. The ability to construct thing from it will make it a lot more powerful.
However, I think the _next_ stage is going to be the really tricky (and interesting) bit - some of the labels can themselves be part of heirarchies, or synonyms for other labels. I have no wish, for instance, to tag something as being a "short film" and a "film", when short films _are_ films. But current tagging technology doesn't allow me to set up that link. Nor does it allow me to say that things tagged with "humor" by Americans are equivalent to things tagged as "humour" by brits. What we currently have is a way to produce a mish-mash of data, with some interesting trends in it. The ability to construct thing from it will make it a lot more powerful.
no subject
Date: 2006-06-30 08:14 am (UTC)no subject
Date: 2006-06-30 08:20 am (UTC)Oh, and this was all in response to zara's post here
no subject
Date: 2006-06-30 08:23 am (UTC)It's not the obvious tags, like place or some commonly shared hierarchy, it's the personal ones that fail...
no subject
Date: 2006-06-30 08:31 am (UTC)no subject
Date: 2006-06-30 08:39 am (UTC)Word stemming is a mostly solved problem for english, there is even some free code in the indexing server Xapian[1]. We could even use soundex to group similar items.
Ideally, we imagine the tag space as a metric space, and define a metric to see how related two items are based on a variety of categories. (Although, searching in a metric space is still an ongoing problem, but it's a nice mathematical model of the space all the same).
Re: (short) film.
So, we need adjectives in tags then.
Tags are a bad solution to the problem of data organization. They are simple to implement though, and easy to use. I would rather have full text indexing, and automatic extraction of metadata than tags.
[1] http://www.xapian.org/
[2] http://en.wikipedia.org/wiki/Soundex
no subject
Date: 2006-06-30 10:42 am (UTC)To give a few examples of problems with full text search:
You can't do text-search on pictures (all of my LJ photos are tagged, and that's how my galleries are constructed).
I've got a load of photos marked "cat" and "dog" - but without a way of saying "Cat -> Animal" and "Dog -> Animal" a search on "animals" will find nothing at all.
And in a purely textual sense, I can can ramble on for 500 words about my walk in the park with Tara, but unless I mention she's a dog you're never going to find that post with a search for dog walking.
no subject
Date: 2006-06-30 11:31 am (UTC)Full text indexing isn't the same as every word is a tag, as word order is important.
Automatic tagging should be user defined and extensible. (Perhaps) In the same way you can train a spam filter.
With regards to the dog/tara - tara is a statistically improbable phrase (Unlike 'and' they' 'blog' etc), and dog isn't that common. So with automatic relations, you can see that tara and dog come up together frequently, and thus a search for one may incurr a search for another.
When I said full text indexing, I meant to say indexing of all the data, including metadata. (Like Date, Time, GPS, etc). To me, this includes correlations too. How often words appear together,
Anyway:
This goes back to the metric space idea. You can define objects and their relation. (I.e the distance between them). You could probably do this sort of grouping for tags easily too.
Ideally, I would like a situation where I have to do as little organisation as possible, and use the pre-existing relationships between files. (From same device, created at the same time, same type , emailed to same person, etc).
Contrived example: Photos 1-10 have a date embedded in them. They are all close to each other with a small distance. A party in the calendar, also has a date attached to it.
You search for the party, it finds the photos.
I know I will have to add relations at some point, but I'd rather the computer did most of the work for me.
no subject
Date: 2006-06-30 12:08 pm (UTC)But I also know that the chances of that working _well_ in the near future are small. For a start, my photos are stored with LJ and my calendar is held on Yahoo.
Amplified Intelligence does seem to be the way forward - as a tiny example, when I tag things in del.icio.us it shows me previous tags used for that link, highlighting the ones I've used previously myself. I can then select some to use myself, making the job of tagging much easier. Further work to improve like this is the direction I think we're likely to see, with full textual analysis coming in with it as they become more mainstream (and presumably less processor intensive).
no subject
Date: 2006-06-30 09:33 am (UTC)no subject
Date: 2006-06-30 10:19 am (UTC)Language is a dynamic entity. It evolves, it changes and people use it with different perceptions (american buzzard = british vulture). LJ is something that is dependant on natural language rather than controlled vocabularies, attributing any decent form of tagging sturcture is a herculean task.
no subject
Date: 2006-06-30 11:53 am (UTC)Or , life is P2P not top down.
no subject
Date: 2006-06-30 12:08 pm (UTC)no subject
Date: 2006-06-30 12:50 pm (UTC)*makes the obvious comment that they are not, necessarily, equivalent*
I've seen quite a few Americans online using the phrase "British humor"... you'd think they could just type "humour" and use the different spelling to give the word a more specific meaning.
no subject
Date: 2006-07-01 03:32 am (UTC)no subject
Date: 2006-06-30 12:54 pm (UTC)Flickr tags have useful "clusters" so if you search for something then as well as coming up with anything tagged with that, it can suggest clusters within that search. Searching for "school", for example brings up one cluster that's pictures of kids, one that's pictures of buildings, one that's pictures of buses, and one that's pictures to do with universities/colleges..
no subject
Date: 2006-07-01 04:54 am (UTC)flickr's clusters for 'dog' does throw up 'animal'...
http://www.flickr.com/photos/tags/dog/clusters/
and ditto 'dog' for the 'animal' clusters...
http://www.flickr.com/photos/tags/animal/clusters/
I don't know if flickr does it, but given the above clusters it'd be possible for software to search an individual's photos for 'animal' and find their dog photos without them having an 'animal' tag against any of them.