Peerworks blog

FAQ

About Peerworks and Individualized Tagging

Q: What is Peerworks?
Q: What is the goal of Peerworks?
Q: What is tagging, and why are we using it?
Q: What is individualized tagging, and how do we support it?
Q: How can individualized tagging help the whole community?
Q: Can this approach help people who don’t do their own tagging?
Q: How can individualized tagging improve online communities?
Q: Is individualized tagging available as an Internet service?
Q: How has individualized tagging been tested?
Q: What software is used for project implementation and coding?

Q: What is Peerworks?

Peerworks is a project focused on building content classification tools to improve online browsing, collaboration, and social discovery. We plan to make our technology open source, and to work with existing websites and content management systems that want to implement this technology. We also plan to work with researchers on enhancing the technology and for better understanding of how people use it.

Q: What is the goal of Peerworks?

We want to help communities organize themselves, and help individuals learn what they want to know. Ideally, a site using Peerworks technology would be able to show each user all the items — and only the items — that are useful or interesting to them. To do this, our technology lets users indicate their interests through individualized tagging. Once users have indicated their individual interests, we can find those users who share interests.

Based on these shared interests, users could form broader relationships and more community structure with each other. This kind of structure can remain fluid and can change as users’ interests and judgments evolve.

Q: What is tagging, and why are we using it?

Tagging simply means labeling items of content with personally meaningful words or phrases. Any user can tag any item of content with any number of tags. Tagging is a very flexible way of organizing content because, unlike traditional hierarchical organization, categories can overlap; an item of content can belong to as many categories as a tagger likes.

There are two approaches to tagging — consensus and individualized. Consensus tagging requires that all users of a site use the same tags, and use them consistently. This is very difficult to enforce, and creates a lot of tension between people who have different interests or tagging styles. We have chosen to use individualized tagging. This is not a new idea — some major sites use it already, and others are considering it — but we take it to a new level by learning each user’s tagging preferences.

Q: What is individualized tagging, and how do we support it?

Individualized tagging lets each user decide on the meaning of the tags they use, rather than having to track a consensus meaning. In our approach, individuals express their interests by inventing tags, or choosing ones that already exist, and tagging a few items with each tag.

Using these examples, our software builds a definition of what that tag means to that individual — we call that definition a “classifier.” Then our software can look at all the other items on the site, and decide fairly accurately how the user would tag those items — it “classifies” all the items. The user can correct (train) the software at any time, making its definition more accurate.

Once our software has tagged all the content, the user can view it filtered and sorted by tags, to show the items they are interested in, organized in ways that work well for them.

Q: How can individualized tagging help the whole community?

First, let’s mention a point that may not be obvious. Since each individual is tagging for their own benefit, they have an incentive to tag items with as much detail as needed to filter and organize the content the way they want. Since the software will feed back its understanding of what they mean, they have a continuing incentive to train the software to understand their meaning correctly. This implies that the software tends to build up an increasingly accurate set of definitions for each user’s interests and preferences. It turns out that it is fairly easy to compare two tag definitions and see how similar they are. Based on this, we can find users with similar interests and preferences.

This is different from existing systems that judge similarity based on tag names alone. For example, if a financier and a limnologist both use the tag name "bank," an existing system might think they are talking about the same thing. However our software will build very different definitions for the two users: the financier's definition will be about institutions that control money, and the limnologist's definition will be about earth that channels water. So by comparing definitions, we can identify users who share interests. More generally we can look at the full range of definitions across an entire site, and track the shifting range of topics that interest users of that site. We can cluster topics together based on shared interest, automatically generate discussion groups based on overlapping topics, etc.

We explicitly accept (and celebrate!) that the users of individualized tags will be a diverse population, each with different interests and ways of classifying the world. Our goal is to accommodate the real diversity of the user population, while also giving people ways to adopt each others’ perspectives, and to collaborate when they want to.

Q: Can this approach help people who don’t do their own tagging?

"Consensus" tag spaces can be created by clustering similar definitions. A site can provide views that are organized according to the consensus tags to help everyone see the big picture. These views can also provide a friendly way for new members of an online community to get up to speed on the existing topics. And people who don't do their own tagging can simply view items using the consensus tags.

The consensus tag space will automatically evolve over time as personal tagging decisions shift. This helps to avoid a straitjacket of fixed topics that have to be changed manually. Emerging issues and changes in community views, etc. will automatically show up in the consensus tag space.

Q: How can individualized tagging improve online communities?

There are lots of ways, we have only thought of a few:

Individualized tagging uses the knowledge and judgment of thousands to drive the creation of a "social landscape," which becomes a map of the interests of the community as a whole. Right now, people have to make these judgments anyway for their own purposes, but they can't easily feed them back into the community and contribute to the commonwealth. Learning individual preferences through tagging lets everyone benefit from these individual choices.

Q: Is individualized tagging available as an Internet service?

We’ve had working classifiers since September 2006, and since then we've built a cross-validation system and a series of test databases. This is required for testing the effectiveness of the statistical tagging. Most of what we've learned during the past few months has been how to fit the user interface well to the semantics of individualized tagging, and how to tune the classifiers effectively. The classifier accuracy is probably adequate for people to use individualized tagging already, but we have to move from prototype architecture to a more scalable architecture that lets us break out the classifiers as a service. This isn't a huge job, but we still have a few things to do. Our blog has more tagging project details and status.

Q: How has individualized tagging been tested?

For testing, we're using a database of blog posts we collect from about 1172 RSS feeds, chosen for feed quality and diversity. Currently we collect about 28,000 usable items each month. An item is deemed usable if the feed is okay, the item has enough content, the feed has enough items, etc. However, we don't test on the entire database, because testing requires complete manual tagging of a set of items, and no user can tag that many items. Instead we semi-randomly pull out a representative sample of items, 2,500 to 5,000, and tag those to create a test dataset for research.

Tagging lots of items is only necessary for our development testing. Users of individualized tagging as a service only need tag enough items to adequately train the system. Typically the system learns the meaning of a tag pretty well from five or so examples.

Q: What software is used for project implementation and coding?

To manage the project, we're using Trac, and Ruby on Rails as an implementation vehicle. However, Ruby on Rails isn't an implementation constraint on sites that use individualized tagging; we're planning to package our work as a REST service so it can be integrated with sites regardless of their implementation. We are deciding incrementally how much of the classification to move from Ruby to C for speed; so far it is all in Ruby. Right now compute performance is okay but certainly can be improved. This will be open source, so the code will be available, we'll open up Trac to external access, and we'll be set up to work with outside contributors. We also plan to offer support for sites that want to test and integrate our stuff into their system.





































Peerworks is a nonprofit effort, funded by The Kaphan Foundation. Project management and much of the design is provided by Mindloom.
Early investigation of this approach was done by Pliant Research, and is discussed in this interview.