Research project: Increasing innovation velocity in the enterprise using text clustering


21 July 2004

I have an interest in social software in the enterprise - the use of tools like blogs, wikis, document management, rss, communities, discussion boards, and so on within large organisations to foster “bottom up” knowledge management, collaboration, etc.

I really became aware of the problem when I received a note from one of the guys at Amazon. He argued that as an organisation they prided themselves on hiring above-average people with above-average desire to create new things, but as a company they still found that ideas were lost and that starting new projects took too long.

We have worked with organisations to implement social software, looking at tools like Ecademy, Socialtext, Enable2, eGroupWare, and others. I argue that the creation of personal (blog) and collaborative content (wiki+document management), and its distribution (rss, Atom) is a solved problem.

What is not a solved problem is how to connect teams or individuals within the organisation that are, unbeknown to each other, working with the same or similar ideas.

I wish to investigate the idea that auto-classification and automatic taxonomy generation may be useful to enable such teams to make contact. The basic idea is to be able to cluster groups of output (blogs, workspaces, etc.) that are discussing similar ideas.

The challenge in the enterprise is that there is simply too much content being generated for a human to follow it (one of my clients has 60,000 employees and some 2,500 active, funded projects).

Search is not the answer, because an individual team does not know that it needs to search.

I do not want to rely exclusively on existing (“top-down”) taxonomies. Chances are that if you have a taxonomy then you have a project, and if you have a project then there are existing processes that enable people to know about them and contribute to them.

I am interested in new projects and emerging ideas within the organisation, and how to bring together the team that can make them happen. This means that I am interested in “the taxonomy of tomorrow”, which is something you haven’t formally built yet.

My theory is that automatic classification and taxonomy generation should be effective when applied within a single enterprise, as the vocabulary and topics will be fairly standard.

I do not wish to rely on authors creating their own categories. In my experience, people don’t categorise. Getting anybody to document what they are doing is enough of a challenge without bringing up topics like “information architecture”.

That is why I am looking for automatic (unsupervised) text clustering. If I have somebody in London who has great ideas for my retail shops; somebody in Manchester who is experimenting with practical changes to my consumer stores; and a man in Glasgow who would like to promote change in our high-street outlets; how do I enable them to discover each other and work together?

I can not use a standard text classifier on this because I do not have a training set.

An alternative approach explored by people like Matt Mower of eVectors is to assume that 10-20% of people will classify and use that to automatically classify the rest. That is an interesting assumption and a well-understood problem (classify text based on examples) with well-documented solutions from naive Bayes through neural networks and on to support vector machines and similar solutions (the list here in roughly order of increasing performance).

However, the issue is that you are always using “yesterday’s taxonomy” to categorise. I am not very interested in this, because chances are that if you have a useful taxonomy then you have existing projects within the organisation dealing with the issues, and promoting existing (funded) projects within a company is a (largely) solved organisational problem.

I’m interested in “tomorrow’s taxonomy” to bring together people around new innovative ideas. In the example above, assume that retail stores are a new idea and that the corporate terminology (“retail shops”, “consumer stores”, or “high-street outlets”) has not yet been embedded within the corporate culture. How can I bring together the idea-man in London with the guys in Manchester who can implement them and the manager in Glasgow who can promote the change within the organisation and help it become a change project?

The assumption (which I want to test) is that there will be enough shared vocabulary and a sufficient limited set of goals and topics that automatic text clustering can work within the single enterprise. The additional advantage is that it doesn’t have to be perfect. I was working with an organisation who estimated that it took them about one year from ideas to funded projects. If we can change that average to 11 months, say by driving the time down to six months for the 17% of connections that we successfully make (which seems like an unambitious hope), then that organisation has a real and lasting advantage over its competitors.

Ideas and suggestions would be most welcome.