On 2004-07-21 14:07:00, Allan Engelhardt wrote in CYBAEA Journal:
I have an interest in social software in the enterprise - the use of tools like blogs, wikis, document management, rss, communities, discussion boards, and so on within large organisations to foster “bottom up” knowledge management, collaboration, etc.
I really became aware of the problem when I received a note from one of the guys at Amazon. He argued that as an organisation they prided themselves on hiring above-average people with above-average desire to create new things, but as a company they still found that ideas were lost and that starting new projects took too long.
We have worked with organisations to implement social software, looking at tools like Ecademy, Socialtext, Enable2, eGroupWare, and others. I argue that the creation of personal (blog) and collaborative content (wiki+document management), and its distribution (rss, Atom) is a solved problem.
What is not a solved problem is how to connect teams or individuals within the organisation that are, unbeknown to each other, working with the same or similar ideas.
I wish to investigate the idea that auto-classification and automatic taxonomy generation may be useful to enable such teams to make contact. The basic idea is to be able to cluster groups of output (blogs, workspaces, etc.) that are discussing similar ideas.
The challenge in the enterprise is that there is simply too much content being generated for a human to follow it (one of my clients has 60,000 employees and some 2,500 active, funded projects).
Search is not the answer, because an individual team does not know that it needs to search.
I do not want to rely exclusively on existing ("top-down”) taxonomies. Chances are that if you have a taxonomy then you have a project, and if you have a project then there are existing processes that enable people to know about them and contribute to them.
I am interested in new projects and emerging ideas within the organisation, and how to bring together the team that can make them happen. This means that I am interested in “the taxonomy of tomorrow”, which is something you haven't formally built yet.
My theory is that automatic classification and taxonomy generation should be effective when applied within a single enterprise, as the vocabulary and topics will be fairly standard.
I do not wish to rely on authors creating their own categories. In my experience, people don't categorise. Getting anybody to document what they are doing is enough of a challenge without bringing up topics like “information architecture”.
That is why I am looking for automatic (unsupervised) text clustering. If I have somebody in London who has great ideas for my retail shops; somebody in Manchester who is experimenting with practical changes to my consumer stores; and a man in Glasgow who would like to promote change in our high-street outlets; how do I enable them to discover each other and work together?
I can not use a standard text classifier on this because I do not have a training set.
An alternative approach explored by people like Matt Mower of eVectors is to assume that 10-20% of people will classify and use that to automatically classify the rest. That is an interesting assumption and a well-understood problem (classify text based on examples) with well-documented solutions from naive Bayes through neural networks and on to support vector machines and similar solutions (the list here in roughly order of increasing performance).
However, the issue is that you are always using “yesterday's taxonomy” to categorise. I am not very interested in this, because chances are that if you have a useful taxonomy then you have existing projects within the organisation dealing with the issues, and promoting existing (funded) projects within a company is a (largely) solved organisational problem.
I'm interested in “tomorrow's taxonomy” to bring together people around new innovative ideas. In the example above, assume that retail stores are a new idea and that the corporate terminology ("retail shops”, “consumer stores”, or “high-street outlets”) has not yet been embedded within the corporate culture. How can I bring together the idea-man in London with the guys in Manchester who can implement them and the manager in Glasgow who can promote the change within the organisation and help it become a change project?
The assumption (which I want to test) is that there will be enough shared vocabulary and a sufficient limited set of goals and topics that automatic text clustering can work within the single enterprise. The additional advantage is that it doesn't have to be perfect. I was working with an organisation who estimated that it took them about one year from ideas to funded projects. If we can change that average to 11 months, say by driving the time down to six months for the 17% of connections that we successfully make (which seems like an unambitious hope), then that organisation has a real and lasting advantage over its competitors.
Ideas and suggestions would be most welcome.
On 2009-07-02 20:33:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
I am a sucker for good quality data. I wrote about data.gov, the US Government data site before, and now I find OECD Statistics which has some 300 data sets, many of which seems to be readily accessible (though some may require subscription)
Read more (~53 words).
On 2009-06-16 10:27:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
I like the "multicore" library for a particular task. I can easily write a combination of if(require("multicore",...)) that means that my function will automatically use the parallel mclapply() instead of lapply() where it is available. Which is grand 99% of the time, except when my function is called from mclapply() (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result.
So, I needed a function to determine if my function was called from any function in the "multicore" library. Here it is.
Read more (~190 words).
On 2009-06-12 10:23:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
Somebody on the R-help mailing list asked how to get Rmpi working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the R statistical computing and analysis platform. Since it is unusually painful to get working, I might as well copy the instructions here.
Read more (~414 words, 2 comments).
On 2009-06-09 11:23:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
O’Reilly has published Data Mashups in R as a $4.99 PDF download in their Short Cut series. In 27 pages it takes you through an example of how to combine foreclosure information with maps and geographical information to produce plots like the one here. This is all done with the R statistical computing and analysis platform.
Read more (~108 words).
On 2009-06-01 07:07:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
Hugh Miller, the team leader of the winner of the KDD Cup 2009 Slow Challenge (which we wrote about recently) kindly provides more information about how to win this public challenge using the R statistical computing and analysis platform on a laptop (!).
Read more (~456 words).
Join the discussion
There are no comments yet. Be the first to comment.