On 2004-07-21 14:07:00, Allan Engelhardt wrote in CYBAEA Journal:
I have an interest in social software in the enterprise - the use of tools like blogs, wikis, document management, rss, communities, discussion boards, and so on within large organisations to foster “bottom up” knowledge management, collaboration, etc.
I really became aware of the problem when I received a note from one of the guys at Amazon. He argued that as an organisation they prided themselves on hiring above-average people with above-average desire to create new things, but as a company they still found that ideas were lost and that starting new projects took too long.
We have worked with organisations to implement social software, looking at tools like Ecademy, Socialtext, Enable2, eGroupWare, and others. I argue that the creation of personal (blog) and collaborative content (wiki+document management), and its distribution (rss, Atom) is a solved problem.
What is not a solved problem is how to connect teams or individuals within the organisation that are, unbeknown to each other, working with the same or similar ideas.
I wish to investigate the idea that auto-classification and automatic taxonomy generation may be useful to enable such teams to make contact. The basic idea is to be able to cluster groups of output (blogs, workspaces, etc.) that are discussing similar ideas.
The challenge in the enterprise is that there is simply too much content being generated for a human to follow it (one of my clients has 60,000 employees and some 2,500 active, funded projects).
Search is not the answer, because an individual team does not know that it needs to search.
I do not want to rely exclusively on existing ("top-down”) taxonomies. Chances are that if you have a taxonomy then you have a project, and if you have a project then there are existing processes that enable people to know about them and contribute to them.
I am interested in new projects and emerging ideas within the organisation, and how to bring together the team that can make them happen. This means that I am interested in “the taxonomy of tomorrow”, which is something you haven't formally built yet.
My theory is that automatic classification and taxonomy generation should be effective when applied within a single enterprise, as the vocabulary and topics will be fairly standard.
I do not wish to rely on authors creating their own categories. In my experience, people don't categorise. Getting anybody to document what they are doing is enough of a challenge without bringing up topics like “information architecture”.
That is why I am looking for automatic (unsupervised) text clustering. If I have somebody in London who has great ideas for my retail shops; somebody in Manchester who is experimenting with practical changes to my consumer stores; and a man in Glasgow who would like to promote change in our high-street outlets; how do I enable them to discover each other and work together?
I can not use a standard text classifier on this because I do not have a training set.
An alternative approach explored by people like Matt Mower of eVectors is to assume that 10-20% of people will classify and use that to automatically classify the rest. That is an interesting assumption and a well-understood problem (classify text based on examples) with well-documented solutions from naive Bayes through neural networks and on to support vector machines and similar solutions (the list here in roughly order of increasing performance).
However, the issue is that you are always using “yesterday's taxonomy” to categorise. I am not very interested in this, because chances are that if you have a useful taxonomy then you have existing projects within the organisation dealing with the issues, and promoting existing (funded) projects within a company is a (largely) solved organisational problem.
I'm interested in “tomorrow's taxonomy” to bring together people around new innovative ideas. In the example above, assume that retail stores are a new idea and that the corporate terminology ("retail shops”, “consumer stores”, or “high-street outlets”) has not yet been embedded within the corporate culture. How can I bring together the idea-man in London with the guys in Manchester who can implement them and the manager in Glasgow who can promote the change within the organisation and help it become a change project?
The assumption (which I want to test) is that there will be enough shared vocabulary and a sufficient limited set of goals and topics that automatic text clustering can work within the single enterprise. The additional advantage is that it doesn't have to be perfect. I was working with an organisation who estimated that it took them about one year from ideas to funded projects. If we can change that average to 11 months, say by driving the time down to six months for the 17% of connections that we successfully make (which seems like an unambitious hope), then that organisation has a real and lasting advantage over its competitors.
Ideas and suggestions would be most welcome.
On 2010-07-13 07:47:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics.
The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type="l") does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary.
We also get a nice opportunity to use the under-appreciated read.fwf function.
Read more (~535 words).
On 2010-06-22 11:45:00, Allan Engelhardt wrote in CYBAEA Journal:
We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is scary.
We now re-do the analysis four years later and, just because we can, we are using the leading companies of the London stock exchange instead of the largest American companies.
The results still hold. We called it the 3/2 rule: treble the number of workers and you halve their individual productivity. Large companies with ten times the number of employees are ¼ as productive as their smaller competitors.
Employee productivity is a big issue. If all the FTSE-100 companies achieved their average profits per employee, then the index would generate almost £1 trn of additional net profits for the economy.
Read more (~245 words).
On 2010-06-22 11:20:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary.
We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.
Read more (~763 words, 5 comments).
On 2010-06-17 09:05:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
Following on from my previous post about improving performance of R by linking with optimized linear algebra libraries, I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their Revolutionary Performance pages.
Read more (~300 words, 2 comments).
On 2010-06-15 10:21:00, Allan Engelhardt wrote in CYBAEA Data and Analysis:
Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection.
But recently David Smith was suggesting that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library. So I decided to investigate.
The quick summary is that it only really makes a difference for fairly artificial benchmark tests. For “normal” work you are unlikely to see a difference most of the time.
Read more (~934 words, 1 comments).
Join the discussion
There are no comments yet. Be the first to comment.