Umbria – Market Intelligence from Blogs

FORTUNE has an article (“Blogging for Dollars”) that covers Umbria, a company based here in Colorado that tracks what bloggers are saying about its clients (aka mining blogs for market intelligence).

Economically, this market is finally starting to take shape — the ideas and attempts have been out there for a few years, but consumer companies have been on the fence about whether the blogosphere is worth listening in on. Until recently, that is. Umbria claims they’ll have $2M revenue this year and will be profitable next year, but the overall market for this kind of service is still only $20M according to the article (Intelliseek has about 1/3rd of that market).

Technologically, Umbria also sounds pretty interesting. They claim to have a competitive edge in automating most of the process:

Umbria’s solution is entirely software-based. [Umbria’s] competitors also meet with clients to interpret the data and suggest strategic responses. “Ultimately we rely on both technology and humans for analysis,” says Max Kalehoff, marketing director for BuzzMetrics [another Umbria competitor]. “Umbria takes an extremely automated approach.”

Umbria’s technology sounds like a pipeline of parsers that generates features that in turn drive product and sentiment classifiers (and those drive reporting):

Every few hours Umbria sends an application called a spider out over the web to scour the blogosphere for postings about the firm’s clients, most of which are big consumer companies, such as Electronic Arts, SAP, and Sprint. By analyzing keywords in blogs, Umbria can classify each citation thematically. In the case of Sprint, for example, Umbria’s software can tell whether a blogger is talking about customer service, the company’s advertisements, or a particular calling plan.

Another big challenge is to decipher what’s on a blogger’s mind. To figure out whether an opinion is strong or tepid, for example, it helps to know that “awesome” is a stronger endorsement than “pretty cool,” and that “shoddy” is less damning than “abominable.” Umbria has several employees with Ph.D.s in linguistics and artificial intelligence who are forever tweaking the software to make it better at categorizing opinions.

I can’t help thinking that more manual tweaking goes into each client’s setup than this description lets on, but still, I’m glad they’re seeing success, and I bet those linguists are having fun with the blogosphere, even if they have to do a bit of slumming to come up with their rules:

The software can also estimate the author’s age and gender. Elongated spellings (“soooooooo”), multiple exclamation marks (!!!), and acronyms such as POS (“parent over shoulder”) suggest a teenage female member of Generation Y (born after 1979). The blogger is probably a teenage boy if a posting is rife with hip-hop terminology such as “aight” (translation: “all right”) and “true dat” (“I agree!”).

There you have it, you don’t even have to know the language to have your voice heard by the people who want to sell you more stuff. Now that’s power. On one side of that function, at least.

Spolsky: Windows Live, Marimba Phenomenon

Joel Spolsky had a similarly disappointing experience with Windows Live as I did. He calls it the Marimba Phenomenon:

The Marimba Phenomenon is what happens when you spend more on PR and marketing than on development. “Result: everybody checks out your code, and it’s not good yet. These people will be permanently convinced that your code is simple and inadequate, even if you improve it drastically later.”

Hadn’t heard it called that before, but it happens often enough, and the name is great: In the mid 90’s, the company Marimba trumpeted and eventually launched a product called Castanet that was something like a Java-based push platform (this wasn’t long after, or maybe even concurrently to, PointCast — remember them?). To Java types, and many non-Java types (or Java non-types?), Marimba Castanet sounded foundational, revolutionary, indispensable, and had an aura of universal usefulness, with a hip 90’s name to boot. I think the product is still around in a different incarnation. Anyway, the release of Castanet revealed a product that just didn’t live up to the high expectations that had been set. It worked (and probably still works), but it wasn’t foundational, revolutionary, indispensable or universal. So we moved on.

Microsoft has a bit more staying than Marimba, but as long as there’s choice in the Web marketplace, it can’t afford too many launches going off the rails like this.

Windows Half-Live?

Windows Live looks like Microsoft’s tardy and half-baked answer to My Yahoo! It’s a customizable portal, with placeholders for weather and news and feed subscriptions etc. According to Bill Gates’ announcement today (video at CNET), Windows and Office are not required to use Windows Live. But try it from Safari on your Mac, and you’ll get just a fraction of the page (only an MSN search box). On Firefox (at least on Mac) you’ll see this:

Firefox support is coming soon. Please be patient 🙂

You know, a garage startup can maybe get away with this kind of thing. But this is Microsoft! And the announcement was a major event, not some leak of an internal research project. OK, so it works only in IE, and I guess Windows Live is destined to be the home page for millions of unsuspecting users of the next version of IE. But if Microsoft wants the rest of us to pay attention and if it wants to be taken seriously in its efforts to catch up with the new realities of Web-as-Desktop (call it whatever you want, but don’t call it Web 2.0), then it has to demonstrate that it’s a) adding some value — that’s TBD for Windows Live — and b) not going to make a fool of itself by trying to bring its insidious embrace-and-extend practices to Web content. That would be fun to watch, though. Never a dull moment … [nor a productive one].

Tagyu

Now this is cool. Tagyu is an “auto-tagging” service of sorts, created by Adam Kalsey. You paste in some text (or submit via their REST API) and it suggests tags, using some kind of a similarity metric between your text and already tagged texts in Tagyu’s index (gathered from del.icio.us etc.).

So far, I’ve tried a few different texts, and about half the time the returned tags are great. This is impressive, because this is not an easy problem to solve, but 50% precision is not quite enough for prime time. If someone (sploggers?) unleashes Tagyu to auto-tag a large volume of posts that feed back into the del.icio.us and Tagyu system, that would be detrimental to improving precision of the system, unless you could assign some kind of a score to the quality of tags (yes, that’s a chicken/egg thing).

Maybe we need some kind of a large-scale tag-quality feedback system. Some clever piece of javascript that lets you click “this tag is right on” or “this tag is a cruel joke” when reading someone’s blog or feed. Of course, if you’re an idiot at tagging, you’re not going to install that piece of javascript. An aggregator might be the best place to do that, where attention.xml lives (eventually).

This is the first service of this kind that I’m aware of, and there are lots of applications of this kind of thing in blog search. There could be an ad-matching app in there, too. And, an intermediate step in Tagyu is matching content to other content (and then to tags). I hope Adam Kalsey keeps up the R&D effort on this. Tagyu has a super-clean looking site. Very nice.

btw, for this post’s text, Tagyu returned the following tags: tagging del.icio.us tools. Looks good to me.

—–
(Via BuzzMachine)

Filemaker is pesterware

I downloaded a 30-day trial of Filemaker for OS X about 2-3 weeks ago. Had some ideas for a notetaking system with tagging, dynamic cross-linking, flexible querying, stuff like that. I haven’t had time to even unzip the trial, but I’ve already received two phone calls and two follow-up emails from Filemaker sales reps. Don’t they have anything better to do? Isn’t Filemaker selling without this kind of pestering? What’s wrong with these people? If I have trouble with it, I know where to go. If I want to buy a license, I know where to go for that, too.

At this rate, it’s unlikely that I’ll even take the time to install the trial. I’ll keep using MacJournal and see if I can uncover some features that get me closer to what I’m imagining my note taking application to be. MacJounral is a nice piece of software, and I haven’t been pestered by them once.

Podcast Recommendation: The Word Nerds

If you’re a word nerd or, heck, if you just speak or are learning English, you’ll enjoy The Word Nerds, a weekly podcast by two brothers on topics like The Cold War and Hostile Language, Collective Nouns, The Unnamed Antecedent and segments like The Rude Word of the Week. It’s obvious these guys spend hours preparing for their podcasts. They don’t just sit down and start talking (I won’t name names. Well, maybe in a future post I will). Anyway, The Word Nerds gets the coveted remylabs podcast seal of approval, which means I’ll stay subscribed in iTunes NetNewsWire (does OPML, unlike iTunes) as long as they keep ‘casting.

Yahoo == IR talent magnet | The tip of the iceberg

Article in NY Times today, Yahoo is wooing I.B.M. Technical Talent:

Yahoo plans to announce Thursday that it is recruiting scientists who pioneered an advanced search-engine technology at I.B.M.’s Silicon Valley research laboratory.

Prabhakar Raghavan, a computer scientist who once led the Clever effort, joined Yahoo last week as head of research. He left I.B.M. in 2000 to become a vice president and chief scientist at Verity Inc., a maker of search and retrieval software for corporations; he was later named chief technical officer.

Yahoo offers one of the best opportunities to explore new ideas in search, Mr. Raghavan said

One area that will be pursued is new search technologies related to digital media.

It’s been fun to watch Google being forced from the position of category killer to more-or-less evenly matched contestant over the last year or two. There’s a mind-boggling amount of innovation happening in search, which is levelling the playing field for new entrants, but even the stuff we’re seeing now is only the beginning. Search, and other modes of information retrieval, will become even more ubiquitous and integrated than they are now, and we’ll wonder how an OS like Windows without integrated search ever came to dominate a market. The desktop market itself may go away (yes, I’ve been reading Paul Graham’s book Hackers and Painters, which contains this great essay on server-based software from 2001, which is still relevant and engaging, as are his many other essays).

Search is poised to become the great collective memory, and new research being brought to market in real services, along with the availability of public APIs, will speed progress toward that reality. But it won’t be just the extent of information covered by search that will grow, but also interconnectivity of seach services and, most importantly, new modes of retrieving information (the only mode now in widespread use is keyword search, which is as old computer science itself — or much older, if you count manual versions such as file cabinets and card catalogs and other manually compiled indexes). I don’t see any reason why search shouldn’t aim to duplicate in software all of the modes in which humans retrieve information in their own brains (by context, by association and so on) or from others, by interactive question answering or guided discovery.