TechFuga and News Aggregation

TechFuga launched a new version today and among the many improvements in the service are better clustering, search, and an interesting feature called “upcoming news” that attempts to surface news that is not yet popular but signals popularity movement.


I like these services because they efficiently surface interesting news that is domain specific (e.g. tech, politics, sports, etc.). In many ways they represent a future for media as well because aggregation is demonstrating itself to be a bigger lever for publishers than organic traffic growth. You could almost say that aggregation is a perfect compliment to search, whereas search surfaces results based on specific keywords, aggregation surfaces results based on domain.

There is a legitimate debate about the legitimacy of purely machine driven aggregation. Gabe Rivera, who I consider a legitimate authority on this subject, explained this in great detail back in December when he announced that Techmeme would be augmenting their service with human editors.

The problem is context and relevance. Current mainstream aggregators determine this through extracting key words from unstructured text, building giant dictionaries that help derive context and then examining link patterns. Semantic search technologies go a step further by building triples that attempt to bring great context to unstructured text entities.

The problem with current relevance technologies is that entity extraction, done well, is hard and small errors magnify into gaping mismatches from the user perspective, there is the stale news problem where old news is resubmitted based on perceived relevance, and link patterns are prone to surfacing a lot of the stories that are just identical because they all trace back to the same source. Link analysis also doesn’t work very well in domains where there isn’t a lot of linking, like among food publications, which examples why aggregation sites tend to focus on technology, sports, celebrity news, and politics/current events.

Semantic technologies offer an appealing future but make no mistake about it, these technologies are demanding from a development standpoint. Freebase attempts to circumvent the semantic challenges by relying on existing data sets that have already been organized (e.g. wikipedia) and a community based approach that allows for massive data organization according to domains while not relying solely on a machine approach to the semantic problem; but Freebase is a database at it’s core that can be used by other services to build context, it is itself not a news or content aggregation site.

Despite all the challenges with news aggregation, the fact remains that this is a very logical approach for publishers and users alike. By clustering related content together and presenting it with an appealing user experience that moves content up/down based on popularity and relevancy, we all benefit.

Related Content Startups Raising Capital

OneSpot is an interesting company and it is good news to read that they have successfully raised $4.2m in capital.

There are a bunch of companies in the “related content” space and what is interesting is that there are several deployment models that not only compliment each other but also offer unique monetization options. OneSpot is one such company and like DayLife, NewsGator, Zemanta, Sphere, and Inform, they offer a service that combines revenue opportunity for publishers and content owners with compelling user experience.

“Obviously, the business guys ask, ‘Are we getting more visits, are people staying longer?’ but from the editorial side the question is, ‘Does this make my site a better place for my audience?’” Cohen said. The Journal found the answer was “yes,” because it meant Journal readers saw even more relevant and comprehensive news coverage, coverage that wasn’t limited to what the Journal had the resources to cover. The fact that Journal editors only had to put in a few minutes of work a day to make it work didn’t hurt, either.

[From OneSpot raises $4.2M for customized news aggregation » VentureBeat]

One of the challenges with delivering good related content is having access to a broad range of full text content sources. It’s really difficult to do related content with partial text excerpts because in order to drive a search function you have to have good entity extraction in order to know what to drive the search for.

I’ve been using a wide range of these services and find two weaknesses in most of them, the first being surfacing of related content that is basically a bunch of copies of the same source. I don’t want to read 10 articles that are carbon copies of one another, I want 10 articles that are related to the source but offer a range of perspectives.

Secondly, quality of the surfaced articles is inconsistent and more significantly the quality doesn’t improve as a function of how I use (click) the delivered content. I don’t think active rating (thumb up/down) is the answer, but my clickstream should offer enough behavioral cues that can be fed into the search engine to refine the search results. However, before any of this could happen there would have to be integration with user profiles or at least a cookie based approach.

Digg’s Recommendation Engine

We’ve been developing filtering technologies based on behaviors and expressed likes/dislikes. It’s hard stuff and one thing is evident, relying on a single mechanism or ideology for recommendations is a strategy fraught with risk.

If you rely on active participants, people training the recommendation engine, you simply won’t get the data inputs necessary to deliver good recommendations. It’s equally true that if you rely simply on historical behaviors you will end up with a recommendation engine that breaks easily when an outlier condition is observed. Anyone who have purchased a random gift item for someone on Amazon knows exactly what I am referring to, the suggested items list gets polluted.

The user experience is also a sticky subject because the recommendation results have to ride alongside the main content or be easily navigated. The simple fact is that users want recommendations as something extra rather than the main experience. The challenges that Digg’s recommendation engine is experiencing are representative of what I am talking about, a less than stellar user experience that reflects UI and more significantly, recommendation results.

After using it for quite some time, like most such ideas, I find it utterly useless. I use Digg in the following way: I check out the front page and the upcoming Technology section for interesting stories. The recommendation engine merely gets in my way, making me go through a couple of extra clicks to get what I want (whenever Digg doesn’t automatically log me in, which is often). The stories that the recommendation engine feeds me seem completely random; standard categorization by topics works way better, and checking only what’s recommended feels like I’m missing out on good stories.

[From So, How’s That Digg Recommendation Engine Been Working For You?]

Personally, I’m a big believer in the value of recommendation engines as a feature which augments a primary user experience and am impressed by the progress we have made on this front. In many ways this is running in parallel to efforts to surface related content because both efforts require building metadata about content that includes key entities, categories, sentiment, and additional taxonomy data that helps narrow the content focus.

It’s also true that there can be too much of a good thing and users have little patience for a system that returns volumes of links and excerpts that are essentially identical, therefore it’s essential to have a filtering mechanism that attempts to surface just the best content according to quality and popularity filters.

I have long contended that nobody ever says “I need more content” or more sources, but this is often asserted as a way of saying “I need better content” in that content is being discovered, filtered and then presented in a manner that helps people find the things that they did now know they did not know. We’re getting there.