TechFuga and News Aggregation

TechFuga launched a new version today and among the many improvements in the service are better clustering, search, and an interesting feature called “upcoming news” that attempts to surface news that is not yet popular but signals popularity movement.


I like these services because they efficiently surface interesting news that is domain specific (e.g. tech, politics, sports, etc.). In many ways they represent a future for media as well because aggregation is demonstrating itself to be a bigger lever for publishers than organic traffic growth. You could almost say that aggregation is a perfect compliment to search, whereas search surfaces results based on specific keywords, aggregation surfaces results based on domain.

There is a legitimate debate about the legitimacy of purely machine driven aggregation. Gabe Rivera, who I consider a legitimate authority on this subject, explained this in great detail back in December when he announced that Techmeme would be augmenting their service with human editors.

The problem is context and relevance. Current mainstream aggregators determine this through extracting key words from unstructured text, building giant dictionaries that help derive context and then examining link patterns. Semantic search technologies go a step further by building triples that attempt to bring great context to unstructured text entities.

The problem with current relevance technologies is that entity extraction, done well, is hard and small errors magnify into gaping mismatches from the user perspective, there is the stale news problem where old news is resubmitted based on perceived relevance, and link patterns are prone to surfacing a lot of the stories that are just identical because they all trace back to the same source. Link analysis also doesn’t work very well in domains where there isn’t a lot of linking, like among food publications, which examples why aggregation sites tend to focus on technology, sports, celebrity news, and politics/current events.

Semantic technologies offer an appealing future but make no mistake about it, these technologies are demanding from a development standpoint. Freebase attempts to circumvent the semantic challenges by relying on existing data sets that have already been organized (e.g. wikipedia) and a community based approach that allows for massive data organization according to domains while not relying solely on a machine approach to the semantic problem; but Freebase is a database at it’s core that can be used by other services to build context, it is itself not a news or content aggregation site.

Despite all the challenges with news aggregation, the fact remains that this is a very logical approach for publishers and users alike. By clustering related content together and presenting it with an appealing user experience that moves content up/down based on popularity and relevancy, we all benefit.