Source discovery and ingestion

Introduction

Twingly maintains a complex system designed to discover, download and index public information. This is an overview on how we discover and download this data.

Source discovery

A big part of the job is to find the sources, we have several systems that work together to find good blogs to ingest. These are some of the major discovery systems.

Blog provider monitoring

Twingly have over a hundred different systems that monitor different blog hosting platforms and communities. These detect newly registered blogs.

All posts that are ingested into Twingly’s systems are scanned for links to other blogs. Once we find a blog link we immediately set up monitoring for its feed.

Social media monitoring

Custom software that monitors blog references in social media, we detect when people link to blogs. This is done in real time and is used both to detect new blogs, but also as a signal to fetch new content from a known blog.

Ping

Ping is available for humans that want us to index their blog. XML-RPC Ping is a version for computers to send notifications when a new post is available.

General web crawling

We have spiders to crawl the general web for blogs, at the moment this is a minor system and generates less sources than our other systems.

Customer collaboration

Our customers have the ability to migrate their sources to us if they have an existing blog ingestion solution.


Source ingestion

Once we have found a blog and located its feed we need to continuously download it when new content is posted.

We build upon open feeds (RSS and Atom) to get the posts from the blogs, the format allows machines to parse and extract the content in a reliable way. The feeds need to be polled regularly to get the latest posts, but not too often as we don’t want to waste the blog provider’s resources nor ours.

After the feed posts are downloaded we analyze them to detect the language.

The data is then available either to be streamed through the LiveFeed API or searched in through the Search API.

During the feed discovery stage our Recon crawler will look for robots.txt and honor the rules.

Challenges


Source monitoring

Once we have discovered and ingested a blog, we need to continuously index all new posts from that blog.

At the core, we have a scheduling system that we call “Autoping.” Autoping is, simplified, a list of (millions of) blogs and a time interval per blog indicating how often we should poll it. The poll interval is updated continuously based on the blog’s activity, a very active blog will have a lower interval than an inactive blog.

Besides Autoping, our discovery systems also help to keep the blogs up to date. A blog link from Twitter might be captured and scheduled faster than Autoping.

Latency challanges

These are known challanges that may be the cause of varying indexing latency, i.e. time from post publish to index.