Twingly maintains a complex system designed to discover, download and index public information. This is an overview on how we discover and download this data.
A big part of the job is to find the sources, we have several systems that work together to find good blogs to ingest. These are some of the major discovery systems.
Twingly have over a hundred different systems that monitor different blog hosting platforms and communities. These detect newly registered blogs.
All posts that are ingested into Twingly’s systems are scanned for links to other blogs. Once we find a blog link we immediately set up monitoring for its feed.
Custom software that monitors blog references in social media, we detect when people link to blogs. This is done in real time and is used both to detect new blogs, but also as a signal to fetch new content from a known blog.
We have spiders to crawl the general web for blogs, at the moment this is a minor system and generates less sources than our other systems.
Our customers have the ability to migrate their sources to us if they have an existing blog ingestion solution.
Once we have found a blog and located its feed we need to continuously download it when new content is posted.
We build upon open feeds (RSS and Atom) to get the posts from the blogs, the format allows machines to parse and extract the content in a reliable way. The feeds need to be polled regularly to get the latest posts, but not too often as we don’t want to waste the blog provider’s resources nor ours.
After the feed posts are downloaded we analyze them to detect the language.
Once we have discovered and ingested a blog, we need to continuously index all new posts from that blog.
At the core, we have a scheduling system that we call “Autoping.” Autoping is, simplified, a list of (millions of) blogs and a time interval per blog indicating how often we should poll it. The poll interval is updated continuously based on the blog’s activity, a very active blog will have a lower interval than an inactive blog.
Besides Autoping, our discovery systems also help to keep the blogs up to date. A blog link from Twitter might be captured and scheduled faster than Autoping.
These are known challanges that may be the cause of varying indexing latency, i.e. time from post publish to index.