Language detection changes

Historically, when trying to identify the language of a given blog post, we have only been looking at the post’s raw text, i.e. the post’s body text, stripped from HTML. In general this works very well, given that the body text is “long enough”.

However, we have noticed that some bloggers tend to write very little in the actual blog post. For example, the post’s body text may only consist of a single word and a bunch of images. Even against all odds we still attempted to identify the language for such posts, with varying outcome.

In order to somewhat increase the accuracy of our language identification we have, effectively yesterday, decided to include the post’s title when identifying the language, provided that the title is not the same as the body text. Naturally, if the title is short, or non-existent, this will likely not improve on the situation at all, but in those cases when the title is at least a few words we should expect to see more reliable results.

The language improvements apply to Twingly Search API, Twingly LiveFeed API and our public search.

Nothing is for free though. We have noticed, since the change, that some Tumblr blogs use a certain title1 and omits body texts. Leaving us with only the title for identification (in our old algorithm these posts would not have been identified at all!2). This has caused an interesting eight-fold increase in identification of Vietnamese posts. We are hoping to be able to address this peculiarity promptly.

Oh, and as a bonus as of today we have started to identify Chinese (zh) posts. We expect this to improve the quality over all languages as Chinese posts may have been identified as non-Chinese before.


  • We include the blog post’s title when identifying its language rather than just the body text, this should improve on the quality of the language field
  • We can detect Chinese posts now, this will also improve the quality of other languages

As always, please contact us if you have any questions or concerns.

  1. Hint 

  2. In some cases we would fall back on our best guess for the blog in general 

Twingly Blog Box 3.1.0

Twingly Blog Box 3.1.0 has been released. This version of the Blog Box introduce full support for sites using HTTPS. As always, check out the documentation for more info about the Blog Box.

Twingly Blog Box 3.0.0

Twingly Blog Box 3.0.0 has been released. The filesize has been reduced by over 30% and all assets and data are now always loaded over HTTPS. Check out the Twingly Blog Box changelog to see all changes that have been made.

The reason behind the big version jump is that Twingly Blog Box from now on uses Semantic Versioning.

Twingly Blog Box 2.0.7

Version 2.0.7 of our Twingly Blog Box has been released. Check out the Twingly Blog Box changelog to see what changes have been made.

Twingly Blog Box Search documentation

We have added the Blog Box Search product to the Twingly Blog Box documentation, enjoy!