HN Time Machine: finally some Hacker News history!

May 12 2020/5 min read
  • Philipp Eisen
    Philipp EisenData Scientist
  • Christoffer Åhrling
    Christoffer ÅhrlingSoftware Developer

Think of the number of times you have read a comment in Hacker News along the lines of "we did this very similarly x years ago". It happens quite a lot, right? That's what sparked the idea for this side project. So we put together an app that ranks Hacker News stories between 2006 and 2015 with the ones of today, using semantic similarity search. You can find it here if you want to have a look. This was built on various components, including hnswlib, DistilSentenceBert and Vuejs' hackernews clone.

Hacker News puts stories into the context of developers' opinions

As people working in tech, Hacker News is without a doubt one of our go-to news sources. It helps us to keep up with what is currently hot in tech. At the same time, it helps us to put the articles into the context of opinions from other people working in the field.

What if we add some temporal context?

People are often very keen to point out that something has been done before. We thought it was finally time to give people of Hacker News what they want tendency, and that's what sparked the idea for this side project. We built the HN Time-Machine to present the current stories together with three other stories from sometime between 2006 and 2015 that rank the highest in semantic similarity.

A semantic sentence encoder takes a sentence as input and represents that sentence as a dense vector in a semantically meaningful way.

Using semantic similarity ranking to retrieve the most similar past stories

So, why did we focus on semantic similarity? Well, in more traditional search and ranking approaches it is common to represent text as a sparse vector that assigns each dimension to one word in a vocabulary. The value in the dimension of a word is then determined by how often that word appears in the text that the vector encodes. This way of encoding a piece of text does not take the order in which words appear into account. In addition, two pieces of text that say the same thing, but use different words will be represented by vectors that are very different from each other. This is especially problematic with short text such as sentences and small paragraphs - or Hacker News story titles.

This is where semantic sentence encoding comes into play. A semantic sentence encoder takes a sentence as input and represents that sentence as a dense vector in a semantically meaningful way. This means that two sentences that are semantically similar will be represented by vectors that are close to each other according to some metric - even if they don’t use the same words. Additionally, a more recent development is that advanced approaches to encoding sentences can now also take the order of words into account and disambiguate, for example, a river bank from a bank that is a financial institution.

This app uses DistilSentenceBert [1] to encode sentences. The encoded vectors are then stored in Hnswlib - a fast approximate nearest neighbor search engine [2]. When the page loads, the title of each story on the current page is encoded as a vector. That vector is then used to retrieve the most similar vectors to it. The stories corresponding to the 3 most similar vectors are shown as similar stories under each story.

The indexed HN stories come from the publicly available BigQuery dataset

I guess by now you have noticed that I mentioned "stories between 2006 and 2015" twice - two seemingly arbitrary years. The reason why the app only retrieves stories between those dates is really only a practical one.

We are using the publicly available hacker_news dataset on BigQuery (bigquery-public-data.hacker_news).

This dataset includes stories between 2006-10-09 18:21:51 UTC and 2015-10-13 08:44:34 UTC. There are in total 1,459,558 stories that have an id and are not dead or deleted.

It would absolutely be possible to keep an index of all past and new stories. But that would require some more work and would go beyond the scope of what this side project sets out to be (for now…).

By the way, you can import datasets from BigQuery on the Peltarion platform to train a model on your data.

Limitations

You will notice that sometimes the results are a bit weird or unexpected. Part of the reason for that is that the semantic encoder has been trained on data that is somewhat different from Hacker News. The definition of what is similar and not can vary greatly between different domains. This problem could be combated by fine-tuning the encoder model to better fit the definition of similarity in the respective context. To make it possible to potentially have such a model in the future we added a ★★★★★ rating at each result. Rating the results could make it possible to have better results in the future. 

As a last word, this Demo is mainly supposed to help explore the potential and limitations of semantic similarity ranking and by no means claiming that this is the best approach for this use case. But even so, I hope you enjoy reading Hacker News with this new historical lens of what has been written before it on the topic. Hopefully it can help us get a better understanding of how certain topics have evolved over the past decade and a half too! 

Citations

[1] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.

[2] Malkov, Yu A., and D. A. Yashunin. "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." TPAMI, preprint: https://arxiv.org/abs/1603.09320

  • Philipp Eisen

    Philipp Eisen

    Data Scientist

    Philipp is an AI Researcher Engineer with a M.Sc. in Data Science from KTH Royal Institute of Technology. Philipp has a strong passion for Deep Learning, NLP, and Big Data techniques. At Peltarion, Philipp has been mainly focusing on helping companies build and improve products with the latest NLP techniques, including one project designed to fundamentally change the way people create and interact with Market Research. Previously, Philipp worked at King using Machine Learning to simulate players in Candy Crush franchise games to help release better content to players.

  • Christoffer Åhrling

    Christoffer Åhrling

    Software Developer

    Christoffer Åhrling is a software developer at Peltarion with over eight years of experience within frontend development and technical consulting. He is particularly passionate about web performance, code quality and solving complex problems with clean, maintainable code. At Peltarion he has worked on several different projects ranging from enhancement and addition of new features.