Searching is a feature that we need offline as it is an important way to find articles. Before designing the search engine, several key constraints are considered:
To address these issues, the following approach is taken:
We do not provide (yet!):
The index chosen is a reverse hashtable. That is, every word is mapped to a list of documents that it occurs in. In addition, there is a score that each word has for each document that it appears in. The higher the score, the more important that word is.
The score is computed based on an algorithm called TF-IDF. TFIDF is an algorithm that scores the importance of each word in an article given a corpus of many articles. It effectively extracts the most important words in any article. For us, we multiply the scores for the terms of the title by 1.2, effectively weighting it more than the summary.
For each search term, we go through the index and finds the list of document and scores the term is associated with. We add up the score for each article and sorts them. The document with the highest score will be displayed at the top and the document with the lowest score will be displayed at the bottom.
As an example, if we have the following index:
{ "bookmarks": [1029, 2.3, 1000, 1.5], "firefox": [1000, 0.9, 1010, 0.7, 1111: 0.8] }
and if we searched for the term “firefox bookmarks”, the following is computed:
[ [1000, 2.4], // 1.5 + 0.9 from bookmarks and firefox [1029, 2.3], [1111, 0.8], [1010, 0.7] ]
These documents will be displayed with that order.