How to Build a Search Engine from Scratch in Python — Part 1

What is a search engine?

Searching for talks

Set up

  • Python 3.5.0
  • Django 2.0.1
  • Scikit-Learn 0.18.1
  • Numpy 1.13.3
  • NLTK 3.2.5
pip install -r requirements.txt

Database

  • Transcript- The whole transcript of each talk
  • URL- The video url corresponding to the talk transcript

Reading data

Feature Extraction

Giving weights to words

document_1 = "I love watching movies when it's cold outside"document_2 = "Toy Story is the best animation movie ever, I love it!"document_3 = "Watching horror movies alone at night is really scary"document_4 = "He loves to watch films filled with suspense and unexpected plot twists"document_5 = "My mom loves to watch movies. My dad hates movie theaters. My brothers like any kind of movie. And I haven't watched a single movie since I got into college"documents = [document_1, document_2, document_3, document_4, document_5]

Preprocessing

documents = [document.split(" ") for document in documents]
document_1 = ['I', 'love', 'watching', 'movies', 'when', "it's", 'cold', 'outside']document_2 = ['Toy', 'Story', 'is', 'the', 'best', 'animation', 'movie', 'ever,', 'I', 'love', 'it!']document_3 = ['Watching', 'horror', 'movies', 'alone', 'at', 'night', 'is', 'really', 'scary']document_4 = ['He', 'loves', 'to', 'watch', 'films', 'filled', 'with', 'suspense', 'and', 'unexpected', 'plot', 'twists']document_5 = ['My', 'mom', 'loves', 'to', 'watch', 'movies.', 'My', 'dad', 'hates', 'movie', 'theaters.', 'My', 'brothers', 'like', 'any', 'kind', 'of', 'movie.', 'And', 'I', "haven't", 'watched', 'a', 'single', 'movie', 'since', 'I', 'got', 'into', 'college']

Stemming and Lemmatization

Normalized Term Frequency

Inverse Document Frequency

All in one

Cosine Similarity

Representation of documents as vectors(taken from https://goo.gl/ppES7b)
Example of cosine similarity measures(taken from https://goo.gl/ppES7b)

It’s not over yet!

Thanks to Hugo Dzin

I like Toy Story and software development

Love podcasts or audiobooks? Learn on the go with our new app.