BigSnarf blog

Infosec FTW

Building your own search engine in Python

  • Learn core concepts of search
  • Learn associated terminology
  • Understand it is document-based search not RDBMS
  • Inverted Index is what is searched and linked to document
  • Python code – inverted index class
  • Technique for stemming words
  • Understand N-grams
  • Understand  tokenizers and n-gram processing
  • Understanding fields
  • Understand document handler
  • Search Engine
  • Concept of Sharding
  • Concept of Faceting
  • Concept of Boost

#############################################

min_gram = 3
max_gram = 6
terms={}

for position, token in enumerate(tokens):
  for window_length in range(min_gram, min(max-gram) + 1, len(token))):
    gram = token[:window_length]
    terms.setdefault(gram, set([]))
    terms[gram].add(position)

Update link: http://dr-josiah.blogspot.ca/2010/07/building-search-engine-using-redis-and.html

http://vimeo.com/61292741

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: