In the majority of cases farming out your search needs to a company like Google is a good idea. Simply using the site:[url]
command is generally sufficient for most cases.
You get a range of benefits from outsourcing things to an external entity:
- they do all the heavy lifting
- lots of time and money is invested in ranking and matching your searches
- people are familiar with the search results format
- social rankings are taken into account for content where it's applicable
- there's an opportunity for user tailored search results
- there's likely a much finer granularity in what terms are searched and matched - for example using advanced stemming and similes
That said, there's certain cases when farming out your search to an external provider is perhaps not what you want. You may require that only a sub-set of data appearing on your website need be searchable; you may have short lifetime data that you want to make searchable; you want to have a complex search mechanism; or you might have the requirement of ranking search results in a unique, pre-defined manner.
Having access to our own site search opens up the possibilities of some niceties that are lost to us whilst farming the job out:
- we can show a mixture of content types, each with an appropriate listing entry; for example dated elements can show how fresh they are whereas static pages have no need to
- we can hide content that we don't want the user to be able to see without the need to set up and maintain a
robots.txt
files - we can show hidden content that wouldn't be crawled by a search engine; perhaps showing user specific content or site elements which require authentication to view
- we can promote content as we see fit, elevating certain results in our searches based on our own criteria
- there's a possibility, if you're so inclined, to roll advertisements out into your search results (please don't do this)
- you can guarantee that there are no adverts shown in your results (please do, do this)
How to search and match content contained in your website is a somewhat trivial problem to solve. Query for the content from your underlying database; parse any static files held on the server; pull all this content into a data structure then loop over it and check if the search terms exist in the content. Spit out a list of matching elements at the end.
You don't have to worry about malicious content, or erroneous elements of the site matching the wrong pages, you specifically control which elements your search finds and how they match up to each page. You create a list of matching pages on your site and display them to the user in exactly the way you want to. Simple.
The complexity starts to appear when you begin ranking content in your search results. The simplest, most rudimentary means of displaying content to the user is just not to rank it at all. This might be an option if you have relatively little content to search, or you're quite constrictive with how you match content. As soon as you start dealing with a lot of content or more open ended matching criteria you ultimately end up with a list of too many results to display unranked.
Taking this to the next step, you can rank on how many times your query is matched in your data. This of course incurs skewed results towards longer pieces of content having higher term frequencies, a problem you'd solve through normalising with a length normalisation mechanism. All of this is perhaps a bit too complex for our site search. We don't have issues with content stuffing or keyword padding - we certainly shouldn't, unless you have a terrible content strategy. Counting term matches and length normalising would suit well enough to match entries from a simple textual search, but it doesn't give any weight to what the actual elements we're searching are about and how well they truly represent the search terms.
We actually have a lot of hidden information rolled into our data structures. The title of a news story, if written properly, should be indicative of the content; shouldn't that mean that if we have a match in the title it's more likely to be relevant than if we only match in the story text? Finding a match in certain fields of a data structure should have more weight than in others.
In order to rank your search results we need to assign a weight to each of the matches we find, I'm proposing an accumulative score that's calculated based on a series of metadata associated with the search results.
Page Type
The first metric to rank on is perhaps the type of page you've found in your search results. Certain pages will be more prominent in your results based on your user base and the use case you have for the search they're making.
For a typical site I'd propose a page type ranking similar to:
- User / account pages: 2
- Normal pages: 1
- News / Blogs: 1½
- Meta pages: ¾
Matching Fields
As previously mentioned, we have a built in weighting system in terms of how we structure our data. Entry titles should hold more weight, metadata probably less than average.
Giving us something along the lines of:
- Title fields: 2½
- Body /content fields: 1
- Meta fields: ¾
Freshness
Dependant upon the type of website we're searching, freshness of content can play a large role. For large community driven websites newer content can be the most relevant content. There's a decision as to how you match freshness with your user base.
For a basic site with a blog or news section I'd be proposing something like:
- Today: 5
- Yesterday: 4
- This week: 2
- This month: 1
- Otherwise: ¾
This allows new content to bubble to the top and stay there for a week or so; after which it will be ranked as all other content, eventually losing a portion of it's weight as it perhaps becomes less relevant to current searches.
Again, the bands you break these entries up into and the weights given to each will be very much focused on your users.
Other elements
There are other ways of promoting content. If our matching mechanism is very loose and doesn't require that all fields are matched to show a result, then we may want to count term matches and apply those. Similarly we may want to greatly boost the weight of elements where an exact search text match is made with more than 1 search term.
Putting it all together
You've got your matching entries from the search, a list of weights for the criteria it matched on, so all we need to do now is combine this into an overall weight to rank the elements against each other.
For this, simply multiply the distinct weights together.
As a set of examples:
A 2 day old (2) blog post (1.5) matched in the title (2.5) would have a weight of
2 * 1.5 * 2.5 = 7.5
Whereas a blog post (1.5) from today (5) matched on metadata (0.75) would have a weight of
1.5 * 5 * 0.75 = 5.625
A normal page (1) matched in the body (1) with an exact term match (10) would have a weight of
1 * 1 * 10 = 10
These examples aren't perfect, they're not going to fit every website; but with a few tweaks to the weightings for each searchable element, some specifics added for unique content types and everything tailored to your users, you should end up with a good ranking system for results in your site search.