PostgreSQL full text search with Django
PostgreSQL 8.3 is coming out soon with full text search integrated into the core database system. It's pretty well documented in chapter 12 of the PostgreSQL docs. The docs are a bit intimidating, but it turns out to be pretty easy to use with Django.
Let's say you're doing a stereotypical blog application, named 'blog', and have a model for entries such as:
This is a generic Full Text Search engine for Django projects
Currently implements three backends: dummy, simple and pgsql.
- dummy - just uses ILIKE to do the search (no indexes, very slow)
 - simple - implements the search using two helper tables for the indexes
 - pgsql - uses PostgreSQL 8.3 full text search engine
 
It should be possible to easily integrate MySQL, Sphinx and Xapian backends too.
Install
To install the latest version:
svn checkout http://django-fts.googlecode.com/svn/trunk/ django-fts cd django-fts python setup.py install
Note: You will need to install the Snowball python bindings if you want to use the snowball stemmer. If you don't a bundled stemmer based in the Porter algorithm will be used (this is also not required if you are using the PostgreSQL backend). Get the Snowball bindings package from http://snowball.tartarus.org/wrappers/PyStemmer-1.0.1.tar.gz
Usage example
Add the fts app to your settings.py file and optionally configure a fts backend (simple by default):
INSTALLED_APPS =( #... 'fts')#FTS_BACKEND = 'pgsql://' # or 'dummy://' or 'simple://'
Assume that we have this model in our imaginary application:
from django.db import models classBlog(models.Model): title = models.CharField(max_length=100) body = models.TextField() def __unicode__(self): return u"%s"%(self.title)
And we want to apply full text search functionality for model Blog. You need to subclass your model from fts.SearchableModule instead of from django.db.models.Model. The new module may look like this:
from django.db import models import fts classBlog(fts.SearchableModel): title = models.CharField(max_length=100) body = models.TextField() # Defining a SearchManager without fields will use all CharFields and TextFields. # This is the default and you do not need to explicitly add the following line: # objects = fts.SearchManager() # You can pass a list of fields that should be indexed # objects = fts.SearchManager( fields=('title','body') ) # The fields you pass as parameters can be foreign fields ('myfield__foreign_field') # or even functions (functions should receive the instance as the only parameter) # You may also specify fields as a dictionary, mapping each field to a weight for ranking purposes # see http://www.postgresql.org/docs/8.3/static/textsearch-features.html#TEXTSEARCH-MANIPULATE-TSVECTOR #objects = SearchManager( fields={ # 'title': 'A', # 'body': 'B', #} ) def __unicode__(self): return u"%s"%(self.title)
In the django shell create some instances of models:
python ./manage.py shell
>>>from core.models importBlog>>> p =Blog(title='This is the title', body='The body of the article')>>> p.save()>>> p =Blog(title='This is the second title', body='The body of another article in the blog')>>> p.save()>>> p =Blog(title='This is the third title', body='The body of yet another simple article')>>> p.save()
Now perform a search:
>>> result =Blog.objects.search('simple').all()>>> result.count()1>>> result [<Blog:Thisis the third title>]
Additional information
You can force an index update to all or some instances:
>>> p.update_index()>>>Blog.objects.update_index()>>>Blog.objects.update_index(pk=1)>>>Blog.objects.update_index(pk=[1,2])
You can omit the search function and make the search directly:
>>> result =Blog.objects('simple')>>> result.count()1>>> result [<Blog:Thisis the third title>]
PostgreSQL specific information
The PostgreSQL backend is heavily based in the code from http://www.djangosnippets.org/snippets/1328/ by Dan Watson.
If using the pgsql backend, don't forget to add a Gin or GiST index to your tables: http://www.postgresql.org/docs/8.3/static/textsearch-indexes.html
Example
CREATE INDEX "tablename_search_index" ON "tablename" USING gin("search_index");
Note: You should index the search_index column, not your text or char columns.
This is a generic Full Text Search engine for Django projects
Currently implements three backends: dummy, simple and pgsql.
- dummy - just uses ILIKE to do the search (no indexes, very slow)
 - simple - implements the search using two helper tables for the indexes
 - pgsql - uses PostgreSQL 8.3 full text search engine
 
It should be possible to easily integrate MySQL, Sphinx and Xapian backends too.
Install
To install the latest version:
svn checkout http://django-fts.googlecode.com/svn/trunk/ django-fts cd django-fts python setup.py install
Note: You will need to install the Snowball python bindings if you want to use the snowball stemmer. If you don't a bundled stemmer based in the Porter algorithm will be used (this is also not required if you are using the PostgreSQL backend). Get the Snowball bindings package from http://snowball.tartarus.org/wrappers/PyStemmer-1.0.1.tar.gz
Usage example
Add the fts app to your settings.py file and optionally configure a fts backend (simple by default):
INSTALLED_APPS =( #... 'fts')#FTS_BACKEND = 'pgsql://' # or 'dummy://' or 'simple://'
Assume that we have this model in our imaginary application:
from django.db import models classBlog(models.Model): title = models.CharField(max_length=100) body = models.TextField() def __unicode__(self): return u"%s"%(self.title)
And we want to apply full text search functionality for model Blog. You need to subclass your model from fts.SearchableModule instead of from django.db.models.Model. The new module may look like this:
from django.db import models import fts classBlog(fts.SearchableModel): title = models.CharField(max_length=100) body = models.TextField() # Defining a SearchManager without fields will use all CharFields and TextFields. # This is the default and you do not need to explicitly add the following line: # objects = fts.SearchManager() # You can pass a list of fields that should be indexed # objects = fts.SearchManager( fields=('title','body') ) # The fields you pass as parameters can be foreign fields ('myfield__foreign_field') # or even functions (functions should receive the instance as the only parameter) # You may also specify fields as a dictionary, mapping each field to a weight for ranking purposes # see http://www.postgresql.org/docs/8.3/static/textsearch-features.html#TEXTSEARCH-MANIPULATE-TSVECTOR #objects = SearchManager( fields={ # 'title': 'A', # 'body': 'B', #} ) def __unicode__(self): return u"%s"%(self.title)
In the django shell create some instances of models:
python ./manage.py shell
>>>from core.models importBlog>>> p =Blog(title='This is the title', body='The body of the article')>>> p.save()>>> p =Blog(title='This is the second title', body='The body of another article in the blog')>>> p.save()>>> p =Blog(title='This is the third title', body='The body of yet another simple article')>>> p.save()
Now perform a search:
>>> result =Blog.objects.search('simple').all()>>> result.count()1>>> result [<Blog:Thisis the third title>]
Additional information
You can force an index update to all or some instances:
>>> p.update_index()>>>Blog.objects.update_index()>>>Blog.objects.update_index(pk=1)>>>Blog.objects.update_index(pk=[1,2])
You can omit the search function and make the search directly:
>>> result =Blog.objects('simple')>>> result.count()1>>> result [<Blog:Thisis the third title>]
PostgreSQL specific information
The PostgreSQL backend is heavily based in the code from http://www.djangosnippets.org/snippets/1328/ by Dan Watson.
If using the pgsql backend, don't forget to add a Gin or GiST index to your tables: http://www.postgresql.org/docs/8.3/static/textsearch-indexes.html
Example
CREATE INDEX "tablename_search_index" ON "tablename" USING gin("search_index");
Note: You should index the search_index column, not your text or char columns.
Your best bet is to use Django raw querysets, I use it with MySQL to perform full text matching. If the data is all in the database and Postgres provides the matching capability then it makes sense to use it. Plus Postgres offers some really useful things in terms of stemming etc with full text queries.
Basically it lets you write the actual query you want yet returns models (as long as you are querying a model table obviously).
The advantage this gives you is that you can test the exact query you will be using first in Postgres, the documentation covers full text queries pretty well.
The main gotcha with raw querysets at the moment is they don't support count. So if you will be returning lots of data and have memory constraints on your application you might need to do something clever.
search¶
A boolean full-text search, taking advantage of full-text indexing. This is
 like contains but is significantly faster due to full-text indexing.
Example:
Entry.objects.filter(headline__search="+Django -jazz Python")
SQL equivalent:
SELECT ... WHERE MATCH(tablename, headline) AGAINST (+Django -jazz Python IN BOOLEAN MODE);
Note this is only available in MySQL and requires direct manipulation of the
 database to add the full-text index. By default Django uses BOOLEAN MODE for
 full text searches. Please check MySQL documentation for additional details.
12.3. Controlling Text Search
To implement full text searching there must be a function to
 create a tsvector from a document and a
 tsquery from a user query. Also, we need to
 return results in a useful order, so we need a function that
 compares documents with respect to their relevance to the query.
 It's also important to be able to display the results nicely.
 PostgreSQL provides support for
 all of these functions.
12.3.1. Parsing
 Documents
PostgreSQL provides the
 function to_tsvector for
 converting a document to the tsvector
 data type.
    to_tsvector([ config regconfig, ] document text) returns tsvector
to_tsvector parses a textual
 document into tokens, reduces the tokens to lexemes, and
 returns a tsvector which lists the
 lexemes together with their positions in the document. The
 document is processed according to the specified or default
 text search configuration. Here is a simple example:
SELECT to_tsvector('english', 'a fat  cat sat on a mat - it ate a fat rats');
to_tsvector
-----------------------------------------------------
'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
In the example above we see that the resulting tsvector does not contain the words a, on, or it, the word rats
 became rat, and the punctuation sign
 - was ignored.
The to_tsvector function
 internally calls a parser which breaks the document text into
 tokens and assigns a type to each token. For each token, a list
 of dictionaries (Section
 12.6) is consulted, where the list can vary depending on
 the token type. The first dictionary that recognizes the token emits one or more
 normalized lexemes to represent the
 token. For example, rats became
 rat because one of the dictionaries
 recognized that the word rats is a
 plural form of rat. Some words are
 recognized as stop words (Section
 12.6.1), which causes them to be ignored since they occur
 too frequently to be useful in searching. In our example these
 are a, on,
 and it. If no dictionary in the list
 recognizes the token then it is also ignored. In this example
 that happened to the punctuation sign - because there are in fact no dictionaries
 assigned for its token type (Space
 symbols), meaning space tokens will never be indexed. The
 choices of parser, dictionaries and which types of tokens to
 index are determined by the selected text search configuration
 (Section 12.7). It
 is possible to have many different configurations in the same
 database, and predefined configurations are available for
 various languages. In our example we used the default
 configuration english for the English
 language.
The function setweight can be
 used to label the entries of a tsvector
 with a given weight, where a weight is
 one of the letters A, B, C, or D. This is typically used to mark entries coming
 from different parts of a document, such as title versus body.
 Later, this information can be used for ranking of search
 results.
Because to_tsvector(NULL) will return NULL,
 it is recommended to use coalesce
 whenever a field might be null. Here is the recommended method
 for creating a tsvector from a structured
 document:
UPDATE tt SET ti = setweight(to_tsvector(coalesce(title,'')), 'A') || setweight(to_tsvector(coalesce(keyword,'')), 'B') || setweight(to_tsvector(coalesce(abstract,'')), 'C') || setweight(to_tsvector(coalesce(body,'')), 'D');
Here we have used setweight to
 label the source of each lexeme in the finished tsvector, and then merged the labeled tsvector values using the tsvector concatenation operator ||. (Section
 12.4.1 gives details about these operations.)
12.3.2. Parsing Queries
PostgreSQL provides the
 functions to_tsquery and
 plainto_tsquery for converting a
 query to the tsquery data type.
 to_tsquery offers access to more
 features than plainto_tsquery,
 but is less forgiving about its input.
    to_tsquery([ config regconfig, ] querytext text) returns tsquery
to_tsquery creates a
 tsquery value from querytext, which must consist of single
 tokens separated by the Boolean operators & (AND), | (OR) and
 ! (NOT). These operators can be
 grouped using parentheses. In other words, the input to
 to_tsquery must already follow
 the general rules for tsquery input, as
 described in Section
 8.11. The difference is that while basic tsquery input takes the tokens at face value,
 to_tsquery normalizes each token
 to a lexeme using the specified or default configuration, and
 discards any tokens that are stop words according to the
 configuration. For example:
SELECT to_tsquery('english', 'The & Fat & Rats');
to_tsquery
---------------
'fat' & 'rat'
As in basic tsquery input, weight(s)
 can be attached to each lexeme to restrict it to match only
 tsvector lexemes of those weight(s). For
 example:
SELECT to_tsquery('english', 'Fat | Rats:AB');
to_tsquery
------------------
'fat' | 'rat':AB
to_tsquery can also accept
 single-quoted phrases. This is primarily useful when the
 configuration includes a thesaurus dictionary that may trigger
 on such phrases. In the example below, a thesaurus contains the
 rule supernovae stars : sn:
SELECT to_tsquery('''supernovae stars'' & !crab');
to_tsquery
---------------
'sn' & !'crab'
Without quotes, to_tsquery
 will generate a syntax error for tokens that are not separated
 by an AND or OR operator.
    plainto_tsquery([ config regconfig, ] querytext text) returns tsquery
plainto_tsquery transforms
 unformatted text querytext to
 tsquery. The text is parsed and
 normalized much as for to_tsvector, then the & (AND) Boolean operator is inserted between
 surviving words.
Example:
 SELECT plainto_tsquery('english', 'The Fat Rats');
plainto_tsquery
-----------------
'fat' & 'rat'
Note that plainto_tsquery
 cannot recognize either Boolean operators or weight labels in
 its input:
SELECT plainto_tsquery('english', 'The Fat & Rats:C');
plainto_tsquery
---------------------
'fat' & 'rat' & 'c'
Here, all the input punctuation was discarded as being space
 symbols.
12.3.3. Ranking Search Results
Ranking attempts to measure how relevant documents are to a
 particular query, so that when there are many matches the most
 relevant ones can be shown first. PostgreSQL provides two predefined ranking
 functions, which take into account lexical, proximity, and
 structural information; that is, they consider how often the
 query terms appear in the document, how close together the
 terms are in the document, and how important is the part of the
 document where they occur. However, the concept of relevancy is
 vague and very application-specific. Different applications
 might require additional information for ranking, e.g. document
 modification time. The built-in ranking functions are only
 examples. You can write your own ranking functions and/or
 combine their results with additional factors to fit your
 specific needs.
The two ranking functions currently available are:
- 
ts_rank([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4
 - 
Standard ranking function.
 - 
ts_rank_cd([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4
 - 
This function computes the cover
density ranking for the given document vector and
query, as described in Clarke, Cormack, and Tudhope's
"Relevance Ranking for One to Three Term Queries" in the
journal "Information Processing and Management",
1999.This function requires positional information in its
input. Therefore it will not work on "stripped" tsvector
values — it will always return zero. 
For both these functions, the optional weights argument offers the ability to
 weigh word instances more or less heavily depending on how they
 are labeled. The weight arrays specify how heavily to weigh
 each category of word, in the order:
{D-weight, C-weight, B-weight, A-weight}
If no weights are provided,
 then these defaults are used:
{0.1, 0.2, 0.4, 1.0}
Typically weights are used to mark words from special areas
 of the document, like the title or an initial abstract, so that
 they can be treated as more or less important than words in the
 document body.
Since a longer document has a greater chance of containing a
 query term it is reasonable to take into account document size,
 e.g. a hundred-word document with five instances of a search
 word is probably more relevant than a thousand-word document
 with five instances. Both ranking functions take an integer
 normalization option that
 specifies whether and how a document's length should impact its
 rank. The integer option controls several behaviors, so it is a
 bit mask: you can specify one or more behaviors using
 | (for example, 2|4).
- 
0 (the default) ignores the document length
 - 
1 divides the rank by 1 + the logarithm of the document
length - 
2 divides the rank by the document length
 - 
4 divides the rank by the mean harmonic distance between
extents (this is implemented only byts_rank_cd) - 
8 divides the rank by the number of unique words in
document - 
16 divides the rank by 1 + the logarithm of the number
of unique words in document - 
32 divides the rank by itself + 1
 
If more than one flag bit is specified, the transformations
 are applied in the order listed.
It is important to note that the ranking functions do not
 use any global information, so it is impossible to produce a
 fair normalization to 1% or 100% as sometimes desired.
 Normalization option 32 (rank/(rank+1)) can be applied to scale all ranks
 into the range zero to one, but of course this is just a
 cosmetic change; it will not affect the ordering of the search
 results.
Here is an example that selects only the ten highest-ranked
 matches:
SELECT title, ts_rank_cd(textsearch, query) AS rank
FROM apod, to_tsquery('neutrino|(dark & matter)') query
WHERE query @@ textsearch
ORDER BY rank DESC LIMIT 10;
title                     |   rank
-----------------------------------------------+----------
Neutrinos in the Sun                          |      3.1
The Sudbury Neutrino Detector                 |      2.4
A MACHO View of Galactic Dark Matter          |  2.01317
Hot Gas and Dark Matter                       |  1.91171
The Virgo Cluster: Hot Plasma and Dark Matter |  1.90953
Rafting for Solar Neutrinos                   |      1.9
NGC 4650A: Strange Galaxy and Dark Matter     |  1.85774
Hot Gas and Dark Matter                       |   1.6123
Ice Fishing for Cosmic Neutrinos              |      1.6
Weak Lensing Distorts the Universe            | 0.818218
This is the same example using normalized ranking:
SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank
FROM apod, to_tsquery('neutrino|(dark & matter)') query
WHERE  query @@ textsearch
ORDER BY rank DESC LIMIT 10;
title                     |        rank
-----------------------------------------------+-------------------
Neutrinos in the Sun                          | 0.756097569485493
The Sudbury Neutrino Detector                 | 0.705882361190954
A MACHO View of Galactic Dark Matter          | 0.668123210574724
Hot Gas and Dark Matter                       |  0.65655958650282
The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
Rafting for Solar Neutrinos                   | 0.655172410958162
NGC 4650A: Strange Galaxy and Dark Matter     | 0.650072921219637
Hot Gas and Dark Matter                       | 0.617195790024749
Ice Fishing for Cosmic Neutrinos              | 0.615384618911517
Weak Lensing Distorts the Universe            | 0.450010798361481
Ranking can be expensive since it requires consulting the
 tsvector of each matching document, which
 can be I/O bound and therefore slow. Unfortunately, it is
 almost impossible to avoid since practical queries often result
 in large numbers of matches.
12.3.4. Highlighting Results
To present search results it is ideal to show a part of each
 document and how it is related to the query. Usually, search
 engines show fragments of the document with marked search
 terms. PostgreSQL provides a
 function ts_headline that
 implements this functionality.
ts_headline([ config regconfig, ] document text, query tsquery [, options text ]) returns text
ts_headline accepts a document
 along with a query, and returns an excerpt from the document in
 which terms from the query are highlighted. The configuration
 to be used to parse the document can be specified by config; if config is omitted, the default_text_search_config configuration is
 used.
If an options string is
 specified it must consist of a comma-separated list of one or
 more option=value pairs.
 The available options are:
- 
StartSel, StopSel: the strings with which query words
appearing in the document should be delimited to
distinguish them from other excerpted words. - 
MaxWords, MinWords: these numbers determine the
longest and shortest headlines to output. - 
ShortWord: words of this length
or less will be dropped at the start and end of a headline.
The default value of three eliminates the English
articles. - 
HighlightAll: Boolean flag; if
true the whole document will be
highlighted. 
Any unspecified options receive these defaults:
StartSel=<b>, StopSel=</b>, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE
For example:
SELECT ts_headline('english', 'The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query.', to_tsquery('query & similarity'));
ts_headline
------------------------------------------------------------
given <b>query</b> terms
and return them in order of their <b>similarity</b> to the
<b>query</b>.
SELECT ts_headline('english', 'The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query.',
to_tsquery('query & similarity'),
'StartSel = <, StopSel = >');
ts_headline
-------------------------------------------------------
given <query> terms
and return them in order of their <similarity> to the
<query>.
ts_headline uses the original
 document, not a tsvector summary, so it
 can be slow and should be used with care. A typical mistake is
 to call ts_headline for
 every matching
 document when only ten documents are to be shown.
 SQL subqueries can help;
 here is an example:
SELECT id, ts_headline(body, q), rank
FROM (SELECT id, body, q, ts_rank_cd(ti, q) AS rank
FROM apod, to_tsquery('stars') q
WHERE ti @@ q
ORDER BY rank DESC LIMIT 10) AS foo;