First the beginning
There was an issue on Plume, issue 149, about adding some search capabilities to Plume. I've never worked on something like that, never touched ElasticSearch or anything, so I through it would be cool to discover how such things worked.
The first think to do was to go on crates.io to see what crates (libraries) could help me.
Doing some quick sorting, I could already eliminate most of the result. Some because they're tied to a database back-end we might not provide, other because they're not documented enough. In the end, I choosed Tantivy because it was the only one which was well documented, not a client library for another software, and actually feats to our need.
Choosing a schema
After digging quite some time in Tantivy's documentation, and in it's examples, I made myself a good idea of how it works and what it was capable of. As one would do with a SQL database, I started to search what I would store, and how. Here is what it got me to :
fields: post_id => i64(id) -> Stored creation_date => i64(date) -> Indexed instance => i64(id) -> Indexed author => string -> Indexed | IndexRecordOption::Basic > ContentTokenizer hashtag => string -> Indexed | IndexRecordOption::Basic > ContentTokenizer mention => string -> Indexed | IndexRecordOption::Basic > ContentTokenizer blog => string -> Indexed | IndexRecordOption::WithFreqsAndPositions > ContentTokenizer content => string -> Indexed | IndexRecordOption::WithFreqsAndPositions > ContentTokenizer subtitle => string -> Indexed | IndexRecordOption::WithFreqsAndPositions > ContentTokenizer title => string -> Indexed | IndexRecordOption::WithFreqsAndPositions > ContentTokenizer license => string -> Indexed | IndexRecordOption::Basic > PropertyTokenizer lang => string -> Indexed | IndexRecordOption::Basic > PropertyTokenizer Tokenizers: PropertyTokenizer => NgramTokenizer(2..8, prefix_only=false).lowerCase: lang, license ContentTokenizer => SimpleTokenizer.limit(40).lowerCaser: author, blog, content, hashtag, subtitle, title
Pretty much all is here, fields names, their types, how they should be managed... but this is still a first sketch, at this time I haven't played with Tantivy yet, and things might still change. Let me explain a bit this the choices I made. First lets start with types, Tantivy support 4 :
- signed and unsigned integers, 64 bits for both
- text, which is exactly what it looks like
- facet, which is a bit like a path, and can be useful to store categories and subcategories. See this link for more information, as I won't talk about them again.
With each field came some options, allowing us to tell Tantivy how to store and index things. The first option is it being stored. A stored field can be retrieved after searching for a document. Plume already store it's content in a database, so we mostly needs to get the id of a post, we don't want to store every post in double, that would be a wast of space. Then, the question is it indexed? Most fields are, because we want to search through those index, but post_id is not, because no one want to search for a post they already found. Edit: after all post_id need to be indexed as this is the key we use when we try to delete or edit a post.
With strings come some more options, what should the index contain about them? Only what words are contained, or also their frequency and position? Storing position for hashtag is not of much use, we don't put them in a nice order. Nor is frequency as there is usually no hashtag twice in a post. On another hand, searching for an expression is quite useful in a title or a full post. Then how do we tokenize our text? The simple, but not most efficient way, to tokenize text would be to simply split each word apart. This could be enough, but we could do better, and I actually want to play a bit with Tantivy, so I searched what I could do more. For licenses, their are single words, so we could index the the easy way and it would be enough, but I wanted someone who search for Creative Commons content to also find CC-0, CC-BY... so I used the n-gram tokenizer, which transform a word in every n-gram it is composed of. "CC-0" would therefore become ["CC", "CC-", "CC-0", "C-", "C-0", "-0"] if we consider 2-grams, and would then be matched by "CC". We then lowercase the result so "CC" and "cc" are equivalent.
For post content, being able to search any n-gram would however generate a lot of data to store, we need to use something else. The something I found is called stemming. The idea is to extract the stem of each word, so that "random" would match "randomly". Without stemming both a considered totally different tokens, and wont match each other. However stemming rules are very language dependent (the same in french would be "aléatoire" and "aléatoirement" for instance, the suffix is "ment" instead of "ly"). Tantivy provide a built in stemmer for English, but not for other languages, so I decided to not use this yet. Maybe latter. Another feature is stopwords, it stop some words from being indexed, because they are of no use. For example "the" is a very common word in English, searching for it has not use. But stopwords can cause some issues. Maybe a very common word in a language is also a very rare one in another. To make the mater worse, sometime common words mean something special. Think "The Who", a famous rock band, using stop words, it's likely both "the" and "who" would get stopped, and no one would ever find that band again.
First lines of code
So now that we know a bit about Tantivy, it's time to actually write some code. First as it's the first time I use this crate, I'll do some tests, with the above configuration, but outside of Plume, just to discover how everything works.
After playing a bit with Tantivy, I've already learnt some interesting things about it's internals. I started with the basic_search example, that I modified to my needs. First to used my own schema that I defined above, and tested a bit how query worked. Then as I was really interested in stemming, I searched how to implement a TokenFilter, and basically copied LowerCaser, and added some
println to understand how it was called.
The struct we create ourselves is, well, created only one time by ourselves. It's
transform function is called on each document we add, and create a new TokenStream. This TokenStream then iterate over each token in the document, modifying or removing the ones it want. For queries, the TokenStream see one token per word, or one token for multiple words if they are quoted.
So it could be possible to put a sort of header to each document indicating it's language, and to stem differently based on that. For queries, we would then have to either test each language or find the language of a query. Although it won't be included in the first try, it's good to know.
Also, according to the documentation, the QueryParser is bad at running queries which are not properly formatted. This will be fixed soon as there is a pull request trying to add a lenient mode to the parser. But for now we will need to make some sort of adapter, getting a query from a user and transforming it to a Query by our-self, or into a query string that Tantivy will understand every-time.
Integrating to Plume
So now that we have explored a bit how it works, lets start playing with it inside Plume. As a public interface for the module, we will need some kind of builder, something to add, edit and remove documents, and obviously something to do searches. We also want some way to regenerate the whole index via a new plm sub-command.
The first thing to do is to create the module, it's what I've done in this commit. There is nothing really fancy, adding Tantivy to the dependencies, also adding whatlang to extract language from content, and also itertools because it make life easier when doing certain things with iterators. I've made two function to create an index, one only reading an existing one, and the other creating it. Plume should only open en existing index, and plm, once implemented, will be used to create it beforehand. Well actually there is a thing a bit fancy. Date storage. Tantivy allow for int storage, which is obviously what I went to. But I did not store date as seconds since 1970, the famous Unix time, because Tantivy's documentation state that RangeQuery, which we need to search for document created between two dates, iterate over the whole range. Iterating over millions of seconds is definitely slower than iterating over the 30 days of a month. So I decided date would be stored as number of days since a particular point, to be more efficient. I also made the choice to commit to Tantivy's store on every post creation/deletion, which turned out to be quite slow. Tantivy seems to prefers large batch jobs over single document ones.
Then in another commit, I added the command search to plm, with two sub-command, one to create a new index, and the other to read all posts from Plume database and copy them in the index. I also added calls to the function I created in the last commit, so that creating a new post, or deleting a blog, reflect accordingly into Tantivy's result. This required me to change the prototype of many functions, but wasn't actually a big work. Rust compiler just kept yelling at me "I don't have enough argument for this function", but once it shutted up, indexing seemed to be working. I also added a
commit() function to my Searcher, which would simply call Tantivy's one and removed commit on each edition. That way I could just commit regularly, with a thread waking up half an hour or so. Changing prototypes of function broke tests as I forgot to update theme, but this was quickly fixed.
A new threadpool
So now I needed to call
commit() regularly, but our current thread pool only allowed to run task as soon as possible, and could not schedule one. I didn't feel like running a new dedicated thread to auto commiting, so I decided to move to a more advanced threadpool.
Next thing was to finally do some search. At this point it looked like it was working, doing some computation, writing result to disc... but I haven't done a single search yet. So I made a basic front-end with the base minimum, and used Tantivy's QueryParser to make my first queries. And hooray all seems to be working. Even paginated search. At this point this is working, doing advanced queries is a bit of pain, but it work.
However there is still at least one big problem, Rocket (the web framework we use in Plume) does not support clean shutdown. Tantivy's index writer get a lock on the index directory, and is supposed to release it when it get dropped, but with Rocket State owning it, this drop never happen. Leading to an error every time you start Plume after having killing it, even with ctrl-c. This was fixed in the next commit, where I added a drop_writer function to Searcher. After calling it, any try to write to the index will panic. So this is called on process termination, just before calling exit. There might be a race condition where one could edit a document after the writer got dropped, but before Plume exited, however this would only mean it's post won't be indexed, and as I see no easy way to fix it I will leave it that way.
Road to advanced search
The next commit is quite big. It contain some refactoring to avoid code duplication, some field where changed of type because after reflection, indexing the instance name into Tantivy is easier than indexing the id, and going to an SQL query to map one to the other. I also created my own tokenizer, which is a copy of SimpleTokenizer, but which don't cut token on punctuation, as instance domain contains dot, and user name can contain underscores or other non alphanumeric characters. This commit also contain a query parser, that is useful mainly because it never fail. Which don't mean it's better than Tantivy's one, it's just way less generic, so it can make assumption Tantivy can't. It also convert date range from the string a browser send (YYYY-MM-DD) to our own internal representation. I also modified the front-end to match with the back-end, having a basic search field, and many advanced field hidden if you don't open the spoiler. This is pretty much the final template, it's not exactly beautiful, but it's functional.
Before merging I decided to split searcher.rs into multiple sub-modules as the code is quite dense. There is not that much lines, but I used macros a lot to reduce duplication, and that does not help with readability. I also added tests to the parser, fixed some issues with it, and made it more intelligent when you search for post from an author of another instance, and that's it.
To conclude I'd say this was very interesting. Tantivy is a very powerful crate, having took a quick look at it's internal, this is also a complex software, but the interface it give is not that hard to use. Most difficulties I encountered were because I decided I wanted to write a query parser matching our needs, but one of Tantivy's example manage to do most I've done (with a smaller schema and using Tantivy build-in QueryParser) in like 70 lines of actual code. Thanks a lot to Fulmicoton for this library, this was a pleasure to discover full text search with it.
Now if we want to make even better use of Tantivy in Plume here is a short list of things we could do:
- Tantivy can extract fragments of documents matching a query, when showing search results, Plume could print this instead of post's subtitle
- We don't use stemming yet, which would probably make results more accurate
- Tantivy development branch support for Levenshtein distance query and regexp, maybe allowing to use one of those could make sense in some case. For those who are not only interested in the public interface of a library, but also in its internal, I recommend you to read this post. From what I've understood Tantivy use this kind of scheme for it's data structure, as does Lucene the backend behind ElasticSearch