FDB_hiroshima

testing link to posts

New Plume Alpha

Hello! It has been a long time since the last article about Plume development was published (actually, there was the very good but technical article by @FDB_hiroshima about integrating tantivy into Plume). But it doesn't mean we were not active during the last months! Quite the opposite, we have continuously been improving Plume, and we now think that it is time for a new alpha release.

What changed since the first alpha?

First of all, there was a number of quite important changes on the organizational level. Indeed, we decided to adopt a Code of Conduct, to make sure contributing to Plume is a pleasant experience for everyone. If you are curious, you can read it here (it is based on the Contributor Covenant Code of Conduct).

We also opened a Loomio group to discuss non-technical issues. Loomio is a tool designed to discuss and vote on specific topics. It only requires an email address to participate, and we would be happy to welcome you to our debates. If you wante to help Plume, but don't have much time or lack technical skills, participating on Loomio is a great way to contribute.

Among other things, we used Loomio to choose the new Plume logo, and after a poll, @trwnh@mastodon.social's proposition was chosen. It will now be displayed by default on all the instances and used for branding on other websites (Matrix, GitHub, etc). However, we plan to integrate other logos as well (so that the other proposition aren't lost) and we will add an option for you to choose the one you'd like to use on your instance.

Plume also has a website to introduce the project, host its documentation and contribution guides (this last page is based on Funkwhale's page).

But what about Plume itself? As you can imagine, since the first alpha was published, a lot of changes have happened.

Let's start with the bug fixes. There was a lot of them (we are still in an alpha phase after all), but the most important ones are probably the following:

We now handle long articles better
Articles and blogs without any authors (when the last of them deletes their account) are now deleted to avoid crashes because of articles or blogs without authors
We have addressed an important security issue in the way we check who has the right to do an action or not (for instance, previously, it was relatively easy for an attacker to delete a picture you uploaded)
@vbrandl also fixed a bug that was breaking the notification page when a comment was deleted

Most of theses changes are not visible (because it works as expected now), but will noticably improve your experience with Plume (as well as your security).

There was also some design updates: @dfeyer@social.ttree.ch improved the contrast of input fields, and improved the title's style so that they are readable even when wrapped on multiple lines (and he started to work on more important design updates, that may come in the next release). Plume should also work better on small screens now.

On the federation side, @KokaKiwi and @FDB_hiroshima committed quite a lot of bug fixes. I (@Ana) also made a lot of changes to the code that handles incoming activities, making it easier to maintain and more compliant with the ActivityPub specification. The federation should be a lot more reliable now! @FDB_hiroshima also implemented two very important security and privacy related features: signature verification, to make sure the activities we receive from other instances are indeed coming from these instances, and blind key rotation. The NodeInfo, that gives information about a Fediverse instance for websites like the-federation.info, was also updated by @0x1C3B00DA@edolas.world to support both the version 2.0 and 2.1 of the standard.

@mareklach@mastodon.xyz also made a lot of changes to the phrasing used in the interface. Thanks to him, the English translation of Plume is now nicer to read. The term "follow" was also replaced by "subscribe" which is less person-centric and more adapted in the context of blogs.

The translations were also updated: Plume is almost completely translated in ten languages now!

I want to thank @alangarciar, @amikigu, @ardydo, @khannaankitaa, @ButterflyOfFire@mstdn.fr, @Devil505, @faho, @fitoshido, @FDB_hiroshima, @heldergg, @ida27, @Lacrymology, @m4sk1n@101010.pl, @ManuelFranz, @mareklach@mastodon.xyz, @metalbiker@ins.mastalab.app, @Moutmoutausore, @netopyrr, @ryonakano, @Silviu200530, @Swedneck, @UniqueActive, @wilPoly and @xosem@mstdn.io (that's a lot of people!) for their help. By the way, we are now using Crowdin to translate Plume, which makes it much more convenient when you are not familiar with GitHub.

Of course, we implemented a bunch of new features these last months! Among them:

You can now delete and edit blogs
Articles can now contain hashtags, which are correctly federated to other ActivityPub platforms
Articles can have cover images which will make them standout more in the article listings
View who is subscribing to a user by going to their profile
Your password can now be reset, in case you forgot it

A new editor, that should be more pleasant to use was also introduced! It is for the moment quite basic, but we plan to add a lot of features to make writing articles with Plume easier, even if you don't know Markdown.

We also improved integration with other services. Microformat tags were added making it easier for external apps to display Plume articles in their interface. RSS feeds are also more discoverable now (a little link was added on blog and user pages), and a bug that broke them in some case was fixed. OpenGraph metadata were also added, so that when you share a Plume article on another website, a nice preview card can be displayed.

The comments improved a lot too. It is now possible to delete them. We also correctly handle comment visibility from other instances, even if it is not yet possible to change it when commenting from Plume. Content warnings (as they are called in Mastodon) are also now supported in comments.

We also added features that will please people hosting an instance. First of all, thanks to @hirojin@dev.glitch.social it is now possible to use SQlite instead of PostgreSQL as a database engine. We also replaced the not so well working "setup script" that asked you questions to help you setup your instance with a command line tool, called plm, that has various commands to manage your instance. The default license for new instance was also changed to CC-BY-SA, which is a better default than CC-0. And finally, it is possible to choose which logo to use for your instance from the configuration.

A REST API to create articles was also added. We built a command line tool, called amsterdam (thanks @Doshirae@social.wxcafe.net for the great name) on top of this new API to import posts stored on your computer.

Finally, maybe the biggest feature in this release, a search engine was added to Plume!

It has been implemented with the amazing tantivy crate, which is working very well and barely consumes any resources. If you want details about the technical aspects of this feature, you can read this article.

On the technical side, a lot of changes happened too. Thanks to @kemonine@social.holdmybeer.solutions and @eliotberriot@mastodon.eliotberriot.com our Docker files are working great, even on ARM, the Docker image became smaller and is now published in the Docker registry (and in Lollipop cloud's one for ARM builds).

The styling is now done with SCSS, which is much more convenient than regular CSS.

We also added some constraints at the database level (there were almost none before), to avoid saving invalid data.

We try to take advantage of Rust power as much as possible. For that purpose, we reworked our error handling to make it use the powerful Result type instead of making Plume crash on error. We also converted our front-end code from JavaScript to Rust thanks to WebAssembly. It means we can now easily share code between the back-end and the front-end if needed. Our templates have also been rewritten using ructe, which is a crate that compiles your page templates to Rust functions, guaranteeing their correctness before Plume starts, and improving performances.

A feature that recently landed in Rust is what we call proc-macro. They are special functions that can transform some code to something else in very powerful ways. I used this feature to create a few proc-macro that automatically generate translation files and use them in Rust apps. If you are curious, here is the repository.

We also started to write more tests, and even if the coverage is still very low (around 30% at the moment), we are making progress, and I hope we can be near to 100% of tested code for the first real release. @FDB_hiroshima also migrated our Continuous Integration to Circle CI, which is much more faster than Travis CI.

How to try this new version

If you want to give this new version a try, Fediverse.blog is already up-to-date. Otherwise, just give some time to your instance admin to update. 😄

If you are an instance admin, and you want to update or install a new instance, please refer to the release notes on GitHub.

What will come next?

Even if we made a lot of progress, tons of features are still missing for Plume to be really usable. I can't tell for sure what will be next, but I would like to work on multi-authors blogs and articles, and on moderation tools, because the one we currently have are too basic to be really useful. @hirojin@dev.glitch.social is also working on custom domains for blogs, so that may be in the next update!

However, I don't know how much I will be able to contribute in the following months, because my exams period will start soon, and I will have less time to give to Plume. Fortunately, I'm not the only person working on Plume.

I probably forgot a lot of people in this article, and I don't have everyone's real @, I'm sorry. I would like to thank everyone who made this release possible (even if your name was not cited here). Thank you! 💜

Adding full text search to Plume

First the beginning

There was an issue on Plume, issue 149, about adding some search capabilities to Plume. I've never worked on something like that, never touched ElasticSearch or anything, so I through it would be cool to discover how such things worked.

The first think to do was to go on crates.io to see what crates (libraries) could help me.

Doing some quick sorting, I could already eliminate most of the result. Some because they're tied to a database back-end we might not provide, other because they're not documented enough. In the end, I choosed Tantivy because it was the only one which was well documented, not a client library for another software, and actually feats to our need.

Choosing a schema

After digging quite some time in Tantivy's documentation, and in it's examples, I made myself a good idea of how it works and what it was capable of. As one would do with a SQL database, I started to search what I would store, and how. Here is what it got me to :

fields:
    post_id       => i64(id)   -> Stored
    creation_date => i64(date) -> Indexed
    instance      => i64(id)   -> Indexed
    author        => string    -> Indexed | IndexRecordOption::Basic                 > ContentTokenizer
    hashtag       => string    -> Indexed | IndexRecordOption::Basic                 > ContentTokenizer
    mention       => string    -> Indexed | IndexRecordOption::Basic                 > ContentTokenizer
    blog          => string    -> Indexed | IndexRecordOption::WithFreqsAndPositions > ContentTokenizer
    content       => string    -> Indexed | IndexRecordOption::WithFreqsAndPositions > ContentTokenizer
    subtitle      => string    -> Indexed | IndexRecordOption::WithFreqsAndPositions > ContentTokenizer
    title         => string    -> Indexed | IndexRecordOption::WithFreqsAndPositions > ContentTokenizer
    license       => string    -> Indexed | IndexRecordOption::Basic                 > PropertyTokenizer
    lang          => string    -> Indexed | IndexRecordOption::Basic                 > PropertyTokenizer


Tokenizers:
    PropertyTokenizer => NgramTokenizer(2..8, prefix_only=false).lowerCase:
        lang, license
    ContentTokenizer => SimpleTokenizer.limit(40).lowerCaser:
    author, blog, content, hashtag, subtitle, title

Pretty much all is here, fields names, their types, how they should be managed... but this is still a first sketch, at this time I haven't played with Tantivy yet, and things might still change. Let me explain a bit this the choices I made. First lets start with types, Tantivy support 4 :

signed and unsigned integers, 64 bits for both
text, which is exactly what it looks like
facet, which is a bit like a path, and can be useful to store categories and subcategories. See this link for more information, as I won't talk about them again.

Fields options

With each field came some options, allowing us to tell Tantivy how to store and index things. The first option is it being stored. A stored field can be retrieved after searching for a document. Plume already store it's content in a database, so we mostly needs to get the id of a post, we don't want to store every post in double, that would be a wast of space. Then, the question is it indexed? Most fields are, because we want to search through those index, but post_id is not, because no one want to search for a post they already found. Edit: after all post_id need to be indexed as this is the key we use when we try to delete or edit a post.

String indexing

With strings come some more options, what should the index contain about them? Only what words are contained, or also their frequency and position? Storing position for hashtag is not of much use, we don't put them in a nice order. Nor is frequency as there is usually no hashtag twice in a post. On another hand, searching for an expression is quite useful in a title or a full post. Then how do we tokenize our text? The simple, but not most efficient way, to tokenize text would be to simply split each word apart. This could be enough, but we could do better, and I actually want to play a bit with Tantivy, so I searched what I could do more. For licenses, their are single words, so we could index the the easy way and it would be enough, but I wanted someone who search for Creative Commons content to also find CC-0, CC-BY... so I used the n-gram tokenizer, which transform a word in every n-gram it is composed of. "CC-0" would therefore become ["CC", "CC-", "CC-0", "C-", "C-0", "-0"] if we consider 2-grams, and would then be matched by "CC". We then lowercase the result so "CC" and "cc" are equivalent.

Stemming

For post content, being able to search any n-gram would however generate a lot of data to store, we need to use something else. The something I found is called stemming. The idea is to extract the stem of each word, so that "random" would match "randomly". Without stemming both a considered totally different tokens, and wont match each other. However stemming rules are very language dependent (the same in french would be "aléatoire" and "aléatoirement" for instance, the suffix is "ment" instead of "ly"). Tantivy provide a built in stemmer for English, but not for other languages, so I decided to not use this yet. Maybe latter. Another feature is stopwords, it stop some words from being indexed, because they are of no use. For example "the" is a very common word in English, searching for it has not use. But stopwords can cause some issues. Maybe a very common word in a language is also a very rare one in another. To make the mater worse, sometime common words mean something special. Think "The Who", a famous rock band, using stop words, it's likely both "the" and "who" would get stopped, and no one would ever find that band again.

First lines of code

So now that we know a bit about Tantivy, it's time to actually write some code. First as it's the first time I use this crate, I'll do some tests, with the above configuration, but outside of Plume, just to discover how everything works.

After playing a bit with Tantivy, I've already learnt some interesting things about it's internals. I started with the basic_search example, that I modified to my needs. First to used my own schema that I defined above, and tested a bit how query worked. Then as I was really interested in stemming, I searched how to implement a TokenFilter, and basically copied LowerCaser, and added some println to understand how it was called.

TokenFilter internals

The struct we create ourselves is, well, created only one time by ourselves. It's transform function is called on each document we add, and create a new TokenStream. This TokenStream then iterate over each token in the document, modifying or removing the ones it want. For queries, the TokenStream see one token per word, or one token for multiple words if they are quoted. So it could be possible to put a sort of header to each document indicating it's language, and to stem differently based on that. For queries, we would then have to either test each language or find the language of a query. Although it won't be included in the first try, it's good to know. Also, according to the documentation, the QueryParser is bad at running queries which are not properly formatted. This will be fixed soon as there is a pull request trying to add a lenient mode to the parser. But for now we will need to make some sort of adapter, getting a query from a user and transforming it to a Query by our-self, or into a query string that Tantivy will understand every-time.

Integrating to Plume

So now that we have explored a bit how it works, lets start playing with it inside Plume. As a public interface for the module, we will need some kind of builder, something to add, edit and remove documents, and obviously something to do searches. We also want some way to regenerate the whole index via a new plm sub-command.

The model

The first thing to do is to create the module, it's what I've done in this commit. There is nothing really fancy, adding Tantivy to the dependencies, also adding whatlang to extract language from content, and also itertools because it make life easier when doing certain things with iterators. I've made two function to create an index, one only reading an existing one, and the other creating it. Plume should only open en existing index, and plm, once implemented, will be used to create it beforehand. Well actually there is a thing a bit fancy. Date storage. Tantivy allow for int storage, which is obviously what I went to. But I did not store date as seconds since 1970, the famous Unix time, because Tantivy's documentation state that RangeQuery, which we need to search for document created between two dates, iterate over the whole range. Iterating over millions of seconds is definitely slower than iterating over the 30 days of a month. So I decided date would be stored as number of days since a particular point, to be more efficient. I also made the choice to commit to Tantivy's store on every post creation/deletion, which turned out to be quite slow. Tantivy seems to prefers large batch jobs over single document ones.

plm sub-commands

Then in another commit, I added the command search to plm, with two sub-command, one to create a new index, and the other to read all posts from Plume database and copy them in the index. I also added calls to the function I created in the last commit, so that creating a new post, or deleting a blog, reflect accordingly into Tantivy's result. This required me to change the prototype of many functions, but wasn't actually a big work. Rust compiler just kept yelling at me "I don't have enough argument for this function", but once it shutted up, indexing seemed to be working. I also added a commit() function to my Searcher, which would simply call Tantivy's one and removed commit on each edition. That way I could just commit regularly, with a thread waking up half an hour or so. Changing prototypes of function broke tests as I forgot to update theme, but this was quickly fixed.

A new threadpool

So now I needed to call commit() regularly, but our current thread pool only allowed to run task as soon as possible, and could not schedule one. I didn't feel like running a new dedicated thread to auto commiting, so I decided to move to a more advanced threadpool.

First search

Next thing was to finally do some search. At this point it looked like it was working, doing some computation, writing result to disc... but I haven't done a single search yet. So I made a basic front-end with the base minimum, and used Tantivy's QueryParser to make my first queries. And hooray all seems to be working. Even paginated search. At this point this is working, doing advanced queries is a bit of pain, but it work.

Shutdown gracefully

However there is still at least one big problem, Rocket (the web framework we use in Plume) does not support clean shutdown. Tantivy's index writer get a lock on the index directory, and is supposed to release it when it get dropped, but with Rocket State owning it, this drop never happen. Leading to an error every time you start Plume after having killing it, even with ctrl-c. This was fixed in the next commit, where I added a drop_writer function to Searcher. After calling it, any try to write to the index will panic. So this is called on process termination, just before calling exit. There might be a race condition where one could edit a document after the writer got dropped, but before Plume exited, however this would only mean it's post won't be indexed, and as I see no easy way to fix it I will leave it that way.

Road to advanced search

The next commit is quite big. It contain some refactoring to avoid code duplication, some field where changed of type because after reflection, indexing the instance name into Tantivy is easier than indexing the id, and going to an SQL query to map one to the other. I also created my own tokenizer, which is a copy of SimpleTokenizer, but which don't cut token on punctuation, as instance domain contains dot, and user name can contain underscores or other non alphanumeric characters. This commit also contain a query parser, that is useful mainly because it never fail. Which don't mean it's better than Tantivy's one, it's just way less generic, so it can make assumption Tantivy can't. It also convert date range from the string a browser send (YYYY-MM-DD) to our own internal representation. I also modified the front-end to match with the back-end, having a basic search field, and many advanced field hidden if you don't open the spoiler. This is pretty much the final template, it's not exactly beautiful, but it's functional.

Code cleaning

Before merging I decided to split searcher.rs into multiple sub-modules as the code is quite dense. There is not that much lines, but I used macros a lot to reduce duplication, and that does not help with readability. I also added tests to the parser, fixed some issues with it, and made it more intelligent when you search for post from an author of another instance, and that's it.

Conclusion

To conclude I'd say this was very interesting. Tantivy is a very powerful crate, having took a quick look at it's internal, this is also a complex software, but the interface it give is not that hard to use. Most difficulties I encountered were because I decided I wanted to write a query parser matching our needs, but one of Tantivy's example manage to do most I've done (with a smaller schema and using Tantivy build-in QueryParser) in like 70 lines of actual code. Thanks a lot to Fulmicoton for this library, this was a pleasure to discover full text search with it.

Going further

Now if we want to make even better use of Tantivy in Plume here is a short list of things we could do:

Tantivy can extract fragments of documents matching a query, when showing search results, Plume could print this instead of post's subtitle
We don't use stemming yet, which would probably make results more accurate
Tantivy development branch support for Levenshtein distance query and regexp, maybe allowing to use one of those could make sense in some case. For those who are not only interested in the public interface of a library, but also in its internal, I recommend you to read this post. From what I've understood Tantivy use this kind of scheme for it's data structure, as does Lucene the backend behind ElasticSearch