• 201509.05

    Nginx returns error on file upload

    I love Nginx and have never had a problem with it. Until now.

    Turtl, the private Evernote alternative, allows uploading files securely. However, after switching to a new server on Linode, uploads broke for files over 10K. The server was returning a 404.

    I finally managed to reproduce the problem in cURL, and to my surprise, the requests were getting stopped by Nginx. All other requests were going through fine, and the error only happened when uploading a file of 10240 bytes or more.

    First thing I though was that Nginx v1.8.0 had a bug. Nobody on the internet seemed to have this problem. So I installed v1.9.4. Now the server returned a 500 error instead of a 404. Still no answer to why.

    I finally found it: playing with client_body_buffer_size seemed to change the threshold for which files would trigger the error and which wouldn’t, but ultimately the error was still there. Then I read about how Nginx uses temporary files to store body data. I checked that folder (in my case /var/lib/nginx/client_body) and the folder was writeable by the nginx user, however the parent folder /var/lib/nginx was owned by root:root and was set to 0700. I set /var/lib/nginx to be readable/writable by user nginx, and it all started working.

    Check your permissions

    So, check your folder permissions. Nginx wasn’t returning any useful errors (first a 404, which I’m assuming was a bug fixed in a later version) then a 500 error. It’s important to note that after switching to v1.9.4, the Permission Denied error did show up in the error log, but at that point I had already decided the logs were useless (v1.8.0 silently ignored the problem).

    Another problem

    This is an edit! Shortly after I applied the above fix, I started getting another error. My backend was getting the requests, but the entire request was being buffered by Nginx before being proxied. This is annoying to me because the backend is async and is made to stream large uploads.

    After some research, I found the fix (I put this in the backend proxy’s location block:

    proxy_request_buffering off;
    

    This tells Nginx to just stream the request to the backend (exactly what I want).

    Comments
  • 201507.29

    Turtl's new syncing architecture

    For those of you just joining us, I’m working on an app called Turtl, a secure Evernote alternative. Turtl is an open-source note taking app with client-side encryption which also allows private collaboration. Think like a private Evernote with a self-hosted option (sorry, no OCR yet =]).

    Turtl’s version 0.5 (the current version) has syncing, but it was never designed to support offline mode, and requires clients to be online to use Turtl. The newest upcoming release supports fully offline mode (except for a few things like login, password changes, etc). This post will attempt to describe how syncing in the new version of Turtl works.

    Let’s jump right in.

    Client IDs (or the “cid”)

    Each object having a globally unique ID that can be client-generated makes syncing painless. We do this using a few methods, some of which are actually borrowed from MongoDB’s Object ID schema.

    Every client that runs the Turtl app creates and saves a client hash if it doesn’t have one. This hash is a SHA256 hash of some (cryptographically secure) random data (current time + random uuid).

    This client hash is then baked into every id of every object created from then on. Turtl uses the composer.js framework (somewhat similar to Backbone) which gives every object a unique ID (“cid”) when created. Turtl replaces Composer’s cid generator with its own that creates IDs like so:

    12 bytes hex timestamp | 64 bytes client hash | 4 bytes hex counter
    

    For example, the cid

    014edc2d6580b57a77385cbd40673483b27964658af1204fcf3b7b859adfcb90f8b8955215970012
    

    breaks down as:

     timestamp    client hash                                                      counter
    ------------|----------------------------------------------------------------|--------
    014edc2d6580 b57a77385cbd40673483b27964658af1204fcf3b7b859adfcb90f8b895521597 0012
     |                                    |                                        |
     |- 1438213039488                     |- unique hash                           |- 18
    

    The timestamp is a new Date().getTime() value (with leading 0s to support longer times eventually). The client hash we already went over, and the counter is a value tracked in-memory that increments each time a cid is generated. The counter has a max value of 65535, meaning that the only way a client can produce a duplicate cid is by creating 65,535,001 objects in one second. We have some devoted users, but even for them creating 65M notes in a second would be difficult.

    So, the timestamp, client hash, and counter ensure that each cid created is unique not just to the client, but globally within the app as well (unless two clients create the same client hash somehow, but this is implausible).

    What this means is that we can create objects endlessly in any client, each with a unique cid, use those cids as primary keys in our database, and never have a collision.

    This is important because we can create data in the client, and not need server intervention or creation of IDs. A client can be offline for two weeks and then sync all of its changes the next time it connects without problems and without needing a server to validate its object’s IDs.

    Using this scheme for generating client-side IDs has not only made offline mode possible, but has greatly simplified the syncing codebase in general. Also, having a timestamp at the beginning of the cid makes it sortable by order of creation, a nice perk.

    Queuing and bulk syncing

    Let’s say you add a note in Turtl. First, the note data is encrypted (serialized). The result of that encryption is shoved into the local DB (IndexedDB) and the encrypted note data is also saved into an outgoing sync table (also IndexedDB). The sync system is alerted “hey, there are outgoing changes in the sync table” and if, after a short period, no more outgoing sync events are triggered, the sync system takes all pending outgoing sync records and sends them to a bulk sync API endpoint (in order).

    The API processes each one, going down the list of items and updating the changed data. It’s important to note that Turtl doesn’t support deltas! It only passes full objects, and replaces those objects when any one piece has changed.

    For each successful outgoing sync item that the API processes, it returns a success entry in the response, with the corresponding local outgoing sync ID (which was passed in). This allows the client to say “this one succeeded, remove it from the outgoing sync table” on a granular basis, retrying entries that failed automatically on the next outgoing sync.

    Here’s an example of a sync sent to the API:

    [
        {id: 3, type: 'note', action: 'add', data: { <encrypted note data> }}
    ]
    

    and a response:

    {
        success: [
            {id: 3, sync_ids: ['5c219', '5c218']}
        ]
    }
    

    We can see that sync item “3” was successfully updated in the API, which allows us to remove that entry from our local outgoing sync table. The API also returns server-side generate sync IDs for the records it creates in its syncing log. We use these IDs passed back to ignore incoming changes from the API when incoming syncs come in later so we don’t double-apply data changes.

    Why not use deltas?

    Wouldn’t it be better to pass diffs/deltas around than full objects? If two people edit the same note in a shared board at the same time, then the last-write-wins architecture would overwrite data!

    Yes, diffs would be wonderful. However, consider this: at some point, an object would be an original, and a set of diffs. It would have to be collapsed back into the main object, and because the main object and the diffs would be client-encrypted, the server has no way of doing this.

    What this means is that the clients would not only have to sync notes/boards/etc but also the diffs for all those objects, and collapse the diffs into the main object then save the full object back to the server.

    To be clear, this is entirely possible. However, I’d much rather get the whole-object syncing working perfectly before adding additional complexity of diff collapsing as well.

    Polling for changes

    Whenever data changes in the API, a log entry is created in the API’s “sync” table, describing what was changed and who it affects. This is also the place where, in the future, we might store diffs/deltas for changes.

    When the client asks for changes, at does so using a sequential ID, saying “hey, get me everything affecting my profile that happened after <last sync id>”.

    The client uses long-polling to check for incoming changes (either to one’s own profile or to shared resources). This means that the API call used holds the connection open until either a) a certain amount of time passes or b) new sync records come in.

    The API uses RethinkDB’s changefeeds to detect new data by watching the API’s sync table. This means that changes coming in are very fast (usually within a second of being logged in the API). RethinkDB’s changefeeds are terrific, and eliminate the need to poll your database endlessly. They collapse changes up to one second, meaning it doesn’t return immediately after a new sync record comes in, it waits a second for more records. This is mainly because syncs happen in bulk and it’s easier to wait a bit for a few of them than make five API calls.

    For each sync record that comes in, it’s linked against the actual data stored in the corresponding table (so a sync record describing an edited note will pull out that note, in its current form, from the “notes” table). Each sync record is then handed back to the client, in order of occurence, so it can be applied to the local profile.

    The result is that changes to a local profile are applied to all connected clients within a few seconds. This also works for shared boards, which are included in the sync record searches when polling for changes.

    File handling

    Files are synced separately from everything else. This is mainly because they can’t just be shoved into the incoming/outgoing sync records due to their potential size.

    Instead, the following happens:

    Outgoing syncs (client -> API)

    Then a new file is attached to a note and saved, a “file” sync item is created and passed into the ougoing sync queue without the content body. Keep in mind that at this point, the file contents are already safe (in encrypted binary form) in the files table of the local DB. The sync system notices the outgoing file sync record (sans file body) and pulls it aside. Once the normal sync has completed, the sync system adds the file record(s) it found to a file upload queue (after which the outgoing “file” sync record is removed). The upload queue (using Hustle) grabs the encrypted file contents from the local files table uploads it to the API’s attachement endpoint.

    Attaching a file to a note creates a “file” sync record in the API, which alerts clients that there’s a file change on that note they should download.

    It’s important to note that file uploads happen after all other syncs in that bulk request are handled, which means that the note will always exist before the file even starts uploading.

    Encrypted file contents are stored on S3.

    Incoming syncs (API -> client)

    When the client sees an incoming “file” sync come through, much like with outgoing file syncs, it pulls the record aside and adds it to a file download queue instead of processing it normally. The download queue grabs the file via the note attachment API call and, once downloaded, saves it into the local files database table.

    After this is all done, if the note that the file is attached to is in memory (decrypted and in the user’s profile) it is notified of the new file contents and will re-render itself. In the case of an image attachment, a preview is generated and displayed via a Blob URL.

    What’s not in offline mode?

    All actions work in offline mode, except for a few that require server approval:

    • login (requires checking your auth against the API’s auth database)
    • joining (creating an account)
    • creating a persona (requires a connection to see if the email is already taken)
    • changing your password
    • deleting your account

    What’s next?

    It’s worth mentioning that after v0.6 launches (which will include an Android app), there will be a “Sync” interface in the app that shows you what’s waiting to be synced out, as well as file uploads/downloads that are active/pending.

    For now, you’ll just have to trust that things are working ok in the background while I find the time to build such an interface =].

    Comments
  • 201111.21

    Composer.js - a new Javascript MVC framework for Mootools

    So my brother Jeff and I are building to Javascript-heavy applications at the moment (heavy as in all-js front-end). We needed a framework that provides loose coupling between the pieces, event/message-based invoking, and maps well to our data structures. A few choices came up, most notably Backbone.js and Spine. These are excellent frameworks. It took a while to wrap my head around the paradigms because I was so used to writing five layers deep of embedded events. Now that I have the hang of it, I can't think of how I ever lived without it. There's just one large problem...these libraries are for jQuery.

    jQuery isn't bad. We've always gravitated towards Mootools though. Mootools is a framework to make javascript more usable, jQuery is nearly a completely new language in itself written on top of javascript (and mainly for DOM manipulation). Both have their benefits, but we were always good at javascript before the frameworks came along, so something that made that knowledge more useful was an obvious choice for us.

    I'll also say that after spending some time with these frameworks and being sold (I especially liked Backbone.js) I gave jQuery another shot. I ported all of our common libraries to jQuery and I spent a few days getting used to it and learning how to do certain things. I couldn't stand it. The thing that got me most was that there is no distinction between a DOM node and a collection of DOM nodes. Maybe I'm just too used to Moo (4+ years).

    composer

    So we decided to roll our own. Composer.js was born. It merges aspects of Spine and Backbone.js into a Mootools-based MVC framework. It's still in progress, but we're solidifying a lot of the API so developers won't have to worry about switching their code when v1 comes around.

    Read the docs, give it a shot, and let us know if you have any problems or questions.

    Also, yes, we blatantly ripped off Backbone.js in a lot of places. We're pretty open about it, and also pretty open about attributing everything we took. They did some really awesome things. We didn't necessarily want to do it differently more than we wanted a supported Mootools MVC framework that works like Backbone.

    Comments