• 201002.04

    NginX as a caching reverse proxy for PHP

    So I got to thinking. There are some good caching reverse proxies out there, maybe it's time to check one out for beeets. Not that we get a ton of traffic or we really need one, but hey what if we get digged or something? Anyway, the setup now is not really what I call simple. HAproxy sits in front of NginX, which serves static content and sends PHP requests back to PHP-FPM. That's three steps to load a fucking page. Most sites use apache + mod_php (one step)! But I like to tinker, and I like to see requests/second double when I'm running ab on beeets.

    So, I'd like to try something like Varnish (sorry, Squid) but that's adding one more step in between my requests and my content. Sure it would add a great speed boost, but it's another layer of complexity. Plus it's a whole nother service to ramp up on, which is fun but these days my time is limited. I did some research and found what I was looking for.

    NginX has made me cream my pants every time I log onto the server since the day I installed it. It's fast, stable, fast, and amazing. Wow, I love it. Now I read that NginX can cache FastCGI requests based on response caching headers. So I set it up, modified the beeets api to send back some Cache-Control junk, and voilà...a %2800 speed boost on some of the more complicated functions in the API.

    Here's the config I used:

    # in http {}
    fastcgi_cache_path /srv/tmp/cache/fastcgi_cache levels=1:2
                               keys_zone=php:16m
                               inactive=5m max_size=500m;
    # after our normal fastcgi_* stuff in server {}
    fastcgi_cache php;
    fastcgi_cache_key $request_uri$request_body;
    fastcgi_cache_valid any 1s;
    fastcgi_pass_header Set-Cookie;
    fastcgi_buffers 64 4k;

    So we're giving it a 500mb cache. It says that any valid cache is saved for 1 second, but this gets overriden with the Cache-Control headers sent by PHP. I'm using $request_body in the cache key because in our API, the actual request is sent through like:

    GET /events/tags/1 HTTP/1.1
    Host: ...
    {"page":1,"per_page":10}

    The params are sent through the HTTP body even in a GET. Why? I spent a good amount of time trying to get the API to accept the params through the query string, but decided that adding $request_body to one line in an NginX config was easier that re-working the structure of the API. So far so good.

    That's FastCGI acting as a reverse proxy cache. Ideally in our setup, HAproxy would be replaced by a reverse proxy cache like Varnish, and NginX would just stupidly forward requests to PHP like it was earlier today...but I like HAproxy. Having a health-checking load-balancer on every web server affords some interesting failover opportunities.

    Anyway, hope this helps someone. NginX can be a caching reverse proxy. Maybe not the best, but sometimes, just sometimes,  simple > faster.

    Comments
  • 200912.01

    A simple (but long-winded) guide to REST web services

    After all my research on what it means for a service to be RESTful, I think I've finally got a very good understanding. Once you understand a critical mass of information on the subject, something clicks and the first thing that comes in to your head is "Oh yeah! That makes sense!"

    It's important to think of a REST web service as a web site. How does a website work?

    • A website works using HTTP. If you need to fetch something on a website, you use the HTTP verb "GET." If you need to change something, you use "POST." A RESTful web service uses other HTTP verbs as well, namely PUT and DELETE, and can also implement OPTIONS to show which methods are appropriate for a resource.
    • A website has resources. A resource can be information, images, flash, etc. These resources can have different representations: HTML, a jpeg, an embedded video. REST is the same way. It is resource-centric. Want a list of users? GET /users. Want an event? GET /events/5. Want to edit that event? PUT /events/5. Every resource has a unique URL to identify it!
    • Resources are not dealt with directly. Instead, representations of resources are used. This can be a bit hard to grasp. What is a user? It's a nebulous object somewhere that I cannot interact with. It is an idea, an entity. A representation is a form of the user resource I can interact with. A representation can be a comma delimited list, JSON, XML...anything the client and server both understand. How do we know what we're interacting with? Media types:
    • As a website will tell you what kind of image you're requesting, a REST service tells you what kind of resource representation you are receiving. This is done using media types. For instance, if I do a GET /events/7, the Content-Type may be "application/vnd.beeets.event+json" which tells us this is a vendor specific media (the "vnd") and it's an event in JSON format. You can pass these media types in your Accept headers to specify what type of representation you would like. These media types are documented somewhere so that client will know exactly what to expect when consuming them.
    • If you request a page that doesn't exist or you aren't authorized to view, a website will tell you. This is done using headers. A good REST service will utilize HTTP status headers to do the same. 200 Ok, 404 Not Found, 500 Internal Server Error, etc. These have already been defined and refined over many, many years by people who have been doing this a lot longer than you (probably)...use them.
    • A website will have links from one page to another. This is one of the main points of a REST service, and is also widely forgotten or misunderstood (it took me a while to figure it out even doing intense research). Resources in a REST service link to eachother, letting a client know what resources can be found where, and how they relate to eachother. An HTML page has links to it. So does a REST resource. Links can be structured however you like, but some good things to include are the URI of the linked resource, the relationship it has with the current resource, and the media type. This creates what's known as a "loose coupling" between client and server. A client can crawl the server and figure out, only knowing a pre-defined set of media types, what resources are where and how to find them. This principal is known as HATEOAS (or "Hypermedia as the Engine of Application State").
    • REST is stateless. This means that the server does not track any sort of client state. There are no session tokens the client uses to identify itself. There are no cookies set. Every request to the REST service must contain all information needed to make that request. Need to access a restricted resource? Send your authentication info for each request. It's that simple. Isn't it easier to track session? Not really. Maybe it's easier on a small level, but once you start needing to scale, you will wish you'd gone stateless. Using a combination of HTTP basic authentication and API/Secret request signing, you don't have to send over plain text passwords at all. Hell, even throw in a timestamp with each request to minimize replay attacks. You can get as crazy as you'd like with security. Or for those who prefer security over performance, use SSL.

    Now for some examples. Because I'm currently working on an event application, we'll use that for most of the examples.

    Let's get a list of events from our server:

    GET /events
    Host: api.beeets.com
    Accept: application/vnd.beeets.events+json
    {"page":1,"per_page":10}
    -----------------------------------------
    HTTP/1.1 200 OK
    Date: Tue, 01 Dec 2009 04:12:48 GMT
    Content-Length: 1430
    Content-Type: application/vnd.beeets.events+json
    {
    	"total":81,
    	"events":
    	[
    		{
    			"links":
    			[
    				{
    					"uri":"/events/6",
    					"rel":"/rel/event self edit",
    					"type":"application/vnd.beeets.event"
    				},
    				{
    					"uri":"/locations/121",
    					"rel":"/rel/location",
    					"type":"application/vnd.beeets.location"
    				}
    			],
    			"id":6,
    			"title":"Paris Hilton naked onstage",
    			...
    		},
    		...
    	]
    }

    What do we have? A list of events, with links to the resource representations of those events. Notice we also have links to another resource: the location. We can leave that for now, but let's pull up an event:

    GET /events/6
    Host: api.beeets.com
    Accept: application/vnd.beeets.event+json
    -----------------------------------------
    HTTP/1.1 200 OK
    Date: Tue, 01 Dec 2009 04:12:48 GMT
    Content-Length: 666
    Content-Type: application/vnd.beeets.event+json
    {
    	"links":
    	[
    		{
    			"uri":"/events/6",
    			"rel":"/rel/event self edit",
    			"type":"application/vnd.beeets.event"
    		},
    		{
    			"uri":"/locations/121",
    			"rel":"/rel/location",
    			"type":"application/vnd.beeets.location"
    		}
    	],
    	"id":6,
    	"title":"Paris Hilton naked onstage",
    	"date":"2009-12-05T04:00:00Z"
    }

    Using the link provided in the event listing, we managed to pull up an individual event, which we know how to parse because we know the media type...but wait, what's this? OMG, someone is trying to smear Paris!! She's on at 8:30!!! NOT 8!!! Let's edit...if we do a PUT with new information, we'll be able to save Paris' good name:

    PUT /events/6
    Host: api.beeets.com
    Accept: application/vnd.beeets.event+json
    Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==
    {"title":"Paris Hilton naked onstage (yuck)","date":"2009-12-05T04:30:00Z"}
    -----------------------------------------
    HTTP/1.1 200 OK
    Date: Tue, 01 Dec 2009 04:12:48 GMT
    Content-Length: 666
    Content-Type: application/vnd.beeets.event+json
    {
    	"links":
    	[
    		{
    			"uri":"/events/6",
    			"rel":"/rel/event self edit",
    			"type":"application/vnd.beeets.event"
    		},
    		{
    			"uri":"/locations/121",
    			"rel":"/rel/location",
    			"type":"application/vnd.beeets.location"
    		}
    	],
    	"id":6,
    	"title":"Paris Hilton naked onstage (yuck)",
    	"date":"2009-12-05T04:30:00Z"
    }

    What have we learned? Given one URL (/events), we have discovered two more (/locations/[id] and /events/[id]). We've also seen the media types in the responses that allow the client to know what kind of resource it's dealing with and how to consume it.

    Hopefully this pounds two really important points in: media types and HATEOAS. Without them, it's not REST. You can't just pass application/xml or application/json for every response. Sure, maybe the client can decode it, but they don't know what it is, and without linking to other resources, they don't know how to find anything...unless you want to document everything and never change your service.

    Some other tips/points:

    • Give yourself a few initial entry points to your REST service. You should be able to discover all of the resources in it just by crawling. If you can't, you haven't done HATEOAS correctly. This is a lot harder than it sounds, but it's more than useful later on. Think of your REST service like a website with good navigation.
    • Remember to implement the OPTIONS verb for your resources. It will tell the client what verbs can be used on what resources. With some decent routing built into your application, this should be a cakewalk.
    • As mentioned, you can use HTTP basic authentication for your requests. If the client is anything but a web browser, you won't have to serve up an ugly popup login box, you can just do all that shit transparently. If you don't want to send a cleartext password (please don't!) you can salt the password on the client side and send it over. Hash the password again with the client's secret for added security. Crackers will be amazed at your 1337 computer hacking skillz. You can then verify the hashed salted value on the server side. Add client-secret request signing with a timestamp for uber security.
    • Read a lot more info on REST. It seems that SO many "RESTful" services out there are half-baked and made by people who researched the topic for half a day. Some good ones to take points from are the Sun Cloud API and the Netflix API. Notice the documentation of media types and LACK of documentation on every single URL you can request. This is that loose-coupling stuff I was talking about.

    That's it for now! I wrote this as a culmination of knowledge for the last week or so of research I've done...please let me know if any information is missing or incorrect and I can make updates. Hope it was helpful!

    Comments