First, see part I here.
Second, Sun published specJ2004 using the upcoming SJSAS 8.2 PE, which contains a similar Grizzly version than the current one available in GlassFish. I think it is the first ever NIO implementation that got benchmarked I suspect it’s our best score ever, but I didn’t look at all the results.
OK enough marketing, and let see how Grizzly looks like under the hood. The next couple of lines will describe the main components in Grizzly, why they exists and how they can be customized. All the classes decribed in the text are available here
The main entry point is the Pipeline. I don’t know why I’ve picked that name from Tomcat’s Catalina, since everybody confuse Catalina Pipeline with Grizzly Pipeline. I guess I should have picked ThreadPoolWrapper or something along that line, because a Grizzly Pipeline component is really a Thread Pool Wrapper. A Pipeline is reponsible of the execution of a Task implementation. There is currently three Pipeline implementation included in the code base:
(1) LinkedListPipeline: thread pool based on a linked list
(2) ThreadPoolExecutorPipeline: thread pool based on java.util.concurrent.ThreadPoolExecutor
(3) ExecutorServicePipeline: based on java.util.concurrent.Executors
Surprisingly, all benchmarks perform better when (1) is used, hence the default. It seems the park()/unpark() uses in (2)(3) are slower that a simple lock.
The main entry point in Grizzly is the SelectorThread. The SelectorThread is where the NIO selector is/are created. It is possible to use more than one Selector, based on the number of cpu GlassFish is installed on. When processing a request, the SelectorThread always create instances of Task, and pass the instance to the Pipeline. The SelectorThread can pass three type of Task:
AcceptTask to handle NIO event OP_ACCEPT
ReadTask|AsyncReadTask|ReadBlockingTask to handle OP_READ
ProcessorTask to handle the request processing, and OP_WRITE
The SelectorThread is configurable and can create one Pipeline per Task, or share a Pipeline amongs Tasks. The best performance I’ve measured as of now is when OP_ACCEPT is executed on the same thread as the SelectorThread (so outside a Pipeline), OR_READ and OP_WRITE on the same thread using a Pipeline. Note that during that time, I’ve faced that strange bug, which is now fixed in 5.0 ur7 and 6.0. Make sure you are using the right VM version if you plan to play with Grizzly.
The ReadTask is mainly responsible for pre-processing the request bytes. By pre-processing, I means to decide when we think we have enough bytes to start the processing of the request. Different strategy can be used to pre-process the request, and a strategy can be implemented using the StreamAlgorithm interface.
The ReadTask always do the first read on a socketChannel, and then delegate the work to a StreamAlgorithm implementation. Grizzly has three StreamAlgorithm implementation:
(1) StateMachineAlgorithm: this algorithm reads the request bytes, and seek for the HTTP content-length header. The strategy here is to find the content-length and reads bytes until the requests headers and the body is fully read (because with NIO, you can’t predict if you have read all the bytes or not). Once all the bytes are read, the algorithm tells the ReadTask it can pass the processing to the ProcessorTask. The algorithm only supports HTTP 1.1, and GET/POST/HEAD/OPTIONS (but not PUT). But performance is quite impressive, but you must always cache the request in memory, which is bad for PUT or large POST body.
(2) ContentLengthAlgorithm: same as (1), except supports all HTTP method and HTT 0.9/1.0/1.1. This Algorithm is based on Coyote HTTP/11 connector way of parsing the request, and perform very well.
(3) NoParsingAlgorithm: this algorithm doesn’t pre process the requests bytes. The processing of the bytes will be delayed until the ProcessorTask is executed. The strategy here is to assume we have read all the bytes, and if bytes weren’t all read, let the ProcessorTask decide when/how to read the missing bytes. This is achieved by the ByteBufferInputStream class, which internally use a pool of Selectors to register the socketChannel and waits for more bytes.
(3) is the current default startegy.
The ReadTask also manage the life cycle of the SelectionKey (if you aren’t familiar with NIO, consider a SelectionKey == Socket). To avoid holding a thread when processing a persistent connection (keep-alive), like most of the current Java HTTP Server implementation, the ReadTask implement a strategy where threads aren’t used in between request. The strategy used is:
(1) If the request is completed, but keep-alived, release the thread and register the SelectionKey for IO event
(2) if the request isn’t completed, release the thread but keep the ReadTask attached to the SelectionKey, to keep the connection state/history.
(3) if the request is completed and not keep-alived, release the thread and the SelectionKey.
This strategy prevent one thread per request, and enable Grizzly to server more that 10 000 concurrent users with only 30 threads. Note that (2) happens when a StreamAlgorithm fail to properly parse the request (ex: by not predicting properly the end of the stream). By default, (2) is never executed, and shouldn’t. But I’m still exploring strategy, so the functionality is still available.
ReadTask also implement the missing socket.setSoTimeout() functionality which was dropped in J2SE when a socketChannel is non blocking. I don’t know what was the rational behind not supporting this API, but it is pretty bad because every NIO implementation will have to implement that mechanism.
Currently, the implementation reside in KeepAlivePipeline, which internally uses a java.util.concurrent scheduler to simulate socket.setSoTimeout. Performance wise, I have to admit Grizzly perform better when this mechanism is turned off. But since setSoTimeout() is crucial when parsing HTTP requests (to detect DOS, attack, etc.), the implementation is turned on by default. I’m still trying to find a better way to implement it, as I really don’t like the current implementation, event if the Rule architecture looks promising, specialy when application resource allocation (ARA) is used. I will soon discuss what ARA does, but you can take a look at the code here
Next, the ProcessorTask mainly operates on the request bytes, parsing the request line and headers. Once parsed, it pass the request to an adapter, which is configurable (the current adapter is the entry point inside the Servlet Container). The ProcessorTask can execute synchronously the request, or delegate the request processing to an AsyncHandler implementation (I will discuss asynchronous request processing in my next blog, which I hope isn’t in 6 months )
ReadTask and ProcessorTask can be configured with Handler, which can be used to intercept the state of the processing. Current Handler implementation include a new static resource cache, which serve static resources (html, gif, etc.) using NIO MappedByteBuffer, without delegating the call to the Catalina Default Servlet. Only static resources that aren’t protected by a filter or a security constraint are candidate to the FileCache.
Handler are always associated with a StreamAlgorithm, and execute at the ByteBuffer level.
Ouf…I should blog more often, so my blogs are shorter . To recap, here is the current interfaces available in Grizzly that can be customized:
1. Pipeline: Thread Pool Wrapper, entry point for Task
2. StreamAlgorithm: The parsing strategy used.
3. Handler: Manipulating the requests before it get parsed.
4. AsyncHandler: The entry point for implementing asynchronous request processing system.
In my next blog, I would like to cover the OP_WRITE headache we had in Grizzly, specially when executing on a win32 platform. I would also like to introduce the Asynchronous Request Processing hooks available in Grizzly, e.g. the ability to serve a request over a long period of time without forcing the entire state to be kept around (e.g. when a business process is calling out to another one over a slow protocol, or if there is a work flow interruption, like the “manager approval email” case).
That’s it for now. Let me know what you think. I’m very interested about improving the setSoTimeout implementation, and to explore more performant StreamAlgorithm implementation.
_uacct = “UA-3111670-1″;