What are the problems with the library? Prometheus is designed for simplicity and speed. In general, a developer should never think whether to add a new metric or not. It should be safe to add it (unless it has labels with very high cardinality). Unfortunately, Haskell library does not afford that. In the past, I’ve experienced severe speed degradations with metrics enabled and that lead to disabling them for an extended period. Today, we have carefully analysed the speed and slowly adding metrics one-by-one measuring performance and improving the library if needed. Besides, prometheus itself is not ideal for some kinds of metrics, but more on that later.
When doing changes, I’ve followed the following rules:
To solve that problem, I’ve introduced variants of counter and gauge for integral values. It’s a pity that you have to to use floating points in a place where only have only integer values. Prometheus documentation tells it’s ok, because you’ll not have rounding errors up to the very high amounts, and this way you may have the same structures for both integers and floats. It’s a valid point. Still it’s possible to experiment. If we take integers, then we can use atomic instructions fetchAddIntArray#
, fetchSubIntArray#
, atomicWriteIntArray#
(or ones from atomic-primops package) and use them. The current implementation also uses CAS instructions, but it does it on the boxed values, so for each write to the storage, you allocate additional structures, with the current approach you should not. On basic experiments using a single thread, IntCounter
and IntGauge
give 10-20% speedup, and that implementation does not introduce additional allocations, so why not get the benefit for free? Interestingly, the go library claims the same performance as Int* approach, it’s done by packing floats in int and performing CAS operations on ints, looks a bit weird but it works.
But why such a small speedup matter? Evaluation time of the measured activity is usually a few magnitudes higher so, in any case, additional 20ns for an operation will be just statistical noise. That is true. But the situation starting to change when you look at the more advanced metric types, for example, a histogram is structured as a set of counter values as buckets, so for you should almost immediately get the next order of magnitude values.
But in general, those structures were just a first step and experiments.
The next changes are more engaging, lets take a look at some real case: we use concurrent logging mechanism, so each thread writes logs to the TBQueue
and continues execution, background thread reads records from the queue and dumps those to the handle. The queue is bounded, so if there will be too many entries generated threads blocks on write. We have two parameters to configure:
But how to configure those parameters and how to measure if everything is going ok with the process? The usual solution is to add the following metrics:
Based on those values, you may configure several graphs and alters, the rate of messages, the number of records in the queue. For the first value, you may get an excellent plot, but for the difference, you a line of zeros, unless something goes off. It happens because the logs are scraped in infrequently, in our case, it’s one minute, and all the values normalise in that time. So with such metrics, you can get alerting, but you don’t have insight on how exactly the system operates, and can’t use that to tune parameters. But what to do? One answer is to scrape logs more often, but that will incur high costs on the measurements tracking system and applications.
To solve that tried the following decision, it’s highly experimental but already works in our production. We measure number in-flight messages as a gauge, but in addition to the current value that gauge stores maximum value in a window. The question is how to define a window, and I don’t want to track time. Hence, the window is any time between two scrapes, it’s semi optimal solution as now observer affects the system and analytical system may no longer get some values. But this approach works, and instead of a line of zeros, you may get a beautiful plot:
There are some problems with that approach: it’s not evident how to generalise this solution, so it handles other functions, not just maximum. Will be easy to handle floating points as well.
But that is not all I’m also using the prometheus-client-ghc package, some time ago I had a post about it. There is a problem with that package is that it reads metrics only when it’s asked by the scraper, so as a result it has total and information about the last GC. So you can calculate totals, averages and random data. It’s not so good. To solve that I’ve completely rewritten the package now it works in the following way:
All of that functionality is battle-tested on our services and behaves quite well.
It’s still not ideal as it handles only two gc generations instead of any, it does not calculate histograms, but only maximum values.
For the library, I have the following plans, introduce all experimental metrics like window gauge that my project needs (as this point, I see the use in a floating window gauge). After that, I’d like to go to the core library and apply the same approach to the histograms and later to the way labels managed. After that, I’m going to experiment with a way logs are exported, I expect that it’s very suboptimal and we can win a lot by a slightly different approach, but I don’t yet have evidence here. After that, I’d hope to add HDR histograms support, as the current approach renders them unusable for many cases, but that is a topic for another post.
Why I’m using separate repo and has not opened any PR? The main reason is that I plan to change almost entire implementation and I’m not sure that current maintainers will be happy about all the changes. Unfortunately, I need all the changes I did yesterday or in the worst case tomorrow in production, so I’ve decided to work without cooperating with upstream and save some time for me. Maybe I’m wrong, but I’m open to suggestion.
]]>there are several sites (places usually schools) where people can pass the contest, and for each of them access can be allowed only for a restricted set of networks.
So we needed a built-in IP filtering app, such that for a given contest can check whether a given user has a permission to access it. After a competition has finished, its user may have access to his results from any IP. And there can be concurrent restricted and unrestricted contests running at the same time.
There are a few additional constraints:
We need the last constraint as an additional protection from DDOS attacks: we don’t do any work if IP is not allowed and do not consume system resources. Actually at first the idea was to use IP filtering internally, but the feature was so helpful for organizers that they decided to use that for the main events.
Leaving technical details aside, we may think that for each user we can map his login to a contest id and id of a site without access to a database. So basically we need to write a function:
check :: ContestId -> SchoolId -> IP4 -> Bool
That will check if a user has an access. Note missing IO
or any other context here.
The simplest solution is to have a
HashMap (ContestId, SchoolId) [Net4Addr]
or
HashMap ContestId (HashMap SchoolId) [Net4Addr]
.
The latter one allows faster-path for the case if the contest is not filtered.
Is there any problem with this solution? There is. The hashmap structure is a very “branchy” tree, and if it’s quite big GC will have hard time evaluating it. It may not be a big problem: if a tree changes rarely, it goes to the older generation and will affect major GC only. And in one project I had an experience with keeping a large (from half to several Gb tries in memory) hashmap. However, it may still negatively affect the performance of a service and we want to have a better story, especially if it’s cheap.
What do other languages do in this case? There are several ways: straightforward use of some in-memory db or an external cache like redis, both solutions provide much more functionality than needed for our use-case. Another solution is to use off-heap data structures. In this case the data-structure does not affect GC. This solution is possible in Haskell as well, and it does solve the problem. But such a solution may be complicated, and we lose the ability to use first-class language features.
We want something better. And there is a solution: compact regions. Interested reader may check
A compact region is a region that contains a Haskell structure inside, but which is stored in contiguous blocks of memory and this structure doesn’t have any references outside of the region. Because of this it can be seen as a single object for the garbage collector. And it doesn’t affect the GC. Programmer may still access stored value and use any Haskell features when working with it.
The simplest solution may be as simple as:
mkCheck
:: IO (HashMap ContestId (HashMap SchoolId (Vector Net4Addr)))
-> IO (Handle, UpdateHandle)
mkCheck mkCache = do
ref <- newIORef =<< compact mkCache {- 1 -}
let hdl = Handle { lookupSchool = \cid key ->
runExceptT $ do
storage <- liftIO $ getCompact $ readIORef ref {- 2 -}
c_storage <- HM.lookup cid storage ?! NoRestriction {- 3 -}
V.toList <$> HM.lookup key c_storage ?! Missing {- 4 -}
pure (hdl
, UpdateHandle $ mkCache >>= compact >>= writeIORef ref) {- 5 -}
Here we create a cache with an update function. We take a cache population function as a parameter. Then we create a cache {- 1 -}
and create a compact region out of that. We store the region in IORef
basic mutable variable with atomic CAS updates. When we read the value {-2-}
we get it our of IORef and get it from the compact and can work with it as with any other Haskell value (steps {-3,4-}
). On the line {-5-}
we return an update function that builds a new value of the cache when is called.
So basically the only lines that were added to the naive algorithm are 1,2,5
, the rest of the algorithm remains unchanged.
Note. There may be other architectural choices how to provide an access to the API and to it’s updates, but I doubt they are very solution dependent, so I’d like to avoid API discussion.
So now let’s discuss this solution. During first tests on data that we have gathered from previous contests, the initial hashmap had a size of 177696b (reported by ghc-datasize). That is a tiny number, but the numbers were for single contests, and we know it will grow. After compacting it was 98304b only (reported by compactSize
function). Isn’t it quite nice, especially when we have that for free? The rest of the algorithm remains unchanged: the only difference is a call to compact
function during the update and getCompact
at the beginning of the lookup.
Are there any other costs? Yes, they are: during an update we update the whole structure even when we could do it only partially. There are more elaborated solutions that can avoid this problem. But we decided to run the next event using the current one: we just set update function to run once per 10 minutes. And everything went fine, except that 10 minutes was two high value to wait when you add changes.
After examining the feedback and results of the contest we have decided to decrease update time to one minute and make cache structure more complex:
After all those changes, it was not feasible to build a new structure each minute, especially because it’s known that most of the time structure does not change. Compact regions provide a way to add a structure to a region:
compactAdd :: Compact b -> a -> IO (Compact a)
this function takes a compact region with a reference to a value of type b, adds a value of type a
, and returns a reference to value a
in the region. So if we want to update a region we can write:
compactAdd b (update (getCompact b))
If a structure update keeps old parts of the structure untouched as much as possible, then we will store only updated parts. And for most of the persistent immutable structures (e.g. HashMap
), we have that already.
But if we go this way we will never free old parts of the structures that are no longer used and introduce a memory leak. To avoid that problem we need to add some strategy when to rebuild the structure from scratch. So we will call compactAdd
often and sometimes call compact
. I’ve decided to keep the following policy:
“rebuild the structure if the size of the value is not more than twice as big as it was when the initial version was built”.
c <- compact =<< makeCache
ref <- newIORef c
old_size <- newIORef =<< compact c {-1-}
let update = do
prev_compact <- readIORef ref
current_size <- compactSize prev_compact
prev_size <- readIORef old_size
let inplace = prev_size * 2 > current_size {-2-}
... new_cache
if inplace
then compactAdd prev_compact new_cache >>= writeIORef ref {-3-}
else do
z <- compact new_cache {-4-}
new_size <- compactSize z {-5-}
writeIORef old_size new_size
setGauge metric_size $ realToFrac new_size {-6-}
writeIORef ref z
On the line {-1-}
we store initial size in addition to the reference to cache itself.
On the line {-2-}
we apply our strategy and decide if we want to update inplace or build a new compact region.
On the line {-3-}
we store update to the structure in the current region.
On the lines {-4,5-}
we rebuild a new cache and on the line {-6-}
we store the size to the metrics, so we can check if everything goes well or not.
But it’s not everything we need to check how updates take place. Assume you have a HashMap
. You update a value for a given key with the id
function. You’ll get an equal value but from the low-level perspective you’ll get a new structure, and it matters when you’ll store that value to compact. So we track if there were updates and if not we explicitly return the same structure and do not update reference at all.
With all the changes, the structure works quite nice. At some point I’d like to put that into a library. Unfortunately I’m not sure if I know how to generalize it from the project-specific parts.
]]>Firstly we define options we need:
Besides, we want to be able to set up sound strategies for the cache eviction, and but we will discuss that that before providing our solution.
The simplest option is to take some LRU cache. This approach keeps memory bound by allowing only fixed amount of items in the cache. And it provides a strategy for increasing hit-rate by keeping least recently used items in the cache.
The best Haskell package for the LRU cache I’ve found so far is lrucaching. This package is based on the psqueues that provides the fastest immutable priority queues in the Haskell ecosystem. How the pseudo-code can look like:
result <- atomicModifyIORef' ref $ \cache -> {-1-}
case lookup key cache of {- 2-}
Nothing -> (cache, Left key)
Just (value, cache') -> (cache', Right value)
case result of
Left key' -> do
result <- performQuery key'
atomicModifyIORef' ref $ \cache ->
(insert key value cache, ())
pure result
Just result -> pure result
Here we first check if the value is in the cache or it’s not. We use atomicModifyIORef'
call that is a CAS
operation that is quite fast but we may run several retries under contention. Then we perform a query and save the result of the computation in the cache if needed, or just return a cached result otherwise.
The solution is nice and simple. We use it in several places in the codebase. However, we have a problem with this solution. If many threads come for the same key, then all the requests do not find the key and execute a query to the database. So this approach does not save us from the first requests burst that is the most dangerous for our service. So we want to solve a problem of such bursts, we want to increase chances that if two requests come for the same key simultaneously, then only one request to the database is made.
The simplest but not affordable solution is to introduce a critical section, so only one request is run at a time:
This solution is too coarse; we still want requests for the different thread to be run simultaneously. We can achieve that by keeping a lock in the cache. Still, if we go this way solution exceeds the complexity budget quite fast: you’ll need STM solution with explicit locking careful exception handing. You can try to implement that yourself.
The interesting thing is that the GHC runtime system already provides the tooling that is enough to build such cache without an explicit lock. When thunk that is evaluated all other threads accessing that thunk block on evaluation and automatically get a result once it’s evaluated. Now let’s check our solution.
Before describing the solution, let’s talk a bit more on the cache eviction. On the early stages of our system, we have decided that cached values may be updated, but it’s ok to keep value in the cache for a limited time. So LRU cache doesn’t work here, as naive variant does not provide such a guarantee: value may live in the cache forever. So instead, we keep the unlimited amount of values but remove values that are too old. We are using psqueues package still.
So the cache itself provides the following interface:
data Handle key result = Handle
{ requestOrInternal
:: POSIXTime -- ^ current time
-> key -- ^ key
-> (key -> IO result) -- ^ function to get the value
-> IO result
, ...
}
The meaning is the following: we provide current time, key, and a function that generates a value in case if the key is not found. And this function returns a result (or throws an exception). Handle provides us with a way to change the actual implementation without changing the interface and code that uses it, you may read more on that approach in the following posts 1, 2.
Now, to the actual implementation. Let’s introduce that line by line:
new :: IO (Handle a b)
new = do
ref <- newIORef PSQ.empty -- Create new priority queue.
pure $ Handle
{ requestOrInternal = \current_time key f -> mdo
...
mdo
- provides recursive do notation. It allows us to refer to the values that we will get in the future.
Now we need to read the cache, to see if there is a value there:
m_result <-
atomicModifyIORef ref
$ swap
. PSQ.alter
(\case
Just (p, ~v)
| p >= current_time ^-^ configLongestAge -> {- 1 -}
(Just v, Just (current_time, v))
_ -> (Nothing, Just (tm, Lazy eresult)) {-2-}
)
key
We update a value in the key and get a result using the following rules:
{-1-}
in case if there is a value, and it lived no more than maximum allowed age then we keep the value untouched and return us that value.{-2-}
otherwise we return nothing from the cache and store the result of our future call in the cache (!).So basically we have stored a result that we don’t even have at the moment, as we have not run the query still. It may sound mindblowing, but it’s perfectly fine in a lazy language.
Note lazy matching of the value ~v
on the line {-1-}
; it means that we don’t try to inspect the value and return whatever is there. It seems that this protection is not required, but it’s better to be safe than sorry.
Another thing is that we wrap our result in a Lazy
data structure:
This way WHNF
of the value doesn’t evaluate result itself, otherwise, we’d get a <loop>
when storing result (as we don’t have the result yet, and our thread would need the result at hands to finish alter
action).
Now we can analyze the result and perform query if needed (skipping some logging):
eresult <- try $ maybe
(f k) {-1-}
(\r -> do
evaluate (getLazy r) >>= \case {-2-}
Left s -> throwIO s
Right x -> pure x
)
m_result
We analyze the result, as you remember in a case if we have Nothing
there we should perfom our request, and we do that on line {-1-}
. We bind the result of our request to the result
name (this is exactly the value we have already put in the cache). It keeps either exception value or the result. Otherwise, there is a result (or a thread working on generating that result). We need to force that (evaluate (getLazy r))
. In a case when the value is already evaluated, we get the result immediately. Otherwise, we block on evaluation, and the Haskell runtime handles it. We have 2 options afterwards, either the query results in an exception, then we rethrow it, or we get a result.
Now we need to do some cleanup, we do not want to keep exception result in the case, so we clean the cache in this case:
result <- case eresult of
Left (s::SomeException) -> do
atomicModifyIORef' ref $ \v -> (PSQ.delete k v, ())
throwIO s
Right x -> pure x
We must keep value with an exception in the cache first: this way all the threads that made the same request while our thread is performing request raise an exception.
The last step is to try to clean the oldest value. There may be better strategies for cleaning, but this strategy is straightforward and it works:
atomicModifyIORef' ref $ swap . PSQ.alterMin
(\case
Nothing -> ((), Nothing)
Just (kk, p, v)
| p < tm ^-^ configLongestAge -> ((), Nothing)
| otherwise -> ((), Just (kk, p, v))
)
pure result
This solution worked just fine for 1.5 years under heavy usage, and we have experienced 2 bugs there:
Lazy
wrapperLazy result
in the cache, instead of Lazy (Either SomeException result)
. As a result, in case of an exception in f, nobody can populate the value in the cache. And other threads wait forever on update. This unfortunate event has hit 2 users :(.TLDR we want to make our servant service to reply 404
whenever we see a particular query part in some endpoint subtree. If you are not interested in why do I need that you can proceed to the solution directly.
I’m working on building a few web services. One of them provides a service to run contests. In such contests lots of people come at once to solve some tasks. Thus we have a severe load in a bounded amount of time (It’s not entirely true as those contests are running all the time but the most of them are quite small). However, we have a numer of significant events with up to 20k unique visitors in just 3 hours, generating a load a bit higher than 500rps. This is not a high value but the main problem here is that any issue with availability or responsiveness hits our reputation a lot. Additionally in the past we had experienced DDOS attacks during such events. Currently our services can handle in the worst case a load at least three times higher than that but still, we are interested in reducing the surface of possible attacks. Even if we can handle many requests the network bandwidth in the cluster is not very wide, as a result it’s possible to introduce a denial of service by requesting many large files.
External CDN service (pic.1) is just a 3rd party service that provides API for uploading and removing files. Such a service takes all responsibilities for distributing data and providing the required quality of service. All you need to do in your user-facing service is to give links to this third-party service. The cost of that is that you need to upload and control files on that service explicitly.
Transparent CDN service (pic.2) is a service that acts as a distributed proxy for your service. In addition to proxying requests it stores and propagates files on its nodes based on response headers it sees. With hepl of that you can control all the data on CDN. Besides, such a service often provides a firewall and anti-DDOS services.
We chose a transparent CDN service because our patterns of working with contests are quite intricate: until frozen files are not expected to be cached and may mutate, but once frozen a file can never mutate and should be persisted. Additionally, we get DDOS protection and a firewall for free.
But just plugging a transparent CDN service in front of yours will not work. CDN is very simple: it does not and can’t analyse actual traffic, so if a malicious user asks for https://our.service/content/big_image
, and then https://our.service/content/big_image?foobar
both requests will pass through CDN and will need to be served by our service. This way attacker can generate infinite amount of URLs and be able to attack the service no matter if there is a CDN or not.
But what to do? One option is to forbid access to content if there is a query part in an URL. It looks like a sane option as “valid” clients never access content while adding query parameters. For some parts of the service it’s done by NGINX rules. But for other parts of it the service itself is responsible for generating contents and setting proper headers and we want it to be able to reply with 404 for such queries.
We use servant as a framework to write our server. It allows writing a declarative description of the service structure on the type level. More details on that may be found in the servant tutorial, or the very basic howto.
We want to restrict query parameters on some part of the endpoints tree. It may look like:
This way, ContentEndpoint
doesn’t know that it was restricted and can implement its logic without knowing how is it called. It is a crucial point that provides the necessary abstraction level.
We are going to implement RestrictQueryParam
now. To achieve this we need to introduce a new servant combinator: something like Capture, QueryParam or other from the Servant.API hierarchy. It’s not particularly well documented but there are enough examples in the servant itself.
The first thing we need is to introduce a type for
-- | Forbid passing any query parameter.
-- If the parameter is given, then we throw 404
data RestrictQueryParams
deriving Typeable
Note that it doesn’t have a type constructor - this construction may live on the type level only.
The next step is implementing a HasServer
instance. This instance describes how we parse a request and how we work with a response.
instance
(HasServer api context) {-1-}
=> HasServer (RestrictQueryParams :> api) context where {-2-}
type ServerT (RestrictQueryParams :> api) m = ServerT api m {-3-}
hoistServerWithContext = undefined
route = undefined
The instance is very straighforward: it tells that the internal server ({-1-}
) wrapped in the RestrictQueryParams
({-2-}
) is also a server. Type ServerT
tells the resulting server type. In our case it’s the same as for the internal server: RestrictQueryParams doesn’t change it. hoistServerWithContext
tells how to change underlying monad if needed, you skip it and see how it’s implemented in the other combinators:
And the last but not the least method is route
that describes the routing itself.
route Proxy context subserver =
route (Proxy :: Proxy api) context $ add subserver
where
add Delayed{..} =
Delayed
{ paramsD = withRequest check *> paramsD
, ..
}
check :: Request -> DelayedIO ()
check req
| not $ B.null $ rawQueryString req =
delayedFailFatal err404
| otherwise = pure ()
Here we add a parameters check to the internal server checks (add subserver
). It checks the Wai.Request
(withRequest check
) in addition to the checks added by internal server (*> paramsD
), keeping other checks untouched. In the check
function we check rawQueryString
and reply with a 404 if it’s not empty. Description of the request checks is given in the docs.
Instances for client and swagger are trivial; they pass all the work to the internal API.
instance (HasSwagger api) => HasSwagger (RestrictQueryParams :> API) where
toSwagger _ = toSwagger (Proxy :: Proxy api)
instance HasClient m api => HasClient m (RestrictQueryParams :> api) where
type Client m (RestrictQueryParams :> api) = Client m api
clientWithRoute pm Proxy req = clientWithRoute pm (Proxy :: Proxy api) req
hoistClientMonad pm _ f cl = hoistClientMonad pm (Proxy :: Proxy api) f cl
This way, it has landed in our codebase because it just solves the problem. Adding this check solves our problem and can be reused in other projects.
But there is a problem. Sub server doesn’t know about that it runs in the restricted context but the server itself should know if this restriction can be safely applied. It can’t be used if the server uses query parameters. So we want to prove that we don’t have such a problem. It can be achieved using custom type erros, see more on docs on the ghc wiki.
To use them we need to introduce an additional type family. You may think of a type family as a function on types, that can perform pattern matching on types and return some result. In our case this function should either return a error or the same type. We need to traverse servant endpoint structure and return a type error when required.
type family CheckNoParams e where
CheckNoParams (f :> g) = f :> (CheckNoParams g)
CheckNoParams (f :<|> g) = CheckNoParams f :<|> CheckNoParams g
CheckNoParams (QueryParam sym x :> y) =
TypeError ('Text "can't use QueryParam under RestrictQueryParams")
CheckNoParams a = a
With this code we recursively go through all endpoints (lines 1 and 2), return an error if we see a query param (line 3), and type as it was if it’s not a QueryParam (line 5)
If you’ll use RestrictQueryParam
on the endpoint that uses a query you’ll get the following error message:
I think after the post it will land in the codebase as well.
But it’s possible to go even further and allow the headers that are used in the internal site. Unfortunately this problem is much more complex and I’m not sure if checks will not be too expensive, but if anyone has a solution I’ll be happy to check it out.
]]>I’ll take the techempower benchmark as a site specification. The API is simple enough to represent exciting features and doesn’t take much time to write. I’m not going to compete against other implementations of that benchmark (at least I think so at the time of writing this post). Today I’m going to make a skeleton of the site and one endpoint only.
I’ll take servant as a web framework, to my taste it is the best framework you can use unless you have exceptional needs. The main feature of the servant framework is that it allows to generate and test much code for free, without much extra cost.
Servant framework has a very nice tutorial and documentation that can be found at read-the-doc site .
When writing an application using servant you first need to define its API:
type Api
= JsonApi
type JsonApi
= Description
"Raw JSON output API \
\ For each request, an object mapping the key message \
\ to \"Hello, World!\" must be instantiated."
:> "json"
:> Get '[JSON] Message
I prefer to keep a type synonym for each endpoint (or endpoint structure) as that would allow using that type in the other parts of the program for code generation.
This type explains how a handler does its work and :>
splits the type into URL pieces. This type tells that application can handle Get requests to the URL /json
if accept type is application/json
and when doing that it returns a Message.
The additional Description
part comes from the servant-swagger package. A few more extra lines provide additional information about our API:
apiVersion :: T.Text
apiVersion = "0.0.1"
swagger :: Swagger
swagger = toSwagger (Proxy @ Api)
& info.title .~ "Experimental API"
& info.description ?~
"This is a benchmark site, and used for general \
\ experiments and blog examples."
& info.version .~ apiVersion
Now we can run the server. Our sever consists of the swagger UI and our application.
type API
= SwaggerSchemaUI "swagger-ui" "swagger.json"
:<|> Api
run :: IO ()
run = do
Warp.run configPort
$ prometheus def {
prometheusInstrumentPrometheus = False
}
$ serve (Proxy @API)
$ swaggerSchemaUIServer swagger
:<|> server
where
configPort :: Int
configPort = 8080
server :: Handler Message
server = pure $ Message "Hello World!"
Remember prometheus def
lines from the previous post. And the application runner is:
import Prometheus
import Prometheus.Metric.GHC
import Sample.Server (run)
main :: IO ()
main = do
_ <- register ghcMetrics
run
Now we can have an application that returns us {message:"Hello, World!"}
on json
URL and swagger UI on swagger-ui/
. With that interface you can explore site API:
Send requests and observe results
And all that comes for free.
There are a few more things I’d like to discuss before moving to metrics:
Naming conventions - it’s worth to define common conventions for converting Haskell data types into their JSON representation and use them across the project.
Encoding tests with servant and swagger you can automatically test serialisation of values for all the types used in the API. Also tests check that specification is up to date.
Now we have a simple site with helper interface and specification. There are many missing pieces, for example: a. configuration parsing; b. logging; c. more autogenerated tests; d. nice defaults for RTS options.
All of them will be covered in the following posts.
In order to build everything I prefer to use nix
package manager. Both stack
and cabal-install
are nice tools to do that but with nix you can add more features to the build system. For example building of docker containers, building packages written in other languages and setting up a development environment. Build scripts for the package can be found at overlays.nix. Build scripts for its docker container are in docker.nix.
At this point we are ready to setup our environment with Grafana and Prometheus. Configs are the same as the ones described earlier or they can be found at GitHub.
Graphana reports are looking like:
Mutator CPU time
shows how much CPU time you’ve spent doing actual work. If total CPU time is much higher than Mutator CPU time, then you are likely to have problems with GC.
Config itself can be taken from GitHub.
Unfortunately GC information tells us information about the last GC only so everything that had happened between scribe intervals will be missing from our data. In the next post I’m going to run some benchmarks of the application (and potentially introduce other endpoints) and discuss if missing information is actually a problem and what can be done about that.
]]>I will try to split the description in a series of posts, in this one I’ll describe the general setup. After reading this post you’ll be able to set up the metrics system of your Haskell application (or suggest me how to do that better). At this point you’ll be able to get some information about your application and set up alerts based on that. In the following posts we will try to go deeper into each metric, check if those metrics are helpful, look if any pieces are missing and how that could be improved.
Let’s spend a bit of time defining the problem we want to solve and describing its solution area. The purpose of a metrics system is to tell if your application is alive and behaves as expected. It should not give you more than statistical information about your application. We can split information into two categories:
The border line between those two is quite fuzzy: for example, you may have a general web-server statistics, like number of processed requests or time to reply which are application specific but applicable to any web-server. So I’d add them to the first group. But the main trap here is not to try to solve an incorrect problem; metrics may not work as an exact source of information. Nor could they be a mechanism for tracing or log server, you need other tools for thse purposes.
In the Haskell ecosystem, there are a few packages providing metrics support. The one that is the best known and has a long history is EKG - this package offers few metrics types and a large variety of systems that you can integrate with. While EKG is generally a good generic solution, I found that some companies are trying to move from that package. (I was not able to gather concrete reports what was a problem with it, so I will try to avoid answering that question).
Otherwise we can take a specific solution that works great with a single system. In Tweag we’re used to use Prometheus. With Prometheus you can dump your metrics to the well-maintained package that other people usually familiar with. Hackage offers an excellent library for working with Prometheus: prometheus-client. Even if you like EKG more or have projects that are using it you can use the adapter for EKG ekg-prometheus-adapter. I have not used that package myself but I hope that it just works, or at least it could be easily fixed.
For an application setup I’m going to use Docker compose. With this approach we will be able to cover all the details and this approach may be adapted to a more complex system like Kubernetes.
Let’s start writing docker compose files. I omitted all irrelevant links and configuration.
version: "3"
services:
haskell-app:
image: <your-image>
ports:
- '8080:8080'
node_exporter:
image: prom/node-exporter
expose:
- 9100
prometheus:
image: prom/prometheus:latest
volumes:
- ./config/prometheus:/etc/prometheus
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
ports:
- '9090:9090'
links:
- haskell-app
grafana:
image: grafana/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=XxXchangemeXxX
depends_on:
- prometheus
ports:
- "3000:3000"
links:
- prometheus
volumes:
- grafana_data:/var/lib/grafana
user: "104"
volumes:
prometheus_data: {}
grafana_data: {}
Prometheus config:
global:
scrape_interval: 5s
external_labels:
monitor: 'my-monitor'
scrape_configs:
- job_name: 'myapp'
static_configs:
- targets:
- haskell-app:8080
It’s possible to configure Grafana declaratively as well but as I don’t have a final solution that I can use out of the box on any system I tend to setup Grafana manually. Just log into your instance and go through the onboarding process.
Now we are ready to set up our Haskell application.
Setup of a Haskell application may be pretty simple. To dump GHC statistics you may use prometheus-metrics-ghc. To make the full use of this package you need to enable gathering of runtime statistics with:
build-depends: prometheus-metrics-ghc
ghc-options: "-with-rtsopts=-T"
Then add to your main:
At this point you gather RTS stats but you don’t export you metrics yet. To export your data you may want to use wai-middleware-prometheus. This package allows you to provide metrics inside any wai
/warp
application.
import Network.Wai.Middleware.Prometheus
import Prometheus
import Prometheus.Metric.GHC
main = do
register ghcMetrics
Warp.run 9090
$ prometheus def
{ prometheusInstrumentPrometheus =False }
$ yourApplication
Or use metricsApp function if you don’t have any web application. And Prometheus will scrape that data from your application. At this point you’ll have some basic information about your endpoints and GC stats. And you can add your application-specific data using Prometheus interface.
We will cover interesting stats in the next posts but for now you may be interested in the following data:
ghc_allocated_bytes_total
- to build a rate
plot based on that metricghc_num_gcs
- to build a rate plot of GCsghc_mutator_wall_seconds_total/(ghc_mutator_wall_seconds_total+ghc_gc_wall_seconds_total)
- to understand the proportion of time spent in GCghc_gcdetails
category. This data may not be very useful as it shows data since last GC, so you may report the same GC multiple times if no GC happened during the report period, or miss some reports if more that one GC happens.I hope that this information will be useful and will try to dig into concrete metrics examples in the following posts.
]]>First thing is a discussion of the explicit pinning capabilities to the cores. It’s possible using +RTS -qa flag, as it was mentioned by the nh2 on Reddit. As I mentioned in the previous blog post, my approach will not work with this option correctly (for some reason I have used -xm
instead of -qa
in that post, I’m sorry) and I’ll need to redefine more functions. But in general pinning capabilities to cores may work on all possible CPU layouts. I have not looked deep inside that issue as in most of our cases -qa
flag gave me worse performance, so your program should have some special properties to make benefit from the hard pinning. I think it’s possible to use /proc/cpuinfo
to make the most efforts when pinning capabilities.
The entire thread is very entertaining and if you are interested in the topic then I recommend to check out ther comments as well.
Secondly, there was a question if my reasoning was incorrect and it’s enough to leave one thread off and still have better performance. We used this approach in some projects, however for one particular case the results with N-1
threads were very depressing:
Cumulative quantiles per tag (N7)
99% 98% 95% 90% 85% 80% 75% 50%
Overall 4600ms 4380ms 3980ms 3540ms 3400ms 3280ms 3210ms 1105ms
get 4600ms 4390ms 3980ms 3550ms 3410ms 3290ms 3210ms 1145ms
put 4600ms 4380ms 3980ms 3540ms 3400ms 3280ms 3210ms 1100ms
Cumulative quantiles per tag (N4)
99% 98% 95% 90% 85% 80% 75% 50%
Overall 139ms 105ms 37ms 17ms 12ms 8ms 6ms 2ms
get 139ms 104ms 37ms 18ms 12ms 9ms 7ms 2ms
put 139ms 105ms 37ms 17ms 12ms 8ms 6ms 2ms
There is 1 to 3 orders of magnitude differences in response times, without going deeper I have decided to stick with -N4
for now.
The third, @TerrorJack adviced me to improve teardown procedure in the wrapper.c
, as it should check ifRTS
was stopped and report its status. So I have rechecked the sources and introduced few updates that allow to report status of running haskell command (the same way as RTS does), and which do not require using FFI extension in the Haskell code.
Some updates to this post can be found in the next one.
Recently I’ve written a small Haskell application that performs some cryptography routines, query management, and communication with Redis. We wanted to test the capabilities of the application and measure RPS that application can support. We used Yandex Tank for this purpose. Yandex tank can generate a load for a site and build some fantastic reports (it could require GitHub authorization).
Results were quite interesting. At first everything went well: the application was able to process about 1k
requests per second and that was enough for the expected load. However, when Yandex Tank gave a pressure of about 2k
RPS the situation became worse. The application stopped being responsive. It was able to process only 200
requests per second. That was troublesome. On the Yandex tank plots there was a period of reasonable performance and then a period of unresponsiveness.
As it happens the first suspect in such cases is the Garbage Collector. You can always hear lots of scary stories about how GC could ruin your life. Keeping that in mind GC was tuned beforehand and I had prepared some metrics. However this time the situation was quite OK: garbage collector pauses were all below 10ms and 99% of the total program time was spent on actual work and not on garbage collection. Memory usage was too big - around 700Mb and I didn’t expect that.
On the left plot you can see reports of the last GC, they are not very precise as they may miss some GCs or report the same GC twice. However they still tell us the order of magnutude. On the right plot we see the ratio of the time spent in mutator to the total running time.
Other parts of the system were not under a stress and were capable of handling higher loads. So the issue was in my program. Although I didn’t spend much time optimising the program, it should behave better.
Another surprising fact was that the issue didn’t appear on my system which is comparable to the one where stress tests were running, and the one that could handle higher load. Accidentally I realised the difference, and the following dialogue took place:
Me: What is the CPU on the system?
Admin: I7 4 cores, eight thread!
Me: Ah! Add
GHCRTS=-N4
to the container’s Environment.
After that single adjustment the situation changed drastically: the program was now able to process 4.5k RPS (comparable to the maximum load that a single instance of Yandex.Tank can generate), now mutator time got >99%, GC pauses were still ~10ms but very rare and memory usage was about 25Mb.
Connect time was still terrible in that case but it was fixed by reconfiguring and improving network which is out of scope here.
So this is just another story about suboptimal defaults for the runtime system. It happens that if you pass +RTS -N
option - you tell runtime system to start the same number of capabilities as the number of cores you have. However, RTS
makes no difference between real and virtual cores. It appears that RTS
could not make enough benefit from virtual cores and performance is not good enough in that case.
While it is pretty impressive that we can optimise a program without any changes in source code, it’s interesting what can we do in general case. It’s problematic that we can not set good options which work on any CPU and have a decent performance. To find an answer I’ve started the project ht-no-more. It lives in my playground for now but I can extract it to a separate repo. I hope that at some point it will be mature enough to be used as a library or even land to into the RTS source code.
The idea is to gather information about the architecture during a program startup and then set up an RTS with a proper configuration. We want:
+RTS -N
option still behaves well.N.B. From this place and now on we assume that we run on Linux
only and that we have procfs
mounted and that we can write non-portable code. Now our life is comfortable, and we can proceed with the task.
The first question is how can we tie our process to a CPU
. There are sched_getaffinity() and sched_setaffinity() calls. Those methods perform hard wiring of a process and all it’s descendent processes to given CPUs. So we can use them to mask CPUs that we are not interested in.
int sched_setaffinity(pid_t pid, size_t cpusetsize,
cpu_set_t *mask);
int sched_getaffinity(pid_t pid, size_t cpusetsize,
cpu_set_t *mask);
First let’s write a simple c
program that explains the API.
#define _GNU_SOURCE // allow to use non-portable functions
#include <sched.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char * argv[]) {
cpu_set_t set; // define CPU set
CPU_ZERO(&set); // mark all CPUs as unused
CPU_SET(0, &set); // allow to use first CPU
sched_setaffinity(0, sizeof(cpu_set_t), &set); // wire process to CPU
int result = system("bash"); // start bash.
return result;
}
In this first program we allow an application to run only on the first CPU. We need this program later for testing purposes. I’ve used it to check if my program behaves well in the constrained case.
The next question is how to check if a processor is real or virtual. The only way I’ve found it is to parse /proc/cpuinfo
file. We are interested in core id
entry for each processor: it tells what the index of the real core that CPU is set on is. For example on my machine I have:
As I have Hyper-Threading disabled - all cores are real ones. On digital ocean host, I have:
all CPU are on the same core (at least for the container).
Now we can combine the answers to the questions and write the code. We need to remember that there is an additional case that we want to cover: if the CPU was already disabled for our program – then we don’t want to “unmask” it. As a result, I ended up with the following code:
int setcpus() {
cpu_set_t set;
int ret = 0;
ret = sched_getaffinity(0, sizeof(cpu_set_t), &set);
if (ret == -1) {
fprintf(stderr, "Error: failed to get cpu affinity");
return 0; // We instruct code that we have failed and it should fallback to the normal procedure.
}
int current_cpu = -1;
int current_core = -1;
FILE *cpuinfo = fopen("/proc/cpuinfo", "rb");
char *arg = 0;
size_t size = 0;
while(getdelim(&arg, &size, '\n', cpuinfo) != -1)
{
if (strstr(arg, "core id") != NULL) {
current_core++;
char * found = strchr(arg, ':');
if (found) {
int cpu = atoi(found+1);
if (current_cpu != cpu) {
current_cpu++;
if (CPU_ISSET(current_core, &set)) {
CPU_SET(current_core, &set); // XXX: this is noop.
fprintf(stderr, "%i real core - enabling\n", current_core);
} else {
fprintf(stderr, "%i was disabled - skipping\n", current_core);
}
} else {
fprintf(stderr, "%i is virual - skipping\n", current_core);
CPU_CLR(current_core, &set);
}
} else {
return 1;
}
}
}
ret = sched_setaffinity(0, sizeof(cpu_set_t), &set);
if (ret == -1) {
fprintf(stderr, "Error: failed to set affinities - falling back to default procedure\n");
procno = 0;
} else {
procno = current_cpu;
}
free(arg);
fclose(cpuinfo);
return 0;
}
Now we are ready to build a cabal project. I’m skipping all irrelevant things that are generated by cabal init
executable ht-no-more
main-is: wrapper.c
other-modules: Entry
build-depends: base >=4.10 && <4.11
default-language: Haskell2010
ghc-options: -no-hs-main
To highlight the important things: first of all, our main module is a C
file. It does not work with old cabal
’s that allowed only Haskell modules to be the main one. Then we add -no-hs-main
– an option that tells GHC not to create its own “main” and use the “main” function that we define. We define Entry.hs
Haskell module that provides an entry function because we no longer have one. In that module we create a single function that tells how many capabilities RTS have created.
module Entry where
import Control.Concurrent
foreign export ccall entry :: IO ()
entry :: IO ()
entry = print =<< getNumCapabilities
The only non-trivial thing we need is to export a foreign function. Compiler generates a C
complatible object called entry
that we can call from C
.
We follow GHC User’s Guide to define our main function (sidenote, if you haven’t read GHC User’s Guide, please do, it’s the most authoritative and precise source of information about GHC features and extensions).
#include "HsFFI.h"
#ifdef __GLASGOW_HASKELL__
#include "Entry_stub.h"
#include "Rts.h"
#endif
int main(int argc, char * argv[]) {
setcpus();
#if __GLASGOW_HASKELL__ >= 703
{
RtsConfig conf = defaultRtsConfig;
conf.rts_opts_enabled = RtsOptsAll;
hs_init_ghc(&argc, &argv, conf);
}
#else
hs_init(&argc, &argv);
#endif
hs_init(&argc, &argv);
entry();
hs_exit();
return 0;
}
However, we have not done everything. If you try to compile this program and run it with RTS +N
you’ll see expected logs but the program reports that you are using the number of capabilities equal to the number of virtual cores. It happens because with RTS +N
GHC asks the number of configured processors and creates that number of capabilities. Instead of this we want the count of capabilities to be equal to the number of real cores. Furthermore, we don’t want to patch GHC just yet because our code is too hacky.
GHC RTS
is linked statically with each Haskell project. It means that we can use a linker to make RTS use our method instead of the one provided with GHC
, we are interested in redefining uint32_t getNumberOfProcessors(void)
. For that reason we use linker’s wrap feature. If you tell linker -Wl,-wrap, function
then for each call to the function
it calls __wrap_function
instead and generates __real_function
that you can call to call the original function.
So we write
static uint32_t procno = 0;
uint32_t __real_getNumberOfProcessors(void);
uint32_t __wrap_getNumberOfProcessors(void)
{
if (procno==0) {
return __real_getNumberOfProcessors();
} else {
return procno;
}
}
to get the desired result. We change cabal to provide required options to the build:
ghc-options: -no-hs-main
-threaded
-optl-Wl,-wrap,getNumberOfProcessors
You can find the full code on GitHub. There is still much work for this project before you can use it for your application.
There is some further work which could be done:
getNumberOfProcessors
alone without calling sched_setaffinity
.-xm
flag (that pins capability to CPU) it may fail.All feedback is welcome.
]]>/etc/nixos/configuration.nix
, but all applications set up there get installed system-wide. Instead, you usually want to have your user environment to be configured per your user. Also you may want to move your environment from one host to another (possibly to another distribution or even to another OS). Updating configuration.nix looks like a too heavy-weight solution for such a task.
Other solutions that I’ve seen were building of one’s own environment or using nix-env for installing applications per user. I was not able to adopt the former solution, and the latter does not provide declarative config, and one may collect lots of garbage of applications she runs once. Besides, it would be nice to control dotfiles in the same style.
One day my colleague Nicolas Mattia suggested me to look at the tool he uses, called homies, the one he had excellently described in his blog. I think everyone should read and check that :). Unfortunately, this approach, at least up to my understanding should exist for each user, because it’s not customisable. So I’ve started my project homster that is solving the same problem. Currently it’s primarily based on the homies
though in the future it’s going to diverge.
The general structure of the project is as follows. There is a default.nix
file that describes all the packages and the modified ones that I use. In each modified package I pass my config, or it’s configured to read system config from the nix package directory so I can configure common options in the homster project and update them on the host by usual means.
So it seems that I have found the solution to all my needs and I can have a declarative configuration for the user environment. I still can use nix-env -i
for temporarily needed packages, but nix-env -f homster/default.nix -i --remove-all
command updates and cleans my environment. Most of the dotfiles can be kept in the project. Also, I can set up my environment anywhere where I can install nix
package manager.
While implementing this the most interesting problem was git
. Git searches for its configuration in 3 places according to the man page:
(prefix)/gitconfig
$(HOME)/.gitconfig
or $XDG_CONFIG_DIR/git/gitconfig
~/.git/config
Besides, it takes --config
option that overrides config search entirely. We can’t use --config
because that overrides project specific options, we can’t update user-wide file with nix, so the only option left is the system-wide config. So we need to understand what is (prefix)
. Man pages tell that it’s a value of the PREFIX
environment option. Let’s check nixpkgs:
https://github.com/NixOS/nixpkgs/blob/54ba2c9afca07b0f14763b3697d00b637b2461e0/pkgs/applications/version-management/git-and-tools/git/default.nix#L86
It seems that nix installs that, but let us check it:
strace git config 2>&1 | grep gitconfig
access("/etc//gitconfig", R_OK) = -1 ENOENT (No such file or directory)
Seems not to be what we were expecting. The story continues, we need to understand where does git looks for a config. You can find it on GitHub (modulo the version that I’ve used, but it’s irrelevant for the code I’m interested in) https://github.com/git/git/blob/7e8bfb0412581daf8f3c89909f1d37844e8610dd/config.c#L1634-L1640
const char *git_etc_gitconfig(void)
{
static const char *system_wide;
if (!system_wide)
system_wide = system_path(ETC_GITCONFIG);
return system_wide;
}
Okay, what is ETC_GITCONFIG
https://github.com/git/git/blob/1f1cddd558b54bb0ce19c8ace353fd07b758510d/configure.ac#L387-L391
GIT_PARSE_WITH_SET_MAKE_VAR(gitconfig, ETC_GITCONFIG,
Use VALUE instead of /etc/gitconfig as the
global git configuration file.
If VALUE is not fully qualified it will be interpreted
as a path relative to the computed prefix at runtime.)
Finally, after searching what GIT_PARSE_WITH_SET_MAKE_VAR
means, we can find that we need to pass --with-gitconfig=name
as a parameter to the configure.
So now we need to patch nix package for git
to do that. As usual, we want to make it declaratively and easy to change without redoing the work that NixOS maintainers already did like applying patches and packaging.
Everything described above can be done pretty quickly in nix
, in my homster/git/default.nix
I can override default package for git as:
nix
interprets the $out
variable and substitutes exact hash there. Then we can copy our file to the right place and get a new system-wide configuration that is controlled by us.
strace git config 2>&1 | grep gitconfig
access("/nix/store/66hi8rssnvhlxbwjg3qkc4bcs76fp8np-git-2.16.4/etc/gitconfig", R_OK) = 0
openat(AT_FDCWD, "/nix/store/66hi8rssnvhlxbwjg3qkc4bcs76fp8np-git-2.16.4/etc/gitconfig", O_RDONLY) = 3
exactly what is needed.
]]>As a result I’ve ended up with a simple non-complete irc-simple
project. This project can’t be used as a real irc server in the real network. But it can be extended to support features. The main intent was to show how server could be written how to deal with communication, parsing, concurrency.
Unfortunately the project is documented in Russian, as it was written for the Russian non-haskell community. But if anyone would be interested in extending, translation or mistakes correction. That would be awesome.
Project itself can be found on:
]]>