Elixir’s GenServers are great. Their fault tolerance makes them a natural choice for situations where you need to store some state over time in a resilient way. They’re not without their gotchas, though. In particular, it’s quite easy to fall into traps with respect to scheduling work within the GenServer’s process.
Consider this toy example:
defmodule Greeter do
use GenServer
def start_link(opts) do
name = Access.get(opts, :name, __MODULE__)
GenServer.start_link(__MODULE__, name, name: name)
end
def greet(server \\ __MODULE__, name) do
GenServer.call(server, {:greet, name})
end
@impl GenServer
def init(server_name) do
:timer.send_interval(20_000, self(), :sleep)
{:ok, server_name}
end
# Depending on where we are in the :sleep handler, we'll time out
# before getting into the body of this!
@impl GenServer
def handle_call({:greet, name}, _from, server_name) do
{:reply, "Hello #{name}, I'm #{server_name}", server_name}
end
@impl GenServer
def handle_info(:sleep, server_name) do
Process.sleep(10_000)
{:noreply, server_name}
end
end
If you call Greeter.greet()
while the :sleep
message is being handled in the “background,” you’ll either get:
Now, obviously that’s a super artificial example, and no one would write something like that. What about this one?
defmodule Greeter2 do
use GenServer
# . . .
@impl GenServer
def handle_info(:sleep, server_name) do
%{minute: minute} = DateTime.utc_now()
timeout =
if minute == 0 do
60_000
else
250
end
Process.sleep(timeout)
{:noreply, server_name}
end
end
This is even more brain-dead!
But… you actually see code that’s equivalent to this all the time in the wild. The :sleep
handler is what it looks like when you interact with a remote API that is usually fast, but occasionally times out. (It’s no coincidence I have it timing out at the top of every hour—I’ve worked with a few different government APIs that refresh their data hourly, and for the first couple minutes of the hour requests to those endpoints often just time out.)
A couple other ways this can occur:
handle_call/3
for instance)—it’ll be fine most of the time, but when the database gets overloaded, you could see cascading timeouts.The fun thing about all these examples is that restarting the GenServer probably won’t help, because you often (re-)schedule the same kind of work upon restart, and the GenServer will keep crashing. (You may even be making the problem worse with retries!) And of course if the GenServer keeps crashing in just the right way, either its Supervisor will give up and just let it die or you’ll bring down the whole Supervisor.
The right solution here really depends on your situation. The simplest fix (where it’s appropriate) is to spawn the sometimes-slow work into a Task
, then send your GenServer process a message when it’s done. Other possibilities include:
:fuse
to stop retrying operations that fail repeatedly to bring down the overall load on the system