Great Grandma

2017-11-26 17.50.25 - Benjamin and Grandma Georgia B&W contrasty.jpg

Advertisements

Esoteric Data Structures and Where to Find Them

This is a summary of the CppCon 2017 talk given by Allan Deutsch.

Slot map

  • An unordered, associative container similar to hash map
  • Assigns items a unique identifier (unlike a hash map, where you provide an identifier)
  • Similar to a pool allocator, but with a map-like interface
  • Advantages over hash map:
    • True constant time lookup, erase, and insert (except when reallocating to grow the size)
    • Contiguous storage (for improved traversal/cache lines)
    • Reuses memory—if you delete a bunch of items, then insert new ones, it doesn’t result in a bunch of new memory allocations
    • Avoids the ABA problem, i.e.,
      1. You insert an item and receive a key
      2. You delete the item
      3. You insert a new item with the same key
      4. You attempt to access the item pointed to by  the old key, but get the new element
  • Disadvantages compared to hash maps:
    • More memory use, since it dynamically allocates blocks like a vector
    • Because it uses contiguous storage, memory addresses of elements are unstable, and lookup is slower than a raw pointer
    • Requires a small amount of extra memory
  • Potential use cases:
    • Storing game entities
    • Any place you’d like a map-like interface, but with constant performance and support for as-fast-as-possible iteration
  • How it works:
    • Array of “slots” (keys we give the user) which indicate each item’s index and the “generation” (the number of times this slot has been used)
    • Array of data, which gets pointed to by the slots
    • Free list head & tail, indicating which slots should be filled next
  • Variants
    • No separate index table
      • Pros:
        • Stable indices/memory addresses
        • Lookups only requires 1 indirection, not 2
      • Cons:
        • Slower iteration (since elements aren’t densely packed)
    • Constant size array
      • Pros:
        • No reallocations (so insert is always constant time)
          • Also means no memory usage spikes (where memory usage temporarily doubles as we allocate a new block and transfer the old one over)
        • Generations increase roughly uniformly
      • Cons:
        • Dynamic sizing is important for most use cases
    • Block allocation
      • Pros:
        • Constant inserts (like the constant-size array)
        • Spikes in memory due to reallocations are smaller
        • Iteration speeds are similar to original
      • Cons:
        • Elements aren’t fully contiguous, so iteration will always be some degree slower
        • More cache misses (scales inversely with block size—bigger blocks mitigate this!)
        • Adds a third indirection in lookup

Bloom filters

  • A probabilistic, super-duper memory efficient hash “set,” used primarily for short-circuiting otherwise expensive computations
  • Find results are only “definitely not” and “maybe”
  • Supports find and insert operations only—no actual data elements is stored (since, as the NSA informed us, metadata is not real data! 😉 ), so there’s such thing as a “get” op
    • This is especially useful for privacy purposes
  • How it works
    • On insert, take K (fast) hash functions, each hash function sets one bit in the Bloom filter’s bit field
    • On find, hash the object and see if each of the bits you get from your hash functions are set; if they are, the item in question was probably inserted in the past
    • Increasing the number of hash functions decreases your false positive probability (but increases the memory usage)

Navigation meshes

  • Representation of a search space (used for pathfinding) that reduces the number of nodes necessary to represent the traversable area
  • Grid representations suck: They don’t map well onto non-square shapes, and you typically use a lot of them
  • Triangle nav meshes improve upon this; using a mix of triangles and squares can improve it even further
  • Consider:Screen Shot 2017-11-02 at 4.06.26 PM.png
  • To actually traverse this, you would move from edge to edge to get between nodes, and when you reach your destination node, you could simply beeline to your goal (since you know it will be free of obstacles)
  • You can use Recast to create nav meshes, and A* or the like to search them

Hash pointers

  • A combination of a pointer and a hash of the thing being pointed to
  • Allows you to verify that the data hasn’t been changed since you got the pointer to it
  • Most common implementation is in Merkle trees in cryptocurrency blockchains
  • Tamper detection: suppose the data in a block has changed maliciously in a blockchain (a series of hash pointers + data). In order for the attacker to “get away with it,” they’ll have to change the data in Block X, plus the hash in Block Y that points to Block X. But, Block Z stored a hashpointer to Block X as well, and its hash included the now-altered hash in Block X, so the attacker has to also change the hash of Block Z (and so on up the chain).
    • Verification is O(N), for N number of blocks
  • Alternative: Merkle trees
    • Tree structure where only the leaves hold data, and all other nodes simply store 2 hash pointers to 2 children
    • Verification of the structure takes O(log(N)) time, but that also means it takes less tampering on the part of an attacker

Hiring Software Developers

We recently hired a pair of software developers at work, and I’ve gotten a couple requests to write up how we did it.

The traditional hiring process—source a billion resumes, eliminate them based on keywords, interview the top hundred people, then pick one—is pretty broken. Fundamentally, filtering by pre-existing experience with your tech makes you miss the enormous swath of smart people who could get up to speed on your stack quickly, and potentially become tremendously valuable to you. Whenever I hear people lament that it’s impossible to hire developers, my first question is always: how many good devs are you excluding from your search?

In addition to unduly weighing resume keywords, the traditional interview process also massively penalizes people who can’t code or problem solve under pressure. There’s a certain class of people who do well in high-pressure interviews, but that skill isn’t strongly correlated with overall value as a developer.

The alternative to all this is: just have people write code that exemplifies the role you’re hiring for. We wanted anyone who might be a good fit to apply, and we would narrow our choices by comparing the code they wrote.

At a high level, our interview process looked like this:

  1. A blog post asking for applicants. (This obviously only worked because we have a large number of developers reading our blog; I think the same principles could have applied to a StackOverflow job listing and the like.)
  2. A preliminary chat over email, to answer any questions they have and generally make sure they aren’t vastly under-qualified.
  3. A small set of “homework” problems—problems that are characteristic of the most difficult type of work they’d be doing, with a definite right answer (or set of right answers).
  4. A phone interview with the top performers on the “homework” questions.

Let me break down how we handled each of these stages…

The call for applications

In our blog post, we were very careful not to eliminate any potential hires. We wanted to hear from any smart, self-disciplined developer, period, and let the homework problems show us who to pursue.

In retrospect, specify a CS degree as a requirement was a mistake, one we won’t make next time; one of the two candidates we hired doesn’t have a degree, and he’s one of the strongest developers I’ve ever met. I hate to think that we would have weeded him out for no reason at all if he hadn’t had the self-confidence to apply anyway. Strong self-confidence is not a job requirement!

The instructions in the post were simple: send us an email introduction  with a brief overview of a project (or projects) you’ve enjoyed working on, and a discussion of projects you have not enjoyed working on. I tried to make it clear: there’s no need to stress about this email!

One final note about the call for applicants: this is also a time to sell your company.

The intro email

In the intro email the applicants sent, I was looking for a few things:

  • Past projects (not necessarily professional) that indicated this person knew software development. I didn’t follow up with the people whose projects amounted to “hello world.”
  • Interest in the type of work they’d be doing. If the person’s description of what they were interested in directly conflicted with the job we were trying to hire for, I sent back a response gently explaining why I didn’t think this would be a good fit.
  • Strong English communication skills. We’re an English speaking team, and especially for a development role, we can’t afford to have ongoing communication issues.
  • Ability to follow simple directions. This one was inadvertent, but I got quite a few emails lacking the details we requested in the blog post. I did not follow up with these people.

I also used the email chat to answer any questions they have and generally make sure they aren’t vastly under-qualified. My primary goal at this stage was to not eliminate anyone who could potentially knock our socks off in the coding challenge… but at the same time, to not waste anyone’s time.

I was also up front at this stage with people I expected we couldn’t offer competitive compensatation. For instance, a developer with 15 years experience in database programming working in Silicon Valley probably commands $150k+, but a) as a distributed company, we’re not going to pay SV premiums, and b) their (vast) experience is in a totally unrelated field. Some applicants were okay moving forward with the interview processs despite the expected pay cut, some were not.

The homework problems

The most important feature of the homework problems is that they had objective criteria on which they could be compared. Each problem had dozens of scoring criteria, such that anyone on our dev team could grade them and we could differentiate a really good solution from an okay one. And because we gave the same problems to everyone, we could compare their performance to the rest of the applicant pool.

I was aiming for a total time cost to the interviewees of under 2 hours for people with prior experience in the problem domain (in this case, performance tuning an algorithm in C++). But, for people who had never worked on something like this, they might have needed a significant amount of reading to get up to speed—whether that’s a primer on what sorts of things govern an algorithm’s performance on modern CPUs, or simply a crash course in C++. I did my best to provide any resources the applicants needed—again, I wasn’t trying to filter by previous experience!—such that a smart applicant could read the provided reference material and turn it into a strong homework submission given enough time. (For applicants that didn’t want to invest the time to go from zero knowledge to a working solution, I completely understood… but obviously we couldn’t move forward with those people.)

The interview

We scheduled Skype interviews with the six candidates who submitted the best homework solutions. Each interview was conducted by two to three senior devs on the team, in a (hopefully) low-stress discussion format.

Since the homework told us everything we needed to know about the person’s ability to code, the interview was focused on their ability to communicate about technical issues. We talked about their homework solutions, asked about their past work, etc. We also did our best to make sure they would be a good fit for the level of independence and autonomy that we need from our developers.

A few of my favorite questions we asked in person:

  • Can you tell us more about some of the challenges you faced on [some past project]? (This question works best if you can ask about a specific subsystem that you, the interviewer, have some domain knowledge on.)
  • In any sufficiently large codebase, you tend to accumulate some old, awful, legacy code that’s kept around because “it ain’t broke.” As a developer, it feels skeezy to leave that code in such a horrendous state, but you probably don’t need to touch it in order to implement any given feature/fix. What do you do? (My goal with this question was to assess how they balanced business needs against personal “hygeine.”)
  • Development teams operate on a continuum between “total developer anarchy” and “micromanagement of every task.” Where do you prefer to be along that continuum? (This question is trying to ensure the person won’t expect more oversight than we can give, and I tend to follow up with questions about how they’ve handled management in the past.)

Results

We were extremely happy with how this worked out. The two people we ended up hiring are phenomenal, but honestly, I expect that any of the people that made it to the phone interview would have been great hires… and most of those would have been filtered out in step 1 if we had posted the typical list of “job requirements.”