Esoteric Data Structures and Where to Find Them

This is a summary of the CppCon 2017 talk given by Allan Deutsch.

Slot map

  • An unordered, associative container similar to hash map
  • Assigns items a unique identifier (unlike a hash map, where you provide an identifier)
  • Similar to a pool allocator, but with a map-like interface
  • Advantages over hash map:
    • True constant time lookup, erase, and insert (except when reallocating to grow the size)
    • Contiguous storage (for improved traversal/cache lines)
    • Reuses memory—if you delete a bunch of items, then insert new ones, it doesn’t result in a bunch of new memory allocations
    • Avoids the ABA problem, i.e.,
      1. You insert an item and receive a key
      2. You delete the item
      3. You insert a new item with the same key
      4. You attempt to access the item pointed to by  the old key, but get the new element
  • Disadvantages compared to hash maps:
    • More memory use, since it dynamically allocates blocks like a vector
    • Because it uses contiguous storage, memory addresses of elements are unstable, and lookup is slower than a raw pointer
    • Requires a small amount of extra memory
  • Potential use cases:
    • Storing game entities
    • Any place you’d like a map-like interface, but with constant performance and support for as-fast-as-possible iteration
  • How it works:
    • Array of “slots” (keys we give the user) which indicate each item’s index and the “generation” (the number of times this slot has been used)
    • Array of data, which gets pointed to by the slots
    • Free list head & tail, indicating which slots should be filled next
  • Variants
    • No separate index table
      • Pros:
        • Stable indices/memory addresses
        • Lookups only requires 1 indirection, not 2
      • Cons:
        • Slower iteration (since elements aren’t densely packed)
    • Constant size array
      • Pros:
        • No reallocations (so insert is always constant time)
          • Also means no memory usage spikes (where memory usage temporarily doubles as we allocate a new block and transfer the old one over)
        • Generations increase roughly uniformly
      • Cons:
        • Dynamic sizing is important for most use cases
    • Block allocation
      • Pros:
        • Constant inserts (like the constant-size array)
        • Spikes in memory due to reallocations are smaller
        • Iteration speeds are similar to original
      • Cons:
        • Elements aren’t fully contiguous, so iteration will always be some degree slower
        • More cache misses (scales inversely with block size—bigger blocks mitigate this!)
        • Adds a third indirection in lookup

Bloom filters

  • A probabilistic, super-duper memory efficient hash “set,” used primarily for short-circuiting otherwise expensive computations
  • Find results are only “definitely not” and “maybe”
  • Supports find and insert operations only—no actual data elements is stored (since, as the NSA informed us, metadata is not real data! 😉 ), so there’s such thing as a “get” op
    • This is especially useful for privacy purposes
  • How it works
    • On insert, take K (fast) hash functions, each hash function sets one bit in the Bloom filter’s bit field
    • On find, hash the object and see if each of the bits you get from your hash functions are set; if they are, the item in question was probably inserted in the past
    • Increasing the number of hash functions decreases your false positive probability (but increases the memory usage)

Navigation meshes

  • Representation of a search space (used for pathfinding) that reduces the number of nodes necessary to represent the traversable area
  • Grid representations suck: They don’t map well onto non-square shapes, and you typically use a lot of them
  • Triangle nav meshes improve upon this; using a mix of triangles and squares can improve it even further
  • Consider:Screen Shot 2017-11-02 at 4.06.26 PM.png
  • To actually traverse this, you would move from edge to edge to get between nodes, and when you reach your destination node, you could simply beeline to your goal (since you know it will be free of obstacles)
  • You can use Recast to create nav meshes, and A* or the like to search them

Hash pointers

  • A combination of a pointer and a hash of the thing being pointed to
  • Allows you to verify that the data hasn’t been changed since you got the pointer to it
  • Most common implementation is in Merkle trees in cryptocurrency blockchains
  • Tamper detection: suppose the data in a block has changed maliciously in a blockchain (a series of hash pointers + data). In order for the attacker to “get away with it,” they’ll have to change the data in Block X, plus the hash in Block Y that points to Block X. But, Block Z stored a hashpointer to Block X as well, and its hash included the now-altered hash in Block X, so the attacker has to also change the hash of Block Z (and so on up the chain).
    • Verification is O(N), for N number of blocks
  • Alternative: Merkle trees
    • Tree structure where only the leaves hold data, and all other nodes simply store 2 hash pointers to 2 children
    • Verification of the structure takes O(log(N)) time, but that also means it takes less tampering on the part of an attacker

Unicode Characters in App Store Descriptions

In the last few years, Apple has cracked down on the special characters they allow in App Store descriptions. This is a list of all Unicode characters I’m aware of that they allow as of July 2017. (I suggest copying & pasting these into your own app descriptions.)

• (bullet)

‣ (small triangular bullet)

◂ (black left-pointing triangle)

▶︎ (black right-pointing triangle)

◀︎ (black left-pointing triangle)

◆ (black diamond)

√ (square root symbol)

If you’re smart, you’ll use these to improve the readability (especially the scannability) of your copy. Since Apple doesn’t allow things like bold, italics, headings, etc., you can make due with some combination of symbols.

Speeding Up SVN Checkout for Large Repositories

After moving our gigantic SVN repo to a new server, we wanted to speed it up. Note that some of these recommendations are peculiar to using the svn+ssh:// protocol. If you’re serving SVN via Apache or something, you might need very different advice.

Here are all the things we changed on the server to speed up SVN. Note that these are in no particular order… it’s hard to say what will give you the biggest bang for your buck. That said, all of these are pretty cheap gags, so if checkout time is a priority, you might as well try them all!

  1. Ensure you’re running the latest version of Subversion. At the time of this writing, that means v1.9, which offers loads of performance improvements over v1.6 that we were running!
  2. Ensure you’re only checking out the files you really need. If you can get by checking out only some directories in the repo, rather than the whole thing, you might consider doing so. (For our usage—we only store art assets in SVN—this is just fine, because there are no real dependencies between subdirectories.)
  3. Set up a cron job to periodically run svnadmin pack on your repository as a way of reducing fragmentation of your repository’s shards.
  4. Upgrade to hosting the repo on an SSD. We found we were I/O bound by our spinning rust hard drives.
  5. Ensure your uplink speed is reasonable. Doing all of the above, we were saturating our old-as-hell 10 Mbit uplink (pushing 1.25 MB/sec isn’t hard off an SSD!).
  6. Try disabling compression (we do so in our svnserve wrapper script, seen below). By default, SVN uses compression level 5 (on a scale from 0 to 9). If you primarily store binary files that don’t benefit from compression, or you have a fast connection, this might be a win. In our case, our server’s CPU was pegged at 100% during a checkout; dropping compression removed the CPU as a bottleneck.
  7. If you’re using the svn+ssh:// protocol:
    1. Ensure you’re running the latest versions of OpenSSH and OpenSSL. Aside from being security holes, old versions have serious performance issues. (I’m ashamed to say our SVN server was running a 6 year old version of both tools! This is what we get for not having anyone “own” the maintenance on this server.)
    2. In your sshd configuration file (for our Ubuntu installation, this was located at /etc/ssh/sshd_config), disallow “old and busted” ciphers. If your server defaults to 3DES, it’s another security risk plus performance disaster. (It’s quite a bit slower than AES, which typically benefits from hardware acceleration.) You should have a line in the file that looks like this:
      Don’t forget to restart your SSH server after making the change!
  8. If you’re using the svn:// protocol: Bump the size of SVN’s in-memory cache. It defaults to something like 16 MB, but if you’re running a dedicated server on halfway reasonable hardware, you can allocate way more than that. To do this, you’ll have to be using an svnserve wrapper. That is, you’ll have to have a custom shell script—something like /usr/local/bin/svnserve—that modifies the arguments that ssh+svn-connected users pass to svnserve. Something like this:
    # Set the umask so files are group-writable
    umask 002
    # Call the 'real' svnserve, also passing in the default repo location
    # Use a 4096 MB in-memory cache size, allowing both deltas and full texts of commits to be cached
    exec /usr/bin/svnserve "$@" --compression 0 --memory-cache-size=4096 --cache-txdeltas yes --cache-fulltexts yes
  9. If all of the above isn’t enough, it may be because you have a large number of files in your directories. (Modern OSes get less efficient as you get to an extremely large number of files.) You might consider re-sharding your repo, but note that this will take a long time (I’ve heard to expect 1 hour per GB to dump, then another hour per GB to reload).

After making the above changes (except re-sharding) on our svn+ssh:// server, we were able to go from an average download speed of about 100 kilobytes/second to about 6 megabytes/secon—not bad at all!

Migrating a Large SVN Repository to a New Server

We ran into a situation at work where we needed to move our SVN repo from an old Linux server (running Ubuntu 10.04, in 2017!!) to a shiny new cloud instance.

Jump to the TL;DR

Dear God, don’t do this

The standard advice you see on the web is to use svn dump on the old server, transfer the dump file, then use svn load on the new server. That works fine for small repos, where the total transfer time will be negligible no matter how you do it, but for a large repo, it’s a disaster. Time estimates I’ve seen for this around the web say it’ll take roughly 1 hour per GB to dump, then another hour per GB to load… not to mention the fact that the dump files are larger overall than the raw filesystem data, so the transfer itself is slower!

Using rsync

Not wanting to spend ~150 hours on this for our 70 GB repo, I wanted to try moving just the raw filesystem data. This StackOverflow answer indicated such a thing could work using scp, but of course there are two scary things that could happen:

  1. What happens if you have a network connection hiccup mid-transfer? (Nobody wants to start over!)
  2. What happens if somebody adds a new commit while you’re working? (You could, at an organizational level, ask for a “lock” for the hours you need to do the transfer, but it’d be nice not to impede everyone’s work for that long.)

The solution, of course, is to use rsync! You run it once to transfer your directory initially, then obtain a lock; then you run it a second time to pick up any changes you missed from the first (long) transfer. Then it’s just a matter of upgrading the repo and preparing it for use.

So, from beginning to end, the complete steps are:

  1. On your old server:
    $ rsync -aPz /path/to/svn-repo/ username@newServer:/destination/path/
  2. Email your team to get a “lock” (no more committing to the old server!)
  3. Once more:
    $ rsync -aPz /path/to/svn-repo/ username@newServer:/destination/path/
    (picking up any changes that were made during the first copy)
  4. On the new server:
    $ svnadmin upgrade /destination/path/
    $ svnadmin verify /destination/path/

Notes on Game Programming Patterns by Robert Nystrom

Game Programming Patterns is a game developer’s guide to the design patterns most often useful in that domain. You can read it online for free.

Chapter 1: Intro

  • Architecture is a tradeoff between 3 types of “speed”:
    • speed of future modifications (making the program flexible),
    • execution speed (don’t want to overdo it and get into Java land, where there are 15 layers to get to a piece of code that actually does something)
    • short term development speed
  • Ideally, you’re decoupling and introducing abstractions in the places most likely to change, or most likely to need the future flexibility, but beware of overdoing it (YAGNI)

Chapter 2: The Command Pattern

  • “A command is a reified method call”—in other words, a callback/first-class function/function pointer/etc.
  • E.g., instead of binding button presses to actions directly, you could swap out the direct method calls for usages of a command object; then, the user could configure what happens when each button is pressed
    • Adds a layer of indirection between the press and the method call
    • Similarly, you could make the object being acted (e.g., a character) a parameter of the command; then, that same command class could be used to make any character do an action (could be used by the AI or a real player)
    • This is a bit like currying in Javascript—you wait to apply some parameters until just the right moment
  • Can also use this to decouple input from execution—input handler or AI generates commands, places them in a queue, and other code picks the commands off and executes them at its leisure
  • This is also how you implement undo/redo: you’ve got a long list of commands you’ve received, and each one has an execute() and an undo() method (since ideally it can apply just the changes it needs, rather than remembering the full state of the object at each time); then, you move forward in the queue by calling execute(), and backward in the queue by calling undo()
    • This same technique is useful for doing replays: you record the commands performed each frame, and to replay, you just run the normal game, executing the pre-recorded commands at the right time

Chapter 3: The Flyweight Pattern

  • Suppose you have a million quite heavyweight objects, all of which are similar in some regards (e.g., a forest of trees which all share the same mesh & textures)
    • Key observation: even though there are thousands of objects, they’re mostly similar
    • Pull out the data they have in common and move it into a separate class (e.g., TreeModel)
    • Only load one of those (and make it const, for Pete’s sake!)
    • All individual trees get a shared reference to the shared model
    • Shared stuff needs to be context free (hence const-ness)
  • Common application: instead of a bunch of functions to get particular properties, each of which has a giant switch on an enum (which has the problem that info about any given enum is spread way, way out), replace the enum entirely with a flyweight: one object of each type, which has its properties stored directly
    • Then, you store a pointer to the flyweight, rather than the enum
    • Advantage: hella less coupling; get to work with real objects that simply tell you about themselves
    • Advantages of OOP without creating a bunch of objects

Chapter 4: The Observer Pattern

  • Defines a one-to-many dependency between objects so that when one object changes state, all its dependents are notified & updated automatically
    • Example: physics code notifying the achievement system—you really don’t want to couple the two, so the physics code says “I don’t know if anyone cares, but this thing happened; do with that what you will”
  • Typical parameters to the notification call are the object that sent the notification and a generic “data” field
  • Cost of this is negligible—but note that this occurs synchronously, so you have to be careful about blocking the main thread (if you need asynchronous, use the Event Queue pattern)
  • If you don’t want to keep a list of observers separate from the objects themselves, use an intrusive linked list, threading the subject’s list through the observers themselves
  • Note that you’ve got a design problem if two observers observing the same subject have some kind of ordering dependency between them
  • You may want a “last breath” notification if you’re observing an object that could be destroyed
  • Advantages: Extremely low coupling
  • Disadvantages: all coupling is visible only at runtime—hard to reason statically about who’s receiving notifications
    • In general, you don’t want to use this within a subsystem (where you typically need to reason about both sides of the communication to understand how something works), where you actually care; use it to communicate between independent subsystems
  • Could use a lighter weight solution: register member function pointers as observers instead of implementing an interface

Chapter 5: The Prototype Pattern

  • Key idea: Objects can spawn other objects similar to themselves.
  • Give base class (e.g., Monster) an abstract clone() method (factory method for duplicating “this” object)
  • This makes it easy to spawn new instances of each subtype ad nauseum
    • Spawner objects internally hold an instance of a subtype—a hidden one used as a template for all spawning
    • Doesn’t have to just spawn different types… can also spawn instances of the same type (e.g., Ghost) with different characteristics (fast Ghosts, slow Ghosts, etc.)—totally flexible!
    • Similarly, could do: Spawner * ghostSpawner = new SpawnerFor<Ghost>();
  • Most useful in the real world for data modeling
    • Just like we want to reuse code rather than copying & pasting, artists want to do the same
    • Scenario: designers need to specify attributes for monsters and items.
      • Commonly use JSON
      • Add support for a “prototype” field to support data reuse—any properties not in the object being declared fall back to its prototype
      • E.g.,
            "name": "goblin grunt",  
            "minHealth": 20,  
            "maxHealth": 30,  
            "resistances": ["cold", "poison"],  
            "weaknesses": ["fire", "light"]  
            "name": "goblin wizard",  
            "prototype": "goblin grunt",  
            "spells": ["fireball", "lightning bolt"]  
      • No need to set up any kind of “abstract” prototype… just delegate to the first concrete type we declared
    • This is a particularly good fit in games with one-off, special entities throughout the world
    • Makes it extremely cheap to create many tiny variations that make a world seem believable (e.g., don’t just give the player a basic sword 100 times throughout the game… give them the sword of extra damage, the sword of extra attack, the sword of lightning bolts, etc.)

Chapter 6: (Problems with the) Singleton Pattern

  • Singleton is an easy way out to the question of “how do I get an instance?”
  • Some (rare) classes cannot operate correctly if there’s more than one instance (typically classes that interface with an external system that has its own global state—e.g., file writers, which need to protect against conflicting manipulations of the same file running asynchronously)
  • Thread-safe initialization in C++11:
    static FileSystem * FileSystem::instance()  
        static FileSystem * s_instance = new FileSystem();  
        return s_instance;  
  • Features:
    • Initialized at runtime (saves you from static initialization hell… but beware the issue of not knowing exactly when it might be initialized!)
    • Instance isn’t created if no one uses it
    • You can subclass it—e.g., different FileSystem initialization per platform, where the instance() method is defined #if PLATFORM == PLAYSTATION, #if PLATFORM == NINTENDO, etc.
  • Long-term issues:
    • It’s a global variable
      • Harder to reason about the code used by individual functions (ruins function purity)
      • Encourages coupling—the instance of our AudioPlayer is visible globally, so why not #include it in the middle of our physics code?
        • Controlling access to instances also controls coupling!
      • Not concurrent—prone to performance issues at best, synchronization bugs, deadlocks, race conditions, etc. at worst
    • “Solves” two problems—ensuring a single instance, and allowing global access—even if you only needed one of those
      • E.g., logging class—only needed global access. What happens when you want each subsystem to write its own log? Have to go change every call to Log::instance()!
  • Singletons are a way to avoid answering the question of how you architect your code
  • Lazy initialization takes away control
    • If you have a perf issue, need to be careful about when it’s initialized
    • Can’t control where in the heap memory gets allocated
    • Can solve this with static initialization… but if you wanted that, why not just make a static class in the first place and do away with the instance() calls altogether?!
  • What to do instead
    • See if you need the class at all. Many singletons are “managers” whose functionality should be baked into the thing they manage.
    • To ensure single instance: declare a static bool s_is_instantiated; assert() in the constructor that it’s false
    • To provide convenient access:
      • Pass it in! (Dependency injection)
      • Get it from a base class (e.g,. have GameObject create a static instance so you don’t have to pass it to individual subclasses)
        • This is the “Subclass Sandbox”
      • Get it from something already global
      • Get it from a Service Locator

Chapter 7: State

Allow an object to alter its behavior when its internal state chnages. The object will appear to change its class.

  • Rather than an increasingly complicated series of nested conditionals, use a finite state machine.
    • E.g., the allowed actions for a character object change depending on its current state: standing, jumping, ducking, diving all allow different action
    • Complex branching and mutable state (fields that change over time) make for very error-prone code
    • Finite state machine characteristics:
      • fixed set of states
      • always in exactly one state at a time
      • events or input get sent to the state machine
      • each state has a set of transition, and inputs can cause it to move to a different state
    • Naive implementation: a state enum, and a switch/case block within each of your methods (e.g., handle_input(), update(), etc.)
      • Downside: all the logic for a particular state gets spread out across many methods
    • Improvement: use dynamic dispatch (i.e., a virtual method)!
      • use a base CharacterState class, and create a separate class for each state (WalkingState, JumpingState, etc.) which implement the methods you need (handle_input(), update(), etc.)
      • Then, the Character object just has a state member variable which it delegates to—so Character::handle_input() calls m_state->handle_input(), etc.
      • Changing state is just a matter of pointing your state member to a new state object
      • Features:
        • Keeps all the logic related to a certain state together
        • Allows you to have member variables private to a given state
      • Alternatively, if your state objects have no fields, only a single static method, you can replace this with a state function pointer
      • How do you get a state object?
        • Create a single, static object and reuse it (if it has no fields beyond its virtual method implementations)—this is the Flywheel pattern (Chapter 3)
        • If you need to instantiate the states, have the states’ virtual handle_input() method return a pointer to the new state you should transition to
    • Further improvement: Enter & exit actions
      • Get called immediately before/after changing state
      • Give you a place to consolidate code that has to occur on transitions (regardless of what you’re going to/from)
      • May want to give them your main object (e.g., your Character) as a parameter so they can manipulate it
  • Issue with FSMs: they’re a poor fit for more complex problems (like game AI)
    • Possible solutions:
      • Concurrent state machines (e.g., one state for the character itself, one for its equipment, etc.)
        • If you have n states for the character, and m states for its equipment, it requires n × m states, but with concurrent state machines, it’s only n + m
        • Probably want a way for one state machine to consume an input so that the other doesn’t receive it (preventing both from erroneously responding to the same input)
      • Hierarchical state machines
        • Use inheritance to share implementation between similar states (e.g., standing, walking, running, sliding could all inherit from an “on the ground” state)
        • Thus, you get superstates and substates
        • If a substate doesn’t handle a given input, it’s responsible for passing it up the chain
        • Beware: inheritance is powerful, but it leads to very strong coupling. “It’s a big hammer, so swing it carefully.”
        • If you don’t want to use inheritance, can model a chain of superstates explicitly with a stack of states—go through the stack, and keep offering the input until one of the states in the stack consumes it
  • More issues: lack of history
    • An FSM knows which state it’s in, but no where it came from—no good way to go back to a previous state
    • Where a single FSM has a single pointer to a state, a pushdown automaton has a stack of them
    • Where an FSM always replaces a new state during a transition, a pushdown automaton can do that, or it can instead push a new state onto the stack, or pop the topmost off (transitioning to the previous state)
    • E.g., you transition from standing to firing; when firing is done, you need to know where you came from (a pop operation)

Chapter 8: Double Buffer

Cause a series of sequential operations to appear instantaneous or simultaneous.

  • Classic use case is computer graphics, of course—can’t have the video driver reading from the buffer at the same time you’re writing to it, or you get tearing
    • State gets modified incrementally throughout the game loop
    • May be “read” by the video driver in the middle of modification
    • Need to prevent the video driver from reading our half-written copy
    • Note that the swap operation must be atomic (or else you’re right back where you started!)
  • Typical implementation: two separate buffers in memory
    • The “swap” is just changing a pointer to point to the new “active” buffer
    • Important to note that outside code can’t store persistent pointers to the buffer
    • Can be useful to note that existing data in the “write” buffer comes from two frames ago—can use this to do motion blur (huh!)
  • Not just for graphics! Consider an AI, where you want all actors to appear to update simultaneously
    • In this case, each actor has an update() method, and a swap() method
    • In the swap() method, each actor changes their “next” state to their “current,” and resets their “next” state to default
    • Then, the “manager” object calls update() on all actors, then swap() at the end of the frame
    • This is a case where we copy the data between buffers (for every object!) rather than simply swapping a pointer
    • Note that data in the active buffer is only a frame old (this could be useful)

Chapter 9: Game Loop

Decouple the progression of game time from user input and processor speed.

  • At it’s most basic, this looks like:
        update(); // Cf. chapter 10 on the Update pattern  
  • Each frame, your game needs to process user input (if applicable), but of course you don’t want to wait for input
  • Old games were designed for specific hardware, so the game did just enough work to run at the desired speed.
    • Aside: Older PCs came with “turbo” buttons: they were too fast to play old games, so you would turn the turbo mode off (it was on by default) to slow down the PC and make old games playable!
  • Game loop’s job is to run the game at a consistent speed regardless of the underlying hardware.
    • Tracks the passage of time each frame in order to control the rate of play
  • Bad implementation: variable/fluid time step
    • Make the time steps grow or shrink to match the real processing time required
    • Each cycle takes a certain amount of real time to process; just make the game time advanced be less than or equal to this!
    • An object moving across the screen has its velocity scaled by the elapsed time each frame
    • Good: the game is playable on hardware of differing speeds, and faster machines get smoother gameplay (better than simulating at a low frame rate and sleeping until you need to render the next frame)
    • Bad: the game is non-deterministic and unstable
      • An object moving at the same speed from the same starting location—but at different frame rates—will end up in a different place due to accumulated floating point rounding errors!
      • This primarily affects physics and networking
  • Better implementation: Fixed update time, variable rendering
    • Update using a fixed time step, but render only when we have time
    • Each cycle, we “catch up” to the real time elapsed using a series of fixed steps
      double previous = getCurrentTime();  
      double lag = 0; // how far game clock is behind real world  
          double current = getCurrentTime();  
          double elapsed = current - previous;  
          previous = current;  
          lag += elapsed;  
          // Update in fixed-length intervals until we (nearly) catch up to real time  
          while(lag >= MS_PER_UPDATE)  
              lag -= MS_PER_UPDATE;  
    • Important points:
      • Time step is no longer locked to frame rate; typically want to update faster than 60 FPS, but not so short that slow machines can’t get through an update() cycle fast enough
        • Safeguard against a too-slow update() by making the game bail after too many iterations through the loop
      • Rendering is typically much cheaper than updating, so by taking rendering out, you buy yourself a lot of CPU time
    • The challenge: rendering often comes between update() ticks (i.e., there is residual lag when render() is called)
      • Left uncontrolled, this makes for stutter
      • Solution: pass lag / MS_PER_UPDATE into render() and have it extrapolate locations based on current velocities (might draw a bullet half a frame ahead)
  • Other considerations:
    • Power usage: may clamp the frame rate to save on battery
  • Further reading:


Highlights from Robert C. Martin’s Clean Code

This page collects the things I found really insightful in Martin’s Clean Code. By “insightful,” I mean things I didn’t already practice as a programmer with a couple years of experience. Thus, I’ve skipped over tips like “don’t be afraid of long variable names” in favor of things like “functions should act at one level of abstraction.”

Chapter 1: Introduction

You know you are working on clean code when each routine you read turns out to be pretty much what you expected.

—Ward Cunningham

In other words, clean code reads as though it is obvious, simple, and compelling.

It’s the difference between watching a professional athlete (who makes the job look easy) and an amateur (who makes you appreciate how difficult the sport really is). Your code should give your readers the sense that the problem you’ve solved is unexpectedly easy (even when it isn’t).

Chapter 2: Variables

Add meaningful context

A variable’s name might be ambiguous if the context isn’t clear. Consider a variable called state; when used alone in a function, it’s not clear that it is part of an address. Consider either

  • wrapping these variables in a class (such that their context is always clear), or
  • adding a prefix to the variable name so that the reader will at least be pointed to the larger structure (e.g., use addrState instead of state).

Chapter 3: Functions

Functions should act at one level of abstraction

It’s confusing when a function mixes multiple levels of abstraction—for instance, if it acts at a very high level of abstraction by calling getHTML() and a very low level of abstraction by manipulating low-level strings.

When this happens,

Readers may not be able to tell whether a particular expression is an essential concept or an implementation detail.

Thus, you should hide steps at different levels of abstraction in different functions.

Get rid of function parameters—especially low-level ones

The more arguments a function takes, the harder it is to use (it requires more mental energy) and test.

If an argument is at a different level of abstraction (e.g., getUserData( StringBuffer htmlBuffer )), it’s even worse—the user is forced to know an implementation detail in order to invoke the function!