When expense management company Expensify first designed its database architecture, it was building its own corporate card for the masses. There were strict requirements to hit to work with financial institutions: response times within milliseconds, multiple servers replicated in real time and every transaction logged and authenticated. At the time, it felt like overkill for a startup to build a high-level, enterprise-grade architecture. When it pivoted to expense reporting, it still had a robust technology at its disposal. Startups often don’t know until later just how valuable their early decisions turn out to be. Only in hindsight did David Barrett realize how the early constraints on Expensify’s database architecture led to many of its key competitive advantages.
Over eight years as the founder and CEO of Expensify, Barrett has seen the advantages of over-investing in early database design for not just the product’s functionality, but also to enable a radical business model that acquires customers very differently than its competitors. It grew over 120% in 2015, nearly 70% more than its closest competitor. The expense management startup now assists over 20,000 companies. Barrett credits a large part of the company’s ability to scale to its technology decisions, in particular its unusually powerful database architecture.
In this exclusive interview, Barrett explains how startups fall into damaging defaults when it comes to database architecture and deconstructs the specific mistakes to avoid. He shares the three key steps to follow to set yourself up for both technological and business-model scale and success. Lastly, he shares how to course-correct midstream if you need to change your database architecture.
Where Many Startups Falter With Database Architecture
Startups are told they are supposed to be “data-driven”, but they rarely know what that means at the time they are making their key technology decisions. “When you first start, you have no data and no customers. Everything you do easily fits in a single database, powered by a single web server. You choose the tools you’ve used before—or are currently in vogue—and hope for the best,” says Barrett. “As you grow, very slowly you start to need multiple servers because either the performance or reliability of a single server is no longer adequate. So one by one you shoehorn another server into your sole data center. Bit by bit, you add capacity as you scale.”
Here’s why that approach traps many startups:
Squirrels abound. In Pixar’s movie Up, the protagonist's talking golden retriever, Dug, stops mid-sentence to avert his eyes and exclaim “Squirrel!” whenever one crosses his line of vision. Don’t be Dug. “Most companies start with whatever is the most familiar to them. This is a safe choice, and it’s easy to argue for minimizing risk at the start. Unfortunately, people can only be familiar with the past — and the world belongs to the future,” says Barrett. “The bolder companies step back and see what’s new and better than they are familiar with. This can be good. But too many companies mistake ‘trendy’ for ‘better’, and choose unproven technologies that just happen to be popular when they are deciding what to use. This can kill your startup, in a slow and painful way.”
Years ago, Barrett once worked at a company that tried to build an earlier version of Dropbox. “All the technology was there, but they insisted on using this horrible P2P technology called JXTA rather than a simple series of centralized servers. Despite us having no customers—and thus no data to store—our CTO bet the company on a fancy distributed system claiming it was the only way to scale to petabytes of data,” says Barrett. “Like most sexy, ‘hyper-scalable’ solutions, in practice it couldn’t even scale to megabytes because it was a nightmare to use and had never been used in a production environment. The company pivoted several times, and then four years later Dropbox built a much more obvious solution involving a more sane mix of new and old—and took over the world.”
The key challenge for any new startup is to retain the benefits of the past with a dash of the future so as to stay cutting-edge without getting killed by the bleeding edge.
You’re not Google. That’s not to say you aren’t able to — or will — be a giant like Google. But don’t start with that assumption. “What's popular is usually some very novel technology driven by a sexy company with very complex needs that appear at scale. The likelihood that your needs as a brand-new startup are the same as Google’s is slim,” says Barrett. “There are good databases to choose out there such as Postgres, but they’re losing their shine because they’re too boring. It’s much more exciting to say, ‘Google uses this really diverse database technology. I want to be like Google so I’m going to use its technology.’”
But people forget that the reason Google developed its technology foundation the way it did is because it’s running the largest search engine in the world. “The technology that powers a search engine is not the same as that which powers a lot of other businesses. Google needs to absorb an enormous quantity of data every day, of which only a tiny fraction is changing or interesting. Back in the day, Google would essentially make a copy of the internet each day. This means it was able to use a method of data-indexing that is very fast to search and add data to, but extremely slow to selectively remove or change data once added. That’s fine for Google, because every day they’d just throw the whole thing out and start from scratch again,” says Barrett. “The vast majority of businesses aren’t doing something like this. Instead, they’re slowly accumulating a much more dense set of data — accounts, passwords and editable documents — that you can’t just throw out and start over, but need to update going forward over time.”
Is your startup making a copy of the internet daily? If not, stop trying to emulate Google’s database architecture.
How To Go In Eyes Wide Open on Database Architecture
The danger of copying a lot of the technology foundations that are popular or of companies you admire is that they are likely very different from your company. The decisions that they make can actually place constraints upon you. But while they’ll navigate around those constraints, those parameters may become showstoppers to you. Here’s how to mitigate that risk when designing your database architecture.
Do your math before specs seduce you. It’s easy to be impressed by the performance of databases without thinking of what you actually need. “Don’t swoon when you hear a database can be split over a thousand computers and process petabytes of data. First, ask yourself: ‘How much data do I really expect to have?’ The likelihood that you’ll need even within 1000 times that much data is very unlikely,” says Barrett. “Even today, if we didn’t care about reliability and maintainability, we could probably run all of Expensify from a single 5-year old database server. People forget just how ridiculously powerful and cheap computers are today. NASA went to the moon with a computer 10,000 times less powerful than your iPhone, and even a cheap server can process 1000 times more data than that. For the average startup, the likelihood that it’ll need more than a single server capacity-wise is very low for years.”
Prioritize maintenance over capacity. “Today, the cheapest Dell server available has a dual-core 2.8GHz processor and 500GB of RAID storage for just over $700. Upgrade to an 8-core 3.7GHz box with 64GB of RAM and 10TB of storage for about $3,200 — far, far less than Expensify spends on coffee every month. Most startups will fail or be acquired before outgrowing the cheapest server on the market — especially since storage costs are falling far faster than most startups grow. And even though EC2 charges a premium for being cloud managed, it’s still a steal at twice the price. Regardless of whether you rent or buy your hardware, capacity is not and will never be a real problem for the overwhelming majority of startups. But maintenance will be. If everything goes wrong on your single server — or you just need to upgrade — what happens when you reboot it? Your entire service vanishes. Bottom line: computers are absurdly cheap — get multiple, not for capacity, but for redundancy.”
Split hairs on security. “Most databases can be accessed in two ways. The first is doing a generic query upon the database. The other way is via what's called a stored procedure. Granted, all databases have security measures, but they’re not created equal,” says Barrett. “Most of the trendy, new databases rely upon security built into web server code. The problem with that is your web servers are the most likely to be compromised, because they sit directly on the internet. And if they do, they can bypass the security measures and get unrestricted access to the database. A stored procedure executes inside the database itself, enabling you to build security into the database layer, outside the reach of web hackers. Unknowingly, the vast majority of startups — especially consumer companies — are choosing technology that simply can’t be made in a secure fashion. And once they do have customers, the cost to retrofit on security without downtime is so much more expensive than had they just done it from the start.”
Let’s buy a safe once we’ve got cash piled up in stacks. That’s how startups often think about database security.
Here’s What To Do
Early-stage, growth-stage. Consumer, enterprise. Every startup has its set of circumstances, but Barrett contends that there’s a rule-of-thumb plan that can serve a spectrum of technology companies seeking intelligent and intentional database architecture. Here’s how to do it:
Start with three data centers. Given that today’s technology makes it possible, Barrett believes that every startup should start in three data centers on Day One. “Three is the magic number,” he says. “One isn't good enough, because it's only a matter of time before it goes down. The internet will die or the power will fail. Whatever it is, something will go wrong, so a single database won’t cut it. The problem with two is that you're vulnerable to what's called a ‘split brain syndrome.’ If the servers at each data center lose contact with each other, it's unclear if the other's gone entirely or just inaccessible temporarily. If, at the same time, they both think that the other one's down, then they both think that they're in charge and duplicate efforts. In our world, that might mean that they both reimbursed the same expense report and double paid. That’s not good.”
With one clock, you always know the time. With two different clocks, you never do, because you don’t know which is right.
Since one data center is susceptible and two can easily cross wires, start with servers in three data centers. “If you have three, that means at any point, you can lose an entire data center, and there's still two remaining. Two can be a quorum and determine a decision,” says Barrett. “This may sound like a lot to set up on Day One, but it could be worse. Tackling this problem upfront is much better when no one’s breathing down your neck. Investors aren't upset with you or the customers aren't freaking out. Build the foundation to scale upfront. Get on three different data centers or three different availability zones in AWS. It's cheap and easy. It just takes foresight.”
Find and use a replication technology. Why don’t more startups start with three data centers? Building for three data centers means that on Day One — before you have any data — you need to deal with a replication problem. Each server in each data center needs to continuously share data with the others, such that each server shares the same level of information.
“But the classic technology for replication is really optimized for having a standby or backup server that everybody ‘fails over’ to when when the primary is down. This failover process is either manual or hacked together with custom scripts that ‘drop’ requests along the way. Either way, failure is a serious all-hands-on-deck problem that affects live customers,” says Barrett. “More modern solutions are built with replication and failover in mind, such that a server going down is an unremarkable occurrence with no dropped requests, no manual action required, and no customer impact. Furthermore, unlike the classic solutions designed for very fast, very reliable networks inside a single datacenter, new solutions are optimized to work across relatively slow and unreliable internet connections between datacenters around the world.”
Part of the challenge is the predominance of an outdated relational database management system. “If you try to do it with MySQL, it’s going to be very hard because it’s an old database designed for a time when disks were super slow and small. The file system couldn’t be larger than 4GB. It was designed for a completely different world than the one we live in today,” says Barrett. “Today, everything's on SSDs or even cached in RAM — and therefore super fast. That’s the conundrum with MySQL. It’s still in use but optimized for a series of constraints that no longer exist, so it has all of the baggage of the old world, but it doesn't really have the advantages of the new world.”
There are tools that can help synchronize servers in different data centers, but the open-source, easy-to-implement ones are just emerging. “It’s still tough to launch with three data centers with some of the classic solutions, and almost impossible back in 2007 without something huge like Oracle. To be fair, Percona might have worked, but it was so new I didn’t discover it at the time,” says Barrett. “Since I had a background in P2P software, we decided to build a solution ourselves. The resulting technology, called Bedrock, makes it easy to operate a real-time replicated, geo-redundant distribution without the complexity of Oracle or MySQL. So far as each server is concerned, it just has a single database in its local environment, but the technology links them and takes care of all the replication. And just for kicks, we also made it speak the MySQL protocol so it can nearly be a drop-in replacement.”
Expensify plans to give this technology away for free given the few options that existed when it was looking for a solution. “Initially it was just a proprietary solution to a specific problem we couldn’t find off the shelf. But over time, we realized just how powerful it was, and how really anybody could use it. Because the idea of seamless clustering with automatic, lossless failover is relevant to anyone who cares about performance and availability — which is about everyone,” says Barrett. “But very few people would’ve had a reason to build it themselves because its core technology, the Paxos distributed consensus algorithm, is really tricky to get right. This is what enables a group of otherwise equal servers to reliably elect a master to orchestrate distributed transactions and rebalance traffic, and to do it within milliseconds of the failure occurring. But we’ve spent eight years honing it through several orders of magnitude, so it’s pretty hardened against a wide range of real-world conditions and thus we think ready for broader use.”
Decide whether or not to partition your data. After selecting three servers from different data centers and selecting a replication technology, determine whether or not you’ll partition your data. Partitioning involves breaking up a database into distinct independent parts of the whole. This choice affects a lot of future decisions.
With partitioning, ask: do I want to allow every user to share data with every other user or split them into disjointed groups?
“Virtually everybody starts out with ‘disjointed groups,’ because it’s conceptually the simplest. Whether you have a consumer product for individuals or enterprise products for companies, the relationships between users seem pretty obvious at the start,” says Barrett. “For example, if you’re doing enterprise document storage, it may initially seem obvious that you only need to share documents between employees inside the company. You might even sign contracts with enterprise customers requiring that data be physically isolated — after all, what’s the harm?”
The risk is one day you encounter an unexpected use case that links people who previously you thought were unlinkable. “Imagine a law firm works for two different clients, each of which is hosted on two different databases. Suddenly the ‘disjointed groups’ model breaks down, because putting the law firm on one database means it can’t ‘see’ the documents in the other. Your technology is now blocking your product from supporting this key use case — and you might’ve cemented this technology in place by signing enterprise agreements that depend on it.”
The alternative approach is to assume out of the gate that any two users might someday want to share data and thus design for a single shared database from the very start. “This puts you on a very different technology path, because you can no longer just ‘throw hardware’ at the problem. If your database gets full, you can’t just spin up a new one for your next batch of customers, you need to find a way to upgrade the entire thing — and do it without taking it down for everyone,” says Barrett. “The upside of maintaining a single continuous database for all your users is that you eliminate constraints of who can share with who, whether right away or any time in the future. The downside is a single giant database is much, much harder to maintain than a bunch of small databases — especially if you’re using a classic database solution.”
A Primer to Course-Correct
Not every startup will have the luxury of starting from scratch. If you’ve already made some of these other choices — intentionally or unknowingly — there are ways to get on the right track. It may take some effort to hit the railroad switch, but the train hasn’t left the station.
Choose a database that replicates across multiple availability zones. “Most startups will say, ‘I’ve got a single EC2 instance which runs my web server, a single RDS instance which is my database, or I store some data in S3. That’s all in a single availability zone.’ Don’t paint yourself into a corner on day one: build to support multiple availability zones with Amazon or distinct data centers.”
Try Expensify’s replication technology. “Switching from MySQL to Bedrock is easy: if you’re small, you can likely take down your server in the middle of the night, export your data, re-import it into Bedrock and your web servers won’t know the difference because Bedrock speaks the MySQL protocol.”
Start with stored procedures. “Ask yourself: ‘How bad would it be if a hacker rooted my webserver?’ If the answer is ‘real bad’, then move your authentication logic to a stored procedure inside the database. It’s more secure, maintains better layering, and higher performance for the end user.”
Bringing it Together
Most wouldn’t describe database architecture as sexy. But it’s vital to get right before you have customers or their data. How you organize, scale and secure this data has an impact on not only your technology but also the scalability of your business model. Start with at least a server in three different data centers or availability zones. Choose a replication technology that allows them to precisely and reliably communicate with each other. Examine the relationship between your users to decide whether you should partition your data. Consider stored procedures to secure the data you have. The biggest challenge is not making a decision to take this approach, but getting comfortable veering from popular database management systems like MySQL.
“Don’t make decisions around database architecture as if you’re Google. But also don’t put constraints on your startup that’ll eliminate its chance to be ‘a Google’ in the future. The truth is that many startups will fail well before some of these issues matter. But ask yourself: are you optimizing for success or failure?” asks Barrett. “It was rigorous requirements and chance early on that got us thinking more thoughtfully about database architecture. We now know that foundation helped us not only scale with the requirements we anticipated, but support all the amazing deep data and artificial intelligence opportunities we could have never imagined. That’s only possible given the decisions we first made with our database architecture, which keep all our data in one giant bucket that can be sliced, diced, and intermingled in any conceivable fashion. The alternative is grim. If you don't realize you're building a prison on your first day, those bars get really hard to move.”
Read this next
“Get Off the Floor” and Other Career Advice from Microsoft, Looker, Reddit & Twitter
Nick Caldwell's resume includes an enviable list of companies — Microsoft, Reddit, Looker and Twitter. He shares his biggest lessons from each for a crash course in finding success across different company cultures, scales & functions.