avid learner and developer[Ben.randomThoughts,GReader.shared,Delicious.public].tweet
433 stories
·
1 follower

AuthZ: Carta’s highly scalable permissions system

1 Comment

How we built an authorization system based on Google Zanzibar

Permissions, also known as authorization, is the process of granting access to resources in your system. For any team, it’s crucial to get permissions right. At Carta, where we are working with financial data all day, it’s the most important thing.

But we had a problem. Instead of maintaining one legacy system, we were maintaining five. The permissions could conflict — and they were impossible to extend. Our business needs were growing, and we had several new products in the funnel.

It was obvious we had to build a new authorization system to create leverage for engineering and product. We knew we needed it to be three things:

  1. Scalable
  2. Fast
  3. Generic enough for any new product needs

Sounds simple enough, but in reality it’s not that easy. In my career, I’ve seen permission systems that are too simple. They lack the features to support fine-grained access on single resources. I’ve also seen them too complex. One small change might unravel a whole policy of attribute-based permissions.

In this article, we’ll look at how my team — Identity and Access Management — took a creative approach to avoid those pitfalls by rebuilding Carta’s permissions system based on Google Zanzibar.

Experimenting with tokens

Our work began in mid-2019 when we started decomposing our monolithic application. Engineers were developing decomposed services and needed a way to authorize incoming requests. Permission data was still coupled to the legacy permission system in the monolith.

Our first experiment was a distributed, token-based access control system. We passed user permission data between services with an encoded JWT token. Distributed tokens made access checks fast, but there was a huge cost to build them. Token build times had a very long tail and caused performance issues for power users with large permission sets.

A snapshot of token build times

Some tokens reached almost 1mb (!!) in size. Almost every request payload had an embedded token and response times would suffer as large tokens passed between services.

Additionally, the authorization token was not extendable. Service owners had to build their own system for new permission sets. We decided to scrap the authorization token and investigate other options.

Navigating complex permissions

During our discovery process, we evaluated both open source and vendor products. Many of the projects we evaluated were not granular enough to handle our use cases, so we ruled them out. Then we turned our focus on avoiding vendor lock-in.

We identified a promising project named Open Policy Agent (OPA). OPA is a framework that provides a single interface for authorization checks.

OPA serves as a system of record for authorization policies. Domain services push code-defined permission policies to OPA. Network traffic passes through OPA and policies are used to either authorize or deny incoming requests. OPA is built for distributed systems, so it scales well across a cluster.

OPA has some downsides. It requires a large set of self-managed infrastructures. Permission checks are also slow. With OPA, the consumer controls the underlying data source. You store policy code in OPA, but policies query domain-level data for access checks. This worked fine for simple permissions, but complex ones caused major issues for us.

Complex permissions often queried several data models and added hundreds of milliseconds to response times. That wasn’t going to work for us.

Role explosion

Other projects failed to meet the requirements for our data because they operated on role-based access control (RBAC). RBAC grants access to users via permissive roles. Roles break down with large data sets due to a phenomenon called “role explosion.”

Role Explosion

Role explosion is caused when users have many roles for several different system entities. (For example, if a user has four roles for 10 different entities, the user has 40 managed roles.) This number has no upper bound and can become millions of managed roles if there are thousands of entities in the system.

At Carta, user permissions are complex because they’re scoped to portfolio assets. A user has access to their individual portfolio for a set of shares they own. They also may access securities through a fund that is invested in the company. Or the fund might even invest in another fund which owns shares in the company. On top of all that, the company might have access to your portfolio so they can administer securities.

Users have permission on each portfolio they can access. With millions of portfolios and securities on the platform, our permissions system has to manage a large amount of individual roles.

Enter: Zanzibar

Mid-2019, Google released a white-paper titled “Zanzibar: Google’s Consistent, Global Authorization System.”

Google uses Zanzibar to service millions of authorization requests per second. Several high-traffic products, including YouTube and Drive, use Zanzibar for authorization. Zanzibar is powerful because it is scalable, flexible, and fast.

One of Zanzibar’s core features is a uniform language that is used to define permissions. Zanzibar consumers use the uniform language to build Access Control Lists (ACLs). ACLs are like unix file permissions. They give users access to individual resources in the system.

With Zanzibar, services compose abstractions for user permission groups. User permission groups can compose each other. They also grant access to low-level resource ACLs.

Unlike OPA, Zanzibar is the source of truth for the data. Consumers simply add permissions without thinking about how to implement fast lookups.

Zanzibar met all our requirements. It was scalable, fast, and generic.

But we had a problem: There weren’t any open source implementations available. We decided to build our own system, adding several of our own modifications to make it an even better fit for Carta.

Our implementation

At this point, I’ll start referring to our next-gen authorization system as AuthZ. AuthZ stands for authorization.

AuthZ configures permissions by adding and removing RelationTuples. A RelationTuple is a tuple consisting of an actor, relation, and object.

  • Actor: The entity performing the action (e.g. User:Bob). An actor can also be a grouping of entities (e.g. Group:7 Members). It can even be a non-user (e.g. Service:Notification-Service).
  • Relation: The action performed on an object (e.g. view, edit, add_card, etc.).
  • Object: The entity the actor is acting upon (e.g. Corporation:7).
A RelationTuple is represented as two nodes in a permissions graph

Each part of the RelationTuple is freeform, so we don’t limit the type of permissions that you can set with AuthZ. For example, “read” and “issue_certificates” are both valid permissions.

Consumers combine RelationTuples using a domain-specific language. RelationTuples create direct access between actors and objects. RelationTuples automatically imply indirect access between actors and objects. I’ll explain later how this makes AuthZ so powerful.

  • Direct Relation: A row in the database that has been explicitly set by a client
  • Indirect Relation: A row that does not exist in the database, but the graph gives access to the actor through an indirect relation.
Direct relations vs. indirect relations

RelationTuples construct a permission graph. Consumers query the permission graph to determine if an actor has access to an object. An actor has access to an object if there is either a direct or indirect relationship between the actor and the object. (A good mnemonic for this process: “Actors act on Objects.”)

Traversing the graph is expensive, so AuthZ maintains a custom secondary index. The secondary index responds to access checks on the permissions graph in sub-10ms response times.

The architecture of this system is out of the scope of this article. We’ll post about it later, so follow Building Carta for updates.

Building an MVP

At Carta, we start every new project by building a minimum viable product (MVP). Internal services are no exception. MVPs help your team collect valuable early feedback. Early feedback prevents your team from spending time building the wrong features.

Our goals were:

  • Validate latency metrics
  • Customer usability feedback
  • Identify risks and roadblocks

Our team started by aggregating several requirements from our consumers. At that time our authentication token was still used by most domain services to query legacy permissions. Most teams wanted a way to build new permissions.

We started implementing a basic version of the index algorithm from Zanzibar. Our first goal was to test this system on new permissions.

Our initial tests

Speed was one of the primary requirements for AuthZ. We built a prototype of the Zanzibar index to verify that we could achieve a similar level of performance.

Once we built the prototype, we began testing. We conducted several tests with our version of the Zanzibar index. We compared AuthZ’s custom index to a couple alternatives for a baseline metric.

A snapshot from one of our early performance tests

The lookup time using the Zanzibar custom index is very fast. Checks on dense graphs aren’t much slower than checks on sparse graphs and the index is an order of magnitude faster than an alternative optimized caching strategy.

There is an initial cost for building the custom index as seen by the higher average time to add an edge.

Eventually, this became a non-issue since AuthZ builds the index with an asynchronous pipeline. Consumers don’t have to wait for the index updates to propagate through the system. To reduce load, index updates are also distributed among several workers. As the number of consumers increases, we can scale up workers to process index events faster.

Consumer roadblocks

At this point, our initial product was simple. There were only three endpoints.

  • Add: Push a RelationTuple into the index
  • Remove: Remove a RelationTuple from the index
  • Check: Query the index for a permission

AuthZ’s Check endpoint performance impressed our consumers, but they felt like the API lacked features. Most use cases added several RelationTuples. Consumers called AuthZ tens of times during an update. Each call would add an individual RelationTuple.

There were also some concerns around scalability. At this point the index was not built asynchronously. Thus, updates were slow. Since permissions were added individually, a full update could take several seconds.

Finally, one of the biggest concerns was lack of visibility into the system. There was no way to inspect permissions once they were pushing into AuthZ. It was impossible to migrate an existing permission model without this feature.

Our consumers were reluctant to use AuthZ in a production environment.

This was hard feedback to hear. But this is why Carta works iteratively. They weren’t saying “No.” They were saying “Not right now.”

We knew that if we added a couple of features (to our MVP, which was already deployed), we could get the adoption we wanted.

Make improvements

The initial feedback was disappointing, but valuable. A few comments stood out to us:

  • Permissions are easy to add
  • There is single source of truth
  • Checks are fast

This feedback was a positive indicator for our initial requirements. The MVP addressed most of the problems that we set out to solve. The scaling issues weren’t a big deal. We could solve those by implementing more features from the Zanzibar document.

The MVP identified something we did not predict: Our consumers wanted tooling. They preferred software which helps them use the system, not features on the system itself. We quickly prioritized tooling due to the demand coming from our consumers.

Our consumers also had issues adding RelationTuples. They didn’t want to make 50 separate requests to bootstrap permissions. Rather they wanted to add multiple permissions in a single call. This customer request became the Template feature.

Visualizer

During our initial tests, we built some tools to debug issues with the index. One of the tools was a simple graph visualizer. Given a set of RelationTuples, it was able to generate a JPEG image of the graph.

A prototype graphing tool we used for early testing

We thought about creating an endpoint to serve this image to our consumers, but quickly decided it was a bad idea. We decided to build query APIs that let consumers view portions of the graph. These APIs acted as our backend. We then created Concord, a web app that acted as our presentation layer to visualize nodes in the graph.

Our visualizer tool in Concord

This feature alone addressed most of the concerns around discoverability, with the ability to identify and inspect permissions previously added.

We’ll share more information about Concord in a future blog post.

Namespaces

During our test sessions, we noticed some services botched other services’ permissions. This happens when two services apply updates to the same permission type.

For instance, one service might add a permission for a bank account. Another might add a permission for a user account. Both might try to write a permission with the actor type: “Account.” One service could overwrite another service’s change.

An example of two services writing to the same permission type

We introduced namespaces to separate domain-level permissions. Namespaces prevent permission overwrites, but promote cross-domain queries. Services write to one or many namespaces where they have write access. If a service attempts to write to a namespace they do not have access to, AuthZ rejects the update and throws an error.

Permissions can connect across namespaces

Carta uses shared namespaces for a few use cases. For instance, user permissions are in one namespace; document permissions are in another. AuthZ grants the document service access for the Identity public namespace. Document service can give a user access to a document permission.

This is a major advantage of having a global permission store.

Templates

We solved the bulk update problem (add multiple permissions) by introducing RelationTuple templates. Templates allow consumers to predefine a set of changes that are applied in AuthZ. Templates make updates explicit and repeatable.

Consumers construct templates using Mustache templating language and pass templates to AuthZ with a set of data when a change occurs. AuthZ injects data into the template and executes a bulk update event.

We encourage services to use separate templates for each use case.

Here’s an excerpt from a template we use for AuthZ meta permissions:

# Set admins for this namespace
{{#admin}}
authz_service:{{service_id}}, admin, authz_namespace:{{namespace_id}}
{{/admin}}
# Give other services write access if they need it
{{#write}}
authz_service:{{service_id}}, create_any_relation, authz_namespace:{{namespace_id}}
{{/write}}

AuthZ uses this template to add permissions for a new namespace. AuthZ injects data for the change event. In this case, the namespace ID and any services that have access to the namespace. The template applies a bulk update to the permissions graph. Future AuthZ calls will include the RelationTuples applied by the template.

Flatten

In some use cases, services display a list of resources that a user has access to (i.e., a search page).

Traversing the graph with our query API returns the correct data, but it’s expensive. Deep graphs can take seconds to retrieve a results set, as opposed to tens of milliseconds when using the custom index.

We determined we could repurpose the custom index to retrieve a flat list of resources for an actor. Since consumers effectively flatten the graph, we called this API “Flatten.”

The Flatten API “flattens” the graph

The index uses special filters to reduce the results returned to the consumer. Consumers use the output of Flatten to query their database for resources to show to the user.

List lookups were much faster with the index, but slowed down when consumers used filters.

We built a trigram index on top of the custom index to support faster lookups with filters. All these optimizations reduced lookup times to 50ms for the 95th percentile.

Version 2.0.0

A constant “build, measure, learn” feedback loop enabled our team to deliver immediate value. New features drove adoption across the company. Teams gravitated towards AuthZ because it was performant and easy to use.

But it wasn’t enough—yet. People wanted to use the new system to query old permissions. To drive further adoption, we implemented a legacy permissions proxy.

The proxy enables teams to call AuthZ for legacy permissions instead of using the JWT token. Legacy permission checks are slower than AuthZ calls. But the proxy encouraged a single interface. It is much easier to migrate customers on a single interface than it is on two disparate systems.

Currently, AuthZ services seven different applications in our production environment for about 130 new, unique permission types. We have an average load rate of about 15 requests per second. Our metrics are growing quickly.

We’re working to replace the legacy permissions proxy with native AuthZ calls. Native AuthZ calls will make checks an order of magnitude faster on average than the proxy. The 95th percentile is more than two orders of magnitude faster.

AuthZ has made it easier for teams to manage permissions. Developers can build new products without having to build underlying authorization infrastructure.

The takeaway

When building something new — be it an internal service or external product — don’t be afraid to experiment. Our team had several failures until we built something that our consumers wanted to use. Ultimately, early release was key for us.

I challenge you to use this process in your own work. Is there a broken system at your company? Can you launch a simple experiment to help build something useful?

We’ll be covering the architecture of AuthZ in future articles. Leave a comment if you have questions about AuthZ or our development process. And if you’d like to help us build the next version of Carta, we’re hiring.


AuthZ: Carta’s highly scalable permissions system was originally published in Building Carta on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read the whole story
seriousben
7 days ago
reply
Performant and authz systems are often not synonymous. Interest approach to leveraged the Google Zanzibar thinking to increase performance.
Canada
Share this story
Delete

The Idempotency-Key HTTP Header Field

1 Comment

Article URL: https://datatracker.ietf.org/doc/html/draft-ietf-httpapi-idempotency-key-header-00

Comments URL: https://news.ycombinator.com/item?id=27729610

Points: 125

# Comments: 48

Read the whole story
seriousben
21 days ago
reply
RFC trying to standardize the use of the idempotency key HTTP header.

The HN comments (https://news.ycombinator.com/item?id=27729610) provide a lot of value. I really like that the Stripe intern that initially implemented that concept is participating in the discussion.
Canada
Share this story
Delete

Stepping Back from Speaking

1 Comment

A personal note on why I wish to give up doing talks

more…

Read the whole story
seriousben
24 days ago
reply
Leaders showing vulnerability is always great. Public speaking can be scary and draining. Even people with lots of experience have to manage this.
Canada
Share this story
Delete

How a Docker footgun led to a vandal deleting NewsBlur’s MongoDB database

6 Comments and 13 Shares

tl;dr: A vandal deleted NewsBlur’s MongoDB database during a migration. No data was stolen or lost.

I’m in the process of moving everything on NewsBlur over to Docker containers in prep for a big redesign launching next week. It’s been a great year of maintenance and I’ve enjoyed the fruits of Ansible + Docker for NewsBlur’s 5 database servers (PostgreSQL, MongoDB, Redis, Elasticsearch, and soon ML models). The day was wrapping up and I settled into a new book on how to tame the machines once they’re smarter than us when I received a strange NewsBlur error on my phone.

"query killed during yield: renamed collection 'newsblur.feed_icons' to 'newsblur.system.drop.1624498448i220t-1.feed_icons'"

There is honestly no set of words in that error message that I ever want to see again. What is drop doing in that error message? Better go find out.

Logging into the MongoDB machine to check out what state the DB is in and I come across the following…

nbset:PRIMARY> show dbs
READ__ME_TO_RECOVER_YOUR_DATA   0.000GB
newsblur                        0.718GB

nbset:PRIMARY> use READ__ME_TO_RECOVER_YOUR_DATA
switched to db READ__ME_TO_RECOVER_YOUR_DATA
    
nbset:PRIMARY> db.README.find()
{ 
    "_id" : ObjectId("60d3e112ac48d82047aab95d"), 
    "content" : "All your data is a backed up. You must pay 0.03 BTC to XXXXXXFTHISGUYXXXXXXX 48 hours for recover it. After 48 hours expiration we will leaked and exposed all your data. In case of refusal to pay, we will contact the General Data Protection Regulation, GDPR and notify them that you store user data in an open form and is not safe. Under the rules of the law, you face a heavy fine or arrest and your base dump will be dropped from our server! You can buy bitcoin here, does not take much time to buy https://localbitcoins.com or https://buy.moonpay.io/ After paying write to me in the mail with your DB IP: FTHISGUY@recoverme.one and you will receive a link to download your database dump." 
}

Two thoughts immediately occured:

  1. Thank goodness I have some recently checked backups on hand
  2. No way they have that data without me noticing

Three and a half hours before this happened, I switched the MongoDB cluster over to the new servers. When I did that, I shut down the original primary in order to delete it in a few days when all was well. And thank goodness I did that as it came in handy a few hours later. Knowing this, I realized that the hacker could not have taken all that data in so little time.

With that in mind, I’d like to answer a few questions about what happened here.

  1. Was any data leaked during the hack? How do you know?
  2. How did NewsBlur’s MongoDB server get hacked?
  3. What will happen to ensure this doesn’t happen again?

Let’s start by talking about the most important question of all which is what happened to your data.

1. Was any data leaked during the hack? How do you know?

I can definitively write that no data was leaked during the hack. I know this because of two different sets of logs showing that the automated attacker only issued deletion commands and did not transfer any data off of the MongoDB server.

Below is a snapshot of the bandwidth of the db-mongo1 machine over 24 hours:

You can imagine the stress I experienced in the forty minutes between 9:35p, when the hack began, and 10:15p, when the fresh backup snapshot was identified and put into gear. Let’s breakdown each moment:

  1. 6:10p: The new db-mongo1 server was put into rotation as the MongoDB primary server. This machine was the first of the new, soon-to-be private cloud.
  2. 9:35p: Three hours later an automated hacking attempt opened a connection to the db-mongo1 server and immediately dropped the database. Downtime ensued.
  3. 10:15p: Before the former primary server could be placed into rotation, a snapshot of the server was made to ensure the backup would not delete itself upon reconnection. This cost a few hours of downtime, but saved nearly 18 hours of a day’s data by not forcing me to go into the daily backup archive.
  4. 3:00a: Snapshot completes, replication from original primary server to new db-mongo1 begins. What you see in the next hour and a half is what the transfer of the DB looks like in terms of bandwidth.
  5. 4:30a: Replication, which is inbound from the old primary server, completes, and now replication begins outbound on the new secondaries. NewsBlur is now back up.

The most important bit of information the above chart shows us is what a full database transfer looks like in terms of bandwidth. From 6p to 9:30p, the amount of data was the expected amount from a working primary server with multiple secondaries syncing to it. At 3a, you’ll see an enormous amount of data transfered.

This tells us that the hacker was an automated digital vandal rather than a concerted hacking attempt. And if we were to pay the ransom, it wouldn’t do anything because the vandals don’t have the data and have nothing to release.

We can also reason that the vandal was not able to access any files that were on the server outside of MongoDB due to using a recent version of MongoDB in a Docker container. Unless the attacker had access to a 0-day to both MongoDB and Docker, it is highly unlikely they were able to break out of the MongoDB server connection.

While the server was being snapshot, I used that time to figure out how the hacker got in.

2. How did NewsBlur’s MongoDB server get hacked?

Turns out the ufw firewall I enabled and diligently kept on a strict allowlist with only my internal servers didn’t work on a new server because of Docker. When I containerized MongoDB, Docker helpfully inserted an allow rule into iptables, opening up MongoDB to the world. So while my firewall was “active”, doing a sudo iptables -L | grep 27017 showed that MongoDB was open the world. This has been a Docker footgun since 2014.

To be honest, I’m a bit surprised it took over 3 hours from when I flipped the switch to when a hacker/vandal dropped NewsBlur’s MongoDB collections and pretended to ransom about 250GB of data. This is the work of an automated hack and one that I was prepared for. NewsBlur was back online a few hours later once the backups were restored and the Docker-made hole was patched.

It would make for a much more dramatic read if I was hit through a vulnerability in Docker instead of a footgun. By having Docker silently override the firewall, Docker has made it easier for developers who want to open up ports on their containers at the expense of security. Better would be for Docker to issue a warning when it detects that the most popular firewall on Linux is active and filtering traffic to a port that Docker is about to open.

The second reason we know that no data was taken comes from looking through the MongoDB access logs. With these rich and verbose logging sources we can invoke a pretty neat command to find everybody who is not one of the 100 known NewsBlur machines that has accessed MongoDB.


$ cat /var/log/mongodb/mongod.log | egrep -v "159.65.XX.XX|161.89.XX.XX|<< SNIP: A hundred more servers >>"

2021-06-24T01:33:45.531+0000 I NETWORK  [listener] connection accepted from 171.25.193.78:26003 #63455699 (1189 connections now open)
2021-06-24T01:33:45.635+0000 I NETWORK  [conn63455699] received client metadata from 171.25.193.78:26003 conn63455699: { driver: { name: "PyMongo", version: "3.11.4" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "5.4.0-74-generic" }, platform: "CPython 3.8.5.final.0" }
2021-06-24T01:33:46.010+0000 I NETWORK  [listener] connection accepted from 171.25.193.78:26557 #63455724 (1189 connections now open)
2021-06-24T01:33:46.092+0000 I NETWORK  [conn63455724] received client metadata from 171.25.193.78:26557 conn63455724: { driver: { name: "PyMongo", version: "3.11.4" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "5.4.0-74-generic" }, platform: "CPython 3.8.5.final.0" }
2021-06-24T01:33:46.500+0000 I NETWORK  [conn63455724] end connection 171.25.193.78:26557 (1198 connections now open)
2021-06-24T01:33:46.533+0000 I NETWORK  [conn63455699] end connection 171.25.193.78:26003 (1200 connections now open)
2021-06-24T01:34:06.533+0000 I NETWORK  [listener] connection accepted from 185.220.101.6:10056 #63456621 (1266 connections now open)
2021-06-24T01:34:06.627+0000 I NETWORK  [conn63456621] received client metadata from 185.220.101.6:10056 conn63456621: { driver: { name: "PyMongo", version: "3.11.4" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "5.4.0-74-generic" }, platform: "CPython 3.8.5.final.0" }
2021-06-24T01:34:06.890+0000 I NETWORK  [listener] connection accepted from 185.220.101.6:21642 #63456637 (1264 connections now open)
2021-06-24T01:34:06.962+0000 I NETWORK  [conn63456637] received client metadata from 185.220.101.6:21642 conn63456637: { driver: { name: "PyMongo", version: "3.11.4" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "5.4.0-74-generic" }, platform: "CPython 3.8.5.final.0" }
2021-06-24T01:34:08.018+0000 I COMMAND  [conn63456637] dropDatabase config - starting
2021-06-24T01:34:08.018+0000 I COMMAND  [conn63456637] dropDatabase config - dropping 1 collections
2021-06-24T01:34:08.018+0000 I COMMAND  [conn63456637] dropDatabase config - dropping collection: config.transactions
2021-06-24T01:34:08.020+0000 I STORAGE  [conn63456637] dropCollection: config.transactions (no UUID) - renaming to drop-pending collection: config.system.drop.1624498448i1t-1.transactions with drop optime { ts: Timestamp(1624498448, 1), t: -1 }
2021-06-24T01:34:08.029+0000 I REPL     [replication-14545] Completing collection drop for config.system.drop.1624498448i1t-1.transactions with drop optime { ts: Timestamp(1624498448, 1), t: -1 } (notification optime: { ts: Timestamp(1624498448, 1), t: -1 })
2021-06-24T01:34:08.030+0000 I STORAGE  [replication-14545] Finishing collection drop for config.system.drop.1624498448i1t-1.transactions (no UUID).
2021-06-24T01:34:08.030+0000 I COMMAND  [conn63456637] dropDatabase config - successfully dropped 1 collections (most recent drop optime: { ts: Timestamp(1624498448, 1), t: -1 }) after 7ms. dropping database
2021-06-24T01:34:08.032+0000 I REPL     [replication-14546] Completing collection drop for config.system.drop.1624498448i1t-1.transactions with drop optime { ts: Timestamp(1624498448, 1), t: -1 } (notification optime: { ts: Timestamp(1624498448, 5), t: -1 })
2021-06-24T01:34:08.041+0000 I COMMAND  [conn63456637] dropDatabase config - finished
2021-06-24T01:34:08.398+0000 I COMMAND  [conn63456637] dropDatabase newsblur - starting
2021-06-24T01:34:08.398+0000 I COMMAND  [conn63456637] dropDatabase newsblur - dropping 37 collections

<< SNIP: It goes on for a while... >>

2021-06-24T01:35:18.840+0000 I COMMAND  [conn63456637] dropDatabase newsblur - finished

The above is a lot, but the important bit of information to take from it is that by using a subtractive filter, capturing everything that doesn’t match a known IP, I was able to find the two connections that were made a few seconds apart. Both connections from these unknown IPs occured only moments before the database-wide deletion. By following the connection ID, it became easy to see the hacker come into the server only to delete it seconds later.

Interestingly, when I visited the IP address of the two connections above, I found a Tor exit router:

This means that it is virtually impossible to track down who is responsible due to the anonymity-preserving quality of Tor exit routers. Tor exit nodes have poor reputations due to the havoc they wreak. Site owners are split on whether to block Tor entirely, but some see the value of allowing anonymous traffic to hit their servers. In NewsBlur’s case, because NewsBlur is a home of free speech, allowing users in countries with censored news outlets to bypass restrictions and get access to the world at large, the continuing risk of supporting anonymous Internet traffic is worth the cost.

3. What will happen to ensure this doesn’t happen again?

Of course, being in support of free speech and providing enhanced ways to access speech comes at a cost. So for NewsBlur to continue serving traffic to all of its worldwide readers, several changes have to be made.

The first change is the one that, ironically, we were in the process of moving to. A VPC, a virtual private cloud, keeps critical servers only accessible from others servers in a private network. But in moving to a private network, I need to migrate all of the data off of the publicly accessible machines. And this was the first step in that process.

The second change is to use database user authentication on all of the databases. We had been relying on the firewall to provide protection against threats, but when the firewall silently failed, we were left exposed. Now who’s to say that this would have been caught if the firewall failed but authentication was in place. I suspect the password needs to be long enough to not be brute-forced, because eventually, knowing that an open but password protected DB is there, it could very possibly end up on a list.

Lastly, a change needs to be made as to which database users have permission to drop the database. Most database users only need read and write privileges. The ideal would be a localhost-only user being allowed to perform potentially destructive actions. If a rogue database user starts deleting stories, it would get noticed a whole lot faster than a database being dropped all at once.

But each of these is only one piece of a defense strategy. As this well-attended Hacker News thread from the day of the hack made clear, a proper defense strategy can never rely on only one well-setup layer. And for NewsBlur that layer was a allowlist-only firewall that worked perfectly up until it didn’t.

As usual the real heros are backups. Regular, well-tested backups are a necessary component to any web service. And with that, I’ll prepare to launch the big NewsBlur redesign later this week.

Read the whole story
seriousben
25 days ago
reply
Great root cause analysis of a security incident.
Canada
Share this story
Delete
5 public comments
chrisrosa
26 days ago
reply
Great write up Samuel. And kudos for your swift and effective response.
San Francisco, CA
jshoq
26 days ago
reply
This is a great account on how to recover a service from a major outage. In this case, NewsBlur was attacked by a scripter that used a well known hole to attack the system. In this case, a well planned and validated backup setup helped NewsBlur to get their service back online quickly. This is a great read of a blameless post mortem executed well.
JS
Seattle, WA
jqlive
27 days ago
reply
Thanks for the write up, it was interesting to read and very transparent of you. It would be an interesting read to know how you'll be applying ML Models to Newsblur.
CN/MX
samuel
27 days ago
reply
What a week. In other news, new blog design launched!
Cambridge, Massachusetts
deezil
26 days ago
Thanks for being above-board with all this! The HackerNews comment section was a little brutal towards you about some things, but I like that you've been transparent about everything.
samuel
26 days ago
HN only knows how to be brutal, which I always appreciate.
acdha
25 days ago
Thanks for writing this up. That foot-gun really needs fixing.
BLueSS
27 days ago
reply
Thanks, Samuel, for your hard work and efforts keeping NewsBlur alive!

How to Work Hard

1 Comment
Read the whole story
seriousben
25 days ago
reply
Great essay on what it means to work hard.
Canada
Share this story
Delete

Infosec Core Competencies

1 Comment and 2 Shares
An incomplete list of things just about anybody working in Information Security would benefit from knowing.
Read the whole story
seriousben
37 days ago
reply
Amazing list of InfoSec tips.
Canada
Share this story
Delete
Next Page of Stories