CVE-2020-13254

June 7, 2020

Information Exposure Vulnerability with Django and Memcached

On Wednesday April 29th, Thread started experiencing a partial outage of our main backend service. We traced the issue down to the existence of malformed Memcached keys and corrected the issue on thread.com. Along the way we suspected that this could be exploited on some Django sites using Memcached to cause private data exposure – either internal service data or data about other users. The only issue on Thread was HTTP 500 server errors seen by a small number of users, no private data was leaked.

We reported this to the Django security team on the same day through their preferred disclosure process, providing a full write up with a potential fix.

After some discussion it was concluded that the issue did indeed represent a security vulnerability in Django based sites, and was assigned the identifier CVE-2020-13254. The fix was reviewed and merged by the security team, and released in 3.0.7 and 2.2.13 on June 3rd.

This blog post covers…

Finding the vulnerability What errors we saw, our debugging, and an unsatisfying conclusion.
Exploitation example A simple Django example to show how this could be exploited.
Previous related Django discussion The history of this issue in Django, discussing why it may not have been realised sooner.
Why should Django validate Memcached keys? Technical discussion of why the existing behaviour was incorrect.
Why wasn’t this found sooner? Discussion of why it’s easy to make these mistakes.
Reporting and fixing How we reported and our experience contributing a fix to Django.

Finding the vulnerability

This bug was one of the hardest I’ve investigated in a while with many dead ends. We saw a number of symptoms indicating that cache queries of many kinds were failing, but one of the clearest examples was this:

# In `django/core/cache/backends/memcached.py`
def get_many(self, keys, version=None):
    key_map = {self.make_key(key, version=version): key for key in keys}
    ret = self._cache.get_multi(key_map.keys())

    # `KeyError` on this line for key `b':1:alternate-colours:1:15492594:213'`
    return {key_map[k]: v for k, v in ret.items()}

# Where:
key_map = {b':1:preferred-sizes-v2:88625:15492594': 'preferred-sizes-v2:88625:15492594'}
v = [15492576, 15492582, 15492619, 15492641]

Django provides a get_many function on its cache backends system. This takes an iterable of keys, and returns a mapping from those keys to the values from the cache, using the cache’s bulk query functionality if there is any.

The issue here was that while the keys input to the function were preferred-sizes keys (for the sizes of products that are appropriate for a given user), the cache had returned in ret an alternate-colours value which the cache backend was unable to match up with a key it was querying for, thus raising a KeyError.

This had us stumped. It looked like the cache was giving back the wrong data, but Memcached is a rock-solid piece of infrastructure, battle tested at companies far larger than us, so it was much more likely the bug was in our code.

The first port of call was what new changes we had shipped. We ship to production up to 40 times a day, so this is often a hard question to answer, but a suspicious commit had change a lot about how we use some of our core cached data and how it’s serialised. We checked to make sure that the data in the cache for given keys was valid, and it was, so we went down a rabbit hole of investigating serialisation behaviour, pickling (a Python form of serialisation) and a number of other issues. This turned out to be a dead end.

Since the data was valid, and our querying and serialisation appeared to be correct, this suggested an issue between us and Memcached. After serveral hours we suspected the issue could have been file pointers being re-used. If two processes could get access to the same file pointer, and were both writing queries and attempting to read results, they could read each others results. We spent some time investigating how this could happen, but what convinced us that this wasn’t the issue was that our incorrect responses from Memcached were not malformed, they were not truncated in the middle of keys or values, behaviour we’d almost certainly see otherwise.

Eventually a colleague who had been working on a separate area of the code, and who had pushed changes that had not worked in production, asked us if it could be related. He had found that through several layers of abstraction, a value that he had been editing – a human-readable title – was ending up in a cache key. He had updated some code with the first multi-word title and therefore inadvertently introduced a space character into a cache key, something not allowed by Memcached.

By including spaces in cache keys, our connection was getting out of step with what data Memcached was responding with. This is best illustrated by the Two Ronnies Mastermind sketch.

After a day of reading source code of Django, PyLibMC and the C source of libmemcached, ruling out many possibilities such as inadvertently upgrading packages or processes sharing file descriptors, finding that this bug was “simply” a space in a cache key was a little disappointing. It does however illustrate how possible or even likely this is in other codebases, and how dangerous this could be.

Exploitation example

Exploiting this issue as a user of a website requires the following things:

The website must be using Django, Memcached, and PyLibMC or another driver for Memcached that does not validate keys (note that python-memcached does validate keys and is not thought to be exploitable).
User-control over content that will end up unprocessed in a cache key. This could be a string, but could equally be a value associated with a form control.
The website must be using the cache in such a way that cache keys referencing sensitive data are queried after those that can be controlled by the attacker – although this is not per request but over the lifetime of a server process.

The full example is available on GitHub at danpalmer/django-cve-2020-13254.

The example codebase demonstrates the exploitation in two ways, via a simple web interface and via a failing test case.

Exploiting via the web

The example provides a web interface with 2 forms, one that sets values in the cache and the other that gets them. These are directly translated into calls to the Django cache backend. Because the codebase does not implement any session or authentication system, multiple uses in the same browser tab are indistinguishable from multiple users using between machines.

To exploit:

Set keys of A and B to values a and b.
Attempt to set C D to value c d. This will error.
Attempt to retrieve key A, there will incorrectly be no result.
Attempt to retrieve key B, the result will incorrectly be a.

Demo via tests

This process can be expressed as a test case as such:

from django.core.cache import cache
from django.test import TestCase


class CacheTests(TestCase):
    def test_cache(self):
        cache.set('k1', 'v1')
        cache.set('k2', 'v2')
        try:
            cache.set('a b', 'v3')
        except Exception:
            pass
        self.assertEqual(
            [
              cache.get(x) for x in
              ['k2', 'k1', 'k2', 'k1', 'k2', 'k1']
            ],
            ['v2', 'v1', 'v2', 'v1', 'v2', 'v1'],
        )

This fails with the following error:

=============================================================
FAIL: test_cache (demo.tests.CacheTests)
-------------------------------------------------------------
Traceback (most recent call last):
  File "tests.py", line 30, in test_cache
    'v1',
AssertionError: Lists differ

First differing element 0:
None
'v2'

- [None, 'v2', 'v1', 'v2', 'v1', 'v2']
?  ------

+ ['v2', 'v1', 'v2', 'v1', 'v2', 'v1']
?                              ++++++

-------------------------------------------------------------

As you can see, after the set, the cache results being returned are out of step with the queries being made.

During investigation we found that Django already validates cache keys to ensure that they do not contain spaces, as well as validating that they don’t include a number of other invalid characters and are under the maximum key length. Unfortunately this validation only happens on non-Memcached backends, and this was intentional!

From reading into the history it seems that in the pursuit of speed in some places, and developer experience in others, each applied unevenly, we ended up in this strange position where the backends that do not need it have it, and those that do don’t.

2008

In January 2008 issue #6447 was opened on Django’s bug tracker. It essentially suggests that because Memcached has these limitations, the cache backends used for local development (which just store the cache in process, unsuitable for production) should also do the same validation so that a developer using development backends locally but Memcached in production won’t be bitten by cache key validity issues once they deliver their code to production.

2010

On the same ticket it is decided that warnings (but not errors) will be added to non-Memcached backends to help, but that they won’t be added to the Memcached backend itself because:

any key mangling there could slow down a critical code path

While this dedication to performance is commendable, the key validation here is simple string checking on strings that must be 255 characters or shorter anyway (the Memcached key limit). This is not only likely to be a very quick operation, it’s also happening during a cache query that would incur a network round-trip.

2013

In February 2013 it was reported in #19914 that the test suite for Django was failing when using PyLibMC and the Memcached cache backend. During the investigation it was found that including spaces in a cache key…

causes subsequent requests to the server … to fail for the next few seconds

The conclusion of this ticket was to remove the offending test from the memcached backend test suite for PyLibMC.

Why should Django validate Memcached keys?

Throughout these tickets, the matter of whether Django should be validating keys came up several times, but why? As mentioned by commenters on those tickets, wouldn’t it be faster not to? Maybe it’s not Django’s repsonsibility to validate these keys.

From the famous Numbers Every Programmer Should Know (from 2009, so representative of the time this was being worked on), a main memory reference is around 100ns and a round-trip network request within the same datacentre is 500,000ns. The string validation may take a few memory accesses, so we could call it 1,000ns¹, but even then we’re still looking at a ~0.2% overhead on a cache query.

From this perspective it’s likely not that impactful, but another perspective is what level of abstraction we’re working at. Django is a relatively high level web framework – it aims to provide easy to use and safe tools for most things that web developers need to do. It does not aim to be the highest performance framework out there and such a framework would also likely not be based on Python. Django and Python already make speed trade-offs for developer productivity and safety, incurring performance overheads for preventing segfaults or making SQL injection attacks much less of a risk.

It’s worth noting that libmemcached also does not validate keys by default. This is probably much more appropriate as libmemcached is not designed to be a safe tool for working with caches, it’s designed to be a fast interface to Memcached that gives all control possible to the developer. A lack of validation here is appropriate for the level of abstraction that libmemcached provides.

Within the context of Django’s aims and Python’s values, skipping the validation to save this time is likely the wrong design choice, and the lack of impact means it’s probably the wrong technical choice, but it’s easy to get stuck in a performance focused view of code and forget about developer experience.

Why wasn’t this found sooner?

The ticket in 2013 came so close to realising the potential security issues, finding the exact behaviour that we at Thread observed, but missing the impact that it could have on a production system being used by untrusted users.

Having a security focused mindset is hard, it’s something I practice as much as I can, but as developers it’s much easier to focus on what software should do rather than what it shouldn’t. I can’t fault the Django team for not spotting this, the reason we joined the dots at Thread was because we were seeing cache keys and values containing user IDs in our error monitoring, without this we may well have not realised the impact.

Despite multiple people looking at this specific issue over the last ~10 years, no one raised it (publicly) as a security vulnerability. Even at Thread, it was only after three of us had worked on the bug we were investigating, and all wondered aloud if it could be a security vulnerability for us, did we finally connect the dots and realise that this was an issue that would likely affect other sites should probably be fixed in Django.

Reporting and fixing

I wrote up a full description of the issue, along with a first-pass attempt at a fix for it in Django and sent this to the security team. Django thankfully publishes contact details for its security team and also explicitly mentions these details in their bug tracker, encouraging developers not to submit public bugs that could have a security impact. This is great practice for a framework behind millions of websites running in production.

I received a response confirming that they had received the report within a few hours. Several days later, the team had a short discussion on the email chain raising questions and pointing to tickets where this had been discussed before, albeit without the security perspective.

After some back and forth it was confirmed on May 6th that this was indeed an exploitable security vulnerability and that it should be fixed in Django.

I finished my patch, including tests and documentation fixes, and submitted on May 8th. This was reviewed and accepted by the team.

The Django security team scheduled the patch for release in 3.1a2, 3.0.7 and 2.2.13 on June 1st.

This whole process was very easy thanks to the Django security team. It’s easy to be defensive when someone tells you there is a security vulnerabiliy in your product, but they came to the process with no ego. I already find the Django community to be helpful, friendly, and professional, and this process has served to further cement that feeling.

One thing I’ll be taking away from this experience is that it’s not always obvious when something is a security issue. It’s a nuanced balance of how code is used in production, attack vectors that might be levels of abstraction away, what the developer believes they are expected to do, and whether it’s appropriate from a performance perspective.

Thanks again to the Django security team, and also to my colleagues Alistair Lynn and Aaron Kirkbride, who both aided in debugging the issue and coming to the realisation of the wider impact of the bug.

This is certainly debatable, but given a valid key is a maximum of 255 characters, we’re likely talking about a maximum of 250-500 bytes assuming that most cache keys are ASCII or common extensions expressable in 2-bytes of unicode data as most written languages are. 500 bytes of a string being analysed will likely be loaded into the CPU cache in under 10 operations. ↩︎