May 25, 2013

Should String Be An Abstract Class?

Why are HTTP headers handled as plain strings in programming?

Is there anything in software engineering that is just a string? If not, shouldn't String be an abstract class, forcing developers to subtype and at least name datatypes?

Domain-Driven Security

Former colleague Dan Bergh Johnsson, application security expert Erlend Oftedal, and I have been evangelizing the idea of Domain-Driven Security. We truly believe proper domain and data modeling will kill many of the standard security bugs such as SQL injection and cross-site scripting.

This blog post is a case for Domain-Driven Security and a case against strings.

The addHeader() Method in Java

Let's be concrete and dive directly into programming with HTTP headers.

In Java EE's interface HttpServletResponse we find the following method (ref):

void addHeader(java.lang.String name,
               java.lang.String value)

Not a heavily debated method as far as I know. On the contrary it looks like most such interfaces do. An implementation of the interface may look like this (ref):

public void addHeader(String name, String value) {
  if (isCommitted())
    return;

  if (included)
    return;     // Ignore any call from an included servlet

  synchronized (headers) {
    ArrayList values = (ArrayList) headers.get(name);
    if (values == null) {
      values = new ArrayList();
      headers.put(name, values);
    }
    values.add(value);
  }
}

It shows we can really set any string as an HTTP header. And that's convenient, right?

The Ubiquitous String

java.lang.String is the ubiquitous datatype that solves all our problems. It can contain anything and nothing and of course it has its sibling in any popular programming language out there. Let's have a look at what a string is.

Java uses Unicode strings in UTF-16 code units which handle over 100,000 characters. As far as I know C# and JavaScript does the same. The max size of strings is often limited by the max size of integers, typically 2^31 - 1 which is just over 2 billion.

So, a string …
  • is anything between 0 and 2 billion in length, 
  • can contain 100,000 different characters, and 
  • can be null.
Hardly a good spec for HTTP headers.

HTTP Headers By the Spec

RFC 2047 gives us the formal specification of how HTTP headers should look. An excerpt will suffice for our discussion.

message-header = field-name ":" [ field-value ]
       field-name     = token
       field-value    = *( field-content | LWS )
       field-content  = <the OCTETs making up the field-value
                        and consisting of either *TEXT or
                        combinations 
of token, separators, and
                        quoted-string>

token          = 1*<any CHAR except CTLs or separators>

CHAR           = <any US-ASCII character (octets 0 - 127)>

CTL            = <any US-ASCII control character
                        (octets 0 - 31) and DEL (127)>

separators     = "(" | ")" | "<" | ">" | "@" |
                 "," | ";" | ":" | "\" | <"> |
                 "/" | "[" | "]" | "?" | "=" |
                 "{" | "}" | SP  | HT

LWS            = [CRLF] 1*( SP | HT )

CRLF            = CR LF

OCTET          = <any 8-bit sequence of data>

TEXT           = <any OCTET except CTLs,
                        but including LWS>

Let's summarize.
  • HTTP header names can consist of ASCII chars 32-126 except 19 chars called separators.
  • Then there shall be a colon.
  • Finally the header value can consist of any ASCII chars 9, 32-126 except 19 chars called separators … or a mix of tokens, separators, and quoted strings.
  • On top of this web servers such as Apache impose length constraints on headers, somewhere around 10,000 chars.
There's clearly a huge difference between just a string and RFC 2047.

The Dangers of Unvalidated HTTP Headers

Can this go wrong? Is there any real danger in using plain strings for setting HTTP headers? Yes. Let's look at HTTP response splitting as an example.

We have built a site where an optional URL parameter tells the server which language to use.

www.example.com/?lang=Swedish

… redirects to …

www.example.com/

… with a custom header telling the web client to use Swedish. After all, we don't want that language parameter pestering our beautiful URL the rest of the session.

So in the redirect response we do the following:

response.addHeader("Custom-Language",
                   request.getParameter("lang"));

The result is an HTTP response like this:

HTTP/1.1 302 Moved Temporarily
Date: Wed, 24 Dec 2013 12:53:28 GMT
Location: http://www.example.com/
Set-Cookie: 
JSESSIONID=1pMRZOiOQzZiE6Y6iivsREg82pq9Bo1ape7h4YoHZ62RXj
ApqwBE!-1251019693; path=/
Custom-Language: Swedish
Connection: Close

But what if the request looks like this (%0d is carriage return, %0a is linefeed):

www.example.com/?lang=foobar%0d%0aContent-Length:%200%0d%0a%0d%0aHTTP/1.1%20200%20OK%0d%0aContent-Type:%20text/html%0d%0aContent-Length:%2019%0d%0a%0d%0a<html>Well, hello!</html>

That would generate the following HTTP response (linefeeds included):

HTTP/1.1 302 Moved Temporarily
Date: Wed, 24 Dec 2013 15:26:41 GMT
Location: http://www.example.com/
Set-Cookie: 
JSESSIONID=1pwxbgHwzeaIIFyaksxqsq92Z0VULcQUcAanfK7In7IyrCST
9UsS!-1251019693; path=/
Custom-Language: foobar
Content-Length: 0

HTTP/1.1 200 OK
Content-Type: text/html
Content-Length: 19 
<html>Well, hello!</html>
Content-Type: text/html

… which will be interpreted as two responses by the web browser. This is an example of the security attack called HTTP response splitting (link to WASC from where I've adapted my example). And that's just one of the dangers of letting users mess with headers. Setting or deleting cookies is another. In fact, the whole header section is in danger.

The HTTP splitting vulnerability has been fixed under the hood in at least Tomcat 6+, Glassfish 2.1.1+, Jetty 7+, JBoss 3.2.7+. (Thanks for that info, Jeff Williams.)

Should We Fix the addHeader() API?

Now we can ask ourselves two different things. The first is – should we fix the addHeader() and related APIs? Yes. They should look something like this:

void addHeader(javax.servlet.http.HttpHeaderName name,
               javax.servlet.http.HttpHeaderValue value)

… where the two domain classes HttpHeaderName and HttpHeaderValue accept strings to their constructors and validate that the strings adhere to the RFC 2047 specification. In one blow all Java developers are relieved of the burden to write that validation code themselves and relieved of always having to remember running it.

Should String Be An Abstract Class?

The larger question is about strings in general. Yes, they are super convenient. But we're fooling ourselves. We think the time we save by not modeling our domain, by not writing that validation code, by not narrowing down our APIs to do exactly what they're supposed to, we think that time is better spent on other activities. It's not.

I truly believe nothing is just a string. Nothing is any of 100,000 characters and anything between 0 and 2 billion in length.

Therefore String should be an abstract class, forcing us developers to subtype and think about what we're really handling.

Even better, why not have a way to declare that a class can only be used in object composition? That way programmers could choose if an "is-a" relation or a "has-a" relation is most suitable for narrowing down the String class.

May 13, 2013

Introduction to Software Security

April 22, 2013 I successfully defended my PhD in computer science, more specifically in the area of software security [fulltext pdf]. I thought I'd share some parts of the thesis in a more digestible format and allow myself to augment our results, comment, and have opinions, things you typically don't see in academic publications.

Let's start with my introductory chapter …


The cover.

``To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem. In this sense the electronic industry has not solved a single problem, it has only created them, it has created the problem of using its products.''
–Edsger W.Dijkstra, The Humble Programmer, 1972

Computer software products are among the most complex artifacts, if not the most complex artifacts mankind has created (see Dijkstra's quote above). Securing those artifacts against intelligent attackers who try to exploit flaws in software design and construct is a great challenge too.

Our research contributes to the field of software security. Software as an artifact meant to interact with its environment including humans. Security in the sense of withstanding active intrusion attempts against benign software.

Software Vulnerabilities

Software can be intentionally malicious such as viruses (programs that replicate and spread from one computer to another and cause harm to infected ones), trojans (malicious programs that masquerade as benign) and software containing logic bombs (malicious functions set off when specified conditions are met).

However, attacks against computer systems are not limited to intentionally malicious software. Benign software can contain vulnerabilities and such vulnerabilities can be exploited to make the benign software do malicious things. A successful exploit has traditionally been the same as an intrusion. But in the era of web application vulnerabilities that term is not used as often. Nevertheless, a successful cross-site scripting attack (XSS) can be seen as executing arbitrary code inside the web application. And arbitrary code execution in a web application may very well be of high impact if the application handles sensitive information (password fields, credit card numbers etc) or is authorized to do sensitive state changes on the server (money transfers, profile updates, message posting etc). I would therefore argue that XSS is an intrusion attack.

Vulnerabilities can be responsibly reported to the public by creating a so called CVE Identifier – a unique, common identifier for a publicly known information security vulnerability. Identifiers are created by CVE Numbering Authorities for acknowledged vulnerabilities. Larger software vendors typically handle identifiers for their own products. Some of these participating vendors are Apple, Oracle, Ubuntu Linux, Microsoft, Google, and IBM.

The National Institute of Standards and Technology (NIST) has a statistical database over reported software vulnerabilities with a publicly accessible search interface. Two specific types of vulnerabilities are of specific interest in the context of our research, namely buffer overflows and format string vulnerabilities in software written in the programming language C. The statistics for Buffer Errors and Format String Vulnerabilities are shown below.



Reported software vulnerabilities due to buffer errors have increased significantly since 2002. Their percentage of the total number of reported vulnerabilities has also increased from 1-4 % between 2002 and 2006 to 10-16 % between 2008 and 2012. These statistics are in stark contrast to the statistics from CERT that Wagner et al used to show that buffer overflows represented 50 % of all reported vulnerabilities in 1999 [pdf]. We have not investigated if there are significant differences in how the two statistics were produced. Still, up to 16 % of all reported vulnerabilities is a significant number.

The reported format string vulnerabilities peaked between 2007 and 2009 but have never reached 0.5 % of the total. Our experience is that format string vulnerabilities are less prevalent, easier to fix, and harder to exploit than buffer overflow vulnerabilities. Nevertheless format string vulnerabilities are still being used for exploitation such as the Corona iOS Jailbreak Tool.

Avoiding Software Intrusions

Intrusion attempts or attacks are made by malicious users or attackers against victims. A victim can be either a machine holding valuable assets or another human computer user. Securing software against intrusions calls for anti-intrusion techniques as defined by Halme and Bauer. We have taken the liberty of adapting and reproducing Halme and Bauer's figure showing anti-intrusion approaches, see below.


  1. Preempt – strike offensively against likely threat agents prior to an intrusion attempt. May affect innocents.
  2. Prevent – severely handicap the likelihood of a particular intrusion’s success.
  3. Deter – increase the necessary effort for an intrusion to succeed, increase the risk associated with an attempt, and/or devalue the perceived gain that would come with success.
  4. Deflect – leads an intruder to believe that he or she has succeeded in an intrusion attempt, whereas in fact the intrusion was redirected to where harm is minimized.
  5. Detect – discriminate intrusion attempts and intrusion preparation from normal activity and alert the operations. Detection can also be done in a post mortem analysis.
  6. Actively countermeasure – counter an intrusion as it is being attempted.

Avoiding the Vulnerabilities

There are many ways to achieve more secure software, i.e. avoiding to have vulnerabilities. Microsoft's Security Development Lifecycle (SDL) defines seven phases where security enhancing activities and technologies apply:

  1. Training
  2. Requirements
  3. Design
  4. Implementation
  5. Verification
  6. Release
  7. Response

Further things can be done in an even wider scope. Programming languages can be constructed with security primitives which allow programmers to express security properties of the system they are writing – so called security-typed languages, a part of language-based security [pdf]. Operating systems and deployment platforms can be hardened and secured both in construction and configuration.

Our research objectives have been on the Requirements and Implementation phases of Microsoft's SDL and on hardening of the runtime environment for software applications. Want to know what we found out? Stay tuned for upcoming posts where we dive into the details of our studies.