Archive for November, 2011

Transfer-Encoding: Chunked Debugging w/ WPT – 101

Hi there!

Just recently there was an interesting performance issue raised in the WPT Forum that served really well as a showcase for the rich debugging capabilities of WPT and I would focus in this Blog Post on Transfer-Encoding: Chunked.

Transfer-Encoding: Chunked was defined in the HTTP1.1 RFC and can be a nice performance improvement regarding Time-to-Render.

In HTTP1.0 the default behaviour of Browser-Server interaction was to a establish a TCP Connection to a Webserver using TCP’s well-known 3-way-handshake. After the Connection was established, the Browser fired its GET-Request and waited for the response. The requested object was sent down the wire from the Server to the Browser, and as soon as all Bytes were transmitted, the Server closed the TCP Connection. The Browser then would establish a new TCP connection to request the next object and so on.

With HTTP1.1 so-called persistent connections became the default, meaning that a Browser could request sequentially, one after another, multiple objects on a single TCP connection (though it would have to wait for each request to be fulfilled, before it could request the next object. Otherwise it would be Pipelining).

Now looking at performance a transportation method was defined in HTTP1.1 called Transfer-Encoding: Chunked. Reason for this was, that sometimes the Werbserver would have already a couple of Bytes to be sent, but would not know yet the total length (size) of the whole answer. Due to this the Webserver would have to wait until the last Byte is served to him, so the Webserver could write the correct Content-Length Header of the full answer into the HTTP Header.

Otherwise, how would the Browser be able to know, that the received answer is complete, and be allowed to send the next request on this very same TCP connection?

Transfer-Encoding: Chunked solves this problem. It tells the Browser basically: “Here are the first Bytes to your request, but more will come. I do not know yet how many, but I will mark the ending of the Byte stream with a defined marker”. How this exactly is marked can be seen in the above linked RFC. And the performance improvement is, that the Browser can already “work” with this partial answer to start rendering or request other objects. How to use this technique best is described here by Steve Souders.

So after we have now some basic understanding of the Why and How of Transfer-Encoding: Chunked let’s get to the issue of this specific site:

In the waterfall below you can see, how this website was loading. You see, that the base page, the HTML document, is loading for a very slow 15 seconds.

What might be the reason? I did repeat the test a couple of times to make sure this wasn’t a single packet being lost. But the pattern remained stable.

As you can see in the bandwidth utilization below the waterfall, it is most definitely not a bandwidth issue. Most of the time the bandwidth consumption is close to zero. And also you see by the blue bar, that the content started to come down very early, but it took a REALLY long time to complete.

So, clicking at the object will tell us a little bit more about it:

First you see the size of the object. A small 3.3 KByte. So the reason for the long transfer time is not the size of the object. And again, that the first Bytes of the object arrived under 1 second, but it took 15 seconds for the full answer.

Additionally you see, that the Browser was allowing the content to be gezipped, and the Server indeed not only gezipped it, but also applied Transfer-Encoding: Chunked.

So, how can you dig deeper into this? A good idea is using the feature TCPDump of WPT, which allows you to see each and every Byte and its timing on the wire. So when doing this,  you get the following picture:

Now you get some more information, of what the issue is. You can see in the middle pane, that the answer consisted of 3 chunks, whereof the first 2 chunks arrived almost instantly (Frame 8 and 28), and then, after 15 seconds, you received the rest of 444 Bytes in Frame 242. So it seems, that the beginning of the object could be sent very fast, but then the webserver had to wait for 15 seconds for the last part to be generated. Unfortunately, the content is gezipped, so you can’t see, what was the content of these last 444 Bytes. You can only guess, that somewhere in the last 15% of the document is the code causing the delay.

But fortunately enough there is Pat. Pat Meenan. He gave me the advise: You might want to look at the setHeader command from WPT’s scripting capabilities. Bingo! setHeader allows you to modify/override any of the Browsers GET-Request Headers. And that’s what you can do. Simply use the Script:

setHeader Accept-Encoding: None

and your done!

Now when repeating this test, you see indeed in the Headers that no gzip was applied:

And in Wireshark you can see the content of the last chunk in plain text!

So now you CAN (or rather COULD) see, what code was in the delayed frame(s). Could, as in this “real-life-example” the site owner fixed the issue, before I was able to apply the Accept-Encoding: None Header. 🙂

Anyway, two scenarios might have been possible to see:

a) Issue still present -> You can see the causing code piece.

b) Issue gone -> You have a problem with flushing gzip buffers.

or something totally wild like serverside malware, that tried and failed to attach malware code to the bottom of the object… 😉

Leave a Comment

Objects in the LAN may appear SLOWER than they are…


Many of you are pretty much aware of the fact, that you should never judge the performance / load times of your site testing from the Local LAN. This is actually pretty common knowledge, as the results may be skewed due to the fact, that your LAN is often connected via a big pipe to the site you are working on.

But, there is actually more to it. The results might be be skewed in the opposite direction as well, and here I would like to point out, what reasons there might be. And also, why you should care anyway, even though that you already know, that testing from your LAN is not recommended.

So let’s answer the second question first. I am responsible for the second time in my career for a rather large portal. And the second time, it was much slower form the local LAN compared to what normal customers see. The reason we did and do care is simply the reason of doubt of our INTERNAL customers. Being in a tech department, our internal customers are Marketing and Customer Services. And these employees (as the rest of our company. Like the CEO for example) of course might  (and some indeed are) thinking: “WTF, they  are celebrating how fast our portal is, and even though I am almost directly connected to it, it is f*king slow!”

There are times, when you have luck, and they confront you with that. And then you might have some good Videos under your belt, “proving” that the customer experience is much better. But I can assure you, doubts will remain (“They came back with some lame techie excuses”). And sometimes they don’t confront you with that. So you don’t even have the chance to defend yourself. We just had that recently, when we had a relaunch of our Portal, announcing big performance improvements, and we got some pretty harsh responses by our colleagues. So this is the reason you maybe SHOULD care about it, that it is at least not SLOWER than customers perception.

After we covered now the motivation, let’s have a look now at the root causes:

Debugging this was difficult, as workstations in the LAN a) rarely do have admin priviliges so some of your tools might be difficult to get running and b) are under the protection of data privacy laws, so tools like Wireshark might be forbidden. In our case most of the analysis was done using Fiddler.

Things we found, sorted by priority:

  1. Internet Explorer: This thing actually has a couple of issues. In my company IE8 is the mandatory Browser, and it is directed to a corporate proxy. The impact on performance is massive:
    IE 6 to IE 8 is limiting the amount of TCP Connections when connecting through a proxy down to 2! As we shard our Portal across three domains, this means for IE 8 a difference of 18 vs. 2 connections.
    IE 6 to IE 8 is by default downgrading from HTTP 1.1 to HTTP 1.0 when connecting through a Proxy! This is massive. You won’t have persistent connections, which is extremely painful with SSL (which is the case with our Portal), but you also lose the ability to use your carefully crafted Cache-Control Headers!The first issue can be solved via some Registry Key, the second one is a Browser Setting. Especially regarding the persistent connection be aware that you have to check the whole chain (Browser, Proxy, Webserver), that none of them is configured to downgrade to HTTP 1.0! Eric Law from Microsoft has written for example an excellent Blogpost on that.
  2. Security: Within our LAN we actually have two kind of proxies. One for unknown domains, and one for “known secure” domains. Which means some kind of white list. Of course our portal is on it 🙂 The proxy for unknown domains checks each and every object for Viruses. Now when we introduced with our relaunch of the portal 2 sharded domains, we forgot to put them on the white list. Resulting in all objects fetched from the sharded domains (~90%) went through a time consuming Virus scan!
  3. DNS: As we found out, in our corporate setup one device in front of our local DNS Servers was configured to drop traffic on TCP Port 53. Unfortunately the workstations in our LAN were trying to resolve our Portal domains using TCP first, and only after a time out, fell back to using UDP. So we had a nice lag in Time-to-Render right at the beginning. This behaviour has been in the past apparantly so common, that they published an RFC to halt people from thinking, UDP Port 53 is enough to support the DNS System.

So… well, we fixed the issues, and now they (our colleagues and the CEO) lived happily ever after. Testing, though, we still don’t do from our local LAN 🙂

A big “Thanks” go out to Diemo S., Lars W. and Holger T. who actually DID the research and the fixes that I was just blogging about 🙂

Leave a Comment

Quiz: Guess the impact of 50 KByte on Page load via DSL

As you might recall, is one of our web properties we’re responsible for. And due to this we had a rather busy week. Reason for this can be found here. But I do not want to comment on that, but rather about the technical outcome 🙂 Because due to “the story” we had to redo quite a few of our graphics on our website.

And THAT was actually one thing I was eagerly waiting for.

Before “the story” we had a page header with a rather difficult image. Our current design language pretty often challenges us (or better, our agencies) quite a bit, as the graphics often consist of a photo-realistic part, which is in front of a colour-gradient background. So the different compression methods fail one way or the other. If we compress using PNG8, the quality of the photo-realistic part degrades rather badly. If we compress using JPG, the colour gradients and sharp edges become really ugly, making it necessary to compress with high quality settings, resulting in rather large files.  If we would work with 2 files using transparancy and different compression methods, well, we would have 2 HTTP Requests instead of 1.

So, to make a long story short, this header image was formerly 72 KByte of size. I asked my colleagues to make sure, that the new one would be much smaller in size, and to really push for that. What we got back was an 18 KByte image. I wasn’t totally satisfied with the result, as the agency used JPG again, even though the new image wasn’t really suitable for JPG. With PNG8 I was able to further reduce it to 8 KByte instead of 18 KByte. Nevertheless I was happy enough by the reduction of ~50 KByte.

Now back to the quiz question: What do you think, how much faster our site reaches its time to visually complete due to that reduction? (using the Frankfurt node of WPT @ 1.5 MBit/s, 50 ms RTT and IE8)

The answer might surprise some people (at least it did within our company). Normally you might do a napkin calculation like this: 50 KByte = 400 KBit. 400 KBit on a 1.5 MBit/s line should be transmitted in ~1/4th of a second = 250 ms. So the page load time might decrease by 250 ms.

But this omits from the equation TCP Slow Start! If you are unfamiliar with TCP Slow Start, the VERY, VERY simplified and brief explanation is: When a TCP Connection is established, it doesn’t utilize all available bandwidth from the beginning, but instead is “slowly” increasing the bandwidth utilization, to test the available bandwidth. A TCP connection “can’t know” the available bandwidth, so in order to not overload the network, it starts slowly and increases over time.

A much better and longer explanation is here by Steve Souders, an excellent Video from Velocity 2010 can be found here, and a really great animation visualizing it can be found here.

Sooo… What was the question again? Oh, right! The benefit of the image size reduction! Getting back to our napkin calculation: The former image was ~70 KByte of size, which is roughly 0.5 MBit, which should load in ~333 ms over a 1.5 MBit/s line. Right?

Again I used WPT with its tcpdump feature and loaded the image. And the result is, without DNS resolution…: ~666 ms! 🙂 So it is roughly the double! Why so? You guessed it, the reason is TCP Slow Start.

As you can see in the image above, using Wiresharks TCP Bandwidth Statistic Analysis, it takes close to 400 ms before this TCP Connection has reached its bandwidth limitation!

Now the problem is, that the header image is quite at the beginning of the HTML basepage. And therefore it gets loaded on a rather “cold” TCP Connection. With IE8 opening up to 6 connections per server, you will start close to the beginning of the page load with 5 “cold” TCP connections.

Just recently a lot of smart people started working on circumventing the limitations of TCP Slow Start in different areas. So SPDY for example multiplexes requests on a single TCP connection, therefore going through TCP Slow Start only once. Firefox now reuses connections by the highest CWND. Starting with Linux Kernel 2.6.33 the initial CWND has been increased from 3 to 10.

But, as you can’t force your visitiors to use a specific Browser, or you might not be able to choose your Linux Kernel, your best bet is still:

Reduce Bytes!
And don’t be fooled by the fact, that 50 KByte on a 1.5 MBit/s line sounds neglectable.

While rambling, I almost forgot the initial question: The impact of this 50 KByte saved on Time-to-visually-complete. See yourself! 🙂

Comments (6)