Skip to main content

The Problem with GRE

·1870 words·9 mins

Story behind #

There are different ways to connect two network sites to each other. Sometimes the best way is to have a dedicated layer2 connection like a Dark Fiber or sometimes its enough having a logical interconnect like a GRE or IPsec tunnel. The use of GRE is often due to the minimal configuration effort and simplicity a less interference-prone alternative to IPsec, as long as an encryption of the payload is not necessary. But there is a also a big problem coming with GRE and that is a big structure change of the IP packet itself. So what is the problem with GRE, and why is the attitude and lack of knowledge of some administrators the cause of it?

Additional Headers due GRE #

Before implementing GRE tunnels, it is important to consider some points regarding the transport, otherwise there may be unpleasant surprises regarding the transport of the packets.

By default our global Internet is based on a 1500 bytes maximum packet size called MTU. Its a law, so no party is expecting something higher than 1500 bytes or IP fragmentation will do the work. By using GRE Encapsulation an additional new IP Header and a new GRE Header with a combined size of 24 bytes will join our IP packet and this would increase the size to 1524 bytes in general. To avoid this increasement we want to shrink our Payload to a maximum value to not exceed the 1500 bytes used by the global domain. In this case a manual adjustment of the MTU and TCP MSS must be made to have a MTU of 1476 and a TCP-MSS of 1436.

Original IP Packet compared to a GRE IP Packet
Original IP Packet compared to a GRE IP Packet

ICMP suite is not only ping #

In theory, manual adjustment would not be necessary as long as Path-MTU-Discovery (PMTUD) works. For this, the router sends an ICMP packet with the configured MTU to its next-hops. If the next hop does not accept the MTU, an ICMP Destination Host Unreachable Error with the allowed MTU value is received. This procedure is repeated from hop to hop. But as soon as a next-hop on the path blocks ICMP Type 3 & 4 or the entire ICMP suite, the Path-MTU-Discovery fails and MTU/MSS mismatch would occur and and that ends in a breakdown in communication because e.g. a Three-Way-Handshake for TCP could not take place.

ICMP is getting dropped by policy so it never reach the webserver
ICMP is getting dropped by policy so it never reach the webserver

As above shown, there is a policy which drops the ICMP packets received by the router before it enters in the server network. In this case a negotation of the MTU size is not automatically take in place.

Examples with EVE-NG #

Its easy to give you an example based on EVE-NG. We re-build a network to go through this problem.

We have a customer site wiht a client1, a webserver somewhere in the global internet and a provider site which is connected with GRE to the customer site. Client1 in the customer site wants to download a 100MB file via SCP from the webserver. He reaches the webserver via the direct connection between R1 and R2. Webserver reaches Client1 via R3 and R4 and finally through the GRE back to the customer site’s router R1.

It’s necessary to configure the MTU and TCP-MSS adjustment at the R1 outgoing interface to R2 and at least on the tunnel interfaces for inbound traffic from R4 to R1. Additional to that we can configure the adjustment on the R1 tunnel interface to R4, but in the example there is no traffic forwarded from R1 via GRE to R4.

Yeah, latency is bad, but its still a virtualization in a virtualization… virtualization inception
Yeah, latency is bad, but its still a virtualization in a virtualization… virtualization inception

You can download the Lab and import it to your EVE-NG (you need to have the images for this): gre.unl

Test #1 - ICMP allowed, no MTU and TCP-MSS adjustment #

First test will cover a best case scenario in which nothing is blocked and PMTUD is working as expected to identify the correct MTU and TCP-MSS size:

  • ICMP is allowed
  • No MTU adjustment
  • No TCP-MSS adjustment

Deleted the adjustments for this on all three interfaces.

R4

root@r4#  show | com
[edit interfaces gr-0/0/0 unit 0 family inet]
-     mtu 1476;
-     tcp-mss 1436;

[edit]
root@r4#  commit check
configuration check succeeds

[edit]
root@r4#  commit
commit complete

[edit]
root@r4#

R1

root@r1#  show | com
[edit interfaces gr-0/0/0 unit 0 family inet]
-     mtu 1476;
-     tcp-mss 1436;
[edit interfaces ge-0/0/1 unit 0 family inet]
-     mtu 1476;
-     tcp-mss 1436;

[edit]
root@r1#  commit check
configuration check succeeds

[edit]
root@r1#  commit
commit complete

[edit]
root@r1#

If we now want to start the copy process on Client1 and have a wireshark running on the webserver interface, we can see that everything is working smoothly. Three-Way-Handshake took place, ICMP Destination unreachable message received with the repsonse to use “MTU of next hop: 1476” instead of the try to send with “Total Length: 1500” like shown in the pcap below.

Start the copy process, slow as hell, but its again; virtualization inception, so for the test it doesn’t matter at all
Start the copy process, slow as hell, but its again; virtualization inception, so for the test it doesn’t matter at all

ICMP is doing his job by negotiate the MTU size between both devices
ICMP is doing his job by negotiate the MTU size between both devices

But we can observe a hell of TCP retransmission even though the negotiations took place. He is doing this for each sequence now.

TCP retransmissions

Download pcap file to see full capture: test-1.pcapng

Test #2 - ICMP allowed, TCP-MSS adjustment but no MTU adjustment #

Will it look better if we do just a TCP-MSS adjustment and let the MTU be negotiated automatically?

  • ICMP is allowed
  • TCP-MSS adjustment to 1436 bytes
  • No MTU adjustment

Added the TCP-MSS adjustments for this on all three interfaces.

R4

root@r4#  show | com
[edit interfaces gr-0/0/0 unit 0 family inet]
+     tcp-mss 1436;

[edit]
root@r4#  commit check
configuration check succeeds

[edit]
root@r4#  commit
commit complete

[edit]
root@r4#

R1

root@r1#  show | com
[edit interfaces gr-0/0/0 unit 0 family inet]
+     tcp-mss 1436;
[edit interfaces ge-0/0/1 unit 0 family inet]
+     tcp-mss 1436;

[edit]
root@r1#  commit check
configuration check succeeds

[edit]
root@r1#  commit
commit complete

[edit]
root@r1#

Difference between Test #1 and Test #2 is that we can see the negotiation but we do not see any Destination unreachable messages from ICMP due the fix adjustment of TCP-MSS size.

Less slow but still slow, a little bit of increasing throughput
Less slow but still slow, a little bit of increasing throughput

We still observe too much TCP retransmission and also funny aspect is, that he automatically shrinked my TCP-MSS from 1436 to 1424 bytes.

Shrinked the TCP-MSS automatically to 1424 instead let it by 1436 bytes
Shrinked the TCP-MSS automatically to 1424 instead let it by 1436 bytes

Download pcap file to see full capture: test-2.pcapng

Test #3 - ICMP blocked, no MTU and TCP-MSS adjustment #

This test will show the worst-case scenario. There will be no communication between Client1 and Webserver, due ICMP block and no working PMTUD and no adjustment of the MTU and TCP-MSS size.

  • ICMP is blocked
  • No MTU adjustment
  • No TCP-MSS adjustment

Deleted the MTU and TCP-MSS adjustments for this on all three interfaces.

R4

root@r4#  show | com
[edit interfaces gr-0/0/0 unit 0 family inet]
-     mtu 1476;
-     tcp-mss 1436;

[edit]
root@r4#  commit check
configuration check succeeds

[edit]
root@r4#  commit
commit complete

[edit]
root@r4#

R1

root@r1#  show | com
[edit interfaces gr-0/0/0 unit 0 family inet]
-     mtu 1476;
-     tcp-mss 1436;
[edit interfaces ge-0/0/1 unit 0 family inet]
-     mtu 1476;
-     tcp-mss 1436;

[edit]
root@r1#  commit check
configuration check succeeds

[edit]
root@r1#  commit
commit complete

[edit]
root@r1#

ICMP filter is running now on the router port on which the webserver is connected to.

R2

root@r2#  show | com
[edit interfaces ge-0/0/0 unit 0 family inet]
+     filter {
+         input icmp;
+         output icmp;
+     }
[edit]
+   firewall {
+       filter icmp {
+           term icmp-drop {
+               from {
+                   protocol icmp;
+               }
+               then {
+                   discard;
+               }
+           }
+           term accept {
+               then accept;
+           }
+       }
+   }

[edit]
root@r2#  commit check
configuration check succeeds

[edit]
root@r2#  commit
commit complete

[edit]
root@r2#

Yeah, that is the problem if ICMP getting blocked and no adjustment taking place… we are not able to communicate through the GRE.

Nothing. Blackhole.
Nothing. Blackhole.

So we can see a communication far above 1500 bytes for MTU size and a lot of retransmissions. After a while there will be a ARP broadcast asking for his own gateway. After the router responded he tries again to end the TCP handshake, but nope.

Within a MTU size of 1514 its not possible to communicate through the GRE tunnel
Within a MTU size of 1514 its not possible to communicate through the GRE tunnel

Download pcap file to see full capture: test-3.pcapng

Test #4 - ICMP blocked, MTU and TCP-MSS adjustment #

So if we block ICMP again but this time we configure the MTU and TCP-MSS adjustment.

  • MTU adjustment to 1476 bytes
  • TCP-MSS adjustment to 1436 bytes
  • ICMP is blocked

Added the MTU and TCP-MSS adjustments for this on all three interfaces.

R4

root@r4#  show | com
[edit interfaces gr-0/0/0 unit 0 family inet]
+     mtu 1476;
+     tcp-mss 1436;

[edit]
root@r4#  commit check
configuration check succeeds

[edit]
root@r4#  commit
commit complete

[edit]
root@r4#

R1

root@r1#  show | com
[edit interfaces gr-0/0/0 unit 0 family inet]
+     mtu 1476;
+     tcp-mss 1436;
[edit interfaces ge-0/0/1 unit 0 family inet]
+     mtu 1476;
+     tcp-mss 1436;

[edit]
root@r1#  commit check
configuration check succeeds

[edit]
root@r1#  commit
commit complete

[edit]
root@r1#

ICMP filter is still running on the router port on which the webserver is connected to.

R2

root@r2#  show | com
[edit interfaces ge-0/0/0 unit 0 family inet]
+     filter {
+         input icmp;
+         output icmp;
+     }
[edit]
+   firewall {
+       filter icmp {
+           term icmp-drop {
+               from {
+                   protocol icmp;
+               }
+               then {
+                   discard;
+               }
+           }
+           term accept {
+               then accept;
+           }
+       }
+   }

[edit]
root@r2#  commit check
configuration check succeeds

[edit]
root@r2#  commit
commit complete

[edit]
root@r2#

Communication is working due the manual adjustment of MTU and TCP-MSS while PMTUD is not working.

No PMTUD, but the manual adjustments are doing the job
No PMTUD, but the manual adjustments are doing the job

TCP-MSS is again shrinked automatically to 1424 bytes, which is fine, at least it should not exceed 1436 bytes.

We can see packets with 5762 bytes, so the TCP retransmission will much more in this case
We can see packets with 5762 bytes, so the TCP retransmission will much more in this case

Communication between both devices is working. We see a full communication starting with more TCP retransmissions but then we can see fewer and you can see an interval, a regularity.

Thats kinda interesting, maybe its due of the virtualization inception?
Thats kinda interesting, maybe its due of the virtualization inception?

Download pcap file to see full capture: test-4.pcapng

To conclude #

We were able to reproduce the biggest problem with GRE in combination with some lack of knowledge concerning ICMP and firewall rules/ACL to block those. My recommendation would be always to configure a manual adjustment for MTU and TCP-MSS on ALL outgoing interfaces (upstreams) and maybe on the tunnel interface on your site, if there is a synchronous routing.

We could avoid such problems if the administrators would only block the types 0 and 8 which are echo response and echo request. But to allow at least the types 3 code 1 & 4 for Destination host unreachable and Fragmentation required, and DF flag set.

Done.

Cheers mate,