Support network ingress on arbitrary ports #9454

marun · 2016-06-21T05:47:00Z

Add a new ExternalIP type to enable tracking of allocated IPs and make it possible to have an external address be allocated to a service automatically from the range indicated by ExternalIPNetworkCIDRs. The type is global in scope with the name an ip address to ensure an ip is allocated at most once. For simplicity, an IP will be dedicated to a single service rather than having the port space of an ip be shared between many services.

TODO:

Decide what policy and quota should be applied to the new type
Prevent deletion of an ExternalIP if a service is still using it
Delete ExternalIPs that are no longer referenced by a service
~~Update the admission controller~~ Done
- ~~Return an error if a requested external ip is already in use~~
- ~~Create a new ExternalIP object for each requested external ip~~
Add support for automatically allocating an external ip for a service that defines an appropriate annotation (e.g. allocateExternalIP)
- Store the automatically allocated ip address in the service's externalIP list. This should eventually be replaced by ingress and use of a status field to separate spec/intent from cluster state. For now, provide a degree of separation by supporting the allocation of a single address and refusing to allocate if externalIPs are specified statically.

cc: @openshift/networking @smarterclayton

eparis · 2016-06-28T22:27:31Z

@jsafrane I'd love if you could review this work....

marun · 2016-07-06T15:20:59Z

[test]

smarterclayton · 2016-07-07T00:24:54Z

pkg/externalip/api/v1/types.go

+
+	// ServiceRef references the service (in any namespace) that is
+	// bound to this external ip address.
+	ServiceRef kapi.ObjectReference `json:"serviceRef"`


Should be in a status struct, just like PV's. The pattern here is exactly like PVC and PV, except we want to not make the mistake we did there (making the name a user would see confusing). How does a user request one of these IPs?

Can you write out in the description of the PR how you expect these to be used (the exact sequence of interactions that leads to one of these being allocated)?

marun · 2016-07-07T16:58:54Z

@smarterclayton

My goal was to allow automatic allocation of an 'ingress ip' to a service, as follows:

a service defines an annotation like 'allocateIngressIP'
a standard controller watches services for the annotation
when the annotation is seen:
- allocate an ingress ip from the range permitted by ExternalIPNetworkCIDRs
- add an externalIP corresponding to the ingress ip to the service
- remove the 'allocateIngressIP' annotation

I was not intending allocation of an ingress ip be performed directly (at least not initially). Either an ingress ip would be allocated for a service in response to an annotation, or a service would specify one or more externalIPs and the admission controller would allocate the corresponding ingress ips.

In response to your comment to avoid using the admission controller to allocate an ingress ip for each externalIP defined on a service, how would a standard controller be able to prevent multiple services from allocating the same ingress ip? I imagine a controller could record allocation status on the service, but services don't have a status field. Given that the ingress object appears poised to absorb responsibility for ingress ip allocation, should one be added? By using an admission controller, I was hoping to avoid needing one since clashing allocations would be prevented on creation and update.

This solution is essentially a stop-gap while we wait for upstream ingress to evolve to meet the same requirement. Do you think this is a reasonable goal? Is there another way you would suggest accomplishing it?

smarterclayton · 2016-07-07T18:02:00Z

Maybe a different question - if I had a Service of type LoadBalancer, and you were on bare metal, would I just expect it to get allocated one of these new-fangled magic external IPs? And then if I went to the cloud, would I expect to get a real AWS/GCE loadbalancer (because I'm kind of assuming that the trickery we do for external wouldn't work on AWS/GCE)? Or would I want to use this in GCE/AWS as well as a loadbalancer? It kind of sounds like this is bare metal L3/4 load balancing, which in theory services should do since we already have an IP.

Controllers are single threaded, so they keep their own state consistently and allocate (that's how the security allocator gives out unique UID blocks to each namespace that is created).

smarterclayton · 2016-07-07T18:03:56Z

The security allocator uses namespaces as the authoritative source of "which ips are handed out" and then uses the "allocator" utility code to manage tracking that. That's our standard "I want to define in config a range, and then hand out values outside of it". An admin can override the security allocator by setting the annotation, in which case the allocator just marks that value taken and moves on.

I don't want to get too much into allocation yet - @pravisankar knows a lot about it and can help explain the code if necessary, just want to understand the exact use case we have to solve so we can align it with existing APIs.

smarterclayton · 2016-07-07T22:05:54Z

Ok, some concrete thoughts:

We have a big choice here:

Implement "bare metal service loadbalancer", where user doesn't get to pick their IPs
Implement "external IP allocation that end users can use", where a user does get to pick their IPs

I think the first is simpler than the later, because we can make some assumptions:

If the user can't hold on to an IP, the allocator is free to correct errors / races by overwriting to match the desired state (which means there are no edge cases / fixup scripts needed)
The implementation is likely usable upstream
We probably don't need to add a new API resource, because the allocator can handle it all (just needs a CIDR from config)
We can alter the admission controller to allow the allocator controller to set arbitrary IPs, but not anyone else
We don't need to add a new field, we can just use the type=LoadBalancer field
When we move to ingresses and want to do L4 load balancing, the controller could be extended to watch both ingress and services
We already have quota in place for type=LoadBalancer

If we do the latter, we have to add more API objects and do a bit more work:

We'll need to have the "pool" -> "claim" model (we have to have a "pool" to perform locks against so we don't hand out the same IP twice), so two api objects just like PV and PVC
We'll have to have fixup code to correct any gaps / mistakes made when things get out of sync, and the "binder" controller
We'll have to alter the admission controller to allow IPs in the namespace, but we can't do it for the user.
We'd need to add new quota objects for the new types

Adding API objects is always expensive - I think just from that measure if we can do "bare metal service loadbalancer" we get the end user feature without having to deal with some of the future compatibility, and we could change directions more easily later on.

smarterclayton · 2016-07-07T22:09:58Z

Re: pool vs claim - all allocators have to synchronize on something to ensure no duplicates. For IP addresses (like ports) we generally would use a single API object that contains the range and a bitmap of everything that has been allocated. The controller has to perform two duties (can be one or two controllers) - verify the bitmap is accurate (no double allocations), and give out IPs as they are needed. The claim is necessary so the user has an anchor (I own this IP). Unlike PVC, nothing is special about the IP except its value, so we don't need one object per IP. The binder has to go in a specific order - allocate from the pool (so that if a second controller starts it can't use that IP), then mark the user's IP as fulfilled (by setting status). It also has to handle rollback if the update failed for any reason.

The "repair" responsibility has to periodically run and verify that nothing funny has happened - usually that happens on controller start. The problem with repair is that it can't resolve any problems - an admin has to step in and fix them (since it could be data loss). That's the strongest advantage to the simple model above - user doesn't get special IPs, so the controller can be aggressive about fixing problems it finds.

marun · 2016-07-11T21:13:45Z

@smarterclayton I think option 1 is preferable for all the reasons you describe. Would the following be acceptable?

a new config option will be added to define the ingress ip range
- e.g. IngressIPNetworkCIDRs
- separate from ExternalIPNetworkCIDRs
  - multiple services can use the same external ip, but as
    proposed ingress ip is not intended to be shared. Separate
    allocation pools will make it easier to support separate
    validation schemes for each address type.
- for simplicity, only a single contiguous range will be initially supported
a user on bare metal will be able to define a service of type
LoadBalancer to request that an ingress ip be allocated for their
service.
a controller will watch for changes to services and when a service of type LoadBalancer is initially seen:
- allocate an ip from the configured range
  - the allocation strategy will be based on that used for ports
- update the service status with the assigned ip
  - e.g. status: {loadBalancer: {ingress: [{ip: x.x.x.x}]}}
- add the ingress ip as an externalIP on the service to signal
  kube proxy that traffic from the ingress ip should be sent to
  the service ip
admission control will be used to prevent a service from defining
an externalIP from the ingress range unless it has been allocated
to the service as per the status field.

Open questions:

does deallocation require discussion or can the scheme for ports be used as-is?
- Reuse of an ingress ip too soon after it has been deallocated could potentially result in dns pointing to the wrong service until TTL expires.
should users be allowed to specify which ip they would like to be assigned (via loadBalancerIP)? Your comments suggest that user-specified ip's complicate things unnecessarily.

cc: @eparis @knobunc @pravisankar

knobunc · 2016-07-12T13:31:12Z

One silly question... I was talking to @DirectXMan12 and he wondered if the admission controller could do the allocation of the externalIP directly and change the service. We weren't sure if that violated any rules about what an admission controller should do. And we weren't sure if we allocated the IP in the admission controller and then had to rewind whether we would ever free the allocated IP.

Would that work? Or is it illegal? Thanks.

marun · 2016-07-12T14:52:19Z

@knobunc There's nothing technically preventing the approach you're suggesting, but one of @smarterclayton's comments (#9454 (comment)) suggests that it's preferable to do as little as possible in admission control.

smarterclayton · 2016-07-12T16:02:01Z

In general, we should only implement an admission controller if there is no way to accomplish the particular goal using a regular controller. Our goal is to have zero admission controllers in the long run.

the allocation strategy will be based on that used for ports

The controller only has to allocate from memory, it does not need to allocate to etcd and back, as long as a couple of invariants are met. One, the controller on startup has to read the current state, check the status of all currently allocated services, allocate them in a deterministic order that another controller would observe (creation date, probably), and then start handling normal allocations. We have code to handle "wait for the initial list" through DeltaFIFO. However, if you do that, then you only need an in memory copy because you're working in a consistent order - another controller at the same time would be making the same decisions as you.

add the ingress ip as an externalIP on the service to signal kube proxy that traffic from the ingress ip should be sent to the service ip

We should upgrade the externalIP admission controller as you note, and also add the thing that was added to the new endpoints admission controller that lets an effective cluster admin set any IP they want into external (bypass the change). We can do that as a second part, but I think it's important so that cluster admins can fix things when they go wrong.

does deallocation require discussion or can the scheme for ports be used as-is?

Generally we allocate from the first open spot, so while I think this is a reasonable concern, I'd say we could punt on this for a future iteration (where we keep a list of recently recovered IPs).

should users be allowed to specify which ip they would like to be assigned (via loadBalancerIP)?

What does the existing service load balancer allocator do for GCE in the event the ip is not available? Fail, or allocate another one? The controller can do the check to see whether the IP is free and use it, and we can let admins mutate the service status if necessary. We could also gate this with a permission check (the same controller as externalIP) where admins only can set this value.

It's important to remember that the allocation controller has to take into account when an admin allocates an IP directly, and that we need to ensure it isn't confused about whether this was a bug or intentional. Arguably, we could tell admins to create a second IP range that is outside of the allocatable range and have them set themselves - the allocation controller must ignore ingress IPs not in its range anyway.

marun · 2016-07-21T07:45:55Z

@smarterclayton Allocating from memory in a consistent order requires that the ordering of watch events be reproducible from a list call. I don't see how that is possible. The order of watch events is determined by when a change is accepted by etcd, but creationTimestamp (the obvious sorting key) is set in code. The two orderings have no deterministic relationship.

Here's a scenario assuming 2 controllers allocating from memory:

controllerX is started, receives nothing from its initial list call, and receives watch events (create serviceA, create serviceB)
controllerY is started and receives (serviceA, serviceB) from its initial list call

It would be possible for both serviceA and serviceB to be assigned the same ip if the sorted order of controllerY's list result happened to be (serviceB, serviceA) (e.g. creationTimestamp of B was less than A).

Given this, I don't think allocating from memory would be supportable for more than a single controller. It would be necessary to resort to persistent storage to coordinate multiple controllers. Which is more important - multiple controllers or avoiding persistent storage?

smarterclayton · 2016-07-21T16:05:22Z

Sort the list call before it is returned by the controller. In general the server is already sorting, so I don't think order is incorrect. In your watch calls you need to order based on something deterministic (create date) to determine whether to replace. In your initial setup you need to make decisions based on that order.

marun · 2016-07-28T22:31:26Z

Latest update is rough (and testing is still in progress) but feedback is welcome.

knobunc · 2016-07-29T08:12:39Z

pkg/cmd/server/api/v1/types.go

@@ -401,6 +401,11 @@ type MasterNetworkConfig struct {
 	// CIDR will be rejected. Rejections will be applied first, then the IP checked against one of the allowed CIDRs. You
 	// should ensure this range does not overlap with your nodes, pods, or service CIDRs for security reasons.
 	ExternalIPNetworkCIDRs []string `json:"externalIPNetworkCIDRs"`
+	// IngressIPNetworkCIDRs controls the range to assign ingress ips from for services of type LoadBalancer on bare
+	// metal. If empty, ingress ips will not be assigned. It may contain a single CIDR that will be allocated from.
+	// For security reasons, you should ensure that this range does not overlap with the CIDRS reserved for external ips,


marun · 2016-08-08T07:37:40Z

@smarterclayton I'm satisfied with the unit test coverage. The integration test is now validating multiple controllers. There is the possibility of thrashing if multiple controllers coincide with ip exhaustion, but I'm assuming that's a rare enough case not to spend further effort on it.

~~There's still the issue of hack/verify-gofmt.sh failing. I can't reproduce locally. Any thoughts?~~

marun · 2016-08-08T08:05:16Z

@smarterclayton More questions:

How to safely default to a random private range? Would it be enough to validate that it doesn't clash with the configuration for external ips, nodes, pods and services?
My reading of your response about when to send events suggests that sending an event every time an ErrFull error is seen from the allocator is not ideal since that could occur repeatedly and in a loop. What's the alternative - rate limit the sending of events? Send only the first one? Or is logging sufficient?
Is it enough to log automatic reallocation? How should a user best be notified when their service's ingress ip allocation changes?

smarterclayton · 2016-08-08T14:51:49Z

How to safely default to a random private range? Would it be enough to validate that it doesn't clash with the configuration for external ips, nodes, pods and services?

Yes. We should pick something similar enough to the existing ranges that it's recognizable.

My reading of your response about when to send events suggests that sending an event every time an ErrFull error is seen from the allocator is not ideal since that could occur repeatedly and in a loop. What's the alternative - rate limit the sending of events? Send only the first one? Or is logging sufficient?

I think an event on ErrFull is relevant, although when we are in that state, how much work do we do before we find out the state is full? I.e. are we making a bunch of calls to the API server and then discovering we're full? I'm thinking about resync mostly - I don't want to write once per resync item. However, it's not the end of the world to just send the event on ErrFull and then open an issue to deal with it later.

Is it enough to log automatic reallocation? How should a user best be notified when their service's ingress ip allocation changes?

An event on automatic reallocation in the user's namespace is appropriate (since it should be a one time event).

marun · 2016-08-08T14:58:44Z

@smarterclayton If the configured ip range is small relative to the number of services requiring allocation, the queue could cycle endlessly and generate an ErrFull event for each service pending allocation on each cycle. At present the only thing that would limit the rate of events being sent is that adding back to the work queue is rate limited. Is that sufficient?

smarterclayton · 2016-08-08T15:06:38Z

Yes that's sufficient.

smarterclayton · 2016-08-08T19:43:23Z

pkg/cmd/server/api/v1/types.go

+	// metal. If empty, ingress ips will not be assigned. It may contain a single CIDR that will be allocated from.
+	// For security reasons, you should ensure that this range does not overlap with the CIDRs reserved for external ips,
+	// nodes, pods, or services.
+	IngressIPNetworkCIDR string `json:"ingressIPNetworkCIDR"`


If I specify CIDR "0.0.0.0/32" that means no ingress - do we check for that?

Done. I've disallowed 0.0.0.0 (see validation/master.go).

If we're going to default this to on and provide a value we have to be able to distinguish between unset and set. You can do that either by allowing 0.0.0.0/32 to mean "set, but deliberately empty", or turn this value into a pointer.

However, we can do that in a follow up.

Given that this is a string, isn't "" enough to mean unset?

I'm working on a followup that sets a default and also validates against overlap between IngressIPNetworkCIDR, ExternalIPNetworkCIDRs, ClusterNetworkCIDR and ServiceNetworkCIDR.

Regarding the default, is this the right place to add it?

https://github.com/openshift/origin/blob/master/pkg/cmd/server/start/network_args.go

If you set a default, you need a way to distinguish between default, and unset, so that if an admin doesn't want to use the default, they can say "no default".

No, defaulting happens in pkg/cmd/server/api/v1/conversions.go, and there are a few other steps to handle "default for existing clusters" vs. "default for new clusters"

smarterclayton · 2016-08-08T20:31:08Z

pkg/cmd/server/api/validation/master.go

@@ -157,6 +157,12 @@ func ValidateMasterConfig(config *api.MasterConfig, fldPath *field.Path) Validat
 			}
 		}
 	}
+	if len(config.NetworkConfig.IngressIPNetworkCIDR) > 0 {
+		cidr := config.NetworkConfig.IngressIPNetworkCIDR
+		if _, ipNet, err := net.ParseCIDR(cidr); err != nil || ipNet.IP == "0.0.0.0" {


What if someone specifies 0.0.0.0 in ipv6? Use IP.IsUnspecified() instead.

smarterclayton · 2016-08-08T20:34:14Z

One last issue (the IP.IsUnspecified()) and I'll merge this.

This change adds a new controller that allocates ips to services of type load balancer from a range configured by IngressIPNetworkCIDR. This is intended to support ip-based traffic ingress on bare metal.

openshift-bot · 2016-08-08T20:57:28Z

Evaluated for origin test up to fbca710

smarterclayton · 2016-08-08T21:44:36Z

LGTM [merge]

openshift-bot · 2016-08-08T22:17:58Z

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/7653/)

openshift-bot · 2016-08-08T23:12:22Z

continuous-integration/openshift-jenkins/merge Waiting: You are in the build queue at position: 2

smarterclayton · 2016-08-09T00:34:15Z

[merge] #10173 https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_conformance/4845/

openshift-bot · 2016-08-09T00:35:16Z

Evaluated for origin merge up to fbca710

openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 24, 2016

rajatchopra mentioned this pull request Jun 24, 2016

About the ssh support in router #6755

Closed

marun force-pushed the tcp-routes branch from 22ff22e to 64fa928 Compare June 28, 2016 16:36

openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 28, 2016

marun force-pushed the tcp-routes branch from 64fa928 to 8fe85ef Compare July 6, 2016 01:29

marun force-pushed the tcp-routes branch from 8fe85ef to 6e194c2 Compare July 6, 2016 22:04

smarterclayton reviewed Jul 7, 2016
View reviewed changes

openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 11, 2016

marun force-pushed the tcp-routes branch from 6e194c2 to f22b0bd Compare July 28, 2016 22:28

marun force-pushed the tcp-routes branch from f22b0bd to 57b7f3b Compare July 28, 2016 22:32

openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 28, 2016

knobunc reviewed Jul 29, 2016
View reviewed changes

marun force-pushed the tcp-routes branch from 13d20e0 to 44b445e Compare August 8, 2016 07:33

marun force-pushed the tcp-routes branch from 44b445e to 8d2bdac Compare August 8, 2016 07:52

marun force-pushed the tcp-routes branch 2 times, most recently from ade3128 to 95adb2b Compare August 8, 2016 13:53

marun changed the title ~~WIP: Support network ingress on arbitrary ports~~ Support network ingress on arbitrary ports Aug 8, 2016

marun force-pushed the tcp-routes branch from 95adb2b to abdc321 Compare August 8, 2016 14:40

marun force-pushed the tcp-routes branch 4 times, most recently from f23e93c to a07ccce Compare August 8, 2016 17:18

smarterclayton reviewed Aug 8, 2016
View reviewed changes

marun force-pushed the tcp-routes branch from a07ccce to e99f672 Compare August 8, 2016 20:23

smarterclayton reviewed Aug 8, 2016
View reviewed changes

Support network ingress on arbitrary ports

fbca710

This change adds a new controller that allocates ips to services of type load balancer from a range configured by IngressIPNetworkCIDR. This is intended to support ip-based traffic ingress on bare metal.

marun force-pushed the tcp-routes branch from e99f672 to fbca710 Compare August 8, 2016 20:53

openshift-bot merged commit 6c50f20 into openshift:master Aug 9, 2016

marun deleted the tcp-routes branch August 9, 2016 15:21

DanyC97 mentioned this pull request Mar 3, 2017

Question around openshift_portal_net parameter openshift/openshift-ansible#3537

Closed

Support network ingress on arbitrary ports #9454

Support network ingress on arbitrary ports #9454

Conversation

marun commented Jun 21, 2016 • edited Loading

eparis commented Jun 28, 2016

marun commented Jul 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marun commented Jul 7, 2016

smarterclayton commented Jul 7, 2016

smarterclayton commented Jul 7, 2016

smarterclayton commented Jul 7, 2016

smarterclayton commented Jul 7, 2016 • edited Loading

marun commented Jul 11, 2016

knobunc commented Jul 12, 2016

marun commented Jul 12, 2016

smarterclayton commented Jul 12, 2016

marun commented Jul 21, 2016

smarterclayton commented Jul 21, 2016

marun commented Jul 28, 2016

Choose a reason for hiding this comment

marun commented Aug 8, 2016 • edited Loading

marun commented Aug 8, 2016 • edited Loading

smarterclayton commented Aug 8, 2016

marun commented Aug 8, 2016 • edited Loading

smarterclayton commented Aug 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton commented Aug 8, 2016

openshift-bot commented Aug 8, 2016

smarterclayton commented Aug 8, 2016

openshift-bot commented Aug 8, 2016

openshift-bot commented Aug 8, 2016 • edited Loading

smarterclayton commented Aug 9, 2016

openshift-bot commented Aug 9, 2016

marun commented Jun 21, 2016 •

edited

Loading

smarterclayton commented Jul 7, 2016 •

edited

Loading

marun commented Aug 8, 2016 •

edited

Loading

marun commented Aug 8, 2016 •

edited

Loading

marun commented Aug 8, 2016 •

edited

Loading

openshift-bot commented Aug 8, 2016 •

edited

Loading