Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support network ingress on arbitrary ports #9454

Merged
merged 1 commit into from
Aug 9, 2016

Conversation

marun
Copy link
Contributor

@marun marun commented Jun 21, 2016

Add a new ExternalIP type to enable tracking of allocated IPs and make it possible to have an external address be allocated to a service automatically from the range indicated by ExternalIPNetworkCIDRs. The type is global in scope with the name an ip address to ensure an ip is allocated at most once. For simplicity, an IP will be dedicated to a single service rather than having the port space of an ip be shared between many services.

TODO:

  • Decide what policy and quota should be applied to the new type
  • Prevent deletion of an ExternalIP if a service is still using it
  • Delete ExternalIPs that are no longer referenced by a service
  • Update the admission controller Done
    • Return an error if a requested external ip is already in use
    • Create a new ExternalIP object for each requested external ip
  • Add support for automatically allocating an external ip for a service that defines an appropriate annotation (e.g. allocateExternalIP)
    • Store the automatically allocated ip address in the service's externalIP list. This should eventually be replaced by ingress and use of a status field to separate spec/intent from cluster state. For now, provide a degree of separation by supporting the allocation of a single address and refusing to allocate if externalIPs are specified statically.

cc: @openshift/networking @smarterclayton

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 24, 2016
@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 28, 2016
@eparis
Copy link
Member

eparis commented Jun 28, 2016

@jsafrane I'd love if you could review this work....

@marun
Copy link
Contributor Author

marun commented Jul 6, 2016

[test]


// ServiceRef references the service (in any namespace) that is
// bound to this external ip address.
ServiceRef kapi.ObjectReference `json:"serviceRef"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be in a status struct, just like PV's. The pattern here is exactly like PVC and PV, except we want to not make the mistake we did there (making the name a user would see confusing). How does a user request one of these IPs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write out in the description of the PR how you expect these to be used (the exact sequence of interactions that leads to one of these being allocated)?

@marun
Copy link
Contributor Author

marun commented Jul 7, 2016

@smarterclayton

My goal was to allow automatic allocation of an 'ingress ip' to a service, as follows:

  • a service defines an annotation like 'allocateIngressIP'
  • a standard controller watches services for the annotation
  • when the annotation is seen:
    • allocate an ingress ip from the range permitted by ExternalIPNetworkCIDRs
    • add an externalIP corresponding to the ingress ip to the service
    • remove the 'allocateIngressIP' annotation

I was not intending allocation of an ingress ip be performed directly (at least not initially). Either an ingress ip would be allocated for a service in response to an annotation, or a service would specify one or more externalIPs and the admission controller would allocate the corresponding ingress ips.

In response to your comment to avoid using the admission controller to allocate an ingress ip for each externalIP defined on a service, how would a standard controller be able to prevent multiple services from allocating the same ingress ip? I imagine a controller could record allocation status on the service, but services don't have a status field. Given that the ingress object appears poised to absorb responsibility for ingress ip allocation, should one be added? By using an admission controller, I was hoping to avoid needing one since clashing allocations would be prevented on creation and update.

This solution is essentially a stop-gap while we wait for upstream ingress to evolve to meet the same requirement. Do you think this is a reasonable goal? Is there another way you would suggest accomplishing it?

@smarterclayton
Copy link
Contributor

Maybe a different question - if I had a Service of type LoadBalancer, and you were on bare metal, would I just expect it to get allocated one of these new-fangled magic external IPs? And then if I went to the cloud, would I expect to get a real AWS/GCE loadbalancer (because I'm kind of assuming that the trickery we do for external wouldn't work on AWS/GCE)? Or would I want to use this in GCE/AWS as well as a loadbalancer? It kind of sounds like this is bare metal L3/4 load balancing, which in theory services should do since we already have an IP.

Controllers are single threaded, so they keep their own state consistently and allocate (that's how the security allocator gives out unique UID blocks to each namespace that is created).

@smarterclayton
Copy link
Contributor

The security allocator uses namespaces as the authoritative source of "which ips are handed out" and then uses the "allocator" utility code to manage tracking that. That's our standard "I want to define in config a range, and then hand out values outside of it". An admin can override the security allocator by setting the annotation, in which case the allocator just marks that value taken and moves on.

I don't want to get too much into allocation yet - @pravisankar knows a lot about it and can help explain the code if necessary, just want to understand the exact use case we have to solve so we can align it with existing APIs.

@smarterclayton
Copy link
Contributor

Ok, some concrete thoughts:

We have a big choice here:

  1. Implement "bare metal service loadbalancer", where user doesn't get to pick their IPs
  2. Implement "external IP allocation that end users can use", where a user does get to pick their IPs

I think the first is simpler than the later, because we can make some assumptions:

  1. If the user can't hold on to an IP, the allocator is free to correct errors / races by overwriting to match the desired state (which means there are no edge cases / fixup scripts needed)
  2. The implementation is likely usable upstream
  3. We probably don't need to add a new API resource, because the allocator can handle it all (just needs a CIDR from config)
  4. We can alter the admission controller to allow the allocator controller to set arbitrary IPs, but not anyone else
  5. We don't need to add a new field, we can just use the type=LoadBalancer field
  6. When we move to ingresses and want to do L4 load balancing, the controller could be extended to watch both ingress and services
  7. We already have quota in place for type=LoadBalancer

If we do the latter, we have to add more API objects and do a bit more work:

  1. We'll need to have the "pool" -> "claim" model (we have to have a "pool" to perform locks against so we don't hand out the same IP twice), so two api objects just like PV and PVC
  2. We'll have to have fixup code to correct any gaps / mistakes made when things get out of sync, and the "binder" controller
  3. We'll have to alter the admission controller to allow IPs in the namespace, but we can't do it for the user.
  4. We'd need to add new quota objects for the new types

Adding API objects is always expensive - I think just from that measure if we can do "bare metal service loadbalancer" we get the end user feature without having to deal with some of the future compatibility, and we could change directions more easily later on.

@smarterclayton
Copy link
Contributor

smarterclayton commented Jul 7, 2016

Re: pool vs claim - all allocators have to synchronize on something to ensure no duplicates. For IP addresses (like ports) we generally would use a single API object that contains the range and a bitmap of everything that has been allocated. The controller has to perform two duties (can be one or two controllers) - verify the bitmap is accurate (no double allocations), and give out IPs as they are needed. The claim is necessary so the user has an anchor (I own this IP). Unlike PVC, nothing is special about the IP except its value, so we don't need one object per IP. The binder has to go in a specific order - allocate from the pool (so that if a second controller starts it can't use that IP), then mark the user's IP as fulfilled (by setting status). It also has to handle rollback if the update failed for any reason.

The "repair" responsibility has to periodically run and verify that nothing funny has happened - usually that happens on controller start. The problem with repair is that it can't resolve any problems - an admin has to step in and fix them (since it could be data loss). That's the strongest advantage to the simple model above - user doesn't get special IPs, so the controller can be aggressive about fixing problems it finds.

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 11, 2016
@marun
Copy link
Contributor Author

marun commented Jul 11, 2016

@smarterclayton I think option 1 is preferable for all the reasons you describe. Would the following be acceptable?

  • a new config option will be added to define the ingress ip range
    • e.g. IngressIPNetworkCIDRs
    • separate from ExternalIPNetworkCIDRs
      • multiple services can use the same external ip, but as
        proposed ingress ip is not intended to be shared. Separate
        allocation pools will make it easier to support separate
        validation schemes for each address type.
    • for simplicity, only a single contiguous range will be initially supported
  • a user on bare metal will be able to define a service of type
    LoadBalancer to request that an ingress ip be allocated for their
    service.
  • a controller will watch for changes to services and when a service of type LoadBalancer is initially seen:
    • allocate an ip from the configured range
      • the allocation strategy will be based on that used for ports
    • update the service status with the assigned ip
      • e.g. status: {loadBalancer: {ingress: [{ip: x.x.x.x}]}}
    • add the ingress ip as an externalIP on the service to signal
      kube proxy that traffic from the ingress ip should be sent to
      the service ip
  • admission control will be used to prevent a service from defining
    an externalIP from the ingress range unless it has been allocated
    to the service as per the status field.

Open questions:

  • does deallocation require discussion or can the scheme for ports be used as-is?
    • Reuse of an ingress ip too soon after it has been deallocated could potentially result in dns pointing to the wrong service until TTL expires.
  • should users be allowed to specify which ip they would like to be assigned (via loadBalancerIP)? Your comments suggest that user-specified ip's complicate things unnecessarily.

cc: @eparis @knobunc @pravisankar

@knobunc
Copy link
Contributor

knobunc commented Jul 12, 2016

One silly question... I was talking to @DirectXMan12 and he wondered if the admission controller could do the allocation of the externalIP directly and change the service. We weren't sure if that violated any rules about what an admission controller should do. And we weren't sure if we allocated the IP in the admission controller and then had to rewind whether we would ever free the allocated IP.

Would that work? Or is it illegal? Thanks.

@marun
Copy link
Contributor Author

marun commented Jul 12, 2016

@knobunc There's nothing technically preventing the approach you're suggesting, but one of @smarterclayton's comments (#9454 (comment)) suggests that it's preferable to do as little as possible in admission control.

@smarterclayton
Copy link
Contributor

In general, we should only implement an admission controller if there is no way to accomplish the particular goal using a regular controller. Our goal is to have zero admission controllers in the long run.

the allocation strategy will be based on that used for ports

The controller only has to allocate from memory, it does not need to allocate to etcd and back, as long as a couple of invariants are met. One, the controller on startup has to read the current state, check the status of all currently allocated services, allocate them in a deterministic order that another controller would observe (creation date, probably), and then start handling normal allocations. We have code to handle "wait for the initial list" through DeltaFIFO. However, if you do that, then you only need an in memory copy because you're working in a consistent order - another controller at the same time would be making the same decisions as you.

add the ingress ip as an externalIP on the service to signal kube proxy that traffic from the ingress ip should be sent to the service ip

We should upgrade the externalIP admission controller as you note, and also add the thing that was added to the new endpoints admission controller that lets an effective cluster admin set any IP they want into external (bypass the change). We can do that as a second part, but I think it's important so that cluster admins can fix things when they go wrong.

does deallocation require discussion or can the scheme for ports be used as-is?

Generally we allocate from the first open spot, so while I think this is a reasonable concern, I'd say we could punt on this for a future iteration (where we keep a list of recently recovered IPs).

should users be allowed to specify which ip they would like to be assigned (via loadBalancerIP)?

What does the existing service load balancer allocator do for GCE in the event the ip is not available? Fail, or allocate another one? The controller can do the check to see whether the IP is free and use it, and we can let admins mutate the service status if necessary. We could also gate this with a permission check (the same controller as externalIP) where admins only can set this value.

It's important to remember that the allocation controller has to take into account when an admin allocates an IP directly, and that we need to ensure it isn't confused about whether this was a bug or intentional. Arguably, we could tell admins to create a second IP range that is outside of the allocatable range and have them set themselves - the allocation controller must ignore ingress IPs not in its range anyway.

@marun
Copy link
Contributor Author

marun commented Jul 21, 2016

@smarterclayton Allocating from memory in a consistent order requires that the ordering of watch events be reproducible from a list call. I don't see how that is possible. The order of watch events is determined by when a change is accepted by etcd, but creationTimestamp (the obvious sorting key) is set in code. The two orderings have no deterministic relationship.

Here's a scenario assuming 2 controllers allocating from memory:

  • controllerX is started, receives nothing from its initial list call, and receives watch events (create serviceA, create serviceB)
  • controllerY is started and receives (serviceA, serviceB) from its initial list call

It would be possible for both serviceA and serviceB to be assigned the same ip if the sorted order of controllerY's list result happened to be (serviceB, serviceA) (e.g. creationTimestamp of B was less than A).

Given this, I don't think allocating from memory would be supportable for more than a single controller. It would be necessary to resort to persistent storage to coordinate multiple controllers. Which is more important - multiple controllers or avoiding persistent storage?

@smarterclayton
Copy link
Contributor

Sort the list call before it is returned by the controller. In general the server is already sorting, so I don't think order is incorrect. In your watch calls you need to order based on something deterministic (create date) to determine whether to replace. In your initial setup you need to make decisions based on that order.

@marun
Copy link
Contributor Author

marun commented Jul 28, 2016

Latest update is rough (and testing is still in progress) but feedback is welcome.

@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 28, 2016
@@ -401,6 +401,11 @@ type MasterNetworkConfig struct {
// CIDR will be rejected. Rejections will be applied first, then the IP checked against one of the allowed CIDRs. You
// should ensure this range does not overlap with your nodes, pods, or service CIDRs for security reasons.
ExternalIPNetworkCIDRs []string `json:"externalIPNetworkCIDRs"`
// IngressIPNetworkCIDRs controls the range to assign ingress ips from for services of type LoadBalancer on bare
// metal. If empty, ingress ips will not be assigned. It may contain a single CIDR that will be allocated from.
// For security reasons, you should ensure that this range does not overlap with the CIDRS reserved for external ips,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit CIDRs

@marun
Copy link
Contributor Author

marun commented Aug 8, 2016

@smarterclayton I'm satisfied with the unit test coverage. The integration test is now validating multiple controllers. There is the possibility of thrashing if multiple controllers coincide with ip exhaustion, but I'm assuming that's a rare enough case not to spend further effort on it.

There's still the issue of hack/verify-gofmt.sh failing. I can't reproduce locally. Any thoughts?

@marun
Copy link
Contributor Author

marun commented Aug 8, 2016

@smarterclayton More questions:

  • How to safely default to a random private range? Would it be enough to validate that it doesn't clash with the configuration for external ips, nodes, pods and services?
  • My reading of your response about when to send events suggests that sending an event every time an ErrFull error is seen from the allocator is not ideal since that could occur repeatedly and in a loop. What's the alternative - rate limit the sending of events? Send only the first one? Or is logging sufficient?
  • Is it enough to log automatic reallocation? How should a user best be notified when their service's ingress ip allocation changes?

@marun marun force-pushed the tcp-routes branch 2 times, most recently from ade3128 to 95adb2b Compare August 8, 2016 13:53
@marun marun changed the title WIP: Support network ingress on arbitrary ports Support network ingress on arbitrary ports Aug 8, 2016
@smarterclayton
Copy link
Contributor

How to safely default to a random private range? Would it be enough to validate that it doesn't clash with the configuration for external ips, nodes, pods and services?

Yes. We should pick something similar enough to the existing ranges that it's recognizable.

My reading of your response about when to send events suggests that sending an event every time an ErrFull error is seen from the allocator is not ideal since that could occur repeatedly and in a loop. What's the alternative - rate limit the sending of events? Send only the first one? Or is logging sufficient?

I think an event on ErrFull is relevant, although when we are in that state, how much work do we do before we find out the state is full? I.e. are we making a bunch of calls to the API server and then discovering we're full? I'm thinking about resync mostly - I don't want to write once per resync item. However, it's not the end of the world to just send the event on ErrFull and then open an issue to deal with it later.

Is it enough to log automatic reallocation? How should a user best be notified when their service's ingress ip allocation changes?

An event on automatic reallocation in the user's namespace is appropriate (since it should be a one time event).

@marun
Copy link
Contributor Author

marun commented Aug 8, 2016

@smarterclayton If the configured ip range is small relative to the number of services requiring allocation, the queue could cycle endlessly and generate an ErrFull event for each service pending allocation on each cycle. At present the only thing that would limit the rate of events being sent is that adding back to the work queue is rate limited. Is that sufficient?

@smarterclayton
Copy link
Contributor

Yes that's sufficient.

@marun marun force-pushed the tcp-routes branch 4 times, most recently from f23e93c to a07ccce Compare August 8, 2016 17:18
// metal. If empty, ingress ips will not be assigned. It may contain a single CIDR that will be allocated from.
// For security reasons, you should ensure that this range does not overlap with the CIDRs reserved for external ips,
// nodes, pods, or services.
IngressIPNetworkCIDR string `json:"ingressIPNetworkCIDR"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I specify CIDR "0.0.0.0/32" that means no ingress - do we check for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I've disallowed 0.0.0.0 (see validation/master.go).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to default this to on and provide a value we have to be able to distinguish between unset and set. You can do that either by allowing 0.0.0.0/32 to mean "set, but deliberately empty", or turn this value into a pointer.

However, we can do that in a follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is a string, isn't "" enough to mean unset?

I'm working on a followup that sets a default and also validates against overlap between IngressIPNetworkCIDR, ExternalIPNetworkCIDRs, ClusterNetworkCIDR and ServiceNetworkCIDR.

Regarding the default, is this the right place to add it?

https://github.com/openshift/origin/blob/master/pkg/cmd/server/start/network_args.go

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you set a default, you need a way to distinguish between default, and unset, so that if an admin doesn't want to use the default, they can say "no default".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, defaulting happens in pkg/cmd/server/api/v1/conversions.go, and there are a few other steps to handle "default for existing clusters" vs. "default for new clusters"

@@ -157,6 +157,12 @@ func ValidateMasterConfig(config *api.MasterConfig, fldPath *field.Path) Validat
}
}
}
if len(config.NetworkConfig.IngressIPNetworkCIDR) > 0 {
cidr := config.NetworkConfig.IngressIPNetworkCIDR
if _, ipNet, err := net.ParseCIDR(cidr); err != nil || ipNet.IP == "0.0.0.0" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if someone specifies 0.0.0.0 in ipv6? Use IP.IsUnspecified() instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@smarterclayton
Copy link
Contributor

One last issue (the IP.IsUnspecified()) and I'll merge this.

This change adds a new controller that allocates ips to services of
type load balancer from a range configured by IngressIPNetworkCIDR.
This is intended to support ip-based traffic ingress on bare metal.
@openshift-bot
Copy link
Contributor

Evaluated for origin test up to fbca710

@smarterclayton
Copy link
Contributor

LGTM [merge]

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/7653/)

@openshift-bot
Copy link
Contributor

openshift-bot commented Aug 8, 2016

continuous-integration/openshift-jenkins/merge Waiting: You are in the build queue at position: 2

@smarterclayton
Copy link
Contributor

@openshift-bot openshift-bot merged commit 6c50f20 into openshift:master Aug 9, 2016
@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to fbca710

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants