-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support network ingress on arbitrary ports #9454
Conversation
@jsafrane I'd love if you could review this work.... |
[test] |
|
||
// ServiceRef references the service (in any namespace) that is | ||
// bound to this external ip address. | ||
ServiceRef kapi.ObjectReference `json:"serviceRef"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be in a status struct, just like PV's. The pattern here is exactly like PVC and PV, except we want to not make the mistake we did there (making the name a user would see confusing). How does a user request one of these IPs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you write out in the description of the PR how you expect these to be used (the exact sequence of interactions that leads to one of these being allocated)?
My goal was to allow automatic allocation of an 'ingress ip' to a service, as follows:
I was not intending allocation of an ingress ip be performed directly (at least not initially). Either an ingress ip would be allocated for a service in response to an annotation, or a service would specify one or more externalIPs and the admission controller would allocate the corresponding ingress ips. In response to your comment to avoid using the admission controller to allocate an ingress ip for each externalIP defined on a service, how would a standard controller be able to prevent multiple services from allocating the same ingress ip? I imagine a controller could record allocation status on the service, but services don't have a status field. Given that the ingress object appears poised to absorb responsibility for ingress ip allocation, should one be added? By using an admission controller, I was hoping to avoid needing one since clashing allocations would be prevented on creation and update. This solution is essentially a stop-gap while we wait for upstream ingress to evolve to meet the same requirement. Do you think this is a reasonable goal? Is there another way you would suggest accomplishing it? |
Maybe a different question - if I had a Service of type LoadBalancer, and you were on bare metal, would I just expect it to get allocated one of these new-fangled magic external IPs? And then if I went to the cloud, would I expect to get a real AWS/GCE loadbalancer (because I'm kind of assuming that the trickery we do for external wouldn't work on AWS/GCE)? Or would I want to use this in GCE/AWS as well as a loadbalancer? It kind of sounds like this is bare metal L3/4 load balancing, which in theory services should do since we already have an IP. Controllers are single threaded, so they keep their own state consistently and allocate (that's how the security allocator gives out unique UID blocks to each namespace that is created). |
The security allocator uses namespaces as the authoritative source of "which ips are handed out" and then uses the "allocator" utility code to manage tracking that. That's our standard "I want to define in config a range, and then hand out values outside of it". An admin can override the security allocator by setting the annotation, in which case the allocator just marks that value taken and moves on. I don't want to get too much into allocation yet - @pravisankar knows a lot about it and can help explain the code if necessary, just want to understand the exact use case we have to solve so we can align it with existing APIs. |
Ok, some concrete thoughts: We have a big choice here:
I think the first is simpler than the later, because we can make some assumptions:
If we do the latter, we have to add more API objects and do a bit more work:
Adding API objects is always expensive - I think just from that measure if we can do "bare metal service loadbalancer" we get the end user feature without having to deal with some of the future compatibility, and we could change directions more easily later on. |
Re: pool vs claim - all allocators have to synchronize on something to ensure no duplicates. For IP addresses (like ports) we generally would use a single API object that contains the range and a bitmap of everything that has been allocated. The controller has to perform two duties (can be one or two controllers) - verify the bitmap is accurate (no double allocations), and give out IPs as they are needed. The claim is necessary so the user has an anchor (I own this IP). Unlike PVC, nothing is special about the IP except its value, so we don't need one object per IP. The binder has to go in a specific order - allocate from the pool (so that if a second controller starts it can't use that IP), then mark the user's IP as fulfilled (by setting status). It also has to handle rollback if the update failed for any reason. The "repair" responsibility has to periodically run and verify that nothing funny has happened - usually that happens on controller start. The problem with repair is that it can't resolve any problems - an admin has to step in and fix them (since it could be data loss). That's the strongest advantage to the simple model above - user doesn't get special IPs, so the controller can be aggressive about fixing problems it finds. |
@smarterclayton I think option 1 is preferable for all the reasons you describe. Would the following be acceptable?
Open questions:
|
One silly question... I was talking to @DirectXMan12 and he wondered if the admission controller could do the allocation of the externalIP directly and change the service. We weren't sure if that violated any rules about what an admission controller should do. And we weren't sure if we allocated the IP in the admission controller and then had to rewind whether we would ever free the allocated IP. Would that work? Or is it illegal? Thanks. |
@knobunc There's nothing technically preventing the approach you're suggesting, but one of @smarterclayton's comments (#9454 (comment)) suggests that it's preferable to do as little as possible in admission control. |
In general, we should only implement an admission controller if there is no way to accomplish the particular goal using a regular controller. Our goal is to have zero admission controllers in the long run.
The controller only has to allocate from memory, it does not need to allocate to etcd and back, as long as a couple of invariants are met. One, the controller on startup has to read the current state, check the status of all currently allocated services, allocate them in a deterministic order that another controller would observe (creation date, probably), and then start handling normal allocations. We have code to handle "wait for the initial list" through DeltaFIFO. However, if you do that, then you only need an in memory copy because you're working in a consistent order - another controller at the same time would be making the same decisions as you.
We should upgrade the externalIP admission controller as you note, and also add the thing that was added to the new endpoints admission controller that lets an effective cluster admin set any IP they want into external (bypass the change). We can do that as a second part, but I think it's important so that cluster admins can fix things when they go wrong.
Generally we allocate from the first open spot, so while I think this is a reasonable concern, I'd say we could punt on this for a future iteration (where we keep a list of recently recovered IPs).
What does the existing service load balancer allocator do for GCE in the event the ip is not available? Fail, or allocate another one? The controller can do the check to see whether the IP is free and use it, and we can let admins mutate the service status if necessary. We could also gate this with a permission check (the same controller as externalIP) where admins only can set this value. It's important to remember that the allocation controller has to take into account when an admin allocates an IP directly, and that we need to ensure it isn't confused about whether this was a bug or intentional. Arguably, we could tell admins to create a second IP range that is outside of the allocatable range and have them set themselves - the allocation controller must ignore ingress IPs not in its range anyway. |
@smarterclayton Allocating from memory in a consistent order requires that the ordering of watch events be reproducible from a list call. I don't see how that is possible. The order of watch events is determined by when a change is accepted by etcd, but creationTimestamp (the obvious sorting key) is set in code. The two orderings have no deterministic relationship. Here's a scenario assuming 2 controllers allocating from memory:
It would be possible for both serviceA and serviceB to be assigned the same ip if the sorted order of controllerY's list result happened to be (serviceB, serviceA) (e.g. creationTimestamp of B was less than A). Given this, I don't think allocating from memory would be supportable for more than a single controller. It would be necessary to resort to persistent storage to coordinate multiple controllers. Which is more important - multiple controllers or avoiding persistent storage? |
Sort the list call before it is returned by the controller. In general the server is already sorting, so I don't think order is incorrect. In your watch calls you need to order based on something deterministic (create date) to determine whether to replace. In your initial setup you need to make decisions based on that order. |
Latest update is rough (and testing is still in progress) but feedback is welcome. |
@@ -401,6 +401,11 @@ type MasterNetworkConfig struct { | |||
// CIDR will be rejected. Rejections will be applied first, then the IP checked against one of the allowed CIDRs. You | |||
// should ensure this range does not overlap with your nodes, pods, or service CIDRs for security reasons. | |||
ExternalIPNetworkCIDRs []string `json:"externalIPNetworkCIDRs"` | |||
// IngressIPNetworkCIDRs controls the range to assign ingress ips from for services of type LoadBalancer on bare | |||
// metal. If empty, ingress ips will not be assigned. It may contain a single CIDR that will be allocated from. | |||
// For security reasons, you should ensure that this range does not overlap with the CIDRS reserved for external ips, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit CIDRs
@smarterclayton I'm satisfied with the unit test coverage. The integration test is now validating multiple controllers. There is the possibility of thrashing if multiple controllers coincide with ip exhaustion, but I'm assuming that's a rare enough case not to spend further effort on it.
|
@smarterclayton More questions:
|
ade3128
to
95adb2b
Compare
Yes. We should pick something similar enough to the existing ranges that it's recognizable.
I think an event on ErrFull is relevant, although when we are in that state, how much work do we do before we find out the state is full? I.e. are we making a bunch of calls to the API server and then discovering we're full? I'm thinking about resync mostly - I don't want to write once per resync item. However, it's not the end of the world to just send the event on ErrFull and then open an issue to deal with it later.
An event on automatic reallocation in the user's namespace is appropriate (since it should be a one time event). |
@smarterclayton If the configured ip range is small relative to the number of services requiring allocation, the queue could cycle endlessly and generate an ErrFull event for each service pending allocation on each cycle. At present the only thing that would limit the rate of events being sent is that adding back to the work queue is rate limited. Is that sufficient? |
Yes that's sufficient. |
f23e93c
to
a07ccce
Compare
// metal. If empty, ingress ips will not be assigned. It may contain a single CIDR that will be allocated from. | ||
// For security reasons, you should ensure that this range does not overlap with the CIDRs reserved for external ips, | ||
// nodes, pods, or services. | ||
IngressIPNetworkCIDR string `json:"ingressIPNetworkCIDR"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I specify CIDR "0.0.0.0/32" that means no ingress - do we check for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I've disallowed 0.0.0.0 (see validation/master.go).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to default this to on and provide a value we have to be able to distinguish between unset and set. You can do that either by allowing 0.0.0.0/32 to mean "set, but deliberately empty", or turn this value into a pointer.
However, we can do that in a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is a string, isn't "" enough to mean unset?
I'm working on a followup that sets a default and also validates against overlap between IngressIPNetworkCIDR, ExternalIPNetworkCIDRs, ClusterNetworkCIDR and ServiceNetworkCIDR.
Regarding the default, is this the right place to add it?
https://github.com/openshift/origin/blob/master/pkg/cmd/server/start/network_args.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you set a default, you need a way to distinguish between default, and unset, so that if an admin doesn't want to use the default, they can say "no default".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, defaulting happens in pkg/cmd/server/api/v1/conversions.go, and there are a few other steps to handle "default for existing clusters" vs. "default for new clusters"
@@ -157,6 +157,12 @@ func ValidateMasterConfig(config *api.MasterConfig, fldPath *field.Path) Validat | |||
} | |||
} | |||
} | |||
if len(config.NetworkConfig.IngressIPNetworkCIDR) > 0 { | |||
cidr := config.NetworkConfig.IngressIPNetworkCIDR | |||
if _, ipNet, err := net.ParseCIDR(cidr); err != nil || ipNet.IP == "0.0.0.0" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if someone specifies 0.0.0.0 in ipv6? Use IP.IsUnspecified() instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
One last issue (the IP.IsUnspecified()) and I'll merge this. |
This change adds a new controller that allocates ips to services of type load balancer from a range configured by IngressIPNetworkCIDR. This is intended to support ip-based traffic ingress on bare metal.
Evaluated for origin test up to fbca710 |
LGTM [merge] |
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/7653/) |
continuous-integration/openshift-jenkins/merge Waiting: You are in the build queue at position: 2 |
Evaluated for origin merge up to fbca710 |
Add a new ExternalIP type to enable tracking of allocated IPs and make it possible to have an external address be allocated to a service automatically from the range indicated by ExternalIPNetworkCIDRs. The type is global in scope with the name an ip address to ensure an ip is allocated at most once. For simplicity, an IP will be dedicated to a single service rather than having the port space of an ip be shared between many services.
TODO:
Update the admission controllerDoneReturn an error if a requested external ip is already in useCreate a new ExternalIP object for each requested external ipcc: @openshift/networking @smarterclayton