Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ipfailover configurability #11644

Merged
merged 1 commit into from
Dec 21, 2016
Merged

Conversation

pecameron
Copy link
Contributor

@pecameron pecameron commented Oct 28, 2016

Openshift 3.5 feature.

Add options to 'oadm ipfailover' to configure the check script and
notify scripts and also control the period the check script runs.

Keepalived periodically checks whether the application is running properly.
In the default case the test is a simple verification that something is listening
on the watch port. This PR permits the user to supply an additional check script
that is run in the ipfailover container context to verify that the application is
operating properly. For example, a web server can be tested by accessing the
watch port and verifying the response.

Whenever a node changes state to MASTER, BACKUP, or FAULT a notify
script can be called. This script has 3 parameters filled in by keepalived:
$1 - "GROUP"|"INSTANCE"
$2 - name of group or instance
$3 - target state of transition ("MASTER"|"BACKUP"|"FAULT")

--check-script="check_script"
The check script is a script in the keepalived container that verifies
the service is running properly. The script must return 0 for OK and 1 for FAIL.
These checks are in addition to verifying that the watch port is listening.

--notify-script="notify_script"
The notify script is a script in the keepalived container that is
called whenever the keepalived state transitions to (MASTER|BACKUP|FAULT)

--check-interval=
The check script is run every seconds. Default is 2.

Note: the scripts name is the full path to the script.

Fixes bug 1362163
https://trello.com/c/228zu7Br/267-5-improve-the-configurability-of-the-ipfailover-container-operations

Signed-off-by: Phil Cameron [email protected]

@smarterclayton
Copy link
Contributor

Is this a check script for the other container, or for the processes in the keepalived container itself?

@pecameron
Copy link
Contributor Author

@knobunc Here is the first draft. It covers the wants in the trello card. PTAL

@smarterclayton
Copy link
Contributor

The descriptions in the PR aren't enough for me to understand what the scripts are supposed to do. More detail in the doc.

Copy link
Contributor

@stevekuznetsov stevekuznetsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is too late for this, but generating a script with a script ... there's got to be a better way.

# When the DC supplies an (non null) iptables chain
# (OPENSHIFT_HA_IPTABLES_CHAIN) make sure the rule to pass keepalived
# multicast (224.0.0.18) traffic is in the table.
HA_IPTABLES_CHAIN=${OPENSHIFT_HA_IPTABLES_CHAIN:-""}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this line and all the others:

HA_IPTABLES_CHAIN="${OPENSHIFT_HA_IPTABLES_CHAIN:-""}"
                  ^                                  ^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevekuznetsov I shifted OPENSHIFT_HA_IPTABLES_CHAIN to here to be more consistent.
I didn't create this file and am just adding a few fields. I am not sure how far to go with this.

This is work in progress. Thanks for your comments and I will modify the code as suggested. I am not sure how far I should go beyond adding this feature. Three is discussion on converting this to a go program and using the go template to generate the file. However, that is future work (if we decide to do it.)

The scripting here generates the config file for keepalived.

@@ -10,7 +10,7 @@ source "$(dirname "${BASH_SOURCE[0]}")/utils.sh"

# Constants.
readonly CHECK_SCRIPT_NAME="chk_${HA_CONFIG_NAME//-/_}"
readonly CHECK_INTERVAL_SECS=2
readonly CHECK_INTERVAL_SECS=${HA_CHECK_INTERVAL}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readonly CHECK_INTERVAL_SECS="${HA_CHECK_INTERVAL}"
                             ^                    ^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -59,6 +59,11 @@ function generate_script_config() {
if [ "$port" = "0" ]; then
echo " script \"true\""
else
if [[ -n $HA_CHECK_SCRIPT ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if [[ -n "${HA_CHECK_SCRIPT}" ]]; then
         ^ ^               ^^

@@ -59,6 +59,11 @@ function generate_script_config() {
if [ "$port" = "0" ]; then
echo " script \"true\""
else
if [[ -n $HA_CHECK_SCRIPT ]]; then
echo " if [[ -f $HA_CHECK_SCRIPT ]]; then"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you generating a script here? This is pretty insane. This expansion of $HA_CHECK_SCRIPT will happen at echo time, not at runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevekuznetsov This is creating a config file for keepalived. One of the parameters to keepalived is the name of a check script that keepalived should call. The expansion must happen when the config file is created.

@@ -1,5 +1,6 @@
#!/bin/bash -e

# set -x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@pecameron
Copy link
Contributor Author

@smarterclayton - I added more to the description. See the trelo card and bug
https://trello.com/c/228zu7Br/267-5-improve-the-configurability-of-the-ipfailover-container-operations

@pecameron
Copy link
Contributor Author

@stevekuznetsov Is there more information you need here? Are there any further changes needed?

@@ -65,7 +89,7 @@ ADMIN_EMAILS=(${OPENSHIFT_HA_ADMIN_EMAILS:-"root@localhost"})
EMAIL_FROM="[email protected]"

# IP address of the SMTP server.
SMTP_SERVER=${OPENSHIFT_HA_SMTP_SERVER:-"127.0.0.1"}
SMTP_SERVER="${OPENSHIFT_HA_SMTP_SERVER:-"127.0.0.A1"}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevekuznetsov Oops, I missed that. Its now fixed.

Copy link
Contributor

@stevekuznetsov stevekuznetsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One typo, the rest of the Bash looks ... ok ... to me. I think it would be reasonable to open a card or an issue to investigate if we can achieve the same features without generating a Bash script from within a Bash script.

@pecameron
Copy link
Contributor Author

@stevekuznetsov You have "Changes requested". What additional changes would you like to have made?

@stevekuznetsov
Copy link
Contributor

@pecameron There is one more comment about an IP address #11644 (review)

I would also like to see if we can open an issue to investigate an alternative approach to this than generating a script from within a script.

@knobunc
Copy link
Contributor

knobunc commented Nov 18, 2016

@stevekuznetsov Yeah, @pecameron already talked to me about the possibility of changing over to a template file, or something less horrific. We opened the card https://trello.com/c/9w6pIkIU

@pecameron
Copy link
Contributor Author

@stevekuznetsov @knobunc This code is generating the keepalived config file. keepalived reads this file to configure itself. It may look like a script but it is the config file. Even if we move to a template, the resulting file will still look the same as this file. It will look like a script and it will still be the same config file.

OK, I am missing something here. Do we want to change keepalived to accept its configuration in a different format?

@knobunc
Copy link
Contributor

knobunc commented Nov 18, 2016

@pecameron Longer-term, we probably want to move it to go with a template. But that will be done as a feature later.

@pecameron
Copy link
Contributor Author

@stevekuznetsov I made the changes. PTAL

@stevekuznetsov
Copy link
Contributor

I do admit I'm not well versed on this but diffs like this are clearly writing Bash code into a file somewhere. Whether that's a Bash script to be run later or a config file that allows for Bash inline... we're generating executable Bash code dynamically. I feel like that is a very fragile approach, and, yes, it should be something that we think about later, it does not block this PR.

@pecameron
Copy link
Contributor Author

@knobunc There is 1 approved on this, Is there something to do before you can approve it as well? Is this ready to merge?

@knobunc
Copy link
Contributor

knobunc commented Nov 18, 2016

@pecameron it is ready to merge. I'm giving master until Monday, then I have a few to put in.

Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@knobunc
Copy link
Contributor

knobunc commented Nov 21, 2016

[test]

@knobunc
Copy link
Contributor

knobunc commented Nov 22, 2016

Failed:

In suite "github.com/openshift/origin/test/cmd/router", test case "test/cmd/router.sh:75: executing 'oadm ipfailover --virtual-ips=\"1.2.3.4\" --check-interval=1177 --dry-run -o yaml' expecting success and text 'value" failed:
=== BEGIN TEST CASE ===
test/cmd/router.sh:75: executing 'oadm ipfailover --virtual-ips="1.2.3.4" --check-interval=1177 --dry-run -o yaml' expecting success and text 'value: 1177'
FAILURE after 0.554s: test/cmd/router.sh:75: executing 'oadm ipfailover --virtual-ips="1.2.3.4" --check-interval=1177 --dry-run -o yaml' expecting success and text 'value: 1177': the output content test failed
Standard output from the command:

https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_check/8687/consoleText

@@ -72,6 +72,9 @@ os::cmd::expect_success_and_text 'oadm ipfailover --virtual-ips="1.2.3.4" --dry-
os::cmd::expect_success_and_text 'oadm ipfailover --virtual-ips="1.2.3.4" --dry-run -o name' 'deploymentconfig/ipfailover'
os::cmd::expect_success_and_text 'oadm ipfailover --virtual-ips="1.2.3.4" --dry-run -o yaml' '1.2.3.4'
os::cmd::expect_success_and_text 'oadm ipfailover --virtual-ips="1.2.3.4" --iptables-chain="MY_CHAIN" --dry-run -o yaml' 'value: MY_CHAIN'
os::cmd::expect_success_and_text 'oadm ipfailover --virtual-ips="1.2.3.4" --check-interval=1177 --dry-run -o yaml' 'value: 1177'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stronger checks here, if possible, would look like:

$ oadm ipfailver .... # create an API object
$ oc get <object> -o jsonpath='{.path.to.check-interval}' # expecting 1177

Not sure if this is possible or how the ipfailover configuration is serialized in etcd, but that sort of check is more declarative and less fragile than grepping through some YAML.

@pecameron
Copy link
Contributor Author

pecameron commented Nov 22, 2016

I don't get this. The yaml does have 1177 and it didn't match. Do you
know what is wrong with this? Could it be the " " around the 1177?

FAILURE after 0.554s: test/cmd/router.sh:75: executing 'oadm ipfailover 
--virtual-ips="1.2.3.4" --check-interval=1177 --dry-run -o yaml' 
expecting success and text 'value: 1177': the output content test failed
Standard output from the command:
apiVersion: v1
items:
- apiVersion: v1
   kind: ServiceAccount
   metadata:
     creationTimestamp: null
     name: ipfailover
- apiVersion: v1
   kind: DeploymentConfig
   metadata:
     creationTimestamp: null
     labels:
       ipfailover: ipfailover
     name: ipfailover
   spec:
     replicas: 1
     strategy:
       resources: {}
       type: Recreate
     template:
       metadata:
         creationTimestamp: null
         labels:
           ipfailover: ipfailover
       spec:
         containers:
         - env:
           - name: OPENSHIFT_HA_CHECK_INTERVAL
             value: "1177" <<<<<<<<<<--------- It is here
           - name: OPENSHIFT_HA_CHECK_SCRIPT

@stevekuznetsov
Copy link
Contributor

Yes, it must be exactly that -- you are looking for the literal text value: 1177 in the YAML, but it does not exist. This is why I made the comment that I did about how to make this test more robust.

@stevekuznetsov
Copy link
Contributor

In addition, you're not checking that it's the check interval value that is set to 1177, nor that it isn't set to 11775, or that not-value is set to 1177... etc

@pecameron
Copy link
Contributor Author

Steve,
How do I code the line in the test?
oadm ipfailover --virtual-ips="1.2.3.4" --check-interval=1177 --dry-run
-o yaml' 'value: 1177'
would
oadm ipfailover --virtual-ips="1.2.3.4" --check-interval=1177 --dry-run
-o yaml' 'value: "1177"'

What do you think?

phil

=== BEGIN TEST CASE ===
test/cmd/router.sh:75: executing 'oadm ipfailover
--virtual-ips="1.2.3.4" --check-interval=1177 --dry-run -o yaml'
expecting success and text 'value: 1177'
FAILURE after 0.554s: test/cmd/router.sh:75: executing 'oadm ipfailover
--virtual-ips="1.2.3.4" --check-interval=1177 --dry-run -o yaml'
expecting success and text 'value: 1177': the output content test failed
Standard output from the command:
apiVersion: v1
items:
...

  • apiVersion: v1
    kind: DeploymentConfig
    ...
    - env:
    - name: OPENSHIFT_HA_CHECK_INTERVAL
    value: "1177" <<<<<---------
    - name: OPENSHIFT_HA_CHECK_SCRIPT
    - name: OPENSHIFT_HA_CONFIG_NAME

On 11/22/2016 08:56 AM, Steve Kuznetsov wrote:

@stevekuznetsov commented on this pull request.


In test/cmd/router.sh
#11644 (review):

@@ -72,6 +72,9 @@ os::cmd::expect_success_and_text 'oadm ipfailover --virtual-ips="1.2.3.4" --dry-
os::cmd::expect_success_and_text 'oadm ipfailover --virtual-ips="1.2.3.4" --dry-run -o name' 'deploymentconfig/ipfailover'
os::cmd::expect_success_and_text 'oadm ipfailover --virtual-ips="1.2.3.4" --dry-run -o yaml' '1.2.3.4'
os::cmd::expect_success_and_text 'oadm ipfailover --virtual-ips="1.2.3.4" --iptables-chain="MY_CHAIN" --dry-run -o yaml' 'value: MY_CHAIN'
+os::cmd::expect_success_and_text 'oadm ipfailover --virtual-ips="1.2.3.4" --check-interval=1177 --dry-run -o yaml' 'value: 1177'

Stronger checks here, if possible, would look like:

$ oadm ipfailver ....# create an API object
$ oc get -o jsonpath='{.path.to.check-interval}' # expecting 1177

Not sure if this is possible or how the ipfailover configuration is
serialized in etcd, but that sort of check is more declarative and
less fragile than |grep|ping through some YAML.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#11644 (review),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ANUgep7ZXyhAFgYI7Q0qQ7zzqArU0cBMks5rAvR0gaJpZM4Kjh5W.

@stevekuznetsov
Copy link
Contributor

Yes, the following would work:

os::cmd::expect_success_and_text "..." 'value: "1177"'

But, for all of the reasons above I still think it is much better to not do --dry-run but actually serialize the object to etcd and use a jsonpath to get the value you want, then check that and that only. See an example in routes.sh.

@knobunc
Copy link
Contributor

knobunc commented Nov 30, 2016

[test]

@knobunc
Copy link
Contributor

knobunc commented Nov 30, 2016

[test] yum failed

@pecameron
Copy link
Contributor Author

@knobunc Still need help with this...
Thanks...

@knobunc
Copy link
Contributor

knobunc commented Dec 8, 2016

# cover github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/restclient
[WARNING] `go test` had the following output to stderr:
cover: internal error: block 268 overlaps block 270
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/restclient/request.go:#27803,#27927 /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/restclient/request.go:#27848,#27867

https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_check/8905/consoleText

@smarterclayton
Copy link
Contributor

That's a spurious error. Doesn't actually cause failure.

@pecameron
Copy link
Contributor Author

pecameron commented Dec 9, 2016 via email

@knobunc
Copy link
Contributor

knobunc commented Dec 9, 2016

[test]

@pecameron
Copy link
Contributor Author

[test]

@knobunc
Copy link
Contributor

knobunc commented Dec 13, 2016

/go/src/github.com/openshift/origin/test/cmd/router.sh: line 82: unexpected EOF while looking for matching ''`

https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_check/9400/consoleText

Add options to 'oadm ipfailover' to configure the check script and
notify scripts and also control the period the check script runs.

Keepalived periodically checks whether the application is running
properly.  In the default case the test is a simple verification
that something is listening on the watch port. This PR permits the
user to supply an additional check script that is run in the
ipfailover container context to verify that the application is
operating properly. For example, a web server can be tested by
accessing the watch port and verifying the response.

Whenever a node changes state to MASTER, BACKUP, or FAULT a notify
script can be called. This script has 3 parameters filled in by
keepalived:
$1 - "GROUP"|"INSTANCE"
$2 - name of group or instance
$3 - target state of transition ("MASTER"|"BACKUP"|"FAULT")

--check-script="check_script"
The check script is a script in the keepalived container that verifies
the service is running properly. These checks are in addition to verifying
that the watch port is listening.

--notify-script="notify_script"
The notify script is a script in the keepalived container that is
called whenever the keepalived state transitions to (MASTER|BACKUP|FAULT)

--check-interval=<n>
The check script is run every <n> seconds. Default is 2.

Note: the scripts name is the full path to the script.

Resolves 1362163

Signed-off-by: Phil Cameron <[email protected]>
@openshift-bot
Copy link
Contributor

Evaluated for origin test up to bce0038

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/12335/) (Base Commit: fb0bbde)

@pecameron
Copy link
Contributor Author

@knobunc are we ready to merge this?

@knobunc
Copy link
Contributor

knobunc commented Dec 15, 2016

[merge]

1 similar comment
@knobunc
Copy link
Contributor

knobunc commented Dec 16, 2016

[merge]

@pecameron
Copy link
Contributor Author

@knobunc This failed again, passed once before you did the merge and failed after. Maybe another merge will work...

@pecameron
Copy link
Contributor Author

@knobunc https://bugzilla.redhat.com/show_bug.cgi?id=1362163 comment #6 is for the doc changes. The code is not yet in...

@knobunc
Copy link
Contributor

knobunc commented Dec 20, 2016

@knobunc
Copy link
Contributor

knobunc commented Dec 21, 2016

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to bce0038

@openshift-bot
Copy link
Contributor

openshift-bot commented Dec 21, 2016

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/12567/) (Base Commit: 05e4f0e) (Image: devenv-rhel7_5584)

@openshift-bot openshift-bot merged commit 3164ce5 into openshift:master Dec 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants