-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Diagnostics] Fix AnalyzeLogs to provide more clear debug message #18654
Conversation
@@ -146,7 +146,7 @@ func resolveServerIP(serverUrl string, fn dnsResolver) ([]string, error) { | |||
|
|||
func searchNodesForIP(nodes []kapi.Node, ips []string) types.DiagnosticResult { | |||
r := types.NewDiagnosticResult(MasterNodeName) | |||
r.Debug("DClu3005", fmt.Sprintf("Seaching for a node with master IP: %s", ips)) | |||
r.Debug("DClu3005", fmt.Sprintf("Searching for a node with master IP: %s", ips)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This debug message currently shows:
debug: map[origin-node:{origin-node false false false 0} openvswitch:{openvswitch true false true 0} iptables:{iptables true true true 0} docker:{docker true true true 0} etcd:{etcd true true true 0} kubernetes:{kubernetes false false false 0} origin-master:{origin-master false false false 0} atomic-openshift-master:{atomic-openshift-master false false false 0} atomic-openshift-node:{atomic-openshift-node true true true 0}]
Actually I wanted to delete it, but if I deleted it, the number DClu3005
will be absent in master_node.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no special reason to keep an identifier.
But, I think you are talking about the debug statement in locate_units.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to keep the message just to preserve DClu3005
tag. If the tag is present then it should point to unique error/debug but if the error/debug is not relevant any more then we should delete the error/debug. Don't keep stale code just to maintain these tags (we have done the same thing for networkcheck and aggregatedlogging diagnostics).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can always pinpoint the error/debug/warning without these tags, we just need ose release (oc version
) and filepath:line-number which are available. I think these tags are more confusing than useful. May be we should get rid of them?
@sosiouxme
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pravisankar The tags certainly serve no purpose for debug/info level at this time. The purpose in warn/error is to have something stable and semantically meaningless that's safe to grep in automation, as there is no machine-readable output at this time. The tags were originally intended for machine-readable output, which is a requested and long-neglected feature.
I doubt anyone's using the tags for that purpose now. I suppose overall it may be better to get rid of tags entirely. It would not be hard to retrofit them when fleshing out the machine-readable feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I made a comment under wrong line. As @sosiouxme, I meant logger.Debug("DS1003", fmt.Sprintf("%v", systemdUnits))
. (And I removed it now.)
The message DClu3005
is really necessary until this bug was fixed https://bugzilla.redhat.com/show_bug.cgi?id=1368865
cc @sosiouxme |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit otherwise LGTM
@@ -146,7 +146,7 @@ func resolveServerIP(serverUrl string, fn dnsResolver) ([]string, error) { | |||
|
|||
func searchNodesForIP(nodes []kapi.Node, ips []string) types.DiagnosticResult { | |||
r := types.NewDiagnosticResult(MasterNodeName) | |||
r.Debug("DClu3005", fmt.Sprintf("Seaching for a node with master IP: %s", ips)) | |||
r.Debug("DClu3005", fmt.Sprintf("Searching for a node with master IP: %s", ips)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to keep the message just to preserve DClu3005
tag. If the tag is present then it should point to unique error/debug but if the error/debug is not relevant any more then we should delete the error/debug. Don't keep stale code just to maintain these tags (we have done the same thing for networkcheck and aggregatedlogging diagnostics).
@@ -146,7 +146,7 @@ func resolveServerIP(serverUrl string, fn dnsResolver) ([]string, error) { | |||
|
|||
func searchNodesForIP(nodes []kapi.Node, ips []string) types.DiagnosticResult { | |||
r := types.NewDiagnosticResult(MasterNodeName) | |||
r.Debug("DClu3005", fmt.Sprintf("Seaching for a node with master IP: %s", ips)) | |||
r.Debug("DClu3005", fmt.Sprintf("Searching for a node with master IP: %s", ips)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can always pinpoint the error/debug/warning without these tags, we just need ose release (oc version
) and filepath:line-number which are available. I think these tags are more confusing than useful. May be we should get rid of them?
@sosiouxme
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this up. I have some requests and then we can merge this.
BTW, the log analysis doesn't check many patterns yet -- if there are useful log entries you'd like this to find for you, please suggest those or add a PR...
@@ -146,7 +146,7 @@ func resolveServerIP(serverUrl string, fn dnsResolver) ([]string, error) { | |||
|
|||
func searchNodesForIP(nodes []kapi.Node, ips []string) types.DiagnosticResult { | |||
r := types.NewDiagnosticResult(MasterNodeName) | |||
r.Debug("DClu3005", fmt.Sprintf("Seaching for a node with master IP: %s", ips)) | |||
r.Debug("DClu3005", fmt.Sprintf("Searching for a node with master IP: %s", ips)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no special reason to keep an identifier.
But, I think you are talking about the debug statement in locate_units.go
@@ -99,8 +99,13 @@ func (d AnalyzeLogs) Check() types.DiagnosticResult { | |||
if err := json.Unmarshal(bytes, &entry); err != nil { | |||
r.Debug("DS0003", fmt.Sprintf("Couldn't read the JSON for this log message:\n%s\nGot error %s", string(bytes), errStr(err))) | |||
} else { | |||
if lineCount > 500 && stampTooOld(entry.TimeStamp, timeLimit) { | |||
r.Debug("DS0004", fmt.Sprintf("Stopped reading %s log: timestamp %s too old", unitName, entry.TimeStamp)) | |||
old, unixTimestamp, err := stampTooOld(entry.TimeStamp, timeLimit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few suggestions to make here.
stampTooOld
is only ever used here. All it really accomplishes is making this logic easy to understand. Here you've turned the method into trinary logic (true/false/error) plus return value and then you have to process that trinary again upon receiving it. I think it would be simpler just to bring the method inline and use loop controls rather than return
.
old, unixTimestamp, err := stampTooOld(entry.TimeStamp, timeLimit) | ||
if err != nil { | ||
r.Error("DS0004", err, fmt.Sprintf("Stopped reading %s log, timestamp %s is invalid", unitName, entry.TimeStamp)) | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subtle difference here. This will end the loop the first time it sees a malformed log line (if that ever happens! not sure I've ever seen it). Previously it would only end the loop if there was an error and it had seen more than 500 lines.
It would make more sense to me to skip the log line with a continue
here and keep looking.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: nak3, sosiouxme The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue. |
Thank you. I think I should have squashed these commits, before merge. (I'm sorry!) |
That does make life easier when doing a git bisect, but otherwise isn't really that important, so don't worry about it. |
When we run
oc adm diagnostics AnalyzeLogs
, diagnostics commandoften misses error messsages in journal logs. So, as an admin we are
trying to find out the reason, but debug message gives us unclear
message:
So, this patch changes regarding AnayzeLogs as:
Here is the message after applied this patch: