Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow some times failed when trying to evaluate 'when' expression #14150

Open
3 of 4 tasks
davidZal1992 opened this issue Feb 5, 2025 · 1 comment
Open
3 of 4 tasks
Labels
area/controller Controller issues, panics type/bug

Comments

@davidZal1992
Copy link

davidZal1992 commented Feb 5, 2025

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

Hey team, We need some help with weird issue.
We have a workflow with a nested DAG. The purpose of this DAG is to act as a sensor that collects artifacts from our S3 bucket (a JSON file). Based on one of the fields, we decide the next steps inside our "when" field inside our yaml.
We process a high volume of workflows (more than 19K) per day and every few days (without a consistent pattern), the step that tries to evaluate the JSON fails with the following error:
Invalid 'when' expression ''TriggerDBK' == {{=jsonpath(tasks['collectArtifact'].outputs.parameters.MonitorDBK, '$.nextStep')}}': Invalid token: '{{=' (hint: try wrapping the affected expression in quotes (")) we can see 3-4 workflows failed in the same time.
Upon checking the output of our json, we see that everything is in place, both in the S3 bucket and in the Argo UI. The controller logs do not show anything unusual, it just happens without any clear direction as to why.
Can you suggest any methods to debug this issue and understand what might be causing it?

Artifact collected:
Image

Expression evaluation faled:

Image

The 'ignoreFail' step fails inside the 'when'.
  - name: collectArtifact
    depends: "MonitorDBK"
    template: collect-prev-steps-artifacts
 - name: ignoreFail
    depends: "collectArtifact"
    when: "'TriggerDBK' == {{=jsonpath(tasks['collectArtifact'].outputs.parameters.MonitorDBK, '$.nextStep')}}" 


#utils/collectArtifacts.yaml

  - name: collect-prev-steps-artifacts
    retryStrategy:
      limit: "3"
      retryPolicy: "Always"
      backoff:
        duration: "30"
    inputs:
      artifacts:
        - name: InputMonitorDBK
          path: "/tmp/MonitorDBK-output.json"
          mode: 0644
          archive:
            none: { }
          s3:
            endpoint: s3.amazonaws.com
            region: us-west-2
            bucket: "{{workflow.parameters.g_s3_bucket}}"
            key: "{{workflow.parameters.g_artifacts_s3_dir}}/MonitorDBK-output.json"
        - name: InputCreateRun
          path: "/tmp/CreateRun-output.json"
          mode: 0644
          archive:
            none: { }
          s3:
            endpoint: s3.amazonaws.com
            region: us-west-2
            bucket: "{{workflow.parameters.g_s3_bucket}}"
            key: "{{workflow.parameters.g_artifacts_s3_dir}}/CreateRun-output.json"
    outputs:
      parameters:
        - name: MonitorDBK
          valueFrom:
            path: "/tmp/MonitorDBK-output.json"
        - name: CreateRun
          valueFrom:
            path: "/tmp/CreateRun-output.json"
    container:
      image: docker.com/python:3.10
      command: [ sh, -c ]
      args: [ "date; echo do nothing" ]
      resources:
        requests:
          cpu: "50m"
          ephemeral-storage: "1Gi"
          memory: "100Mi"


Version(s)

V3.4.16

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

The 'ignoreFail' step fails inside the 'when'.
   - name: collectArtifact
    depends: "MonitorDBK"
    template: collect-prev-steps-artifacts
 - name: ignoreFail
    depends: "collectArtifact"
    when: "'TriggerDBK' == {{=jsonpath(tasks['collectArtifact'].outputs.parameters.MonitorDBK, '$.nextStep')}}" 
#utils/collectArtifacts.yaml
    - name: collect-prev-steps-artifacts
      retryStrategy:
        limit: "3"
        retryPolicy: "Always"
        backoff:
          duration: "30"
      inputs:
        artifacts:
          - name: InputMonitorDBK
            path: "/tmp/MonitorDBK-output.json"
            mode: 0644
            archive:
              none: { }
            s3:
              endpoint: s3.amazonaws.com
              region: us-west-2
              bucket: "{{workflow.parameters.g_s3_bucket}}"
              key: "{{workflow.parameters.g_artifacts_s3_dir}}/MonitorDBK-output.json"
          - name: InputCreateRun
            path: "/tmp/CreateRun-output.json"
            mode: 0644
            archive:
              none: { }
            s3:
              endpoint: s3.amazonaws.com
              region: us-west-2
              bucket: "{{workflow.parameters.g_s3_bucket}}"
              key: "{{workflow.parameters.g_artifacts_s3_dir}}/CreateRun-output.json"
      outputs:
        parameters:
          - name: MonitorDBK
            valueFrom:
              path: "/tmp/MonitorDBK-output.json"
          - name: CreateRun
            valueFrom:
              path: "/tmp/CreateRun-output.json"
      container:
        image: docker.com/python:3.10
        command: [ sh, -c ]
        args: [ "date; echo do nothing" ]
        resources:
          requests:
            cpu: "50m"
            ephemeral-storage: "1Gi"
            memory: "100Mi"

Logs from the workflow controller

Error node dataset-17100-1738598700[1].dataset-dag[4].dbk-dag(0)[3]: Invalid 'when' expression ''TriggerDBK' == {{=jsonpath(steps['collectArtifact'].outputs.parameters.MonitorDBK, '$.nextStep')}}': Invalid token: '{{=' (hint: try wrapping the affected expression in quotes ("))

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@tooptoop4
Copy link
Contributor

similar to #13799

@shuangkun shuangkun added the area/controller Controller issues, panics label Feb 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics type/bug
Projects
None yet
Development

No branches or pull requests

3 participants