Closed Bug 1678183 Opened 4 years ago Closed 3 years ago

Google Trust Services: Invalid ASN.1 encoding of singleExtensions in OCSP responses

Categories

(CA Program :: CA Certificate Compliance, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: awarner, Assigned: awarner)

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36

Steps to reproduce:

Actual results:

  1. How your CA first became aware of the problem

On 2020-11-05 we received a notification from PrimeKey regarding a new EJBCA version that includes fixes related to compliance requirements for public CAs. One of the two bugs, the inclusion of an empty SEQUENCE when no singleExtensions exist for a SingleResponse affects one Google Trust Services CA.

  1. A timeline of the actions your CA took in response.

2019-11-28: The affected EJBCA installation was updated from version 7.2.1 to version 7.3.1 which included the bug.
2020-11-05: We received a notification from PrimeKey regarding a new EJBCA version that fixes this bug.
2020-11-06: We confirmed the issue and identified update windows for test and prod to test the update and roll it out to production.
2020-11-09: New EJBCA version rolled out in the test environment.
2020-11-12: New EJBCA version rolled out in the production environment.

  1. Whether your CA has stopped, or has not yet stopped, certificate issuance or the process giving rise to the problem or incident.

The issue was remediated with the rollout of the new EJBCA version.

  1. In a case involving certificates, a summary of the problematic certificates.

No problematic certificates were issued.

  1. In a case involving certificates, the complete certificate data for the problematic certificates.

No problematic certificates were issued.

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

The bug was introduced by the software vendor. We have not received any reports from user agents not being able to validate the concerned OCSP responses. We follow Mozilla best practices and we routinely use 3rd party resources such as revocationcheck.com to validate our responses, none of which identified this issue.

  1. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

After triaging the bug, we confirmed the issue and identified test and update windows to roll out the new EJBCA version. We will continue to review release notes closely and conduct our own quality checks to help identify similar problems in the future.

Expected results:

=

Assignee: bwilson → awarner
Status: UNCONFIRMED → ASSIGNED
Type: enhancement → task
Ever confirmed: true
Whiteboard: [ca-compliance]

(In reply to Andy Warner from comment #0)

  1. How your CA first became aware of the problem

On 2020-11-05 we received a notification from PrimeKey regarding a new EJBCA version that includes fixes related to compliance requirements for public CAs.

This incident appears to have the same root cause as bug 1667944 reported on 2020-09-29 by GlobalSign. Does Google monitor Bugzilla for incident reports from others CAs? Was Google aware of bug 1667944?

  1. List of steps your CA is taking to resolve the situation and ensure that such situation or incident will not be repeated in the future, accompanied with a binding timeline of when your CA expects to accomplish each of these remediation steps.

After triaging the bug, we confirmed the issue and identified test and update windows to roll out the new EJBCA version. We will continue to review release notes closely and conduct our own quality checks to help identify similar problems in the future.

Could you provide more detail on the quality checks implemented? Are they similar to the "full asn1 comparison check of OCSP responses" described by GlobalSign?

Apple was also affected in bug 1669618, and they described their resolution steps with a similar level of detail as GlobalSign.

GTS regularly reviews m.d.s.p posts and Bugzilla incident reports. We noted GlobalSign’s incident report and scheduled a follow up action to determine whether our EJBCA CA was also affected. The investigation concluded later than it usually would have because EJBCA is not our primary CA platform and it was not in use for production issuance at that time.

Our primary quality checks are zlint and internal test suites, including use of openssl asn1parse to check encodings. We also have probers (continuous checks of live data) that cover the validity and correctness of OCSP responses.

Sorry Andy, I'm not sure how this bug got dropped on our end. However, it probably would have been noticed sooner, had Google provided weekly updates

From your response, it seems there are several areas for Google to improve:

  1. Ensure regular communication on all outstanding issues until GTS has received confirmation that the bug is closed.
  2. An examination and root cause analysis as to why "The investigation concluded later than it usually would", and what steps are being done to improve.
  3. A more detailed answer to the question from Comment #1 re: quality checks. Given that ZLint does not check OCSP responses, it's not clear why GTS mentioned it. Given the lack of context, it's also likely that the "internal test suites" also do not cover OCSP. OpenSSL's asn1parse doesn't really check encodings against the schema, just that data is BER encoded. You mention "continuous checks of live data", but don't really describe what the "correctness" checks are, which is, as best I can understand, what Comment #1 was going for.

I'm equally concerned that the responses in Comment #0 to Question 6 and Question 7 don't really demonstrate good practice at all. Question 6 appears to blame a software vendor and takes the position "We did everything we could", while Question 7 seems that GTS's approach to compliance is that it's "somebody else's problem" and that GTS' role is just reviewing release notes.

I'm hoping this was just a rush to publish an initial incident report, since this really does not seem to meet the level of Google's own postmortem culture

Flags: needinfo?(awarner)
  1. We would appreciate clarification on the expectations around weekly updates for items that are considered remediated. We are aware of the expectations around updates for issues that are still actively being worked on, but it is quite common for it to take a while for Mozilla staff / peers to close out issues after resolution and we thought that updates simply saying "we're awaiting close" would simply create noise or not be well received and that does not seem to be the norm. If there is an expectation that CAs provide updates / checkins until an issue is fully closed, it would be good to have that clarified in the Mozilla incident guide.

  2. Our regular compliance review identified the issue as needing follow-up soon after GlobalSign's post. We did not prioritize the follow-up, as the initial assessment was that it was low risk, but merited further review. The additional review ought to have happened sooner, but based on the initial classification and other workloads at the time, it took longer than normal. The team has since onboarded and trained new team members, which reduces the risk of similar delays in the future.

  3. I interpreted the quality check question to be more of a general inquiry as it followed the discussion of updating running checks, so I provided an answer covering all post update checks. Given the specific query about 'full asn1 comparison' I should have addressed it more directly. We have a sequence size check in our internal test suite, but do not perform a before / after diff of the asn1 output for OCSP responses. This appears to be a potential area of improvement and will be discussed within the team.

Our intent with the details on EJBCA were not to deflect blame, but rather to provide context. We use EJBCA as a backup. We keep it fully compliant, but considering that it is a backup, our primary focus is on our own CA software and we have a less frequent cadence for review of updates and testing for EJBCA. We choose to operate EJBCA as a backup, therefore we are obligated to keep it up to date and ensure our running configuration is fully compliant with applicable BRs and RFCs.

Flags: needinfo?(awarner)

It is unclear if weekly updates are expected for issue pending responses. GTS is still awaiting feedback on our last response.

Flags: needinfo?(bwilson)

It does not appear that there are any follow-up questions. I will schedule this to be closed on or about this Friday 16-April-2021.

(In reply to Andy Warner from comment #5)

  1. We would appreciate clarification on the expectations around weekly updates for items that are considered remediated. We are aware of the expectations around updates for issues that are still actively being worked on, but it is quite common for it to take a while for Mozilla staff / peers to close out issues after resolution and we thought that updates simply saying "we're awaiting close" would simply create noise or not be well received and that does not seem to be the norm. If there is an expectation that CAs provide updates / checkins until an issue is fully closed, it would be good to have that clarified in the Mozilla incident guide.

So while I appreciate this request for Ben, I do think it highlights several problematic concerns.

As a CA that is following other CA's updates, GTS would presumably be aware of the repeated discussions on requests for updates, and the concerns (and risk of removal of trust in some CAs) related to a lack of updates. Equally, a CA that prioritized operating beyond reproach would take the maximally restrictive approach, and assume that unless and until such clarification is given, it should assume that more updates (and transparency) is better. If anything, it would demonstrate GTS has a process to ensure it reviews, every week, all open bugs, and ensures that all of them are making progress to confirmed closure.

If this was three years ago, this would seem like a reasonable request, but I must admit, it seems surprising to see this statement in 2021, which is what Comment #4 was getting at. I wanted to let Comment #5 to see if things had changed, and I'm encouraged to see Comment #6.

  1. Our regular compliance review identified the issue as needing follow-up soon after GlobalSign's post. We did not prioritize the follow-up, as the initial assessment was that it was low risk, but merited further review. The additional review ought to have happened sooner, but based on the initial classification and other workloads at the time, it took longer than normal. The team has since onboarded and trained new team members, which reduces the risk of similar delays in the future.

I think I still find this answer troubling, because it certainly feels substantively the same, both in terms of substance and detail, as the problematic example called out at https://wiki.mozilla.org/CA/Responding_To_An_Incident#Incident_Report as "human error", with it being a "prioritization error" where "prioritization has been improved".

There was an opportunity here to respond more in substance in a transparent and beyond reproach way, which is what I tried to encourage in Comment #4, and it's disappointing to see that even with that, GTS failed to rise to the occasion. That GTS sees some compliance incidents as "low risk" that can be deprioritized is concerning, and this could have been an opportunity for GTS to share what it felt was more important than compliance and sharing with the community its steps to improve it. These are things that lower the degree of trust and confidence in a CA, and while individually might not be enough for the removal of trust, are the things in aggregate the CA should realize is the single greatest risk, because it is the greatest risk to end users.

Since it appears GTS is still struggling to rise to the level of other CAs, I think an example of a very clear "missed opportunity" in Comment #0 was unclear whether or not GTS performed any investigation about its issuance and whether it did issue such responses. It simply appears as "Maybe we did, maybe we didn't, we didn't think it mattered so didn't bother to find out. We updated, so we think we've done everything we need to do". If that sort of response wouldn't be appropriate for, say, issuing an unconstrained intermediate CA certificate, then it equally isn't appropriate for this: the CA bears the burden of proof in demonstrating the severity, backed by the details, and evidence that the CA understands that every compliance issue is, itself, significant. I know this is an area GTS has struggled with in the past, and I'm concerned it's not improved.

  1. I interpreted the quality check question to be more of a general inquiry as it followed the discussion of updating running checks, so I provided an answer covering all post update checks. Given the specific query about 'full asn1 comparison' I should have addressed it more directly. We have a sequence size check in our internal test suite, but do not perform a before / after diff of the asn1 output for OCSP responses. This appears to be a potential area of improvement and will be discussed within the team.

Again, this is an answer that doesn't really provide assurance, because the context of questions like Comment #1 and Comment #2 gave specific examples of two other CA bugs that went into significantly more detail, even if with some probing, that should have set a baseline expectation showing how that interpretation was unreasonable.

I encourage Google to compare its Comment #0 , filed 2020-11-18, versus Apple's https://bugzilla.mozilla.org/show_bug.cgi?id=1669618#c2 - a full month before, on 2020-10-13. Look at how the questions to Question 6 and 7 are structured, and compare that with GTS's response, and it should be clear there's a significant difference.

Further, compare Comment #5, just 12 days ago, in discussing a "potential area of improvement and will be discussed within the team" (even though it predates Google's initial response by an entire month), as somehow new information, and further in light of https://wiki.mozilla.org/CA/Responding_To_An_Incident#Incident_Report discussions of binding timelines.

If this was a 2017-2018 report, I'd be far less concerned, but again, this is an area where GTS has seemed to struggle, and I think it's entirely reasonable to expect more, especially from an organization who quite literally wrote a book on post-mortem culture that this process is modeled after.

This may feel like it's being a very "blameful post-mortem of a blameless post-mortem", but I'm equally concerned that without this detailed critique, GTS may interpret Comment #7 as meaning "This was an excellent post-mortem and the model for the future". It isn't, and while I hope GTS does not have further incidents, I do look forward to seeing substantial improvement in substance, detail, and familiarity with the baseline expectations in future incidents.

Good comments and suggestions for going forward. Thanks, Ryan

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Flags: needinfo?(bwilson)
Resolution: --- → FIXED
Summary: Google Trust Services - Invalid ASN.1 encoding of singleExtensions in OCSP responses → Google Trust Services: Invalid ASN.1 encoding of singleExtensions in OCSP responses
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [ocsp-failure]
You need to log in before you can comment on or make changes to this bug.