Closed Bug 1630040 Opened 4 years ago Closed 4 years ago

Google Trust Services: OCSP serving issue 2020-04-09

Categories

(CA Program :: CA Certificate Compliance, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: awarner, Assigned: awarner)

Details

(Whiteboard: [ca-compliance] [ocsp-failure])

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36

Steps to reproduce:

Actual results:

From 2020-04-08 16:25 UTC to 2020-04-09 05:40 UTC, Google Trust Services' EJBCA based CAs (GIAG4, GIAG4ECC, GTSY1-4) served empty OCSP data which led the OCSP responders to return unauthorized.

These CAs exist for issuance of custom certificate profiles and certificates for test sites for inactive roots. Our primary CAs (GTS CA 1O1 and GTS CA 1D2) were unaffected. The problem self-corrected, but we have added safeguards to prevent recurrence.

  1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date.

Monitoring detected the issue on 2020-04-08 at 16:35 UTC. The root cause was identified within hours. The issue was automatically remediated in the next generation and push to CDN cycle while debugging and fixes were ongoing.

  1. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done.

2020-04-08, 11:29 UTC - Scheduled system update begins
2020-04-08, 14:00 UTC - Incorrect OCSP archives are generated
2020-04-08, 15:03 UTC - Scheduled system update concludes
2020-04-08, 16:20 UTC - Incorrect OCSP responses pushed to CDN
2020-04-08, 16:35 UTC - First production monitoring alert fires
2020-04-08, 22:00 UTC - Correct OCSP archives are generated automatically
2020-04-09, 00:20 UTC - Correct OCSP responses pushed to CDN
2020-04-09, 05:40 UTC - Monitoring confirms all probes are passing

  1. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation.

The affected CAs are only used for infrequent and manual custom certificate issuance. No certificate issuance aside from a manually issued post update test certificate to validate the upgrade to resolve the issue took place during this period. The issue in question also was specific to refreshing OCSP responses and not certificate issuance.

  1. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued.

No certificate issuance aside from a manually issued post update test certificate to validate the upgrade to resolve the issue took place during this period. The test certificate was a valid and fully compliant issuance.

  1. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem.

No certificate issuance aside from the manually issued post update test certificate to validate the the upgrade.

  1. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now.

Our creation of OCSP responses and packaging them for serving is designed to fail if any sub-command fails using set -e. However, if the function call is part of an AND or OR sequence (ie. using '&&' or '||' control operators), the set -e is suppressed inside the function.

The tool we use to fetch OCSP responses from EJBCA correctly returned a non-zero exit code (due to no OCSP responses being generated because EJBCA was not running), but because it was called inside a function with its own error handling (using && syntax), the script continued without handling the error properly and wrongly used empty tar.gz files with no responses in them. The bug had existed for multiple years as a potential race condition and we did not encounter it previously.

Quality tests are executed before publication to the CDN, however, those tests accommodate empty responses as a valid condition because it is something that can and does happen.

This condition did not repeat on the following update of the OCSP responses. As a result the next update resolved the issue. Our monitoring caught the issue enabling expedient root cause analysis and resolution.

  1. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things.

No certificate issuance aside from a valid manually issued post update test certificate to validate the upgrade took place during this period.

The logic error that led to incorrect OCSP responses being served has been corrected, is checked in and in production. Additionally, checks have been added to ensure that bad data cannot replace known good data.

We reviewed all existing monitoring of response generation and publishing and found no gaps.

A review of similar code has also been conducted to ensure we do not have other instances where similar logic could incorrectly suppress errors.

The only non-expired and revoked certificates under these CAs are used by our six demo sites.

Users or automation using these sites for testing may have interpreted the unauthorized responses to mean these revoked demo certificates were to be considered valid during the window in which bad data was served.

The issue was limited to OCSP handling and CRL data was correct during the same period.

No additional improvements are outstanding at this time.

Expected results:

Assignee: wthayer → awarner
Status: UNCONFIRMED → ASSIGNED
Type: defect → task
Ever confirmed: true
Whiteboard: [ca-compliance]
Summary: GTS - OCSP serving issue 2020-04-09 → Google Trust Services: OCSP serving issue 2020-04-09

Google Trust Services has had issues in the past with OCSP, as covered by Bug 1630079, and with CRLs, as covered by Bug 1581183. I understand that GTS has two separate CA stacks, as this bug draws attention to, but the previous OCSP bug seemed to affect the same infrastructure here.

What's not clear to me is why the test cases in that bug failed to catch this issue, and how testing is being improved here.

The situation of testing for OCSP also seems related to Bug 1522975, although I can understand the differences here between manual ceremonies and automation.

I think, given the pattern of incidents being seen here around OCSP, this incident would benefit from having a detailed description of GTS' OCSP processes, and importantly, a description of how and what GTS monitors to detect issues here.

Flags: needinfo?(awarner)

Google Trust Services is happy to provide some additional context on this and the prior issues. In the context of this and the prior revocation related incidents there are three certificate issuance “stacks”, not two:

-- Our “legacy” EJBCA deployment that was utilized for the older and now turned down GIAG2 Certificate Authorities.

-- Our “current” EJBCA deployment that is utilized for the infrequent cases where we need to issue customized certificates and CAs, for example we use this for the CAs that are utilized for the test sites. This deployment is based on an improved code base and shares little with the “legacy” system described above.

-- Our “primary” CA infrastructure that is designed for supporting global TLS related use cases and was written from scratch.

The oldest issue related to revocation checking was captured in bug 1522975. This was human error in configuration and not an issue with automated revocation data generation and publishing. There is some overlap in terms of safeguards, namely configs and live data are both zlint-ed on an ongoing and recurring basis.

After that, we experienced an issue in the legacy system. This issue is covered by bug 1630079. In this case the system encountered a bug in the Unix utility 'tar' which led to empty data sets feeding into the OCSP publishing pipeline. This issue resulted in us improving our production monitoring to catch this and other related issues for all of our “stacks”.

Later while performing a regular internal review of our primary CA software and infrastructure we discovered the software for our primary CA did not include CRL entries of expired certificates for the interval after expiration. This issue is covered by bug 1581183.

In all of these “stacks” revocation messages are produced by the CA and then pushed to our CDN. We have comprehensive monitoring of the production service and have focused our efforts in ensuring that this monitoring is robust as it represents the real world user experience with our services.

We also have various logic checks for correctness in various stages of revocation production however there are legitimate cases, for example a CA that has no unexpired certificates, where empty bundles may be expected. With that said, as a result of this incident we are adding a pre-push check with a CA whitelist for this condition so that other variants of this issue would be caught before being pushed to the CDN.

Obviously we would have preferred to have not experienced any issues but it is worth noting that our monitoring did catch the issue and correctly resulted in an investigation by the appropriate team. This investigation resulted in this incident quickly being understood and addressed. The code associated with this pipeline being reviewed and improved with the hope to address other potential weaknesses.

If other members of the community are willing to share how they handle error checking and monitoring of revocation data we would appreciate hearing about alternatives and additional actions we could consider.

Flags: needinfo?(awarner)

Thanks Andy, this is really useful context and detail. It’s also helpful that you emphasized that your monitoring and controls did catch this quickly, and similarly, remediation itself was very quick.

I was primarily focused on these events:

2020-04-08, 16:20 UTC - Incorrect OCSP responses pushed to CDN
2020-04-08, 16:35 UTC - First production monitoring alert fires

Bug 1630079 particularly came to mind because it revealed a lack of monitoring and pre-push testing. I’m encouraged to hear the monitoring this time caught this new issue, but I’m concerned a bad push was still possible. Knowing that the perfect can be the enemy of the good, I’m still keen to find more thoughts, ideas, and experience for what sort of controls and tests can and should exist here, to compliment the monitoring by more effectively preventing, and not just detecting, issues.

That’s why in Comment #1 I’m trying to build a better understanding of the existing pre-push checks and what’s being added or improved here. It sounds like the current set of checks is just “>=1 response per issuer”, but I’m hoping you can expand with more detail if that’s not the case.

Flags: needinfo?(awarner)

Resetting severity to default of --.

We appreciate the additional input. We had a bit more in place than it appears came through and additional improvements are prepared for deployment in our next regular update.

The EJBCA pipeline also had basic preventative checks (pre-push) for nulls and serial number validation. We have enforced a very specific order for the generation of revocation data and confirmed that error codes and conditions are set and handled properly at each step. This ensures that an empty file will not be generated due to CA or database issues.

Our next production update will include additional pre-push (presubmit) checks for:

  • validation that tar.gz is correct (ie. it can be read)
  • validation that tar.gz contains at least one OCSP response
  • validation that each CA has at least one OCSP response

If you or members of the community have additional suggestions, we'll happily consider additional safeguards.

Flags: needinfo?(awarner)

While I don't have further suggestions for this particular bug, I will note that GTS had yet another production revocation incident, as captured in Bug 1634795, that along with those mentioned in Comment #1, fit into a pattern of issues with GTS not maintaining revocation effectively.

Flags: needinfo?(wthayer)

Andy: has the update described in comment #5 been deployed? If not, when is that expected?

Flags: needinfo?(wthayer) → needinfo?(awarner)

Sorry about the lag, I'm out on parental leave with a newborn and doing minimal work. The changes in comment 5 went into production at the end of April shortly after that update.

Flags: needinfo?(awarner)

Thank you Andy.

It appears that all questions have been answered and remediation is complete.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Product: NSS → CA Program
Whiteboard: [ca-compliance] → [ca-compliance] [ocsp-failure]
You need to log in before you can comment on or make changes to this bug.