We've identified the root cause of the issue with the increased 500 error rates with S3 after engaging in an investigation into the issue with AWS. At this time no more errors should be occurring and we’ve received no further reports of them either over the last couple of weeks.
Based on our investigation, we believe the errors and timeouts started occurring because the brokered S3 buckets in the platform exist in a large partition that could not efficiently process all of its keys during a lookup operation. What this means is that S3 struggled with finding any given bucket in a short period of time, and would intermittently return a timeout error that resulted in the 500 response some customers were experiencing.
AWS has created child partitions that account for our bucket naming scheme so that we no longer have this issue with our existing buckets or any future buckets.
If anyone has any questions or concerns about this, or continues to experience 500 error responses from their S3 bucket(s), please reach out to us at support@cloud.gov.