Data Cloud

Export your organization's data to any data warehouse

How it works

Swarmia will send your data in the form of a Parquet file to your storage bucket, starting with Amazon S3. This data can then be pulled to your choice of data warehouse such as Snowflake. The full dataset will be refreshed every 8 hours to ensure there is recent enough data when the data is pulled.

Data being exported

The data exported currently is information about pull requests and deployments. The export contains multiple files

  • pull_requests.parquet
  • deployments.parquet
  • authors.parquet
  • pull_request_reviews.parquet
  • pull_request_review_requests.parquet
  • pull_request_comments.parquet

The pull request data contains:

  • Author information
  • Contributing teams
  • Investment categories
  • Pull request cycle time components: cycle time, progress time, time in review, and time in merge
  • Timestamps of pull request events: first review time, first and last commit time, etc...
  • Pull request comments by authors, review requests, when pull request was reviewed and by whom etc. 

The deployments data contains:

You can read more of these metrics in Metrics & Definitions.

Getting started

The setup requires you to have an Amazon S3 storage bucket to which Swarmia can push data.

Setup the integration

  1. Navigate to Data Cloud settings
  2. Insert your S3 bucket credentials
  3. Save the settings. After this, you can test the integration from the "Export" button in the UI.

Considerations for the S3 bucket

Swarmia requires the following permissions on an S3 bucket and folder to be able to upload the exports:

  • s3:GetBucketLocation

  • s3:PutObject

An example of AWS IAM Policy for the export bucket, where bucket name is the name of the bucket you provided use in the Data Cloud settings.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
            "s3:PutObject",
            ],
          "Resource": "arn:aws:s3:::<bucket name>/*"
        },
        {
            "Effect": "Allow",
          "Action": [
                "s3:GetBucketLocation"
            ],
          "Resource": "arn:aws:s3:::<bucket name>",
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                      "*"
                    ]
                }
            }
        }
    ]
}

If you will access the data directly from the warehouse you most likely will need additional permissions. The following IAM Policy is the one recommended by Snowflake, which works then in both directions: Swarmia to upload and Snowflake to extract data from the same bucket.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:GetObjectVersion",
"s3:DeleteObject",
"s3:DeleteObjectVersion"
],
"Resource": "arn:aws:s3:::<bucket name>/*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::<bucket name>",
"Condition": {
"StringLike": {
"s3:prefix": [
"*"
]
}
}
}
]
}

Loading data to data warehouse

You can load data into various data warehouse systems using their native S3 connection methods. Here are list of some commonly used systems and links to their documentation.

Example: Loading data into Snowflake

With the following SQL template can get you started with Swarmia Data Cloud and Snowflake. Use this template to create needed Snowflake stages, file formats, and data tables to start making analytics based on Swarmia data.

Make sure to set up the access from Snowflake to the S3 bucket by following Snowflake's official instructions

-------------------------------------------------------------------------------------------
-- Create the integration to AWS S3 bucket and the Snowflake stages

-- Create the storage integration for the AWS S3 bucket you have configured with Swarmia Data Cloud.
-- To create the role follow the Snowflake instructions in
-- https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration#step-2-create-the-iam-role-in-aws
-------------------------------------------------------------------------------------------
 
CREATE OR REPLACE STORAGE INTEGRATION swarmia_data_cloud_s3
 TYPE = EXTERNAL_STAGE
 STORAGE_PROVIDER = 'S3'
 ENABLED = TRUE
 STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::<id>:role/<role name>'
 STORAGE_ALLOWED_LOCATIONS = ('s3://<bucket name>');

-- Describe our Integration
DESCRIBE INTEGRATION swarmia_data_cloud_s3;
SHOW STORAGE INTEGRATIONS;

-- Create the parquet format for Swarmia data 
CREATE OR REPLACE FILE FORMAT swarmia_parquet
 TYPE = parquet;

-- Create the stage for the Swarmia exported data
CREATE OR REPLACE STAGE swarmia_data_cloud_s3_stage
URL = 's3://<bucket name>/'
STORAGE_INTEGRATION = swarmia_data_cloud_s3 -- created in previous step
FILE_FORMAT = swarmia_parquet;

-- View our Stages
SHOW STAGES;
DESCRIBE STAGE swarmia_data_cloud_s3_stage;
LIST @swarmia_data_cloud_s3_stage;

---
-- Create the target tables
---
-- Create table for pull requests
CREATE OR REPLACE TABLE swarmia_pull_requests (
    id                              text not null,
  github_number                   int not null,
author_id                   text not null,
   author_external_identifier      text not null,
    github_repository_full_name     text not null,
    pr_status                       text not null,
  labels                       text not null,
    owner_team_ids                  array,
    owner_team_names                array,
    investment_category_ids         array,
  investment_category_names       array,
    is_draft                        boolean,
    is_excluded                     boolean,
    github_created_at               timestamp,
    github_updated_at               timestamp,
    last_closed_or_merged_at        timestamp,
    first_commit_at                 timestamp,
    last_commit_at                  timestamp,
    opened_at                       timestamp,
    first_review_request_at         timestamp,
    first_reviewed_at               timestamp,
    first_approved_review_at        timestamp,
    last_approved_review_at         timestamp,
    cycle_time_seconds              int,
    progress_time_seconds           int,
    review_time_seconds             int,
    merge_time_seconds              int
);

CREATE OR REPLACE TABLE swarmia_deployments (
    id                             text not null,
    pull_request_ids               array,
    deployment_environment         text not null,
    deployed_at                    timestamp,
    fixes_deployment_id            text,
    fixes_version                  text,
    deployment_version             text,
    deployment_app_name            text,
    time_to_recovery_seconds       int,
    change_lead_time_seconds       int,
    time_to_deploy_seconds         int
);

CREATE OR REPLACE TABLE swarmia_authors (
    id                             text not null,
  name               text,
  associated_emails         array
);

CREATE OR REPLACE TABLE swarmia_pull_request_comments (
    id                             text not null,
pull_request_id               text not null,
  source_id text not null,
body_text text not null,
created_at timestamp not null,
author_id text not null,
author_is_bot boolean not null
);

CREATE OR REPLACE TABLE swarmia_pull_request_review_requests (
    id                             text not null,
pull_request_id               text not null,
  requested_by_author_id text not null,
   requested_from_author_id text,
   requested_from_team_name text,
source_id text not null,
requested_at timestamp not null
);

CREATE OR REPLACE TABLE swarmia_pull_request_reviews (
    id                             text not null,
pull_request_id               text not null,
author_id text not null,
author_external_identifier text not null,
  source_id                      text not null,
  state text not null,
submitted_at timestamp not null,
   github_repository_full_name text not null,
   source_repository_full_name text not null,
   pull_request_review_request_id text
);

-------------------------------------------------------------------------------------------
-- Load data into the Snowflake tables
-------------------------------------------------------------------------------------------

-- Remove any old data the table has to prevent duplicates
TRUNCATE TABLE swarmia_pull_requests;
TRUNCATE TABLE swarmia_deployments;
TRUNCATE TABLE swarmia_authors;
TRUNCATE TABLE swarmia_pull_request_comments;
TRUNCATE TABLE swarmia_pull_request_review_requests;
TRUNCATE TABLE swarmia_pull_request_reviews;


-- Copy the whole dataset into the target tables
COPY INTO swarmia_pull_requests
FROM @swarmia_data_cloud_s3_stage/v0/pull_requests.parquet
    MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE';

COPY INTO swarmia_deployments
FROM @swarmia_data_cloud_s3_stage/v0/deployments.parquet
  MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE';

COPY INTO swarmia_authors
FROM @swarmia_data_cloud_s3_stage/v0/authors.parquet
  MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE';

COPY INTO swarmia_pull_request_comments
FROM @swarmia_data_cloud_s3_stage/v0/pull_request_comments.parquet
  MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE';

COPY INTO swarmia_pull_request_review_requests
FROM @swarmia_data_cloud_s3_stage/v0/pull_request_review_requests.parquet
  MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE';

COPY INTO swarmia_pull_request_reviews
FROM @swarmia_data_cloud_s3_stage/v0/pull_request_reviews.parquet
  MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE';

-------------------------------------------------------------------------------------------
-- Query the data!
-------------------------------------------------------------------------------------------
SELECT COUNT(*) FROM swarmia_pull_requests;
SELECT * FROM swarmia_deployments;

-- Monthly average change lead time for 'Engineers'-team pull requests created since 2023 that have been deployed into production environment.
SELECT
    ROUND(AVG(change_lead_time_seconds))    AS change_lead_time_seconds,
    DATE_TRUNC('month', github_created_at)  AS month
FROM swarmia_deployments d
JOIN swarmia_pull_requests p ON (array_contains(p.id::variant, d.pull_request_ids))
WHERE array_contains('Engineers'::variant, p.owner_team_names)
    AND d.deployment_environment='production'
    AND p.github_created_at >= '2023-01-01'::timestamp
GROUP BY 2
ORDER BY 2 DESC;

Parquet schemas

All export schemas are defined as follow

pull_request

field type required description
id UTF8 Y UUID
github_number INT64 Y GitHub number, like 5532
author_id UTF8 Y Created by, UUID
title UTF8 Y Pull request title
description UTF8 N Pull request description
author_external_identifier UTF8 Y Created by nickname
github_repository_full_name UTF8 Y Repository's name
pr_status UTF8 Y One of the following: OPEN, CLOSED, MERGED
owner_team_ids Array, UTF8 Y Who owns the pull request
owner_team_names Array, UTF8 Y Team names of the 
investment_category_ids Array, UTF8 Y In which investment categories pull request belongs to
investment_category_names Array, UTF8 Y Linked investment categories names
labels Array, UTF8 Y Attached labels to pull request
is_draft BOOLEAN Y Is the pull request draft
is_excluded BOOLEAN Y Is the pull request excluded from metrics
github_created_at TIMESTAMP_MICROS Y Created at in GitHub
github_updated_at TIMESTAMP_MICROS Y Updated at in GitHub
last_closed_or_merged_at TIMESTAMP_MICROS N  
first_commit_at TIMESTAMP_MICROS N Time of when first commit took place
last_commit_at TIMESTAMP_MICROS N Time of when last commit took place
opened_at TIMESTAMP_MICROS N When the pull request was opened
first_review_request_at TIMESTAMP_MICROS N When the first review request was sent
first_reviewed_at TIMESTAMP_MICROS N When somebody reviewed pull request first time
first_approved_review_at TIMESTAMP_MICROS N When pull request was approved first time
cycle_time_started_at TIMESTAMP_MICROS Y When cycle time started
cycle_time_ended_at TIMESTAMP_MICROS N When cycle time ended
additions INT64 N How many lines were added
deletions INT64 N How many lines were deleted
cycle_time_seconds INT64 N How long was pull request cycle time
progress_time_seconds INT64 N How long pull request was in progress
review_time_seconds INT64 N How long pull request was in review
merge_time_seconds INT64 N How long it took to merge pull request
linked_swarmia_issue_id UTF8 N Linked swarmia issue
is_revert Boolean N Did the pull request revert existing pull request
changed_files Array, UTF8 Y Array of changed file names

pull_request_review_requests

field type required description
id UTF8 Y UUID
pull_request_id UTF8 Y Linked pull request, UUID
requested_by_atuhor_id UTF8 Y Who requested review
requested_from_author_id UTF8 N From whom review was requested
requested_from_team_name UTF8 N Team whom review was requested
source_id UTF8 Y e.g GitHub's ID
requested_at TIMESTAMP_MICROS Y When review was requested
 

pull_request_reviews


field type required description
id UTF8 Y UUID
pull_request_id UTF8 Y Linked pull request, UUID
author_id UTF8 Y Reviewed by, UUID
author_external_identifier UTF8 Y Reviewed by's nickname
source_id UTF8 Y e.g GitHub's ID
state UTF8 Y One of the following : approved, changes_requested, commented
submitted_at TIMESTAMP_MICROS Y When review was submitted
github_repository_full_name UTF8 Y Deprecated, repository name
source_repository_full_name UTF8 Y Repository name
pull_request_review_request_id UTF8 N Review request linked to review, UUID

pull_request_comments

field type required description
id UTF8 Y UUID
pull_request_id UTF8 Y Linked pull request, UUID
source_id UTF8 Y e.g GitHub's ID
body_text UTF8 Y Comment content
created_at TIMESTAMP_MICROS Y When the comment was posted
author_id UTF8 Y Who posted the comment, UUID
author_is_bot BOOLEAN Y Is the author a bot

deployments

field type required description
id UTF8 Y UUID
pull_request_ids Array, UTF8 Y Linked pull requests requests within deployment
deployment_environment INT64 Y E.g production
deployed_at UTF8 Y When deployment was done
fixes_deployment_id UTF8 N If deployment fixes previous deployment
fixes_version UTF8 N Version of the fixed deployment
deployment_version UTF8 Y  
deployment_app_name UTF8 Y  
involved_team_ids Array, UTF8 Y  
involved_team_names Array, UTF8 Y  
involved_author_ids Array, UTF8 Y  
time_to_recovery_seconds INT64 N  
change_lead_time_seconds INT64 N  
time_to_deploy_seconds INT64 N  
is_change_failure BOOLEAN Y  
change_failure_source UTF8 N  

authors

field type required description
id UTF8 Y UUID
name UTF8 N Author's name
associated_emails Array, UTF8 Y Author's all email addresses