Export your organization's data to any data warehouse
How it works
Swarmia will send your data in the form of a Parquet file to your storage bucket, starting with Amazon S3. This data can then be pulled to your choice of data warehouse such as Snowflake. The full dataset will be refreshed every 8 hours to ensure there is recent enough data when the data is pulled.
Data being exported
The data exported currently is information about pull requests and deployments. The export contains multiple files
pull_requests.parquet
deployments.parquet
authors.parquet
pull_request_reviews.parquet
pull_request_review_requests.parquet
pull_request_comments.parquet
The pull request data contains:
- Author information
- Contributing teams
- Investment categories
- Pull request cycle time components: cycle time, progress time, time in review, and time in merge
- Timestamps of pull request events: first review time, first and last commit time, etc...
- Pull request comments by authors, review requests, when pull request was reviewed and by whom etc.
The deployments data contains:
- List of deployed PRs in the deployment
- Deployment information: version, environment, application
- DORA metrics: Time to deploy, Time to recovery, Change lead time
You can read more of these metrics in Metrics & Definitions.
Getting started
The setup requires you to have an Amazon S3 storage bucket to which Swarmia can push data.
Setup the integration
- Navigate to Data Cloud settings
- Insert your S3 bucket credentials
- Save the settings. After this, you can test the integration from the "Export" button in the UI.
Considerations for the S3 bucket
Swarmia requires the following permissions on an S3 bucket and folder to be able to upload the exports:
-
s3:GetBucketLocation
s3:PutObject
An example of AWS IAM Policy for the export bucket, where bucket name
is the name of the bucket you provided use in the Data Cloud settings.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
],
"Resource": "arn:aws:s3:::<bucket name>/*"
},
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::<bucket name>",
"Condition": {
"StringLike": {
"s3:prefix": [
"*"
]
}
}
}
]
}
If you will access the data directly from the warehouse you most likely will need additional permissions. The following IAM Policy is the one recommended by Snowflake, which works then in both directions: Swarmia to upload and Snowflake to extract data from the same bucket.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:GetObjectVersion",
"s3:DeleteObject",
"s3:DeleteObjectVersion"
],
"Resource": "arn:aws:s3:::<bucket name>/*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::<bucket name>",
"Condition": {
"StringLike": {
"s3:prefix": [
"*"
]
}
}
}
]
}
Loading data to data warehouse
You can load data into various data warehouse systems using their native S3 connection methods. Here are list of some commonly used systems and links to their documentation.
- BigQuery, by using BigQuery Data Transfer Service for Amazon S3
- AWS Redshift, by copying data from S3 with parquet format
- Snowflake, with bulk loading from S3
Example: Loading data into Snowflake
With the following SQL template can get you started with Swarmia Data Cloud and Snowflake. Use this template to create needed Snowflake stages, file formats, and data tables to start making analytics based on Swarmia data.
Make sure to set up the access from Snowflake to the S3 bucket by following Snowflake's official instructions.
-------------------------------------------------------------------------------------------
-- Create the integration to AWS S3 bucket and the Snowflake stages
-- Create the storage integration for the AWS S3 bucket you have configured with Swarmia Data Cloud.
-- To create the role follow the Snowflake instructions in
-- https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration#step-2-create-the-iam-role-in-aws
-------------------------------------------------------------------------------------------
CREATE OR REPLACE STORAGE INTEGRATION swarmia_data_cloud_s3
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'S3'
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::<id>:role/<role name>'
STORAGE_ALLOWED_LOCATIONS = ('s3://<bucket name>');
-- Describe our Integration
DESCRIBE INTEGRATION swarmia_data_cloud_s3;
SHOW STORAGE INTEGRATIONS;
-- Create the parquet format for Swarmia data
CREATE OR REPLACE FILE FORMAT swarmia_parquet
TYPE = parquet;
-- Create the stage for the Swarmia exported data
CREATE OR REPLACE STAGE swarmia_data_cloud_s3_stage
URL = 's3://<bucket name>/'
STORAGE_INTEGRATION = swarmia_data_cloud_s3 -- created in previous step
FILE_FORMAT = swarmia_parquet;
-- View our Stages
SHOW STAGES;
DESCRIBE STAGE swarmia_data_cloud_s3_stage;
LIST @swarmia_data_cloud_s3_stage;
---
-- Create the target tables
---
-- Create table for pull requests
CREATE OR REPLACE TABLE swarmia_pull_requests (
id text not null,
github_number int not null,
author_id text not null,
author_external_identifier text not null,
github_repository_full_name text not null,
pr_status text not null,
labels text not null,
owner_team_ids array,
owner_team_names array,
investment_category_ids array,
investment_category_names array,
is_draft boolean,
is_excluded boolean,
github_created_at timestamp,
github_updated_at timestamp,
last_closed_or_merged_at timestamp,
first_commit_at timestamp,
last_commit_at timestamp,
opened_at timestamp,
first_review_request_at timestamp,
first_reviewed_at timestamp,
first_approved_review_at timestamp,
last_approved_review_at timestamp,
cycle_time_seconds int,
progress_time_seconds int,
review_time_seconds int,
merge_time_seconds int
);
CREATE OR REPLACE TABLE swarmia_deployments (
id text not null,
pull_request_ids array,
deployment_environment text not null,
deployed_at timestamp,
fixes_deployment_id text,
fixes_version text,
deployment_version text,
deployment_app_name text,
time_to_recovery_seconds int,
change_lead_time_seconds int,
time_to_deploy_seconds int
);
CREATE OR REPLACE TABLE swarmia_authors (
id text not null,
name text,
associated_emails array
);
CREATE OR REPLACE TABLE swarmia_pull_request_comments (
id text not null,
pull_request_id text not null,
source_id text not null,
body_text text not null,
created_at timestamp not null,
author_id text not null,
author_is_bot boolean not null
);
CREATE OR REPLACE TABLE swarmia_pull_request_review_requests (
id text not null,
pull_request_id text not null,
requested_by_author_id text not null,
requested_from_author_id text,
requested_from_team_name text,
source_id text not null,
requested_at timestamp not null
);
CREATE OR REPLACE TABLE swarmia_pull_request_reviews (
id text not null,
pull_request_id text not null,
author_id text not null,
author_external_identifier text not null,
source_id text not null,
state text not null,
submitted_at timestamp not null,
github_repository_full_name text not null,
source_repository_full_name text not null,
pull_request_review_request_id text
);
-------------------------------------------------------------------------------------------
-- Load data into the Snowflake tables
-------------------------------------------------------------------------------------------
-- Remove any old data the table has to prevent duplicates
TRUNCATE TABLE swarmia_pull_requests;
TRUNCATE TABLE swarmia_deployments;
TRUNCATE TABLE swarmia_authors;
TRUNCATE TABLE swarmia_pull_request_comments;
TRUNCATE TABLE swarmia_pull_request_review_requests;
TRUNCATE TABLE swarmia_pull_request_reviews;
-- Copy the whole dataset into the target tables
COPY INTO swarmia_pull_requests
FROM @swarmia_data_cloud_s3_stage/v0/pull_requests.parquet
MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE';
COPY INTO swarmia_deployments
FROM @swarmia_data_cloud_s3_stage/v0/deployments.parquet
MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE';
COPY INTO swarmia_authors
FROM @swarmia_data_cloud_s3_stage/v0/authors.parquet
MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE';
COPY INTO swarmia_pull_request_comments
FROM @swarmia_data_cloud_s3_stage/v0/pull_request_comments.parquet
MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE';
COPY INTO swarmia_pull_request_review_requests
FROM @swarmia_data_cloud_s3_stage/v0/pull_request_review_requests.parquet
MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE';
COPY INTO swarmia_pull_request_reviews
FROM @swarmia_data_cloud_s3_stage/v0/pull_request_reviews.parquet
MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE';
-------------------------------------------------------------------------------------------
-- Query the data!
-------------------------------------------------------------------------------------------
SELECT COUNT(*) FROM swarmia_pull_requests;
SELECT * FROM swarmia_deployments;
-- Monthly average change lead time for 'Engineers'-team pull requests created since 2023 that have been deployed into production environment.
SELECT
ROUND(AVG(change_lead_time_seconds)) AS change_lead_time_seconds,
DATE_TRUNC('month', github_created_at) AS month
FROM swarmia_deployments d
JOIN swarmia_pull_requests p ON (array_contains(p.id::variant, d.pull_request_ids))
WHERE array_contains('Engineers'::variant, p.owner_team_names)
AND d.deployment_environment='production'
AND p.github_created_at >= '2023-01-01'::timestamp
GROUP BY 2
ORDER BY 2 DESC;
Parquet schemas
All export schemas are defined as follow
pull_request
field | type | required | description |
id | UTF8 | Y | UUID |
github_number | INT64 | Y | GitHub number, like 5532 |
author_id | UTF8 | Y | Created by, UUID |
title | UTF8 | Y | Pull request title |
description | UTF8 | N | Pull request description |
author_external_identifier | UTF8 | Y | Created by nickname |
github_repository_full_name | UTF8 | Y | Repository's name |
pr_status | UTF8 | Y | One of the following: OPEN, CLOSED, MERGED |
owner_team_ids | Array, UTF8 | Y | Who owns the pull request |
owner_team_names | Array, UTF8 | Y | Team names of the |
investment_category_ids | Array, UTF8 | Y | In which investment categories pull request belongs to |
investment_category_names | Array, UTF8 | Y | Linked investment categories names |
labels | Array, UTF8 | Y | Attached labels to pull request |
is_draft | BOOLEAN | Y | Is the pull request draft |
is_excluded | BOOLEAN | Y | Is the pull request excluded from metrics |
github_created_at | TIMESTAMP_MICROS | Y | Created at in GitHub |
github_updated_at | TIMESTAMP_MICROS | Y | Updated at in GitHub |
last_closed_or_merged_at | TIMESTAMP_MICROS | N | |
first_commit_at | TIMESTAMP_MICROS | N | Time of when first commit took place |
last_commit_at | TIMESTAMP_MICROS | N | Time of when last commit took place |
opened_at | TIMESTAMP_MICROS | N | When the pull request was opened |
first_review_request_at | TIMESTAMP_MICROS | N | When the first review request was sent |
first_reviewed_at | TIMESTAMP_MICROS | N | When somebody reviewed pull request first time |
first_approved_review_at | TIMESTAMP_MICROS | N | When pull request was approved first time |
cycle_time_started_at | TIMESTAMP_MICROS | Y | When cycle time started |
cycle_time_ended_at | TIMESTAMP_MICROS | N | When cycle time ended |
additions | INT64 | N | How many lines were added |
deletions | INT64 | N | How many lines were deleted |
cycle_time_seconds | INT64 | N | How long was pull request cycle time |
progress_time_seconds | INT64 | N | How long pull request was in progress |
review_time_seconds | INT64 | N | How long pull request was in review |
merge_time_seconds | INT64 | N | How long it took to merge pull request |
linked_swarmia_issue_id | UTF8 | N | Linked swarmia issue |
is_revert | Boolean | N | Did the pull request revert existing pull request |
changed_files | Array, UTF8 | Y | Array of changed file names |
pull_request_review_requests
field | type | required | description |
id | UTF8 | Y | UUID |
pull_request_id | UTF8 | Y | Linked pull request, UUID |
requested_by_atuhor_id | UTF8 | Y | Who requested review |
requested_from_author_id | UTF8 | N | From whom review was requested |
requested_from_team_name | UTF8 | N | Team whom review was requested |
source_id | UTF8 | Y | e.g GitHub's ID |
requested_at | TIMESTAMP_MICROS | Y | When review was requested |
pull_request_reviews
field | type | required | description |
id | UTF8 | Y | UUID |
pull_request_id | UTF8 | Y | Linked pull request, UUID |
author_id | UTF8 | Y | Reviewed by, UUID |
author_external_identifier | UTF8 | Y | Reviewed by's nickname |
source_id | UTF8 | Y | e.g GitHub's ID |
state | UTF8 | Y | One of the following : approved, changes_requested, commented |
submitted_at | TIMESTAMP_MICROS | Y | When review was submitted |
github_repository_full_name | UTF8 | Y | Deprecated, repository name |
source_repository_full_name | UTF8 | Y | Repository name |
pull_request_review_request_id | UTF8 | N | Review request linked to review, UUID |
pull_request_comments
field | type | required | description |
id | UTF8 | Y | UUID |
pull_request_id | UTF8 | Y | Linked pull request, UUID |
source_id | UTF8 | Y | e.g GitHub's ID |
body_text | UTF8 | Y | Comment content |
created_at | TIMESTAMP_MICROS | Y | When the comment was posted |
author_id | UTF8 | Y | Who posted the comment, UUID |
author_is_bot | BOOLEAN | Y | Is the author a bot |
deployments
field | type | required | description |
id | UTF8 | Y | UUID |
pull_request_ids | Array, UTF8 | Y | Linked pull requests requests within deployment |
deployment_environment | INT64 | Y | E.g production |
deployed_at | UTF8 | Y | When deployment was done |
fixes_deployment_id | UTF8 | N | If deployment fixes previous deployment |
fixes_version | UTF8 | N | Version of the fixed deployment |
deployment_version | UTF8 | Y | |
deployment_app_name | UTF8 | Y | |
involved_team_ids | Array, UTF8 | Y | |
involved_team_names | Array, UTF8 | Y | |
involved_author_ids | Array, UTF8 | Y | |
time_to_recovery_seconds | INT64 | N | |
change_lead_time_seconds | INT64 | N | |
time_to_deploy_seconds | INT64 | N | |
is_change_failure | BOOLEAN | Y | |
change_failure_source | UTF8 | N |
authors
field | type | required | description |
id | UTF8 | Y | UUID |
name | UTF8 | N | Author's name |
associated_emails | Array, UTF8 | Y | Author's all email addresses |