Data cloud

How it works

Swarmia will send your data in the form of a Parquet file to your storage bucket. Currently Amazon S3 is supported. This data can then be pulled to your data warehouse of choice, such as Snowflake. The full dataset will be refreshed every 8 hours to ensure recent enough data when the exported files are pulled.

Exported data

Currently, pull requests, commits, and deployments data are being exported, split into multiple files:

  • pull_requests.parquet

  • deployments.parquet

  • authors.parquet

  • pull_request_reviews.parquet

  • pull_request_review_requests.parquet

  • pull_request_comments.parquet

  • commits.parquet

The pull request data contain:

  • Author information

  • Contributing teams

  • Investment categories

  • Pull request cycle time components: cycle time, progress time, time in review, and time in merge

  • Timestamps of pull request events: first review time, first and last commit time, etc...

  • Pull request comments by authors, review requests, when pull request was reviewed and by who

The deployments data contain:

The commits data contain:

  • Commit hash

  • Author information

  • Associated pull requests, i.e. which ones the commit participated

  • Committed and (if available) authored dates

  • Parent commits

More metrics details available in https://github.com/swarmia/knowledge-base/blob/main/getting-started/integrations/data-export/broken-reference/README.md.

Getting started

The setup requires you to have an Amazon S3 storage bucket to which Swarmia can push data.

Set up the integration

  1. Insert your S3 bucket credentials

  2. Save the settings. After this, you can test the integration from the "Export" button in the UI.

Considerations for the S3 bucket

Swarmia requires the following permissions on an S3 bucket and folder to be able to upload the exports:

  • s3:GetBucketLocation

  • s3:PutObject

An example of AWS IAM Policy for the export bucket, where bucket name is the name of the bucket you provided use in the Data Cloud settings.

If you will access the data directly from the warehouse you most likely will need additional permissions. The following IAM Policy is the one recommended by Snowflake, which works then in both directions: Swarmia to upload and Snowflake to extract data from the same bucket.

Loading data to data warehouse

You can load data into various warehouse systems using their native S3 connection methods. Here's a list of some commonly used systems with links to their respective documentations:

Example: Loading data into Snowflake

The following SQL template can get you started with Swarmia Data Cloud and Snowflake. Use this template to create necessary Snowflake stages, file formats, and data tables to start making analytics based on Swarmia data.

Make sure to set up the access from Snowflake to the S3 bucket by following Snowflake's official instructions.

Parquet schemas

All export schemas are defined as follow

pull_request

field

type

required

description

id

UTF8

Y

UUID

github_number

INT64

Y

GitHub number, like 5532

author_id

UTF8

Y

Created by, UUID

title

UTF8

Y

Pull request title

description

UTF8

N

Pull request description

author_external_identifier

UTF8

Y

Created by nickname

github_repository_full_name

UTF8

Y

Repository's name

pr_status

UTF8

Y

One of the following: OPEN, CLOSED, MERGED

owner_team_ids

Array, UTF8

Y

Who owns the pull request

owner_team_names

Array, UTF8

Y

Team names of the

investment_category_ids

Array, UTF8

Y

In which investment categories pull request belongs to

investment_category_names

Array, UTF8

Y

Linked investment categories names

labels

Array, UTF8

Y

Attached labels to pull request

is_draft

BOOLEAN

Y

Whether the PR is currently marked as draft in GitHub. If the PR was originally opened as draft, but has since been marked ready for review, this will be false.

is_excluded

BOOLEAN

Y

Whether the PR is currently excluded from metrics and alerts in Swarmia. This includes both rule-based exclusions and manual exclusions.

github_created_at

TIMESTAMP_MICROS

Y

When the PR was created on GitHub. This is unaffected by whether the PR was opened as draft or not.

github_updated_at

TIMESTAMP_MICROS

Y

Last time the PR data was updated in GitHub. This includes pushing code, reviews, comments, etc.

last_closed_or_merged_at

TIMESTAMP_MICROS

N

Most recent ClosedEvent for the PR, which covers both regular closing and merge actions. null if the PR hasn’t been closed or merged yet. If a closed PR is reopened, this goes back to null.

first_commit_at

TIMESTAMP_MICROS

N

The authored date (falling back to committed date) of the first commit associated with the PR.

last_commit_at

TIMESTAMP_MICROS

N

The authored date (falling back to committed date) of the last commit associated with the PR.

opened_at

TIMESTAMP_MICROS

N

When the PR was created on GitHub. This is unaffected by whether the PR was opened as draft or not.

first_review_request_at

TIMESTAMP_MICROS

N

Time of the first ReviewRequestedEvent for the PR.

first_reviewed_at

TIMESTAMP_MICROS

N

Time of the first PullRequestReview for the PR. This excludes reviews by bots, and comments by the author themselves.

first_approved_review_at

TIMESTAMP_MICROS

N

Time of the first PullRequestReview for the PR, with state approved. This excludes reviews by bots.

last_approved_review_at

TIMESTAMP_MICROS

N

Time of the last PullRequestReview for the PR, with state approved.

cycle_time_started_at

TIMESTAMP_MICROS

Y

Time of the first commit associated with the PR, or time when the PR was created on GitHub, whichever came earlier.

cycle_time_ended_at

TIMESTAMP_MICROS

N

Time when the PR was merged or closed. null if the PR is still open.

additions

INT64

N

How many lines were added

deletions

INT64

N

How many lines were deleted

cycle_time_seconds

INT64

N

Amount of seconds between start and end of cycle time. If the PR is still open, the current time is considered the end.

progress_time_seconds

INT64

N

Amount of seconds between start and end of in progress time. Start is the same as cycle time start time. End is defined as, in order of preference:

  • Start of review time

  • Start of merge time

  • Cycle time end time

  • Current time

review_time_seconds

INT64

N

Amount of seconds between start and end of review time. Start is defined as:

  • For PRs without a ReadyForReviewEvent or ReviewRequestedEvent, and which don’t have approving reviews, review time is not defined

  • Otherwise, start is, in order of preference:

    • Time of first ReviewRequestedEvent

    • Time of latest ReadyForReviewEvent

    • Time when PR was opened

End is defined as, in order of preference:

  • Start of merge time

  • End of cycle time

  • Current time

merge_time_seconds

INT64

N

Amount of seconds between start and end of merge time. Start is defined by, in order of precedence:

  • For merged and approved PRs, the time of the last approval

  • For merged PRs with review requested, but never given, merge time is not defined

  • For merged PRs without reviews requested, time of last commit, or PR open time, whichever comes last End is the same as the end of cycle time. If the PR is still open, the current time is considered the end.

linked_swarmia_issue_id

UTF8

N

Linked swarmia issue

is_revert

Boolean

N

Did the pull request revert existing pull request

changed_files

Array, UTF8

Array of changed file names

pull_request_review_requests

field

type

required

description

id

UTF8

Y

UUID

pull_request_id

UTF8

Y

Linked pull request, UUID

requested_by_atuhor_id

UTF8

Y

Who requested review

requested_from_author_id

UTF8

N

From whom review was requested

requested_from_team_name

UTF8

N

Team whom review was requested

source_id

UTF8

Y

e.g GitHub's ID

requested_at

TIMESTAMP_MICROS

Y

When review was requested

pull_request_reviews

field

type

required

description

id

UTF8

Y

UUID

pull_request_id

UTF8

Y

Linked pull request, UUID

author_id

UTF8

Y

Reviewed by, UUID

author_external_identifier

UTF8

Y

Reviewed by's nickname

source_id

UTF8

Y

e.g GitHub's ID

state

UTF8

Y

One of the following : approved, changes_requested, commented

submitted_at

TIMESTAMP_MICROS

Y

When review was submitted

github_repository_full_name

UTF8

Y

Deprecated, repository name

source_repository_full_name

UTF8

Y

Repository name

pull_request_review_request_id

UTF8

N

Review request linked to review, UUID

pull_request_comments

field

type

required

description

id

UTF8

Y

UUID

pull_request_id

UTF8

Y

Linked pull request, UUID

source_id

UTF8

Y

e.g GitHub's ID

body_text

UTF8

Y

Comment content

created_at

TIMESTAMP_MICROS

Y

When the comment was posted

author_id

UTF8

Y

Who posted the comment, UUID

author_is_bot

BOOLEAN

Y

Is the author a bot

deployments

field

type

required

description

id

UTF8

Y

UUID

pull_request_ids

Array, UTF8

Y

Linked pull requests requests within deployment

deployment_environment

INT64

Y

E.g production

deployed_at

UTF8

Y

When deployment was done

fixes_deployment_id

UTF8

N

If deployment fixes previous deployment

fixes_version

UTF8

N

Version of the fixed deployment

deployment_version

UTF8

Y

deployment_app_name

UTF8

Y

involved_team_ids

Array, UTF8

Y

involved_team_names

Array, UTF8

Y

involved_author_ids

Array, UTF8

Y

time_to_recovery_seconds

INT64

N

change_lead_time_seconds

INT64

N

time_to_deploy_seconds

INT64

N

is_change_failure

BOOLEAN

Y

change_failure_source

UTF8

N

authors

field

type

required

description

id

UTF8

Y

UUID

name

UTF8

N

Author's name

associated_emails

Array, UTF8

Y

Author's all email addresses

commits

field

type

required

description

id

UTF8

Y

UUID

commit_hash

UTF8

Y

Commit's SHA-1 git hash

author_id

UTF8

N

Who authored the commit, UUID

associated_pull_request_ids

Array, UTF8

N

Pull requests the commit was part of

committed_date

TIMESTAMP_MICROS

Y

Commit "last updated" date

authored_date

TIMESTAMP_MICROS

N

Original authored date

parents

Array, UTF8

N

Zero or more parent commit git hashes. Ordinary commits have one, root commits zero, and merge commits two or more parents.

Last updated

Was this helpful?