Skip to content

github-scraper used to scan repos owned by an org, clone them locally, look for a Dockerfile, and extract the FROM into a nice CSV for management

License

Notifications You must be signed in to change notification settings

psenger/github-scraper

Repository files navigation

github-scraper

Purpose

github-scraper is used to scan repos owned by an org, clone them locally, look for a Dockerfile, extract the FROM (build) value into a nice CSV for management to use in its reports, or to find a container that is running at the wrong version without asking the Dev Ops guys to do it.

Script Purpose
scraper.js Pulls all the repo data belonging to the org ( as defined by type ) and stores the data in a file ./data/<GITHUB-OUTFILE>. This file drives everything else.
build-masterlist.js This just reads ./data/<GITHUB-OUTFILE> and builds a CSV file ./data/<GITHUB-CSVFILE>
build-inventory.js Removes the directory ./out/ which will be the clone directory, once cloned, scans all files for a Dockerfile, reads them, and extracts ^FROM\s+(.*)\s*$ to a report called ./data/<GITHUB-INVENTORY>

Running

Required

  • A good internetnet connection
  • Node 15

Steps

  1. from the command prompt run npm install
  2. create a .env file with the environment variables listed in Variables
  3. from the command prompt run npm run build-masterlist
  4. from the command prompt run npm run scraper
  5. from the command prompt run npm run build-inventory
  6. send your report to your boss, and then drink some coffee or reach out to me Philip A Senger philip.a.senger@cngrgroup.com for a job.

Additional Docs

Refer to OctoKit for the Git hub api.

Refer to dotenv for a better understanding of .env files

Refer to Github Guides for Github

Refer to Docker Docs for Docker

Variables

This project uses .env

Variable Required Default Purpose
GITHUB-PAL-TOKEN true Personal access token (create)
GITHUB-TIMEZONE true The time zone (list)
GITHUB-ORG true The org to scan in the repos
GITHUB-TYPE true Specifies the types of repositories you want returned. Can be one of all, public, private, forks, sources, member, internal. Default: all. If your organization is associated with an enterprise account using GitHub Enterprise Cloud or GitHub Enterprise Server 2.20+, type can also be internal.
GITHUB-CSVFILE false ./data/data.csv Builds a CSV master list file ( when build-masterlist is executed )
GITHUB-OUTFILE false ./data/data.json Output from the scraper command, a full listing from github.
GITHUB-INVENTORY false ./data/inventory.csv the results of scanning files in github ( in this repo it is the Dockerfile FROM command )
GITHUB-SKIP-NAMES false '' any repos you want to skip while building the inventory.

Todo

  • The environment variables and expected chaining of data files is problematic.
  • Might be nice to scan for repos owned by owners and or orgs.
  • I think extracting the shell commands would be good, so you can make the code more reusable
  • Naming convention is not so good.
  • linting and tests would be good.
  • update build-masterlist to use the csv module and extract fields to environment variables.
  • change GITHUB-ORG so it is defaulted to all

About

github-scraper used to scan repos owned by an org, clone them locally, look for a Dockerfile, and extract the FROM into a nice CSV for management

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy