Azure Data Factory with Azure CLI - Part 2: Inspecting Pipelines & JSON

Watch

Script (read-only demo)

#!/usr/bin/env bash

# ==========================================================
# Azure Data Factory (ADF) - Deeper Dive (Read Definitions)
# Focus: Read pipeline + dataset JSON (still read-only)
# ==========================================================

set -euo pipefail

# ---- variables ----
RESOURCE_GROUP="covid-reporting-rg"
FACTORY_NAME="covid-reporting-adf-mike123"

# Optional: set these to auto-select a specific pipeline/dataset.
# If left empty, the script will pick the first one returned.
PIPELINE_NAME="${PIPELINE_NAME:-}"
DATASET_NAME="${DATASET_NAME:-}"

if ! command -v az >/dev/null 2>&1; then
  echo "Azure CLI (az) not found. Install it first: https://aka.ms/azure-cli"
  exit 1
fi

print_or_none() {
  local output="$1"
  local none_msg="$2"

  if [[ -z "$output" || "$output" == "null" ]]; then
    echo "$none_msg"
  else
    printf '%s\n' "$output"
  fi
}

echo "=========================================================="
echo " Azure CLI - ADF Deeper Dive (Pipelines + Datasets)"
echo "=========================================================="
echo ""

# ----------------------------------------------------------
echo "STEP 0: Show current Azure subscription context"
echo "Command: az account show -o table"
echo ""
az account show -o table

# ----------------------------------------------------------
echo ""
echo "STEP 1: List all Data Factories (pick the one we want)"
echo "Command: az datafactory list -o table"
echo ""
az datafactory list -o table

# ----------------------------------------------------------
echo ""
echo "STEP 2: Confirm the Data Factory exists (and where)"
echo "Command: az datafactory show --name \"$FACTORY_NAME\" --resource-group \"$RESOURCE_GROUP\" -o table"
echo ""
az datafactory show \
  --name "$FACTORY_NAME" \
  --resource-group "$RESOURCE_GROUP" \
  -o table

# ----------------------------------------------------------
echo ""
echo "STEP 3: List pipelines (so we know what we can inspect)"
echo "Command: az datafactory pipeline list --factory-name \"$FACTORY_NAME\" --resource-group \"$RESOURCE_GROUP\" -o table"
echo ""
az datafactory pipeline list \
  --factory-name "$FACTORY_NAME" \
  --resource-group "$RESOURCE_GROUP" \
  -o table

# Capture pipeline names for validation/selection (macOS Bash 3.2 compatible)
PIPELINE_NAMES=()
while IFS= read -r line; do
  [[ -n "$line" ]] && PIPELINE_NAMES+=("$line")
done < <(az datafactory pipeline list \
  --factory-name "$FACTORY_NAME" \
  --resource-group "$RESOURCE_GROUP" \
  --query "[].name" -o tsv)

if (( ${#PIPELINE_NAMES[@]} == 0 )); then
  echo ""
  echo "No pipelines found in factory $FACTORY_NAME (resource group $RESOURCE_GROUP)."
  echo "Create a pipeline or update RESOURCE_GROUP/FACTORY_NAME and try again."
  exit 1
fi

echo ""
echo "Available pipelines:"
printf '  - %s\n' "${PIPELINE_NAMES[@]}"

# Auto-pick a pipeline if not provided, otherwise validate
if [[ -n "$PIPELINE_NAME" ]]; then
  pipeline_found="false"
  for name in "${PIPELINE_NAMES[@]}"; do
    if [[ "$name" == "$PIPELINE_NAME" ]]; then
      pipeline_found="true"
      break
    fi
  done
  if [[ "$pipeline_found" != "true" ]]; then
    echo ""
    echo "Pipeline '$PIPELINE_NAME' not found. Choose one of:"
    printf '  - %s\n' "${PIPELINE_NAMES[@]}"
    exit 1
  fi
else
  PIPELINE_NAME="${PIPELINE_NAMES[0]}"
fi

echo ""
echo "Selected pipeline for deep dive: $PIPELINE_NAME"
echo ""

# ----------------------------------------------------------
echo "STEP 4: Read the full pipeline JSON definition (save to file)"
echo "This is the source-of-truth definition: activities, parameters, references, etc."
echo ""
PIPELINE_JSON="pipeline_${PIPELINE_NAME}.json"
echo "Command: az datafactory pipeline show --factory-name \"$FACTORY_NAME\" --resource-group \"$RESOURCE_GROUP\" --name \"$PIPELINE_NAME\" -o json > $PIPELINE_JSON"
echo ""
az datafactory pipeline show \
  --factory-name "$FACTORY_NAME" \
  --resource-group "$RESOURCE_GROUP" \
  --name "$PIPELINE_NAME" \
  -o json > "$PIPELINE_JSON"

echo ""
echo "Saved pipeline JSON to: $PIPELINE_JSON"
echo ""
echo "Pipeline overview:"
az datafactory pipeline show \
  --factory-name "$FACTORY_NAME" \
  --resource-group "$RESOURCE_GROUP" \
  --name "$PIPELINE_NAME" \
  --query "{Name:name, Folder:folder.name, Description:description, Concurrency:concurrency}" \
  -o table

ACTIVITY_COUNT="$(az datafactory pipeline show \
  --factory-name "$FACTORY_NAME" \
  --resource-group "$RESOURCE_GROUP" \
  --name "$PIPELINE_NAME" \
  --query "activities[].name" -o tsv | awk 'END {print NR}')"

echo "Activity count: $ACTIVITY_COUNT"

# ----------------------------------------------------------
echo ""
echo "STEP 5: Show a human-friendly pipeline summary (activities)"
echo "Command: az datafactory pipeline show --factory-name \"$FACTORY_NAME\" --resource-group \"$RESOURCE_GROUP\" --name \"$PIPELINE_NAME\" --query \"activities[].{Name:name,Type:type}\" -o table"
echo ""
ACTIVITY_SUMMARY="$(az datafactory pipeline show \
  --factory-name "$FACTORY_NAME" \
  --resource-group "$RESOURCE_GROUP" \
  --name "$PIPELINE_NAME" \
  --query "activities[].{Name:name, Type:type}" \
  -o table)"
print_or_none "$ACTIVITY_SUMMARY" "(no activities found)"

# ----------------------------------------------------------
echo ""
echo "STEP 6: Show activity inputs/outputs and dependencies"
echo "Command: az datafactory pipeline show --factory-name \"$FACTORY_NAME\" --resource-group \"$RESOURCE_GROUP\" --name \"$PIPELINE_NAME\" --query \"activities[].{Activity:name,Type:type,Dataset:dataset.referenceName,Inputs:inputs[].referenceName,Outputs:outputs[].referenceName,DependsOn:dependsOn[].activity}\" -o jsonc"
echo ""
ACTIVITY_IO="$(az datafactory pipeline show \
  --factory-name "$FACTORY_NAME" \
  --resource-group "$RESOURCE_GROUP" \
  --name "$PIPELINE_NAME" \
  --query "activities[].{Activity:name, Type:type, Dataset:dataset.referenceName, Inputs:inputs[].referenceName, Outputs:outputs[].referenceName, DependsOn:dependsOn[].activity}" \
  -o jsonc)"
print_or_none "$ACTIVITY_IO" "(no activity inputs/outputs/dependencies found)"

# ----------------------------------------------------------
echo ""
echo "STEP 7: Show pipeline parameters (if any)"
echo "Command: az datafactory pipeline show --factory-name \"$FACTORY_NAME\" --resource-group \"$RESOURCE_GROUP\" --name \"$PIPELINE_NAME\" --query parameters -o json"
echo ""
PIPELINE_PARAMS="$(az datafactory pipeline show \
  --factory-name "$FACTORY_NAME" \
  --resource-group "$RESOURCE_GROUP" \
  --name "$PIPELINE_NAME" \
  --query "parameters" \
  -o json)"
print_or_none "$PIPELINE_PARAMS" "(no parameters found)"

# ----------------------------------------------------------
echo ""
echo "STEP 8: Show pipeline variables (if any)"
echo "Command: az datafactory pipeline show --factory-name \"$FACTORY_NAME\" --resource-group \"$RESOURCE_GROUP\" --name \"$PIPELINE_NAME\" --query variables -o json"
echo ""
PIPELINE_VARIABLES="$(az datafactory pipeline show \
  --factory-name "$FACTORY_NAME" \
  --resource-group "$RESOURCE_GROUP" \
  --name "$PIPELINE_NAME" \
  --query "variables" \
  -o json)"
print_or_none "$PIPELINE_VARIABLES" "(no variables found)"

# ----------------------------------------------------------
echo ""
echo "STEP 9: List datasets (so we know what we can inspect)"
echo "Command: az datafactory dataset list --factory-name \"$FACTORY_NAME\" --resource-group \"$RESOURCE_GROUP\" -o table"
echo ""
DATASET_LIST="$(az datafactory dataset list \
  --factory-name "$FACTORY_NAME" \
  --resource-group "$RESOURCE_GROUP" \
  -o table)"
print_or_none "$DATASET_LIST" "(no datasets found)"

# Capture dataset names for validation/selection (macOS Bash 3.2 compatible)
DATASET_NAMES=()
while IFS= read -r line; do
  [[ -n "$line" ]] && DATASET_NAMES+=("$line")
done < <(az datafactory dataset list \
  --factory-name "$FACTORY_NAME" \
  --resource-group "$RESOURCE_GROUP" \
  --query "[].name" -o tsv)

if (( ${#DATASET_NAMES[@]} == 0 )); then
  echo ""
  echo "No datasets found in factory $FACTORY_NAME (resource group $RESOURCE_GROUP)."
  echo "Create a dataset or update RESOURCE_GROUP/FACTORY_NAME and try again."
  exit 1
fi

echo ""
echo "Available datasets:"
printf '  - %s\n' "${DATASET_NAMES[@]}"

# Auto-pick a dataset if not provided, otherwise validate
if [[ -n "$DATASET_NAME" ]]; then
  dataset_found="false"
  for name in "${DATASET_NAMES[@]}"; do
    if [[ "$name" == "$DATASET_NAME" ]]; then
      dataset_found="true"
      break
    fi
  done
  if [[ "$dataset_found" != "true" ]]; then
    echo ""
    echo "Dataset '$DATASET_NAME' not found. Choose one of:"
    printf '  - %s\n' "${DATASET_NAMES[@]}"
    exit 1
  fi
else
  DATASET_NAME="${DATASET_NAMES[0]}"
fi

echo ""
echo "Selected dataset for deep dive: $DATASET_NAME"
echo ""

# ----------------------------------------------------------
echo "STEP 10: Read the full dataset JSON definition (save to file)"
echo "This shows type, linked service reference, and location/path/table details."
echo ""
DATASET_JSON="dataset_${DATASET_NAME}.json"
echo "Command: az datafactory dataset show --factory-name \"$FACTORY_NAME\" --resource-group \"$RESOURCE_GROUP\" --name \"$DATASET_NAME\" -o json > $DATASET_JSON"
echo ""
az datafactory dataset show \
  --factory-name "$FACTORY_NAME" \
  --resource-group "$RESOURCE_GROUP" \
  --name "$DATASET_NAME" \
  -o json > "$DATASET_JSON"

echo ""
echo "Saved dataset JSON to: $DATASET_JSON"

# ----------------------------------------------------------
echo ""
echo "STEP 11: Show dataset type + linked service (quick summary)"
echo "Command: az datafactory dataset show --factory-name \"$FACTORY_NAME\" --resource-group \"$RESOURCE_GROUP\" --name \"$DATASET_NAME\" --query \"{Name:name,Type:properties.type,LinkedService:properties.linkedServiceName.referenceName}\" -o table"
echo ""
az datafactory dataset show \
  --factory-name "$FACTORY_NAME" \
  --resource-group "$RESOURCE_GROUP" \
  --name "$DATASET_NAME" \
  --query "{Name:name, Type:properties.type, LinkedService:properties.linkedServiceName.referenceName}" \
  -o table

# ----------------------------------------------------------
echo ""
echo "STEP 12: Show dataset location/type-specific properties"
echo "Command: az datafactory dataset show --factory-name \"$FACTORY_NAME\" --resource-group \"$RESOURCE_GROUP\" --name \"$DATASET_NAME\" --query \"{Location:properties.location, Format:properties.type, Compression:properties.compressionCodec, ColumnDelimiter:properties.columnDelimiter, FirstRowAsHeader:properties.firstRowAsHeader}\" -o json"
echo ""
DATASET_TYPE_PROPS="$(az datafactory dataset show \
  --factory-name "$FACTORY_NAME" \
  --resource-group "$RESOURCE_GROUP" \
  --name "$DATASET_NAME" \
  --query "{Location:properties.location, Format:properties.type, Compression:properties.compressionCodec, ColumnDelimiter:properties.columnDelimiter, FirstRowAsHeader:properties.firstRowAsHeader}" \
  -o json)"
print_or_none "$DATASET_TYPE_PROPS" "(no type properties found)"

echo ""
echo "=========================================================="
echo " Deeper dive complete."
echo " Tip: Override PIPELINE_NAME and DATASET_NAME like this:"
echo "   PIPELINE_NAME=pl_ingest_population_data DATASET_NAME=ds_population_raw_gz ./\"Azure Data Factory with Azure CLI - Part 2: Reading Pipeline & Dataset JSON.sh\""
echo "=========================================================="

Full transcript (from the video)

So we're going into Azure CLI with a deeper dive in Azure data factory. So here we are. We're just doing the same thing from before just running a simple bash script which we will be looking at within the terminal.

So here we went over this in the last video which the Azure you know account show table. This just shows our subscriptions a data factory list. We just want to see all of our actual Azure data factories. So here we now we're just confirming it exists and now let's go ahead and list the pipelines. Then let's go into the actual command that we want to show today. So essentially this shows the pipeline right and then we're going to actually look at what type of activities what type of output what type of data sets are within it. So we can put that within this JSON here.

Let's take a deeper look into this. So, here is what we're actually looking at. We're looking at these validation, get metadata, get file, if condition, and we're able to see all of these things within this JSON. Let's take a quick look into that. Let's just say validation, right? So you can see everything you need to know the timeout the actual data set within it and so forth with all these other things. So you can see column count exist the size. Of course this becomes very useful if you need to quickly reference what you're what you're actually working on.

We can see if your true activities which is just simply what we see within here the copy population delete source. You can see it here also. And now you can give this to AI which responds obviously much better to JSON than it would to a UI and then send email with false. There it is.

Now essentially this just does a quick count tells you how much is there and this is if you just want to see specific data within your terminal again same data here but gets more specific uh I'll go ahead and put the link in the description for the repository so you can use it yourself give it to AI and shape it to the way you need the script. So then here's another command here. So essentially this shows the pipeline core right or shows the data factory name then shows you have to reference the subscription then this is what's essentially the meat of it. So query parameters o json. So essentially this just gives you the parameter here. You go here and there you go. Same exact data.

Next one. Same thing except with variables being the only difference.

And now let's say we want to query our data sets. Here's what we have here. DS population raw GZ DS population raw TSV.

So let's go ahead and look inside there. But basically again it's just the metadata.

So your gzip optimal. There you go.

But now you can very quickly reference what is inside a pipeline.