Docs: 'Ingest from other data sources' missing critical filename/dedup details

## Summary

The [Ingest from other data sources](https://learn.microsoft.com/en-us/cloud-computing/finops/toolkit/hubs/deploy#ingest-from-other-data-sources) section of the deploy tutorial documents the folder path convention but omits several details that are required for the Hub's ingestion pipeline to work correctly. Following the doc literally led to silent data loss in our GCP → FOCUS → Hub pipeline.

Gaps we hit while integrating a GCP Cloud Billing FOCUS view into the Hub:

### 1. Filename convention is not documented (critical)

The doc says nothing about filenames, but the ingestion pipeline splits each filename on `__` (double underscore) to derive `ingestionId` and `originalFileName`. Source: [`Analytics/app.bicep:1806-1811`](https://github.com/microsoft/finops-toolkit/blob/dev/src/templates/finops-hub/modules/Microsoft.FinOpsHubs/Analytics/app.bicep#L1806-L1811).

The pre-ingest cleanup ([`Analytics/app.bicep:1352`](https://github.com/microsoft/finops-toolkit/blob/dev/src/templates/finops-hub/modules/Microsoft.FinOpsHubs/Analytics/app.bicep#L1352)) drops all extents tagged with the same \`drop-by:<folderPath>\` that don't share the current run's \`ingestionId\`. If you upload files without the \`__\` separator, every file gets its own derived \`ingestionId\` (the full filename), so each file's ingestion drops all previously-ingested files in the same folder — leaving only whatever got ingested last.

Concrete impact: we uploaded 12 Parquet files per month; 10 rows out of ~2100 survived in \`Costs_final_v1_2\` per run.

**Suggested doc fix:** Add a subsection:
- Filename pattern: \`<ingestionId>__<originalFileName>.parquet\`
- \`ingestionId\` must be stable across all files in a single logical run (e.g. a timestamp + GUID)
- Separator is two underscores

### 2. Dedup requires full-month replacement, not deltas

The doc hints at this with \"overriding datasets should consistently land in the same folder path so they're overwritten each time,\" but it's easy to miss. The implication isn't stated: if you write only incremental/delta rows to the month folder, the pre-ingest cleanup wipes the rest of the month's extents, and Kusto ends up with just the delta.

**Suggested doc fix:** Add an explicit note: \"When new rows arrive for an existing month, you must re-write the entire month's data in that folder (with a new ingestionId). Incremental deltas are not supported — they will result in data loss.\"

### 3. Tables referenced in step 4 are outdated

Step 4 says \"update the **Costs_raw** and **Costs_final_v1_0** tables, and **Costs_transform_v1_0**, **Costs_v1_0**, and **Costs** functions\". Current Hub deployments use the \`v1_2\` variants (\`Costs_final_v1_2\`, \`Costs_transform_v1_2\`, \`Costs_v1_2\`). The doc wasn't updated when the schema was bumped.

### 4. Empty Parquet shards cause 8-minute ingestion stalls

Not a documentation gap per se, but worth calling out: tools that produce partitioned Parquet output (BigQuery \`EXPORT DATA\`, Spark \`write.parquet\`, etc.) commonly emit header-only empty shards (~8 KB on a typical FOCUS schema). Kusto rejects these with \`BadRequest_NoRecordsOrWrongFormat: 0 bytes\`, and each failure is retried 3× at 120s intervals (\~8 minutes wasted per empty file). With \~50% empty shards a typical run's Hub pipeline takes 1–2 hours instead of minutes.

**Suggested doc fix:** Add a tip: \"Filter out empty shards (below \~10 KB on a typical FOCUS schema) before uploading to the ingestion container to avoid retry storms in the Hub pipeline.\"

### 5. Manifest content specification is ambiguous

The doc says \"Create an empty \`manifest.json\` file.\" The pipeline actually accepts any valid JSON. The \`IngestionQueries\` pipeline writes a \`settings.json\`-derived payload, so \"empty\" is a convention, not a requirement. Worth clarifying so integrators know they can embed audit metadata (e.g. \`ingestionId\`, exported timestamp).

### 6. Stale source-blob cleanup is the integrator's responsibility

When integrators use the \`<ingestionId>__<originalFileName>.parquet\` convention with a fresh \`ingestionId\` per run (recommended by point 1), each run's Parquet files have different filenames than previous runs. The Hub pipeline's drop-by cleanup operates on **Kusto extents** but **does not delete source blobs** in the ingestion container.

Consequence: stale Parquets from previous runs accumulate in the target folder. On the next manifest trigger, the Hub lists **all** Parquets in the folder (old + new), ingests every one, then uses drop-by-cleanup to keep only one \`ingestionId\`'s extents. Which ingestionId wins depends on blob processing order — often the older (lexicographically earlier) one, silently reverting the data to the prior run.

Concrete impact we observed: after changing the FOCUS view's \`BilledCost\` semantics, triggering a full re-export, and confirming the new Parquets were uploaded, \`Costs_final_v1_2\` still showed the old values. The Hub had picked up both old-ingestionId and new-ingestionId files, and the older-timestamped run's extents won.

**Suggested doc fix:** Add a step: \"Before uploading a new run's Parquets, delete all existing Parquets in the target folder that don't share the current run's ingestionId. The Hub pipeline does not clean up source blobs — leftover files from prior runs can race with new ones and silently revert your data.\"

Alternatively, integrators could use fixed filenames (\`part-0.parquet\`, \`part-1.parquet\`, …) that overwrite in place, matching the Azure Cost Management export pattern. This works only if the shard count is stable across runs; otherwise leftover shards cause the same issue.

## Repro context

GCP Cloud Billing detailed export → BigQuery FOCUS view → Parquet exports to GCS → streamed to Azure ingestion container → Hub pipeline → Kusto \`Costs_final_v1_2\`. Verified against \`Hub\` database on a FinCops-hosted Fabric eventhouse.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Docs: 'Ingest from other data sources' missing critical filename/dedup details #2096

Summary

1. Filename convention is not documented (critical)

2. Dedup requires full-month replacement, not deltas

3. Tables referenced in step 4 are outdated

4. Empty Parquet shards cause 8-minute ingestion stalls

5. Manifest content specification is ambiguous

6. Stale source-blob cleanup is the integrator's responsibility

Repro context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Docs: 'Ingest from other data sources' missing critical filename/dedup details #2096

Description

Summary

1. Filename convention is not documented (critical)

2. Dedup requires full-month replacement, not deltas

3. Tables referenced in step 4 are outdated

4. Empty Parquet shards cause 8-minute ingestion stalls

5. Manifest content specification is ambiguous

6. Stale source-blob cleanup is the integrator's responsibility

Repro context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions