The augmented medallion architecture - Silver Layer

"Silver" layer

Published on : April 15, 2025

|   Lastly edited on : April 15, 2025

|   11 minutes read

Lirav DUVSHANI

Lirav DUVSHANI


The Silver layer - Where the data is cleaned/curated

Silver Overview

Purpose of the layer

The main purpose of this layer is to provide standardized cleaned and structured data, optimized for analytical use.


The secondary purposes of this layer are to :

  • Schema enforcement : Enforce schema to ensure consistency with expected format
  • Deduplication & Filtering : Ensure data is accurate and relevant for analytical use

Characteristics of the layer

  • Storage should be optimized for analytical use
    • The data should be stored in a query-friendly structures, including primary key definition, partitionning and indexing
  • Allowed Data Structures
    • Structured, can be converted from semi-structured and unstructured data
  • Granularity of data
    • The data should be detailed, at the same level of granularity as the source data
  • Useful metadata information
    • All information available in bronze layer
      • The extraction timestamp
      • The load in bronze layer timestamp (also called ingestion timestamp)
      • (if available) The last update of records in the source system
    • The load in silver layer timestamp

Good practices

In this layer, the transformation applied are mostly cleaning and technical formatting of data (like semi-structured data flattening). For all these transformations, it is very useful to keep inside the Silver layer structure both the original and cleaned data.

This ensures that during data analysis, not only the result of cleaning is available but also the original value prior to cleaning.


As in all the other layers, tracking as much metadata information as possible is very beneficial for data analysis purposes and for monitoring reporting. Among these metadata information, the most important ones would be the source extraction timestamp and the ingestion timestamp.

Standard transformations applied in this layer

  • Structure-level Selection and transformation
    • Data flattening : Transform semi-structured data in a more user and query friendly format
    • Last Available Data selection : Selection of the last version of any record
  • Column-level transformations
    • Cleaning of the source data
      • Technical cleaning : Empty values, which should not be allowed
      • Functional cleaning : Incoherent values should be cleaned, like invalid dates or any type of incoherent entries
    • Standardization of referentials
      • Transposition of source data into custom referential : When multiple sources are present

Transformation CategoryTransformationUse CaseBefore CleaningAfter Cleaning
Technical CleaningEmpty valuesFill empty metric values with 0NULL
1
0
1
Functional CleaningValid DateDefault to NULL if date not coherent2024-13-13
2024-02-12
NULL
2024-02-12
Standardisation of referentialUnify country names into codeTranspose the country names to ISO 3166 alpha-2France
Fr.
Germ.
GERMANY
FR
FR
GE
GE
Data flatteningEmpty valuesFill empty metric values with 0{"sector": "Retail", "earnings_range": "<10M"}Into 2 columns
sector = Retail
earnings_range = <10M

Transformation Examples

All the examples are provided considering a storage in a database, e.g. in tables.

Data Flattening

Flattening of a field can be done in multiple way :

  • Stored inside a single table
  • Stored in multiple tables
Flattening inside a single table

Here is an example of a customer table with a key named "Customer ID".


Bronze Table - Customer

Customer IDCustomer NameCustomer TagsIngestion Timestamp
1Customer A{"sector": "Industry", "earnings_range": ">100M"}2025-01-02T03:55:02.728Z
2Customer B{"sector": "IT", "earnings_range": "10-100M"}2025-01-02T03:55:02.728Z
3Customer C{"sector": "Retail", "earnings_range": "<10M"}2025-01-02T03:55:02.728Z
1Customer A{"sector": "Industry", "earnings_range": ">100M"}2025-01-03T03:52:03.127Z
3Customer C{"sector": "Retail", "earnings_range": "<100M"
, "billing_currency": "EUR"}
2025-01-04T04:02:42.009Z

Silver Table - Customer

Customer IDCustomer NameSectorEarnings RangeBilling CurrencyIngestion TimestampSilver Load Timestamp
1Customer AIndustry>100M2025-01-02T03:55:02.728Z2025-01-05T05:12:02.000Z
2Customer BIT10-100M2025-01-02T03:55:02.728Z2025-01-05T05:12:02.000Z
3Customer CRetail<10M2025-01-02T03:55:02.728Z2025-01-05T05:12:02.000Z
1Customer AIndustry>100M2025-01-03T03:52:03.127Z2025-01-05T05:12:02.000Z
3Customer CRetail<100MEUR2025-01-04T04:02:42.009Z2025-01-05T05:12:02.000Z

All the tags have been transformed as columns providing information about a customer.

Flattening in multiple tables

Here is an example of a customer table with a key named "Customer ID".


Bronze Table - Customer

Customer IDCustomer NameCustomer TagsIngestion Timestamp
1Customer A{"sector": "Industry", "earnings_range": ">100M"}2025-01-02T03:55:02.728Z
2Customer B{"sector": "IT", "earnings_range": "10-100M"}2025-01-02T03:55:02.728Z
3Customer C{"sector": "Retail", "earnings_range": "<10M"}2025-01-02T03:55:02.728Z
1Customer A{"sector": "Industry", "earnings_range": ">100M"}2025-01-03T03:52:03.127Z
3Customer C{"sector": "Retail", "earnings_range": "<100M"
, "billing_currency": "EUR"}
2025-01-04T04:02:42.009Z

Silver Table - Customer (Main)

Customer IDCustomer NameIngestion TimestampSilver Load Timestamp
1Customer A2025-01-02T03:55:02.728Z2025-01-05T05:12:02.000Z
2Customer B2025-01-02T03:55:02.728Z2025-01-05T05:12:02.000Z
3Customer C2025-01-02T03:55:02.728Z2025-01-05T05:12:02.000Z
1Customer A2025-01-03T03:52:03.127Z2025-01-05T05:12:02.000Z
3Customer C2025-01-04T04:02:42.009Z2025-01-05T05:12:02.000Z

Silver Table - Customer (Tags)

Customer IDSectorEarnings RangeBilling CurrencyIngestion TimestampSilver Load Timestamp
1Industry>100M2025-01-02T03:55:02.728Z2025-01-05T05:12:02.000Z
2IT10-100M2025-01-02T03:55:02.728Z2025-01-05T05:12:02.000Z
3Retail<10M2025-01-02T03:55:02.728Z2025-01-05T05:12:02.000Z
1Industry>100M2025-01-03T03:52:03.127Z2025-01-05T05:12:02.000Z
3Retail<100MEUR2025-01-04T04:02:42.009Z2025-01-05T05:12:02.000Z

All the tags have been transformed as columns providing information about a customer.


Last Available Data selection

The last available data selection transformation provides a Silver table which contains only the last available relevant information. Thus, the Silver table could be identical to the table in the source system, with a single record per key.


Here is an example of a product table with a key named "Product ID".


Bronze Table - Customer

Product IDProduct NameProduct CategoryIngestion Timestamp
1Product AWood2025-01-02T03:55:02.728Z
2Product BIT Service2025-01-02T03:55:02.728Z
3Product CClothes2025-01-02T03:55:02.728Z
1Product AFloors - Wood2025-01-03T03:52:03.127Z
3Product CWomen's clothes2025-01-04T04:02:42.009Z

Silver Table - Customer

Product IDProduct NameProduct CategoryIngestion TimestampSilver Load Timestamp
2Product BIT Service2025-01-02T03:55:02.728Z2025-01-05T05:12:02.000Z
1Product AFloors - Wood2025-01-03T03:52:03.127Z2025-01-05T05:12:02.000Z
3Product CWomen's clothes2025-01-04T04:02:42.009Z2025-01-05T05:12:02.000Z

Only the last record for "Product ID" 1 and 3, based on the ingestion timestamp, have been selected and loaded into the silver table


Data Cleaning

Whatever the format of the source data in Bronze, in the Silver layer, the data should be clean and ready to used to analysis, reporting and any kind of analysis.


Here is an example of a purchase order table with a key named "Purchase Order ID".

This presents all the steps from Bronze, including the selection of the latest available data.


Bronze Table - Order

Purchase Order IDCustomer NameProduct NameOrder StatusOrder Unit Of MeasureOrder QuantityBilling CurrencyBilled AmountIngestion Timestamp
PO-1Customer 1Product AInvoicedPC3EUR10,022025-01-02T03:55:02.728Z
PO-2Customer 1Product BOrderedPC1002USD35,992025-01-02T03:55:02.728Z
PO-3Customer 2Product CPlanned30GBP20,002025-01-02T03:55:02.728Z
PO-4Customer 1Product APlannedEUR2025-01-02T03:55:02.728Z
PO-2Customer 2Product CReceivedPC1002USD35,992025-01-04T04:02:42.009Z

In the load of the Silver table, here we'll consider that only the last available data will be stored in Silver. Thus, a first step will only select the latest information for each order.


Silver Table - Orders (Latest Data) before cleaning

Purchase Order IDCustomer NameProduct NameOrder StatusOrder Unit Of MeasureOrder QuantityBilling CurrencyBilled AmountIngestion Timestamp
PO-1Customer 1Product AInvoicedPC3EUR10,022025-01-02T03:55:02.728Z
PO-2Customer 2Product CReceivedPC1002USD35,992025-01-04T04:02:42.009Z
PO-3Customer 2Product CPlanned30GBP20,002025-01-02T03:55:02.728Z
PO-4Customer 1Product APlannedEUR2025-01-02T03:55:02.728Z

Silver Table - Orders (Latest Data) after cleaning


The cleaning of the orders presented here is as follows :

  • All Unit of measure and currency columns should be defaulted to #, for analytical purposes
  • All quantity and amounts should be defaulted to 0, for analytical purposes
  • The order quantity is defaulted when no order unit of measure available

Purchase Order IDCustomer NameProduct NameOrder StatusOrder Unit Of MeasureOrder QuantityBilling CurrencyBilled Amount
PO-1Customer 1Product AInvoicedPC3EUR10,02
PO-2Customer 2Product CReceivedPC1002USD35,99
PO-3Customer 2Product CPlanned#
(defaulted from ∅)
0
(Cleaned from 30)
GBP20,00
PO-4Customer 1Product APlanned#
(defaulted from ∅)
0
(defaulted from ∅)
EUR0 (defaulted from ∅)

All the cleaning done here can also be linked to the data quality layer to provide direct feedback to the data stewards and owners

Referential Transposition

The source data can be very heteregenous even on similar concepts where a similar referential data (like product, customer, ...) is maintained in multiple operational tools.

In the Silver layer, data should be aligned to enable usage of the data regardless of its source.

The goal here is to provide standard axis of analysis which match a general definition of an axis for all the data stored in Silver.


Here is an example of purchase orders linked to a delivery country which is provided not normalized at all in the data source. A specific algorithm is thus implemented to clean this data into a more normalized referential.

This presents all the steps from Bronze, including the selection of the latest available data.


Bronze Table - Purchase Orders


Purchase Order IDDelivery CountryIngestion Timestamp
PO-1FR2025-01-02T03:55:02.728Z
PO-2FranCe2025-01-02T03:55:02.728Z
PO-3FR.2025-01-02T03:55:02.728Z
PO-4GE2025-01-02T03:55:02.728Z

Parameter Table - Country transposition table


Depending on the use case, either an intelligent algorithm can be develop to clean the data, or a transposition can be coded to clean the data from the known delivery country filled in the source to the standard referential.

In this example, we consider using a parameter table as follows :

Delivery Country (original)Delivery Country (cleaned)
FRFrance
FranCeFrance
FR.France
GEGermany

Silver Table - Orders after cleaning


Purchase Order IDDelivery Country (original)Delivery Country (cleaned)Ingestion Timestamp
PO-1FRFrance2025-01-02T03:55:02.728Z
PO-2FranCeFrance2025-01-02T03:55:02.728Z
PO-3FR.France2025-01-02T03:55:02.728Z
PO-4GEGermany2025-01-02T03:55:02.728Z

All the cleaning done here can also be linked to the data quality layer to provide direct feedback to the data stewards and owners




This article is part of a series of article describing the augmented medallion architecture.