2023-09 September Reads

Tags

Created by

Marc Leprince

Created time

Aug 9, 2023 3:24 PM

Last edited time

Oct 2, 2023 7:41 PM

Medium Post

Published Date

Advanced SQL Topics
Case for Data Mesh https://martinfowler.com/articles/data-monolith-to-mesh.html

Good comprehensive read into high level arch and conceptual drivers for data mesh over existing approaches
My favorite line was each domain should be responsible for representing their orgs data truthfully
While intellectually strong I believe this will be hard to do in practice because data efforts are like govt infra, it gets funded, scoped and built and maintained. There’s no clear way I see for funding.
every domain to get a cut of the existing data fund and split out the resources to all handle the same problems across domains, it doesn’t make a lot of sense to me how you could justify this spend to the board.
They are getting answers they need today, why would they spend 10x and the effort to achieve - lack of problems with the existing approach?
Another thing is the level of scrutiny with lineage and governance can chain projects down for years and that may not be necessary to build a train to transport doordash meals. You’re better off with a Vespa or small car. Variability in complexity of domains and requirements for implementation will lead to wildly different data results
Another reason the data lake is straight forward from an org and funding and specialty skills - to me I don’t see how splintering this makes data mesh more achievable. It becomes an unanswered mgt question

https://www.datamesh-architecture.com/
https://dataproducthinking.substack.com/p/the-problems-in-the-modern-data-stack

One of the complaints is the constant reworking of the data pipelines - and I must agree

I’m taking Azure Data engineering certification courses right now and all the SQL queries are by their very nature schema dependent. You build a schema right into your query to read data in efficiently (prodduct ID, product name, price…)
If an upstream data source changes or maybe the following month it adds/changes data fields, or even reorders them, some guy needs to udpate the script that reads it in

SELECT *
FROM
    OPENROWSET(
        BULK 'https://datalaked25gb9m.dfs.core.windows.net/files/sales/csv/20*.csv',
        FORMAT = 'CSV',
        PARSER_VERSION = '2.0'
    ) WITH (
         SalesOrderNumber VARCHAR(10) COLLATE Latin1_General_100_BIN2_UTF8,
         SalesOrderLineNumber INT,
         OrderDate DATE,
         CustomerName VARCHAR(25) COLLATE Latin1_General_100_BIN2_UTF8,
         EmailAddress VARCHAR(64) COLLATE Latin1_General_100_BIN2_UTF8,
         Item VARCHAR(30) COLLATE Latin1_General_100_BIN2_UTF8,
         Quantity INT,
         UnitPrice DECIMAL(18,2),
         TaxAmount DECIMAL (18,2)
     ) AS [result]

-- ERROR: String or binary data would be truncated while reading column of type 'VARCHAR'
-- Truncated value: "Short-Sleeve Classic Jersey, XL" 
-- ^^^ above is 31 chars

and NO clue what downstream impacts there are - I see why the delta lake and why the schema layer OVER the raw data is such a popular message now.