Tags
Created by
Marc Leprince
Created time
Aug 9, 2023 3:24 PM
Last edited time
Oct 2, 2023 7:41 PM
Medium Post
Published Date
- Advanced SQL Topics
- Case for Data Mesh https://martinfowler.com/articles/data-monolith-to-mesh.html
- Good comprehensive read into high level arch and conceptual drivers for data mesh over existing approaches
- My favorite line was each domain should be responsible for representing their orgs data truthfully
- While intellectually strong I believe this will be hard to do in practice because data efforts are like govt infra, it gets funded, scoped and built and maintained. There’s no clear way I see for funding.
- every domain to get a cut of the existing data fund and split out the resources to all handle the same problems across domains, it doesn’t make a lot of sense to me how you could justify this spend to the board.
- They are getting answers they need today, why would they spend 10x and the effort to achieve - lack of problems with the existing approach?
- Another thing is the level of scrutiny with lineage and governance can chain projects down for years and that may not be necessary to build a train to transport doordash meals. You’re better off with a Vespa or small car. Variability in complexity of domains and requirements for implementation will lead to wildly different data results
- Another reason the data lake is straight forward from an org and funding and specialty skills - to me I don’t see how splintering this makes data mesh more achievable. It becomes an unanswered mgt question
- https://www.datamesh-architecture.com/
- https://dataproducthinking.substack.com/p/the-problems-in-the-modern-data-stack
- One of the complaints is the constant reworking of the data pipelines - and I must agree
- I’m taking Azure Data engineering certification courses right now and all the SQL queries are by their very nature schema dependent. You build a schema right into your query to read data in efficiently (prodduct ID, product name, price…)
- If an upstream data source changes or maybe the following month it adds/changes data fields, or even reorders them, some guy needs to udpate the script that reads it in
- and NO clue what downstream impacts there are - I see why the delta lake and why the schema layer OVER the raw data is such a popular message now.
SELECT *
FROM
OPENROWSET(
BULK 'https://datalaked25gb9m.dfs.core.windows.net/files/sales/csv/20*.csv',
FORMAT = 'CSV',
        PARSER_VERSION = '2.0'
) WITH (
SalesOrderNumber VARCHAR(10) COLLATE Latin1_General_100_BIN2_UTF8,
SalesOrderLineNumber INT,
OrderDate DATE,
CustomerName VARCHAR(25) COLLATE Latin1_General_100_BIN2_UTF8,
EmailAddress VARCHAR(64) COLLATE Latin1_General_100_BIN2_UTF8,
Item VARCHAR(30) COLLATE Latin1_General_100_BIN2_UTF8,
Quantity INT,
UnitPrice DECIMAL(18,2),
TaxAmount DECIMAL (18,2)
) AS [result]
-- ERROR: String or binary data would be truncated while reading column of type 'VARCHAR'
-- Truncated value: "Short-Sleeve Classic Jersey, XL"
-- ^^^ above is 31 chars