2023-09 September Reads

Tags
Created by
Marc Leprince
Created time
Aug 9, 2023 3:24 PM
Last edited time
Oct 2, 2023 7:41 PM
Medium Post
Published Date
  • Advanced SQL Topics
  • Case for Data Mesh https://martinfowler.com/articles/data-monolith-to-mesh.html
    • Good comprehensive read into high level arch and conceptual drivers for data mesh over existing approaches
    • My favorite line was each domain should be responsible for representing their orgs data truthfully
    • While intellectually strong I believe this will be hard to do in practice because data efforts are like govt infra, it gets funded, scoped and built and maintained. There’s no clear way I see for funding.
    • every domain to get a cut of the existing data fund and split out the resources to all handle the same problems across domains, it doesn’t make a lot of sense to me how you could justify this spend to the board.
    • They are getting answers they need today, why would they spend 10x and the effort to achieve - lack of problems with the existing approach?
    • Another thing is the level of scrutiny with lineage and governance can chain projects down for years and that may not be necessary to build a train to transport doordash meals. You’re better off with a Vespa or small car. Variability in complexity of domains and requirements for implementation will lead to wildly different data results
    • Another reason the data lake is straight forward from an org and funding and specialty skills - to me I don’t see how splintering this makes data mesh more achievable. It becomes an unanswered mgt question
  • https://www.datamesh-architecture.com/
  • https://dataproducthinking.substack.com/p/the-problems-in-the-modern-data-stack
    • One of the complaints is the constant reworking of the data pipelines - and I must agree
      • I’m taking Azure Data engineering certification courses right now and all the SQL queries are by their very nature schema dependent. You build a schema right into your query to read data in efficiently (prodduct ID, product name, price…)
      • If an upstream data source changes or maybe the following month it adds/changes data fields, or even reorders them, some guy needs to udpate the script that reads it in
      • SELECT *
        FROM
            OPENROWSET(
                BULK 'https://datalaked25gb9m.dfs.core.windows.net/files/sales/csv/20*.csv',
                FORMAT = 'CSV',
                PARSER_VERSION = '2.0'
            ) WITH (
                 SalesOrderNumber VARCHAR(10) COLLATE Latin1_General_100_BIN2_UTF8,
                 SalesOrderLineNumber INT,
                 OrderDate DATE,
                 CustomerName VARCHAR(25) COLLATE Latin1_General_100_BIN2_UTF8,
                 EmailAddress VARCHAR(64) COLLATE Latin1_General_100_BIN2_UTF8,
                 Item VARCHAR(30) COLLATE Latin1_General_100_BIN2_UTF8,
                 Quantity INT,
                 UnitPrice DECIMAL(18,2),
                 TaxAmount DECIMAL (18,2)
             ) AS [result]
        
        -- ERROR: String or binary data would be truncated while reading column of type 'VARCHAR'
        -- Truncated value: "Short-Sleeve Classic Jersey, XL" 
        -- ^^^ above is 31 chars
      • and NO clue what downstream impacts there are - I see why the delta lake and why the schema layer OVER the raw data is such a popular message now.