fa
Feedback
There will be no singularity

There will be no singularity

رفتن به کانال در Telegram

Smartface, technologies and decay @antonrevyako

نمایش بیشتر
1 957
مشترکین
اطلاعاتی وجود ندارد24 ساعت
-17 روز
-530 روز
آرشیو پست ها
https://jepsen.io/analyses/mysql-8.0.34 Reason for not using MySQL #1234: MySQL’s “Repeatable Read” does not satisfy PL-2.99 Repeatable Read

New week, new "special behavior" in Snowflake 🙂 In Snowflake, there is a special function IDENTIFIER() that allows referencing objects using strings and session variables:

SELECT
  * 
FROM
  IDENTIFIER($my_table)
Documentation can be found here: https://docs.snowflake.com/en/sql-reference/identifier-literal Additionally, you can reference an object in the column list:

SELECT 
  IDENTIFIER($my_table).* 
FROM 
  IDENTIFIER($my_table)
Wait, that's not the "special behavior" yet... 1)

IDENTIFIER($my_table).*
works, but

IDENTIFIER($my_table).column_name 
does not This looks very strange... But until you try this: 2)

SELECT 
  IDENTIFIER($my_table1).*, 
  IDENTIFIER($my_table2).* 
FROM 
  IDENTIFIER($my_table1),
  IDENTIFIER($my_table2)
... and you'll get twice of all the columns 🙂 Because IDENTIFIER(whats_ever).* gets replaced by just *! As a result, the above query is converted to

SELECT 
  *,
  * 
FROM 
  IDENTIFIER($my_table1),
  IDENTIFIER($my_table2);
PS: send your reactions here 🙂

Don't be alarmed. It's time to rewrite the name of the channel in English 🙂

A bit of explanation in my own words: 1) Venn diagrams describe interactions between sets. However, when we perform a JOIN, w
A bit of explanation in my own words: 1) Venn diagrams describe interactions between sets. However, when we perform a JOIN, we don't get a single set but a set of sets (a table). Therefore, Venn diagrams aren't suitable for describing what actually happens after a JOIN. 2) In such illustrations (even if supplemented with tables), usually only one, almost the rarest case is depicted: unique values in both columns and an equality condition for the join. There are no repetitions, no NULLs, and no complex join conditions. Previously, I created a static analyzer for PostgreSQL - holistic.dev. Besides type inference, I wanted to infer two more important parameters: constraints and the row count class. If we have information about unique indexes and foreign keys, we can infer the uniqueness of values, NULLability, and the row count class. What are these classes? NONE, ONE, ONE_OR_NONE, MANY, MANY_OR_NONE. Why are they needed? For instance, automatic asserts in code can be created. These exist in the Node.js driver. Or if you store your SQL in separate files, automatic checks can be made to ensure nothing broke during query or schema updates and to detect Row Explosion. So, outputting these classes and constraints were the most complex parts of the entire compiler :) All because of JOIN. Maybe that's why my eye still twitches when I see such explainers... People might look at this and then receive bills for 50 million... Conclusion: Venn diagrams are suitable for explaining UNION/INTERSECT/EXCEPT. For JOIN, this picture by Andreas Martinson is more suitable.

Lads, can you please like the post of who is using Snowflake? Or other analytical databases. Maybe you remember, I used to make a service for static analysis of PostgreSQL (holistic.dev) and posted stuff about PG. For quite some time now, I've been doing an observability platform for Snowflake (dwh.dev) with my team, and I'm fully immersed in this area. And would love to understand how interested you guys are in this at all. Is it worth sharing details about creating such software? Or is it necessary to bring back SQL-WTF about PostgreSQL? In short, tell us what you would like to see. PS: Yes, by the way, we have Snowflake chat rooms (RU/EN) and a channel with Snowflake news.

photo content

Сейчас будет скучный пост про стартапы и бизнес, Потому что в ChatGPT вчера добавили поддержку PDF файлов (при чем тут это?) При том, что десятки инди-стартапов вчера дружно построились и пошли умирать. Все эти pdf.ai, ChatPDF, AskYourPDF внезапно стали не нужны. Прикольно, что все это очень смахивает на историю с Amazon Basics. Вы же в курсе, как работает Амазон? Если они в аналитике видят выстреливший продукт от независимого поставщика - они быстро его клонируют и выпускают под брендом Amazon Basics. И впихивают свой клон в топ поиска. А потом наваливаются всей своей логистическо-ритейлово-маркетиноговой мощью. Оригинальный продукт через месяц умирает. За это их много раз таскали по судам, антимонопольным разбирательствам и даже на слушания в Конгрессе. После чего они... просто научились лучше прятаться. Теперь копии продаются под "независимыми" фейковыми брендами. Если вы делаете тупой враппер для GPT в горизонтальной нише - OpenAI убьет вас так же, как Amazon Basics убивает инди-ритейл. Я почти уверен, что они посматривают на свою API-аналитику и как только видят где-то "хоккейную клюшку" и легко реплицируемый продукт - добавляют его в виде фичи. В английском бизнес-жаргоне пару лет назад стал популярен классный термин "moat". Буквальный перевод - "крепостной ров". Moat - это то, что защитит вас от конкурентов и platform-риска. Это ваш момент импульса (тот самый, который не дает гироскопу отклоняться от оси). Moat - это то, почему нельзя просто взять и скопировать бизнес-модель. Это не обязательно фича продукта - возможно это преданная аудитория. Или вы "go to" бренд в своей нише. А возможно ваш продукт плотно и незаметно интегрируется в критические бизнес-процессы клиента, и слезть с него тупо невозможно. Ну или вы - AWS и берете такие баблищи за траффик что переезжать очень больно. Если у вас нет moat - бизнес под угрозой. В общем, хватит пилить врапперы для ЖПТшки. Пилите врапперы "with a twist". Ну, например, я не знаю, "PDF-чатилка для адвокатов да еще и совместимая с GDPR". AI is a feature, not a product ☝️

I wrote an article about the Snowflake update and how startups are getting screwed by it, but my graphomania didn't fit into the damn LinkedIn, so I had to split it into three. I won't torture you here either, so here's a link to the whole article: https://dwh.dev/blog/startup-killers and you can like it on LinkedIn here, here and here (please). There's just an article from Alex on the same topic (forward next post). "What are you going to do if Snowflake (Amazon, Google, name it) did the same thing?" - every investor's favorite question...

Is such functionality useful in the user interface? At first glance, yes. But could it happen that data used to flow one way, then changed? Could PII stop flowing from upstream? Why not? Anything can happen. Would tags set on downstream columns still make sense in a month, a quarter, a year? Nobody knows. Suddenly, a new problem arises: regular review of tags and masking policy. Reviewing becomes more complicated as no comments can be added to tag and masking policy set process. How will all this look and perform when there are hundreds of downstream objects? So, How Do Startups Survive Now? You might think I'm nitpicking? Many independent data lineage products work the same features as Snowflake now, and that not bothered them (it's in the past now). Ultimately, many users are satisfied with the information provided by DBT 🙂 It's also clear that Snowflake simply bolted an interface onto the lineage data they already had. It's crucial to understand: lineage in Snowflake is the result of dynamic analysis. All independent data lineage services don't have Snowflake's data and build their work based on static analysis. Dynamic and static analyses are different approaches, not replacements. For data startups, it seems things that lie beyond Snowflake's capabilities or demand far more effort than extracting metrics from query engines should now take the forefront. For instance, as soon as the need arises to view lineage from BI dashboards up to sources or to obtain columnar lineage for DBT, Snowflake won't be of much help. Or static analysis of SQL queries. Developing this is time-consuming and expensive (as noticed across all dwh.dev competitors). It seems vendors' efforts will now focus on AI rather than static analysis. Will it become more challenging for modern data stack startups to survive? Yes, certainly. Will it kill them? The ones that thrived on basic functionality - undoubtedly. And How About the Clients? The vendor provided good enough functionality. What's next is up to them 🙂

COLD-BLOODED STARTUP KILLER (part 1) Amid the soap opera starring Sam Altman, everyone forgot that just a week prior, Open AI held a developer conference that became a tombstone for numerous startups. Of course, this isn't a unique occurrence. All major vendors do it. How many applications has Apple buried? Too many to count. For instance, just a few days before the Open AI conference, another major vendor dug some graves. On November 2nd, an online Snowflake conference took place. Here it is: Snowday agenda A few presentations have ruined a significant portion of the services working with Snowflake. Everything took a hit: data quality, data security, data catalog, cost management, text2sql generators, and, of course, data lineage. Data Quality In my view, data quality-first services suffered the most. Now, there's no need to pay $50k/y for an interface to run SQL queries via cron. Now, in Snowflake, you can do this:
CREATE DATA METRIC FUNCTION INVALID_EMAIL_COUNT (ARG_T TABLE(ARG_C1 STRING))
  RETURNS NUMBER AS
  ...
;

ALTER TABLE t SET DATA_METRIC_SCHEDULE = 'USING CRON */5 * * * * UTC';
ALTER TABLE t ADD DATA METRIC FUNCTION INVALID_EMAIL_COUNT ON (EMAIL);
And obtain a table with results. Then, create an ALERT that fetches results from this table and sends you emails if something goes wrong. With Snowflake now having Cortex Anomaly Detection and Time-Series Forecasting, complex metric-related tasks can be handled within Snowflake itself. Data Security The new Trust Center section contains information about account issues (account access settings), network policies, and new DATABASE ROLE objects. Also there's a new object: PRIVACY POLICY. A cool feature adding noise to aggregation functions, adapting to the dataset's size, and guarding against differencing attacks. Data Catalog Universal Search: AI-based metadata search in your database. The demo looks impressive: natural language search yields relevant results even without column's comments. Cost Management A new section in the admin panel lets you view expense dynamics, the most expensive queries (grouped by hash, irrespective of query parameters' values), lists of rarely used materialized views, tables with clustering keys, and more. Text2sql Generators Snowflake Copilot 🙂 Code generation happens almost instantly. Data Lineage Now, tables (for other objects, nothing is known) have a lineage tab. You can view object-to-object dependencies as a graph, expanding it one object at a time. Details Altogether, it's called Snowflake Horizon. The vendor's logic is clear: give clients maximum convenience without requiring additional purchases. Is it bad? Good! But not entirely. While all this looks great, it's not enough to make clients' pain disappear… DATA METRIC FUNCTION lacks an interface, and setting up Slack notifications will require coding. Universal Search lacks essential features present in data catalogs, like built-in chats for database objects 🙂 Cost management/optimization for startups involves rewriting queries (getespresso.ai) or tuning the warehouse (select.dev). Snowflake Copilot generates poor queries, needing requests for rewrite. It's unclear if these queries are validated against reality. And more on lineage: Bird's-eye view only Takes several seconds to open the next level of dependencies and can only be done for one object at a time Object view. Columnar Lineage looks like a table in a separate modal window and takes 5 seconds to load for 5 dependent objects in 3 downstream levels. It's unclear where the relationships shown in the demo come from. CTAS? INSERT/UPDATE/MERGE? Over what time period? DYNAMIC TABLE only? Does it work for COPY INTO in PIPES? What about VIEWS? The demo demonstrates a rather specific case: applying PII tags (and dependent masking policy) to columns from downstream objects by clicking in the interface. However, in Snowflake, tags don't automatically spread to dependent objects. If a tag is assigned to a column, anything downstream won't be affected.

Scalar functions name resolution special behavior in Snowflake. I never get tired of repeating that writing a universal compiler for different SQL dialects is impossible. Today, I will tell you about the behavior of name resolution in Snowflake inside CREATE VIEW. When you execute queries, Snowflake looks for the objects specified in the query in the schemas specified in SEARCH_PATH. You can check them like this:
SELECT current_schemas();
But it seems like a good idea to make the creation of VIEWs independent of SEARCH_PATH. Otherwise, we will get different results when we work with VIEWs at different SEARCH_PATH. The documentation says the following:
The SEARCH_PATH is not used inside views or UDFs. All unqualifed objects in a view or UDF definition will be resolved in the view’s or UDF’s schema only.
That's great! And it works!
CREATE OR REPLACE DATABASE db1;
CREATE SCHEMA sh1;

CREATE TABLE public.t1(c1 int);

CREATE VIEW sh1.v1 AS
SELECT * FROM t1;
will return
SQL compilation error: Object 'DB1.SH1.T1' does not exist or not authorized.
And not only for tables. For any objects, except … scalar functions!
CREATE OR REPLACE DATABASE db1;
CREATE SCHEMA sh1;
CREATE TABLE sh1.t1(c1 int);
INSERT INTO sh1.t1(c1) VALUES (1);

CREATE FUNCTION public.test()
RETURNS NUMBER
LANGUAGE SQL
AS '1';

CREATE VIEW sh1.v1 AS
SELECT *, test() c2 FROM t1;

SELECT * FROM sh1.v1;
will return
C1 C2
1   1
Strange behavior, don't you agree? In PostgreSQL, for example, it works like this: when creating a VIEW, all objects without schema specification are searched in the public schema. If you want a different schema, specify it by hand. But maybe I'm being picky? Let's add one more thing…
CREATE FUNCTION sh1.test()
RETURNS NUMBER
LANGUAGE SQL
AS '2';

SELECT * FROM sh1.v1;
will return
C1 C2
1    2
Oops… If there is a scalar function in the schema where VIEW is created, it will be used. If not, the function from PUBLIC will be used. I.e. if you didn't specify a schema for a scalar function from the PUBLIC schema while creating a VIEW, in a schema other than PUBLIC, then to corrupt the data in your database it is enough to create a function with the same name in the corresponding schema… At dwh.dev, we know about a lot of these nuances. So you are unlikely to find a better data lineage for Snowflake 🙂 linkedin likes goto https://www.linkedin.com/feed/update/urn:li:activity:7132776122580152320/