Zoom Logo

Workshop/Tutorial I - 14:00 - 16:00 pm UTC - Shared screen with speaker view
bzaitlen
13:05
https://docs.google.com/presentation/d/e/2PACX-1vS-0EPOCZezbpgJnSaHAAjspUifVruGS1wYQqV_ud-MWMG4FL28GCA2WSfPkGL5zjnGJpjdRKVRNIDa/pub?start=false&loop=false&delayms=3000
Sheikh Alam
28:42
Which language is used as core language for Dask like scala for spark?
James Bourbeau (he/him)
28:50
Python
Matthew Powers (speaker)
30:23
I’ve dug around in Spark / Scala code a lot and find the Dask codebase way easier to follow. Love Scala code, but it’s kind of mind bending, haha.
deepak tunuguntla
30:56
i understand your love for dask now ;)
Martin Durant (speaker)
32:27
When doing data frames or arrays in dask, the execution on the parts happens directly in numpy (or cupy, …) and pandas (or cuDF), python talking to python.
Marc-Andre Lemburg
33:01
These was a talk about a new faster dataframe implementation called "Polars" yesterday. Is it possible to use this with Dask or is there consideration move in that direction ?
Marc-Andre Lemburg
33:27
s/move/to move/
Sheikh Alam
33:46
I was thinking its must be written in C/C++ . But I just visited git hub and it is 99% python 😍 .
Martin Durant (speaker)
33:59
I don’t know of anyone having investigated it. If it supports the pandas API, could work well, but we also need to think about how to pass around serialized dataframe chunks
Marc-Andre Lemburg
36:28
The Polars API is similar, but not the same, as I understood the presentation (https://pola-rs.github.io/polars-book/user-guide/introduction.html). It does focus a lot on SQL style queries, so would be ideal for ETL. It also uses Arrow for data exchange.
Martin Durant (speaker)
37:09
As we can swap out pandas for cuDF, maybe worth trying.
Marc-Andre Lemburg
37:30
Sounds like it's worth an experiment :-)
Zhuchang Zhan
38:18
So If I have a cluster of N Cores, do I create a cluster of N+1 processes where the scheduler and 1 worker share one process so I have N workers or do I create a N process cluster with N-1 Workers?
James Bourbeau (he/him)
38:36
It’s certainly in scope for Dask. For example, integrating with polars came up in this issue https://github.com/dask/dask/issues/8074
Martin Durant (speaker)
39:30
You normally reckon on one worker thread per core. The scheduler should not need a lot of CPU in typical one-machine workflows. On a cluster (kubernetes, etc), the scheduler normally has its own container.
Zhuchang Zhan
41:27
So scheduler may need a lot of CPU for say.. a 500 core, multi node machine? By “scheduler normally has its own container” do you mean I should put the scheduler on a separate node?
Carlos Emilio Buenaventura
43:07
Is there any “managed” cluster in the cloud aka EMR for dask?
Martin Durant (speaker)
43:39
When I mentioned containers, I was talking about cluster systems in which everything lives in containers, such as kubernetes. On a high single machine, and big jobs, yes you might find that the scheduler needs significant resources of its own. The difference between N and N+1 is probably not important for that case!
Martin Durant (speaker)
44:01
> “managed” cluster in the cloud
Martin Durant (speaker)
44:08
There are MANY deployment models!
deepak tunuguntla
44:19
does using different engines affect the reading and writing speeds?
James Bourbeau (he/him)
44:28
There are a varied of ways to deploy Dask clusters (including managed solutions). See https://docs.dask.org/en/stable/how-to/deploy-dask-clusters.html for more information
James Bourbeau (he/him)
44:37
*variety
Martin Durant (speaker)
44:48
Yes, the engine matters, but I would say both are “fast” and “feature complete”
Sheikh Alam
44:54
Is there any performance gain of Dask parquet over Pyarrow
Sheikh Alam
45:12
Or it more about compatibility with Dask?
Zhuchang Zhan
45:14
Thanks Martin!
Eray Aslan
45:25
is there any work going on for making the scheduler more scalable? we've had some workflows where the scheduler became the bottleneck (think of a few million embarassingly parallel jobs) and prevented us from migrating to dask
Martin Durant (speaker)
45:51
Dask uses pyarrow (or fastparquet), so you may get parallelism and partition-wise (out-of-core) processing, with some additional overhead, but the raw byte processing is the same
Martin Durant (speaker)
46:05
> making the scheduler more scalable?
Martin Durant (speaker)
46:25
Yes, we think about it a lot! and to try to optimise graphs better
Sheikh Alam
46:37
Thanks
Radovan Kavicky
47:45
Thank you! It was perfect!
Marc-Andre Lemburg
50:07
Thanks for the answers on Polars, Martin and James.
Radovan Kavicky
52:44
OK, so it continues... here we go.
James Bourbeau (he/him)
01:01:38
At the Dask Summit earlier this year there was a workshop on improving Dask’s scheduler performance https://www.youtube.com/watch?v=fXbcY0y0Fa0, Eray you may find that workshop interesting
James Bourbeau (he/him)
01:02:19
(There are also a bunch of other great videos from the Dask Summit on Dask’s YouTube channel https://www.youtube.com/c/Dask-dev/videos)
Eray Aslan
01:03:37
> At the Dask Summit earlier this year...
Eray Aslan
01:05:00
thanks for the answer and all your work. we really would like to get rid of managing and reading error messages from jvm :)
James Munroe
01:05:21
For the very large dataset problems where the aggregated metadata file itself gets very large, has there been any consideration of using a traditional relational dbms as an alternative to a flat file for the metadata layer?
Bob Fahr
01:06:00
What are the feature differences between Pyarrow and Fastparquet? I've only used Pyarrow.
Majid alDosari
01:06:11
What's the point of using parquet for intermediate results if you're going to keep data processing in python? You can just generically 'pickle' python objects. There is still a bit of risk translating back and forth b/w some usable memory representation and the serialized format (parquet). Mind that maybe you don't want tabular storage.
Majid alDosari
01:11:22
@james, I leveraged such a technique; 'cache' arbitrary (small) metadata (from big reads) ...formalizes into a 'schema'. The schema structure follows your data processing DAG.
Rick Zamora
01:13:29
> What are the feature differences between Pyarrow and Fastparquet? I've only used Pyarrow.They are just two different libraries that allow us to read and write in the parquet format. I have heard of some people avoiding pyarrow because the install size was too large, but I’m not sure if that is still an issue.> What's the point of using parquet for intermediate results if you're going to keep data processing in python? You can just generically 'pickle' python objects.

The parquet API is designed with persistence in mind, but I’m sure you can write a custom workflow to leverage pickle instead (for certain applications).
Martin Durant (speaker)
01:15:30
Pickle wouldn’t allow you to, for example, subset your saved data by column and partition. Also, it doesn’t compress very well, so transmitting to disc or network might be slow. Finally, you need to match the versions of the system loading the data very exactly for pickle, whereas parquet is a standard format.
Roger Webb - KJ7LNE
01:20:16
Data files are great for staging, but I worry that these data lake schemes are a symptom of traditional DBMS not keeping up with cloud architecture. In theory, all DBMS use clusters of files in the back end and they do all this transaction/ATOMic operations, and indexing/metadata. I've been hesitant to adopt the data lake system as it feels like I should be spending that time tuning the DBMS. Is the data lake here to stay or is it a stop gap until systems like Aurora catch up and "cloudify" the DB space? Thoughts?
Martin Durant (speaker)
01:21:20
^ please ask in discussion, the future is a big unknown!
Marc-Andre Lemburg
01:23:06
What about Apache Druid ?
Roger Webb - KJ7LNE
01:23:10
For sure. Apologies, I was unaware there was a specific time set aside for discussion. The future isn't known, but development is happening, and I'm genuinely curious if others are more hip to the development path of Aurora (or other DBMS attempts to adapt to the cloud/large data sets) and was hoping, at a conference like this, that there may be someone wishing to discuss what may be coming down the hatch in the next couple years.
Majid alDosari
01:23:34
@roger, i have the same sentiment. i advocate doing as much processing as possible in a 'db'.i only reach into python (or spark) if it's something that i can't describe in sql.
Majid alDosari
01:24:17
some cloud dbs are insanely scalable.
Martin Durant (speaker)
01:25:05
For sure, if your data is in some DB and you want to do SQL to it, leverage the DB - they are really well optimized! Python is best for its flexibility and connecting to a wide range of tools such as ML
James Munroe
01:27:01
That sounds so much like query optimization!
Marc-Andre Lemburg
01:28:00
Thank you for the overview !
Daniel Gonzalez
01:28:06
Thanks you
Alex Delgado
01:28:17
Can DASK read and write DeltaLake files?
matteo pallini (he/him/his)
01:28:22
Thanks great talk!
Roger Webb - KJ7LNE
01:29:43
Bingo! That's the balance I'm trying to find. The reporting is out of a DB. My data is too big to do it all in the DB. I've been dancing around where the line needs to be.
Veronika
01:29:56
Great talk! Can you explain a bit about the differences between pypark/scala and Dask?
matteo pallini (he/him/his)
01:30:17
+1 to Veronika's question
Radovan Kavicky
01:31:42
Already thanked you... but it deserves yet again... really excellent talk/talks and it was perfect! As for questions... what do you think about other Parquet-based solutions like f.e. Dremio.io and so on (is basically open source solution in this area still competable?)
Roger Webb - KJ7LNE
01:31:58
Is there a status update on dask/dask-cloudprovider as it relates to AWS batch. I'm using dask to multithread on the EC2 but am manually wiring up my AWS batch integration. Do you have plans here?
Martin Durant (speaker)
01:32:41
^ perhaps an issue on the cloudprovider repo, or a question in slack/discourse?
Roger Webb - KJ7LNE
01:32:57
https://github.com/dask/dask-cloudprovider/issues/73 Last comment is "Any news?"
Marc-Andre Lemburg
01:32:58
IMO, there are plenty enough database systems around, so it's better to focus on things where Dask can shine, e.g. manage workflows, integrate lots of systems (both on input and output), do smart compute partitioning, etc.
Bob Fahr
01:33:41
@Veronika from my experience pyspark seems to handle memory use better than dask for large datasets expecially when you have limited resources. Also the logging is aggregated in spark so its easier to see when you have problems.
James Bourbeau (he/him)
01:34:56
https://docs.dask.org/en/stable/spark.html
James Bourbeau (he/him)
01:35:04
^ High-level Dask / Spark comparison
Matthew Powers (speaker)
01:36:04
Think that’s a fair assessment @Bob
Gabriel Vicent Jover Mañas
01:36:09
Is Xarrays integrated in Dask?
Sheikh Alam
01:36:20
Some suggestion on Choosing between Dask and Spark for a project?
James Bourbeau (he/him)
01:36:32
Xarray uses Dask under the hood for scaling to larger-than-memory datasets
Marc-Andre Lemburg
01:36:40
A more technical question: Are there some recommendations regarding how to size the Parquet files / do partitioning in order to get max performance with Dask ?
Gabriel Vicent Jover Mañas
01:37:56
thanks
Matthew Powers (speaker)
01:38:00
The Dask vs Spark selection depends on the team, data sizes, and objective. From what I’ve seen, some teams really don’t like Scala / JVM, others do. Dask provides access to some libs that’ll make some operations a lot easier.
Veronika
01:38:36
Awesome thanks for the answers and link 🙂
James Munroe
01:39:02
Thanks for a great set of presentations and good discussion.
Majid alDosari
01:39:41
i'll add that dask is a *much* lighter ask than dealing with a spark cluster.
Johannes Steenbuck
01:39:45
at what Point of a ETL Project would you recomment switching from classical Pandas dfs to dask? Or from what expected size of data do you recommend Setting up your ETL with Dask from the beginning?Ps. great presentation, thanks a lot
Majid alDosari
01:40:58
it's very smooth to go from pandas > dask 1 machine > dask powerful machine > dask cluster.
Matthew Powers (speaker)
01:41:05
I generally use the 5x RAM / dataset size rule of thumb for Pandas (not scientific). So if my machine has 32gb of RAM, I will only use Pandas if the dataset is smaller than 6.4GB.
Matthew Powers (speaker)
01:41:30
But I always write everything in Dask cause I’ve always found it’s nice to have stuff scalable
Matthew Powers (speaker)
01:41:47
Even for stuff I think are small, then something changes, and they’re too big, haha
Marc-Andre Lemburg
01:41:59
Great, thanks for the advice.
Bob Fahr
01:43:39
One other thing to consider when choosing which tool to use is data types in your parquet files. There are some type differences between spark and dask/pandas and pyarrow that can cause unexpected results if you have a heterogenous environment.
Marc-Andre Lemburg
01:50:19
Is there a limitation of how much data you can pass around between tasks ? Is it better to keep data of around 1MB in data storage or pass this in via parameters ?
Marc-Andre Lemburg
01:51:19
(serialization would be via Arrow, I suppose)
Marc-Andre Lemburg
01:54:33
Great, thanks. My use case is processing lots of smaller CSV files (up to 5MB), which as I learned now, would not really perform well using Parquet files (one per CSV file).
Majid alDosari
01:54:47
i'll just add that it sucks that we have to deal with quite a few datatype systems: numpy, pandas old/new types, python, arrow/parquet
Brian
01:55:07
handling s3 connection limits--you mentioned this briefly can you talk about some solutions? is it just retry a bunch of times?
Giles Weaver
01:56:05
Is there any coordinated effort towards consistency between pyarrow/fastparquet and pandas/dask data types?
Majid alDosari
01:57:15
..oh and sql data types, odbc data types..
Majid alDosari
02:02:00
convert everything to string and then do inference is the only way i was able to generally do generally deal with the problem when moving data from one system to another
Marc-Andre Lemburg
02:02:28
That's why we still use CSV everywhere :-)
Matthew Powers (speaker)
02:02:49
I change from CSV => Parquet whenever possible
Matthew Powers (speaker)
02:03:08
But come across it all the time as well
Majid alDosari
02:03:39
the visions library needs more credit.
Matthew Powers (speaker)
02:03:52
There is a Dask Slack too
Martin Durant (speaker)
02:03:59
The dashboard? That’s all Bokeh.
Matthew Powers (speaker)
02:04:00
Where you can find us ;)
Giles Weaver
02:04:20
Thanks!
Marc-Andre Lemburg
02:04:21
Many thanks to all the speakers ! It was a real pleasure discussing with you.
Bob Fahr
02:04:27
Thanks very much, really enjoyed it!
Majid alDosari
02:04:44
thanks!
Ruud Wassink
02:05:00
Thanks for the great talk and discussion!
Radovan Kavicky
02:05:10
It was awesome!
Marc-Andre Lemburg
02:05:28
👏👏👏