Doing Data Science the Cloud and Distributed Way

See how we're iterating the way we do early stage exploratory data science with large datasets.

photo of Humberto Corona
Humberto Corona

Data Scientist

photo of Sergio Gonzalez Sanz
Sergio Gonzalez Sanz

Data Scientist

Posted on Nov 04, 2016

Our team have been building data science products together for almost one year now. We believe that data scientists and data engineers should work closely together, yet we understand the differences in the environments and tasks that each of us perform.

In this vein, we have iterated the way we do data science, specially early stage exploratory data science with large datasets. Moreover, we have also established a flexible framework for reviewing and sharing data science knowledge within and across the teams. This framework allows us to carry out data science tasks in a production-ready environment; to have a better standard of work via peer reviews; and to use distributed computing frameworks such as Spark or Hadoop, where we can build machine learning models in the cloud with large datasets.

We’ve explored an array of different frameworks in our team and we’d like to share the pros and cons of each of them below.

Jupyter Notebooks

A few of us in the team really like Python, because it is great for early prototyping. Libraries such as scikit-learn allow us to iterate really fast around ideas. We also like using Jupyter Notebooks where you can integrate text, graphs, code, and data in a single human-readable file. However, we have found a few drawbacks with this method.

To start with, Jupyter Notebooks do not render in GitHub Enterprise, which we need for reviewing and versioning our work (we solved this by doing rubber-duck reviews on markdown exports of the code). An even a bigger problem with this approach is that is doesn’t help when using the tools we need to perform data science in really large datasets. Using EMR clusters, reading S3 files, or having immutable experiments is not straightforward with this approach.

EC2 / EMR steps

As software engineers, we have also tried a more standard approach used in the software industry. Using Scala and Spark, we have built GitHub repos to tackle a wide range of data science problems. Since we work in a production-ready environment, we have used code reviews and standard testing techniques to guarantee the quality of our code. Additionally, we have deployed Jenkins for Continuous Integration. In order to obtain results or create models, we usually deploy our code as fat JARs that then run on Docker, on top of EC2 or EMR, depending on our needs.

The problem with this approach is that is not data science friendly. Data science requires a lot of experimentation and tuning. The process of creating a JAR, deploying it, and running is incredibly time consuming and tedious. An alternative would be using config files with parameters that we can change on each run of our programs to get different results. Nevertheless, it is not as flexible nor dynamic as data scientists would expect.

Zeppelin Notebooks

Zeppelin is an interactive web-based notebook (similar to Jupyter) that is being developed by the Apache community. Its main features are the large number of different programming languages it supports and the flexibility it has to incorporate new languages in its interpreter. Its built-in visualization tools are also one of its highlights.

Zeppelin simplifies the running of large-scale experiments across large datasets using the cloud (for example, AWS). It automatically provides a SparkContext and a SQLContext so data scientists don’t have to initialize them manually. It is also possible to upload JARs and libraries from your local filesystem or Maven repository.

Conclusion

We are still iterating with small improvements over our current setup for data science, especially making it closer to engineering processes. As new tools come out, we will be testing them to see how they fit our workflow, and as our products change, the type of data science work we do will shift.

The integration of Zeppelin Notebooks within the software development process has proven to be difficult. High quality code is a standard in Zalando and we like to peer-review our notebooks. Unfortunately, just like Jupyter, Zeppelin Notebooks are not rendered by GitHub. We have sometimes opted for exporting the notebooks as .html or .json, but the review process here becomes tiresome and ticky.

null

Our current approach for exploratory and early stage data science can be seen in the image above. What we’ve experimented with is only a view of our current state and the processes we have put in place to get there.

We know many teams have gone through the same process and we would like to hear about your experiences. You can contact us via Twitter at @totopampin and @S_Gonzalez_S to share your own processes and feedback.



Related posts