Thursday, May 23, 2019

Tech Book Face Off: Python For Data Analysis Vs. Python Data Science Handbook

I'm starting to dabble in machine learning. (You know it's all the rage now.) As with anything new, I find it most effective to pick out a couple of books on the subject and start learning the landscape and the details straight away. Online resources are good for an introduction, or to find answers to specific questions on how to get a particular task done, but they don't hold a candle to the depth and focus that you can find from reading about a subject in a well-written book. Since I'd already had some general exposure to machine learning in college, I wanted to work through a couple of books that focused on how to do data analysis and machine learning in a practical sense with a real language and modern tools. Python with Pandas and Scikit-Learn has a huge community and plenty of active development right now, so that's the route I went with for this pair of books. I selected Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython by Wes McKinney to get the details of using the Pandas data analysis package from the author of the package himself. Then I chose Python Data Science Handbook: Essential Tools for Working with Data by Jake VanderPlas to get more coverage of Pandas from another perspective and expand into some of the Scikit-Learn tools available for machine learning. Let's see how these two books stack up for learning to make sense of large amounts of data.

Python for Data Analysis front coverVS.Python Data Science Handbook front cover

Python for Data Analysis

This book covers all of the fundamentals of doing data analysis with Python using IPython, Jupyter Notebooks, Matplotlib graphing, and the main data analysis packages: NumPy and Pandas. It stops short of going into the other major data analysis and machine learning library, Scikit-Learn, because it had already filled over 500 pages with the intricate details of NumPy and Pandas. Wes McKinney is the original author of the Pandas library, so we're getting all of those details straight from the source.

The book starts out with the perfunctory chapters on installing Python and other packages, how to use IPython and Jupyter Notebooks, and running through the basic Python language features. It's filler chapters like these in nearly every programming book out there that makes me think that I no longer need to read introductory books on new languages. I can just go directly into books on applications of any given language, confident that they'll introduce me to the syntax and features I need to know anyway. It's not wrong, exactly, but the result is an awful lot of books with the same extra introductory material filling up pages that will mostly go unread.

Then there's a big chapter on using NumPy before moving on to Pandas for the rest of the book, with a chapter on the Matplotlib graphing library thrown in somewhere in the middle. The main focus is on Pandas, which is a huge library with tons of invaluable features for messing around with data. The book covers everything from reading and writing data, data cleaning, combining and merging data in various ways, doing complex calculations on the data with aggregation and groupby operations, and working with time series and categorical data.

The number and types of operations you can do on a data set with Pandas is pretty incredible, and that makes Pandas an excellent library to learn to use well. As McKinney says in the book,
During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst's time.
With all of that time spent on low-level data tasks, Pandas makes the life of a data scientist so much easier and more enjoyable. Data can be cleaned and transformed much more easily and reliably, and you can get down to making inferences about the data quickly.

Beyond covering all of the ins and outs of Pandas, McKinney sprinkles in a few good tips on other tools that can speed up your data analysis tasks. For instance,
If you work with large quantities of data locally, I would encourage you to explore PyTables and h5py to see how they can suit your needs. Since many data analysis problems are I/O-bound (rather than CPU-bound), using a tool like HDF5 can massively accelerate your applications.

Other than these scattered tips, the book is actually fairly dry and uninspiring. It reads a lot like the (excellent) online documentation for Pandas, but doesn't add too much more than that. Even most of the examples for different features are just drab randomly generated numbers with boring labels. You could just as easily read the online docs and get all of the same material. It may be a little nicer to have it all in book form so that you can sit down and focus on it, but that's a slight advantage. I was hoping for something more, that secret sauce that you sometimes find in books on software libraries, to make the book a greater value than just reading the online docs.

The book does have a chapter at the end that goes through some extended examples of data wrangling with publicly available data sets, which is a nice way of bringing everything together, but it's a small part of a large book. All in all, it's a no-nonsense, comprehensive exploration of the Pandas library, but not too much more than that. I wouldn't recommend it because there are better options out there that add something more than the online documentation can give you, like the next book.

Python Data Science Handbook


The Python Data Science Handbook covers most of what Python for Data Analysis does with somewhat less depth, but then goes much further into using Scikit-Learn to analyze data sets with machine learning techniques. The book is split into five large chapters, only the first of which delves into introductory minutiae by introducing the IPython interpreter. Thankfully, the book assumes you know Python already and doesn't bore the reader with another summary of lists, dicts, and comprehensions.

The next few chapters cover the use of NumPy, Pandas, and Matplotlib, and while the Pandas material is somewhat reduced from Python for Data Analysis, the Matplotlib material actually gets into the cartography drawing capabilities of this library. So, there are trade-offs in the number of topics covered in this book, as I would say the author gives more breadth while sacrificing some depth. The last chapter explores a good amount of Scikit-Learn with explanations and discussions of ten different machine learning models. This chapter added significantly to the book, grounding the features explored in the previous chapters with machine learning applications on real data sets of hand-written digits, bicycle traffic, and facial recognition. Seeing how different models performed better or worse in different applications was fascinating and enlightening.

The writing style of Jake VanderPlas was much more engaging as well. While reading the book, I felt like I was being guided by a mentor who wanted to make sure I understood the reasons behind different decisions, and why things should be done a certain way. While Python for Data Analysis focused on the "what" and "how" of programming with Pandas, the Python Data Science Handbook really addressed the "why" of data science programming, from explaining some of the reasons behind little decisions:
One guiding principle of Python code is that "explicit is better than implicit." The explicit nature of loc and iloc make them very useful in maintaining clean and readable code; especially in the case of integer indexes, I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.
To carefully describing the big issues with training machine learning models:

The general behavior we would expect from a learning curve is this: A model of a given complexity will overfit a small dataset: this means the training score will be relatively high, while the validation score will be relatively low. A model of a given complexity will underfit a large dataset: this means that the training score will decrease, but the validation score will increase. A model will never, except by chance, give a better score to the validation set than the training set: this means the curves should keep getting closer together but never cross.
This conversationally instructive style was quite comfortable, and made the whole book an enjoyable read, even though the material was understandably complicated with a lot of different features and concerns to think about. VanderPlas helped it all go down easily. It was a lot to take in, but it was never overwhelming. He also had plenty of words of encouragement, knowing that when real problems with data arise, it could get discouraging:
Real-world datasets are noisy and heterogeneous, may have missing features, and may include data in a form that is difficult to map to a clean [n_samples, n_features] matrix. Before applying any of the methods discussed here, you must first extract these features from your data; there is no formula for how to do this that applies across all domains, and thus this is where you as a data scientist must exercise your own intuition and expertise.
It's easy to tell that I much preferred this book over Python for Data Analysis, and I would recommend anyone looking into data science and machine learning take a look at the Python Data Science Handbook. It's a great overview of the subject, and you'll be able to get up and running with Python quickly, experimenting with some real applications of machine learning, and learning some of the critical issues of feature engineering and model validation.

Only the Beginning


These two books, Python for Data Analysis and Python Data Science Handbook, clearly only scratch the surface of machine learning. They teach you how to use the main Python libraries for data analysis and machine learning, but they don't go much further than that. There's a ton more stuff to learn about how to do machine learning well and what goes on under the hood in all of these various models. I've got my eye on more machine learning books like Python Machine Learning by Sebastian Raschka, Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron, and The Elements of Statistical Learning by Trevor Hastie, et al, among many others. There's a vast amount of literature out there now on machine learning, covering everything from practical applications to the theoretical underpinnings of the models. Suffice it to say, this is only the beginning of the exploration.

No comments: