Four Useful Books for Learning Data Science

I was listening to an old episode of Partially Derivative, a podcast on data science and the news. One of the hosts mentioned that we’re now living in the “golden age of data science instruction” and learning materials. I couldn’t agree more with this statement. Each month, most publishers seem to have another book on the subject and people are writing exciting blog posts about what they’re learning and doing.I wanted to outline a few of the books that helped me along the way, in the order I approached them.

Nudging and Data Science

I’ve recently been reading a great book on how people make decisions and what organizations can do to help folks make better choices. That book is Nudge. What is a nudge? The authors describe a nudge as anything that can influence the way we make decisions. Take the primacy affect, for instance, namely the idea that order matters in a series of items. We’re more likely to recall the first or last option in a list of items simply because of their positions.

You Probably Need a Database

When I see organizations using and talking about their data, they love to present the tools they’re using to handle and wrangle it. You’ve probably heard terms like Hadoop, Spark, Shark, PostgreSQL, MySQL, MongoDB, and rarely Excel. (If you haven’t, there’s a good list to look up on Wikipedia.) I won’t argue that taming data doesn’t take good tools, but what I will argue is that the tools you use depend on the scale of your data.

Getting Started With Tabular Data

I look at data every day. If I had to go back to a past version of myself to give him advice, I’d offer this: make it a rule to fit your data into a box. There are plenty of mathematical techniques out there for analyzing data but to effectively apply them to your particular data, your data needs to fit the following format: Data consists of rows and columns Your data should be viewable using any common spreadsheet application Each row represents an instance of data (in other words, each row represents one object under study be it a person, a spammy email, or a photograph with a face to be recognized) Each column represents a feature or something that we can use to describe the instance (and this could be a person’s height, the number of occurrences of the word “FREE” in a spammy email, or a length of a detected edge in a picture of a face) When you encounter some new data, it’s best to strive to fit it into that framework.