Data science (I?)

I’m not sure if I’ll actually blog this book in detail – I might, later on, but for now I’ll just cover it extremely lazily, by adding links to topics covered which I figured I wanted to include in this post.

The book is ‘okay’ – it’ll both allow (relatively) non-technical (management) people to at least begin to understand what sort of tasks the more technical guys are spending time on (and how to prioritize regarding critical resources, and engage with the nerds!), and it might also give the data guys a few more tools that they’ll be able to use when confronted with a specific issue. I really liked the book’s emphasis on conceptualizing data as a strategic asset. On the other hand I imagine some parts of the book will often be close to painful to read for people who have spent at least a few semesters dealing with stats-related topics in the past: This is the sort of book which is also at least in part written for people who might not be completely clear on what a statistical hypothesis test is, which discusses text mining without at any point in the coverage even mentioning the existence of regular expressions, and which discusses causal evaluation without mentioning topics like IV estimation.

Although there are some major gaps in the coverage the level of coverage is however not really all that bad; I hope to refer to at least some of the more technical material included in the book in my work in the future, but it’s not clear at this point how relevant this stuff’ll actually end up being long-term.

Links (…in random order, I did not have the book in front of me as I was writing this post so this is just a collection of links/topics I could recall being potentially worth including here):

Training, validation, and test sets
Cross-validation (statistics)
Statistical classification
Tree model
Decision tree pruning
Random forest
Naive Bayes classifier
Data mining
Zipf’s law (not covered, but relevant to some parts of the coverage)
Nearest neighbor search
Cluster analysis
Jaccard index
Bias–variance tradeoff
Hierarchical clustering
Boosting (machine learning)
Ensemble learning
Feature (machine learning)
Feature selection
Curse of dimensionality
Regularization (mathematics)
Association rule learning
Labeled data
Dimensionality reduction
Supervised_learning/Unsupervised learning
Model selection
Rubin causal model (not covered, but relevant to some parts of the coverage)
Regression discontinuity design (-ll-)
Lift (data mining)
Receiver operating characteristic
Stepwise regression
Grid search (hyperparameter optimization).


October 4, 2019 - Posted by | Books, Mathematics, Statistics

No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: