2 Introduction - The Choice that Doesn’t Matter

The very first (and intimidating) choice a novice hacker faces is which is programming language to learn. Unfortunately the medium popularily summed up as the internet offers a lot of really really good advice on the matter. The problem is, however, that this advice does not necessarily agree which language is the best for research. In the realm of data science – get accustomed to that label if you are a scientist who works with data – the debate basically comes down to two languages: The R Language for Statistical Computing and Python.

At least to me, there is only one valid advice: It simply does NOT matter. If you stick around in data science long enough you will eventually get in touch with both languages and in turn learn both. There is a huge overlap of what you can do either of those languages. R came out of the rather specific domain of statiscs 25+ years ago and made its way to a more general programming language thanks to 15K+ extension packages (and counting). Built by a mathmatician, Python continues to be as general purpose as it’s ever been. But it got more scientific, thanks to extension packages of its own such as pandas, SciPy or numPy. As a result there is a huge overlap of what both languages can do and both will extend your horizon in unprecendented fashion if you did not use a full fledged programming language for your analysis before.

R: “Dplyr smokes pandas.” Python: “But Keras is better for ML!” Language wars can be entertaining, sometimes spectacular, but most times they are just useless…

But why is there such a heartfelt debate online, if it doesn’t matter? Let’s pick up a random argument from this debate: R is easier to set up and Python is better for machine learning. If you worked with Java or another environment that’s rather tricky to get going, you are hardened and might not cherish easy onboarding. If you got frustrated before you really started, you might feel otherwise. You may just have been unlucky making guesses about a not so well documented paragraph, trying to reproduce a nifty machine learning blog post. Just because you installed the wrong version of Python or didn’t manage to make sense of virtualenv right from the beginning.

The point is, rest assured, if you just start doing analytics using a programming languages both languages are guaranteed to carry you a long way. There is no way to tell for sure which one will be the more dominant language in 10 years from now or whether both still be around holding their ground the way they do now. But once you reached a decent software carpentry level in either language, it will help you a lot learning the other. If your peers work with R, start with R, if your close community works with Python, start with Python. If you are in for the longer run either language will help you understand the concepts and ideas of programming with data. Trust me, there will be a natural opportunity to get to know the other.

2.1 Why Should a Researcher Work Like a Software Engineer?

First of all, because everybody and their grandmothers seem to do it. Statistical computing continues to be on the rise in many branches of research.

Source code can be a tremendously sharp, unambigous and international communication channel.

Second because it’s reproducible. Code has become a tremendous communication channel. Your web scraper does not work? Instead of reaching out in a clumsy but wordy cry for help, posting what you tried so far described by source code will often get you good answers within hours on platforms like Stackoverflow or Crossvalidated. Or think of feature requests: After a little code ping pong with the package author your wish eventually becomes clearer. Let alone chats with colleagues and co-authors. Sharing code just works. Academic journals have found that out, too in the meantime. Many outlets require you to make the data and source code behind your work available. Social Science Data Editors is a bleeding edge project at the time of writing this, but is already referred to by top notch journals like American Economic Review (AER).

Third, because it scales and automates. Automation is not only convenient. Like when you want to download data, process and create the same visualization and put it on your website any given Sunday. Automation is inevitable. Like when you have to gather daily updates from different outlets or work through thousands of .pdfs.

Last but not least because of things you couldn’t do w/o being an absolute guru (if at all) if wasn’t for programming. Take visualization. Go, check these D3 Examples. Now, try to do that in Excel. If you do these things in Excel it’d make you an absolute spreadsheet visualization Jedi, probably missing out on other time consuming skills to master. Moral of the story is, with decent, carpentry level programming skills – that’d be the upfront investment – you can already do so many spectular things while not really specializing and staying very flexible.

2.2 How to Read this Book?

Hacking for Social Sciences is written based on the experience of helping students and seasoned researchers of different fields with their data management, processing and communication of results. A part of the book contains the information I wish I had when I started a PhD in economics. Part of the book is written years after said PhD was completed and with the hindsight of 10+ years in academia. Every page of the book is written with the belief that the future is OPEN and it is up to our generation of researchers to shape it.

“The ministry warns: The future is open,” taken from a 2020 ad campaign on Open Access by the German ministry for education and research (pdf).

If you came to cherry pick, you’re welcome, too.

2.3 Requirements - What You Need to Know to Make the Most of This Book

Though the book offers many entry points to the reader and strives to make advanced considerations accessible, certain prior knowledge will provide readers with a kickstart and help avoid frustration. Prior experience with a scripting language such as the R Language for Statistical Computing, Python or Javascript as well as familiarity with console / terminal basics will help you leverage the book.

Though Programming with data does contain its share of easy to follow introductory examples, there are definitely better resources for a systematic introduction to any of the scripting languages mentioned above. That is

does not intend to be a systematic introduction to R or Python.

There is so much good material out there already

(because 1) there is so much good material out there already 2) )