Advice for Getting into Data

Last week one of my friends wrote me an email looking for advice.  Here’s the body of the email:

is there a basic way to teach oneself basic data management or even just database access skills? Like SQL?

Don’t wanna learn to code, just to look at DATA

I’ve included my response to them below in case it’s useful to anyone else:

Okay, if you’re looking to get into data and research, there are a few things to think about — it’s a pretty big world and there are lots of different areas that you could focus on depending on what’s most interesting to you.

In general, when it comes to “doing research with data” there are a few different skills that might be required:

  • Data summarization and visualization: Aggregating data into a way that makes it interpretable and constructing visualizations that communicate information about the data to others
  • Statistical skills: Making inferences from data is something humans do very naturally, but unfortunately it’s also a thing that we frequently do incorrectly because our intuitions very often lead us astray. Being familiar with basic statistical concepts is critical to both not looking like a fool when drawing lessons from data as well as for evaluating the work of others
  • Data organization and cleaning: often, the hardest part of a data analysis project is actually obtaining the relevant data and then getting it into a format that’s usable. In practice, you will never be presented with a data set that’s “clean” and ready to analyze with some summary statistics — you’ll need to combine multiple datasets together, remove duplicate records, and decide what to do with data that are partially missing.

When it comes to the fastest way to get started, Excel is probably your best bet. It’s ubiquitous, especially in government and it’s really powerful. If you can get a dataset in CSV format you can generally open it up in excel and start to get a sense of what’s going on really quickly. Here you’re going to want to focus on how to use pivot tables effectively which will allow you to flexibly look at different summaries of the data very quickly.

Excel can do a lot of things, but unfortunately, some people try to push it too far (see the example from the British government using one excel file to track their Covid case counts and having it break when it got too big).

Once you want to start working with data that are “bigger” (i.e., more than a few hundred thousand rows) or more complicated you’re going to want to start branching out into tools outside of excel. Depending on what you’re trying to do, you might go a few different directions:

  • SQL is the lingua-franca of data analytics and is a powerful tool. However, it’s most useful when you’re working inside of an organization that uses a SQL database to manage their data — government agencies, for example, generally use SQL databases internally, but they never make direct access to those databases available to external users — so if you’re looking to work for a watchdog group that’s doing a lot of FOIA requests, SQL may not help you if you don’t have a way to get the data that you’re working with into a database.
  • R & Python are the next two on the list of tools for working with data. R was originally built as a language for doing statistical computing but has now developed to the point where it’s a great all-purpose tool for data cleaning and management (plus statistics and data visualization).  Python is a general-purpose programming language but there are a number of libraries that have been developed for working with data that make it a great choice as well.

Here’s where I’ll pause to say that you shouldn’t be scared of “learning to code” when it comes to these tools. Trust me, just because you know a little bit of SQL and/or python, no one is going to make you become a software engineer! In fact, I often tell people that build complex excel workbooks that they are in fact building software and writing code, just in a tool that makes it difficult for other people to review the work. What is an index(match(),match()) function in excel if not code?

So, that being said, learning just-enough SQL, R, and Python to achieve what you want to do is not actually that hard, and shouldn’t feel like a step-change from other analysis techniques, but more like a continuous extension of other data skills. It doesn’t take much to appreciate how much better it is to build charts in ggplot than to build them in excel. Through code, you can make your analyses repeatable and easily modifiable in a way that unlocks a step-change in productivity for folks working with data.

There are tons of (free) online courses to teach you all of these technical skills — I’d just dive in and start figuring out what you like and are interested in.

When it comes to the statistical side, that might be both more important and more difficult to learn. What you want to do is to get a grasp of the basics of the sorts of statistical analyses that scientists do in the real world (what’s a p-value, what’s a confidence interval, how to interpret regression coefficients) and (more importantly) what the implicit assumptions those analyses are making and how those can go wrong. Things like “correlation is not causation” is trivial-sounding but very important, and sufficiently subtle that in practice real scientists make this error frequently. I don’t know of any great resources to learn this from scratch, but a few minutes of online googling turned up the following which seem like plausibly useful introductions:

Okay, that’s a lot of info but hopefully a start for you. Let me know what questions you have!

About the author

By michael