Dealing With Monster Data

So my data in its raw form is very large … years and years with millions of contracts each and hundreds of fields of data on each contract.

But, of course I don’t need all that information. I just want average monthly contracting expenditures. It’d be super helpful of course if DoD just provided this, but no such luck: My choices are yearly (too broad) or what this contract-level data (too narrow, but I could work with it).

That’s a pretty big task in general, to turn billions of data points into a couple hundred, and becomes more irritating with multi-gigabyte files. Some years of data were so large that Excel simply refuses to open them, and even those I could seemed to set my laptop back about 30 years (http://www.usatoday.com/story/tech/gaming/2015/01/08/oregon-trail-internet-archive/21450885/).

So, here was my solution, and hopefully, if anyone else is dealing with huge data, here are some things that may help you:

  • You can split .csv files (or really most anything) via Terminal (if you have a Mac). It’s pretty simple and turns multi-gigabyte files into megabytes, which will prevent your laptop from crashing. Here’s an easy how to: http://monchito.com/blog/autosplit-csv
  • From those smaller files, it’s then an easy copy/paste to recreate another large book with only the raw data you need in Excel, knocking the extraneous fields (Contractor fax number?).
  • The final step would be then to create a *final* Excel document with only the condensed data you need.

Hope this helps someone. Good luck all!

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *