Tuesday, July 7, 2020
Home Data Science Large Files and GitHub

Large Files and GitHub

There’s one challenge anyone working in data analysis will encounter at some point in their work: GitHub’s file size limit, which is exactly 100 MB. Due to the nature of the work, large files are par for the course (I’m looking at you, half gig CSV files), and GitHub is an industry standard for just about everyone. How are analysts reconciling the reality of too large files with too little storage? In this post, I’ll cover three of the most common approaches to dealing with the conundrum:

  1. Store the file using Git’s Large File System (Git LFS)
  2. Access the database without saving it locally
  3. Cut your large file into smaller files
  4. Don’t push the file to GitHub (by use of the .gitignore file)

1. Git Large File System (LFS)

Your best option for preserving the integrity of your project and your commit history is to make use of Git LFS. It’s easy to declare which files you want tracked, and you can then continue working normally within the git workflow you’re already familiar with. When you commit and push, Git LFS works by intercepting the designated files and migrating them to the LFS server, and it leaves pointers in your GitHub repository that point to those files on the LFS server. After installing on your local device, you only need three lines of code to install LFS in the desired repository and track all CSV files therein, shown below. Execute this in each local repository where you plan to use GitHub LFS:

$ git lfs install
$ git lfs track "*.csv"
$ git add .gitattributes
- Advertisement -

And you’re done. It’s just that easy. Now you can commit and push like you normally would, and all your data has been saved and connected to the repository. You can also look to see which files are being tracked by the LFS using the following command:

$ git lfs ls-files

Lastly, if you’ve already committed the file to your repository, you can use git-lfs-migrate to add the file to LFS and have the file taken out of your git history.

There are two potential drawbacks to this option: 1) the LFS itself has a ceiling, which can be exceeded by paying $5/month/data pack, a data pack being 50 GB of storage and 50 GB of bandwith, and 2) I forgot the second thing.

2. Access Data Without Saving It Locally

- Advertisement -

This option will only be viable for some projects, namely those where data can be accessed remotely via API/(private?) remote server/conjuring, and ideally, it’s data you feel you have some control over or data you feel exists as a relatively fixed state. Rather than save the data you need as a file and then writing your code around that local file, don’t do that: query the data remotely, in code, inline, and save the data to your list/array/DataFrame. Don’t save it in a file on your side at all.

I’m calling this approach the Indiana Jones approach, because it feels very caution-thrown-to-the-wind for a lot of reasons.

For one, if the data is migrated, or if your database undergoes a schema migration–if the way your interact with the database/file to get that data changes–you must rewrite your code to get the same information you started with, no negotiations.

For seconds, if the database/file goes offline, all subsequent code is useless until it comes back online.

- Advertisement -

For thirds, if the data changes, the data changes, including, potentially, the trends you initially sussed out and regressed and drew conclusions from.

In essence, you are trusting in whatever higher power you believe in that the data will be the way it was when you left it, in every way, shape, and form: the URL is the same, and the API is the same, and it’s still, for instance, a MySQL database, and all the features are the same, and all the data belonging to those features is the same. Maybe you own this external place where the data exists, and then you can make your own bets with destiny and fate and nature, but otherwise, you are asking a question the universe has already answered:

No man ever steps in the same river twice, for it’s not the same river, and he’s not the same man.

Heraclitus, bemoaning the loss of his beloved, a 5 gig DB of stock values

You stand a huge risk of your data not being the same when you come back to it, in which case you will have to significantly alter your code.

3. Make Smaller Files

Another option would be to make smaller files out of your big file. To use a Python example, you could import your data (in a separate place from your Jupyter Notebook) and save as a Pandas DataFrame, cut that into smaller DataFrames, and export each of the smaller ones as separate files. Then you can delete the very large file that was creating the problem in the first place.

Effective, but potentially a less elegant solution as you’ll have more file clutter, and do you really want data that logically belongs together as a group to be separate? This strategy could also fail to solve your problem, because while GitHub places a hard limit on repository size at 100 GB, it encourages users to keep repositories under 1 GB.

4. Don’t Push The File to GitHub (.gitignore)

If, after some consideration, you decide that you don’t need this large file to be sent up to GitHub. There are three situations you might find yourself in, depending on what stage of stage/commit/push you’re at: (1) you haven’t committed the file, (2) you’ve committed the file but not pushed, and (3) you’ve committed and pushed the file to GitHub.

If you find yourself in the first (1) case, simply add the file to the repository’s .gitignore file using whatever editor you like or just use

$ echo "big_file.sql" >> .gitignore

If, as in case (2), this is a file you’ve already committed to the project, but haven’t pushed to the remote repository, you can remove it by clearing the cache and then adding the file to .gitignore.

$ git rm --cached big_file.sql
$ echo "big_file.sql" >> .gitignore

But if you’re in case (3) and you’ve committed the file and pushed to the remote repository, you’ll need to A) clean up the repository’s git history using the git filter-branch command, B) add the file to .gitignore, and finally C) force push those changes.Par example, if you wanted to get rid of big_file.sql located at Users/me/myproject/big_file.sql, you would need to

A) Execute git filter-branch

$ git filter-branch --force --index-filter 
  "git rm --cached --ignore-unmatch Users/me/myproject/big_file.sql" 
  --prune-empty --tag-name-filter cat -- --all

B) Add the file to .gitignore

$ echo "big_file.sql" >> .gitignore
$ git add .gitignore
$ git commit -m "Add big_file.sql to .gitignore"

C) Force push those changes

$ git push origin --force --all


Bring it in for the TL;DR, a quick list of pros and cons for each strategy for dealing with GitHub’s file size limit.



  • Integrates with Git workflow
  • Keeps data with project


  • 1 GB limit on storage and bandwidth each
  • $5/50 GB additional

Interact with API/Database

  • The freshest data?
  • You don’t have to download Git LFS
  • Lack of control of state of data
  • Threatens integrity of project
  • Lack of control of state of data
  • Threatens integrity of project

Make Smaller Files

  • Keeps data with project
  • You have to cut up files
  • Might exceed GitHub repository size limit anyway

Don’t push to GitHub (.gitignore)

  • Forgo the issue entirely
  • Data not attached to project on GitHub

Source and Credit: https://datalingo.wordpress.com/2020/03/21/large-files-and-github/

- Advertisement -

Subscribe to our newsletter

To be updated with all the latest news, offers and special announcements.


techsocialnetwork has teamed up with some great affiliates - check out the shop.




Most Popular

Fitbit’s Chinese rival Amazfit mulls a transparent, self-disinfecting mask

The COVID-19 pandemic has ushered in a wave of Chinese corporations with production functions to deliver virus-combating devices:...

10 Ways AI Is Improving Manufacturing In 2020

Machinery Maintenance and Quality are the leading AI transformation projects in manufacturing operations today, according to Capgemini.Caterpillar's Marine Division is saving $400K...

Review: Intel Hades Canyon NUC

Intel's fast mini PC weds a Core i7 to powerful AMD Vega graphics. Available on Amazon below:-

Will 5G be the technology to deliver Industry 4.0?

John Harris, Global R&D Director and Frank Wüstefeld, Senior Sales Engineer at Panasonic Mobile Solutions Business Division We have...