bfg-repo-cleaner
Removes large or troublesome blobs like git-filter-branch does, but faster. And written in Scala
BFG Repo-Cleaner by rtyley a simpler, faster alternative to git-filter-branch for deleting big files and removing passwords from git history.
I found that BFG is much faster than the original git-filter-branch.
We have multiple svn repo to move to even more git repositories, this implies some repository folder merges and splits.
During the process I need to remove a set of root folders and I'd like to remove those to the whole history.
I tried to use the BFG --delete-folders and it works fine for one single folder but I did not find a way to delete multiple folders. Is it even possible ? or shall I loop to call BFG as many times as I have folders to remove ?
Thanks for any help.
Source: (StackOverflow)
Suppose I have a giant repo for an as-of-yet unpublished software product called "Hammerstein", written by the famous German software company "Apfel" of which I am an employee.
One day, "Apfel" spins out the Hammerstein division and sells it to the even more famous company "OrĂ¡culo" which renames "Hammerstein" to "Reineta" as a matter of national pride and decides to open source it.
Agreements mandate that all references to "Hammerstein" and "Apfel" be replaced by "OrĂ¡culo" and "Reineta" in the repository.
All filenames, all commit messages, everything must be replaced.
So, for example:
src/core/ApfelCore/main.cpp
must become src/core/OraculoCore/main.cpp
.
The commit message that says "Add support for Apfel Groupware Server"
must become "Add support for Oraculo Groupware Server"
The strings ApfelServerInstance* local_apfel
, #define REINETA
and Url("http://apfel.de")
must become OraculoServerInstance* local_oraculo
, #define HAMMERSTEIN
, etc.
This applies to files that are not in HEAD
anymore as well.
What is the simplest and most pain-free method to achieve it with minimal manual intervention (so that it can be applied in batch to a potentially large number of repositories/assets)?
- BFG can replace the strings, but it seems to only have a
--delete-file
option, not a --rename-file
, and even then it does not take patterns as an argument
- This approach seems to work only for
HEAD
and not for the whole history; I have had no luck using it with --tree-filter
Source: (StackOverflow)
I have somehow deeply borked by entire repository (used only by me) and could use some assistance in sorting it out.
Here is what I did. I realized that in my commit history, there were some files containing credentials that I did not want just laying around. So, I decided to be legit and try to use the BFG Repo-Cleaner to fix these issues. I threw all the credentials in .gitignores, and moved on to trying to scrub them out of the history. As per the documentation instructions, I executed these commands:
git clone --mirror myrepo.git
java -jar bfg.jar --delete-files stuffthatshouldbedeleted.txt myrepo.git
At this point, BFG told me that x number of files had been found and removed. Sweet.
cd myrepo.git
git reflog expire --expire=now --all
git gc --prune=now --aggressive
git push
According to the terminal logs, it updated the repo. So far so good, right? I pop into my github account, and after a few clicks, find the credentials still there, file and all, in my history. I go back and try the same set of commands, but using this line instead of the file remover:
java -jar bfg.jar --replace-text passwords.txt myrepo.git
where passwords.txt is a file containing string instances of all the credentials I would like gone. Again, BFG logs indicate that there are several instances that it has fixed. I push up, check, and the credentials are still there, sitting in Github. I notice that the SHA-1 keys for all of my commits have been altered, so presumably BFG did something, just not the thing I want it to do.
At this point, I give up and try to get back to work, figure I'll sort it out later. I do some work, try to push up, get a weird merge conflict (you are 50 ahead and 50 behind on commits). What? I try to pull and merge, and suddenly, every single commit in my git history is duplicated in name, and some of them are just blank. I check my Github network graph, and it looks like there is a second branch starting from my initial commit that exactly mirrors all of my commits that has been zippered in with my last commit (I have never branched, just been linearly chugging along).
I can't revert to a previous commit, because they are all chronologically duplicated. My credentials are still in there, with twice as many instances now, and my history is doubled and very confusing to try to understand. When I try to run BFG from the beginning now, cloning and mirroring the repo anew, it tells me that there are no credentials in it, despite the fact that I can see them in Github. I could really use some help in understanding what happened, and how, if at all, I can get back to a state of things again.
I am considering just deleting the entire repo and starting anew. I really don't want to do that.
tldr; Tried using BFG, somehow duplicated half-baked versions of all commits in my repo, can't untangle, and to add insult to injury, BFG did nothing and claims it's done its job.
Source: (StackOverflow)
My BitBucket repo had a whole bunch of large files in that weren't needed. I removed them and then wanted to clear them out from the history to shrink down the repo which had gotten too big.
I ran BFG repo cleaner which reported 1755 files found and processed - all the ones I was expecting.
Ran the final git gc as instructed here: https://rtyley.github.io/bfg-repo-cleaner/
All fine - the .git folder shrunk to 17% its original size.
Pushed it back up and the repo size as reported by BitBucket actually got larger!
Not sure what went wrong as all seemed to behave correctly up to that point.
Any advice gratefully received as I really don't want to have to recreate the repo to bring down the size.
Thanks
Source: (StackOverflow)
Very basic git question:
I uploaded some compromising information to Github and am using bfg to clean the repo. I followed the documentation and performed the following actions:
$ git clone --mirror git://example.com/some-big-repo.git
$ bfg --replace-text passwords.txt my-repo.git
I received the following output:
Found 233 objects to protect
Found 9 commit-pointing refs : HEAD, refs/heads/experimental, refs/heads/master, ...
Protected commits
-----------------
These are your protected commits, and so their contents will NOT be altered:
* commit 497fc1c8 (protected by 'HEAD')
Cleaning
--------
Found 80 commits
Cleaning commits: 100% (80/80)
Cleaning commits completed in 301 ms.
BFG aborting: No refs to update - no dirty commits found??
I'd like to see if the private information was cleared from my repo but I'm not sure how to check the files in the mirrored repo. Any ideas?
Source: (StackOverflow)
My repo is forked from an open sourced project, so I don't want to modify the commits before the ForkPoint tag. I've tried the BFG Repo Cleaner but it doesn't allow me to specify a range.
I want to
- Go through the history in
ForkPoint..HEAD^
- Rewrite the commits to delete all files larger than 10M
How to remove unused objects from a git repository? says it should be something like this
BADFILES=$(find . -type f -size +10M -exec echo -n "'{}' " \;)
git filter-branch --index-filter \
"git rm -rf --cached --ignore-unmatch $BADFILES" ForkPoint..HEAD^
but wouldn't BADFILES
only contain the files that exist in HEAD
?
For instance, if I've mistakenly committed a HUGE_FILE
then later made another commit that removes that file, the BADFILES
search wouldn't find the HUGE_FILE
since find
doesn't see it in the current working tree.
Edit1: Now I'm considering using BFG on a clone, then moving my fork onto the original ForkPoint. Would this be the right command, given fatRepo
and slimRepo
?
mkdir merger ; cd merger ; git init
git remote add fat ../fatRepo
git remote add slim ../slimRepo
git fetch --all
git checkout fat/ForkPoint
git cherry-pick slim/ForkPoint..slim/branchHead
Edit2: Cherry-picking didn't work because cherry-picking can't handle merges in slimRepo. Can I somehow crush down the history of slimRepo, and simply merge onto fatRepo/ForkPoint?
git <turn into a single commit> slim/rootNode..slim/ForkPoint
git checkout fat/ForkPoint
git merge slim/branchHead
Source: (StackOverflow)
We are currently in the process of migrating our SVN repo to GIT (hosted at bitbucket).
I used subgit to import all our branches/history into a bare repo i have locally on my (Windows) PC.
The repo is quite big (7.42 GB after the import) this is because it also contains information about SVN like revision numbers to provide a way to have a two way sync between Git and SVN (I'm only interested in a one way SVN to GIT).
I create a local clone of the imported bare repo and push all the branches to bitbucket.
After a couple of hours (!) the repo was fully uploaded. BitBucket now gave me warnings about the repo size. I checked the size and it was 1.1GB. Thats not as big as the imported bare but still to big to have a fast repository.
After playing around with BFG i managed to remove soms large DLL/SQL export files using these commands on the bare repo (I only use the clone for pushing without all the svn-related refs):
java -jar bfg.jar --delete-files '{''specialized 2015''','''specialized,''insert-pcreeks''}.sql' --no-blob-protection
java -jar bfg.jar --delete-files 'Incara.*.dll' --no-blob-protection Incara.git
git reflog expire --expire=now --all && git gc --prune=now --aggressive
This took a while and afterwards the git_find_big.sh script did not show these large sql files anymore. But after pushing things back to bitbucket (as a new repo, not as a force push) it only got bigger (1.8GB)
Can you provide a possible explanation for this behavior?
I don't know if it matters but we used a non standard branch/tag model in svn. This resulted in branches like:
/refs/heads/archive/some/path/to/branch
. These branches seemed to work just fine and removing them also did not affect the size.
Next to these problems i noticed i had some XML files showing up in the git_find_big.sh
output:
size,pack,SHA,location 12180,1011,56731c772febd7db11de5a66674fe6a1a9ec00a7 repository/frontend.xml 12074,1002,0cefaee608c06621adfa4a9120ed7ef651076c33 repository/frontend.xml 12073,1002,a1c36cf49ec736a7fc069dcc834b784ada4b6a06 repository/frontend.xml 12073,1002,1ba5bd92817347739d3fba375fc42641016a5c1d repository/frontend.xml 12073,1002,e9182762bfc5849bc6645fdd6358265c3930779f repository/frontend.xml 12073,1002,dff5733d67cb0306534ac41a4c55b3bbaa436a2e repository/frontend.xml 12072,1002,8ee628f645ce53d970c3cf9fdae8d2697224e64c repository/frontend.xml 12072,1002,1266dee72b33f7a05ca67488c485ea8afc323615 repository/frontend.xml
These files contain the frontend logic of the web platform we are using and are indeed quite big.
But they should be treated as text right? Therefore I don't get why they show up as separate objects in the above output. Am i right this should not be happening?
The SVN import also resulted in some empty commits (for example when SVN creates or moves a branch it needs a new commit). I guess these can only be removed using filter-branch?
Sorry, I have a lot of questions!
Could someone help me with this?
Thanks,
Piet
Source: (StackOverflow)
Faced with anarchic add of binary files by a coder, how to slim down a git repository not only removing the problematic files but also their history in the tree.
I tried using bfg but as it works on mirrored bare repository I've been faced with difficulties in getting the whole workflow, needing to gather answers from different places on the web.
Source: (StackOverflow)
Reading the instructions for bfg repo-cleaner, the work flow seems like:
- clone the repo using the --mirror option
- strip the repo from unwanted items using bfg
- use git gc to physically remove the items
- do a push of the cleaned repo
However, then it is unclear to me whether you need to remove your own copy of the working directory and do a fresh clone, or whether you can just do a pull to get the clean repo/history? At the moment I am the only one who uses the repo.
Source: (StackOverflow)
git filter-branch
was taking a long time. Happily, I found BFG repo-cleaner.
But it is unexpectedly changing the contents of my last commit.
$ git clone --mirror example.com:/repo.git
$ cd repo.git
$ git log HEAD^!
commit 5f737d28756d4854d25899632abffe7cca2c7423
Author: Paul Draper <paul@example.com>
Date: Sat Jan 24 19:31:47 2015 -0700
Fix /contact and /folderEntries/listFoldersSimple
$ git diff --stat HEAD^!
cake/app/controllers/folder_entries_controller.php | 1 +
And now I clean.
$ java -jar ~/bfg-1.12.0.jar -b 1M
...
In total, 161797 object ids were changed. Full details are logged here:
...
$ git log HEAD^!
commit 3ff700cebe32497423435b416ea11169b7fcbf90
Author: Paul Draper <paul@example.com>
Date: Sat Jan 24 19:31:47 2015 -0700
Fix /contact and /folderEntries/listFoldersSimple
Former-commit-id: 5f737d28756d4854d25899632abffe7cca2c7423
$ git diff --stat HEAD^!
cake/app/controllers/folder_entries_controller.php | 1 +
.../lucidchart-tools/caja/ant-jars/guava-r09.jar | Bin 0 -> 1141964 bytes
.../caja/ant-jars/guava-r09.jar.REMOVED.git-id | 1 -
cake/app/lucidchart-tools/caja/ant-jars/js.jar | Bin 0 -> 1122370 bytes
.../caja/ant-jars/js.jar.REMOVED.git-id | 1 -
.../lucidchart-tools/caja/ant-jars/pluginc-src.jar | Bin 0 -> 5172676 bytes
.../caja/ant-jars/pluginc-src.jar.REMOVED.git-id | 1 -
.../app/lucidchart-tools/caja/ant-jars/pluginc.jar | Bin 0 -> 2959487 bytes
.../caja/ant-jars/pluginc.jar.REMOVED.git-id | 1 -
.../lucidchart-tools/caja/ant-jars/xercesImpl.jar | Bin 0 -> 1229125 bytes
.../caja/ant-jars/xercesImpl.jar.REMOVED.git-id | 1 -
cake/app/lucidchart-tools/jsdoc/rhino/js.jar | Bin 0 -> 1111429 bytes
.../jsdoc/rhino/js.jar.REMOVED.git-id | 1 -
cake/app/lucidchart-tools/selenium/chromedriver | Bin 0 -> 5778064 bytes
.../selenium/chromedriver.REMOVED.git-id | 1 -
.../selenium/selenium-server-standalone-2.37.0.jar | Bin 0 -> 34730734 bytes
...ium-server-standalone-2.37.0.jar.REMOVED.git-id | 1 -
.../selenium-server-standalone-2.42.2-mod.jar | Bin 0 -> 34873583 bytes
...server-standalone-2.42.2-mod.jar.REMOVED.git-id | 1 -
.../selenium/selenium-server-standalone-2.42.2.jar | Bin 0 -> 34823352 bytes
...ium-server-standalone-2.42.2.jar.REMOVED.git-id | 1 -
.../lucidchart-tools/test-runner-1.0-SNAPSHOT.jar | Bin 0 -> 9732125 bytes
.../test-runner-1.0-SNAPSHOT.jar.REMOVED.git-id | 1 -
.../CommandLine/Scaffolders/DefaultScaffolder.phar | Bin 0 -> 4404199 bytes
.../DefaultScaffolder.phar.REMOVED.git-id | 1 -
.../WebPICmdLine/Microsoft.Web.Deployment.dll | Bin 0 -> 1201991 bytes
.../Microsoft.Web.Deployment.dll.REMOVED.git-id | 1 -
cake/app/vendors/aws.phar | Bin 0 -> 6784935 bytes
cake/app/vendors/aws.phar.REMOVED.git-id | 1 -
.../tcpdf/fonts/dejavu-fonts-ttf-2.33/status.txt | 6657 +++++
.../status.txt.REMOVED.git-id | 1 -
cake/app/vendors/tcpdf/tcpdf.php | 28808 +++++++++++++++++++
cake/app/vendors/tcpdf/tcpdf.php.REMOVED.git-id | 1 -
.../img/onboarding-chart/04_shape manager.gif | Bin 0 -> 1413721 bytes
.../04_shape manager.gif.REMOVED.git-id | 1 -
cake/app/webroot/img/onboarding-chart/05_share.gif | Bin 0 -> 1341876 bytes
.../onboarding-chart/05_share.gif.REMOVED.git-id | 1 -
.../js/closure/usage/rhino/javadoc/index-all.html | 12027 ++++++++
.../rhino/javadoc/index-all.html.REMOVED.git-id | 1 -
cake/app/webroot/js/closure/usage/rhino/js-14.jar | Bin 0 -> 1471932 bytes
.../closure/usage/rhino/js-14.jar.REMOVED.git-id | 1 -
cake/app/webroot/js/closure/usage/rhino/js.jar | Bin 0 -> 1134765 bytes
.../js/closure/usage/rhino/js.jar.REMOVED.git-id | 1 -
.../js/closure/usage/rhino/testsrc/tests.tar.gz | Bin 0 -> 1778543 bytes
.../rhino/testsrc/tests.tar.gz.REMOVED.git-id | 1 -
cake/app/webroot/js/mathquill/font/Symbola.svg | 5102 ++++
.../js/mathquill/font/Symbola.svg.REMOVED.git-id | 1 -
.../webroot/js/templates/SoyToJsSrcCompiler.jar | Bin 0 -> 2154164 bytes
.../SoyToJsSrcCompiler.jar.REMOVED.git-id | 1 -
cake/app/webroot/persona-pages/img/gif-v3.gif | Bin 0 -> 1570363 bytes
.../persona-pages/img/gif-v3.gif.REMOVED.git-id | 1 -
.../webroot/persona-pages/img/interactive-gif.gif | Bin 0 -> 1434134 bytes
.../img/interactive-gif.gif.REMOVED.git-id | 1 -
cake/build/closure/compiler.jar | Bin 0 -> 6007184 bytes
cake/build/closure/compiler.jar.REMOVED.git-id | 1 -
.../lucidchart-mobile-sliders-landscape-4.png | Bin 0 -> 1718536 bytes
...t-mobile-sliders-landscape-4.png.REMOVED.git-id | 1 -
.../lucidchart-mobile-sliders-portrait-4.png | Bin 0 -> 1614308 bytes
...rt-mobile-sliders-portrait-4.png.REMOVED.git-id | 1 -
.../Versions/A/OCHamcrestIOS | Bin 0 -> 3671740 bytes
.../Versions/A/OCHamcrestIOS.REMOVED.git-id | 1 -
.../OCMockitoIOS.framework/Versions/A/OCMockitoIOS | Bin 0 -> 1299132 bytes
.../Versions/A/OCMockitoIOS.REMOVED.git-id | 1 -
.../Versions/A/CrashReporter | Bin 0 -> 1432156 bytes
.../Versions/A/CrashReporter.REMOVED.git-id | 1 -
chart-ios/libFlurry_6.0.0.a | Bin 0 -> 3819300 bytes
chart-ios/libFlurry_6.0.0.a.REMOVED.git-id | 1 -
67 files changed, 52595 insertions(+), 33 deletions(-)
All of these extra files are ones that I want removed.
Why are all these files being changed in my latest commit?
Source: (StackOverflow)
Bitbucket is alarming that my Git repository is over 1 GB. Actually, in Repository details page it says it is 1.7 GB. That's crazy. I must have included large data files in the version control. My local repository is in fact 10 GB, which means that at least I have been using .gitignore
successfully to some extent to exclude big files from version control.
Next, I followed the tutorial here https://confluence.atlassian.com/display/BITBUCKET/Reduce+repository+size and tried to delete unused large data. The command files.git count-objects -v
at the top level folder of my repo
returned the following:
count: 5149
size: 1339824
in-pack: 11352
packs: 2
size-pack: 183607
prune-packable: 0
garbage: 0
size-garbage: 0
The size-pack 183607 KB is much smaller than 1.7 GB. I was a bit perplexed.
Next I downloaded the BFG Repo Cleaner https://rtyley.github.io/bfg-repo-cleaner and ran the command java -jar bfg-1.12.3.jar --strip-blobs-bigger-than 100M
at the top level directory to remove files bigger than 100 MB from all the not latest commits. However, BFG returned the following message:
Warning : no large blobs matching criteria found in packfiles
- does the repo need to be packed?
Repeating the same for 50M resulted in the same.
Does this mean that all the files larger than 50 MB are in the latest commit? In Source code browser in Bitbucket, I looked at folders that contain large data files but those files are not included (successfully ignored).
Could anyone explain briefly what is the source of confusion about the repository size and existence of large files in the repo?
Source: (StackOverflow)
I have used the BFG Repo-Cleaner to remove a large file from a git repository:
java -jar ../bfg-1.11.8.jar --delete-folders escrow application.git
cd application.git
git reflog expire --expire=now --all
git gc --prune=now --aggressive
cd ..
mkdir clone
cd clone
git clone file:///home/damian/temp/TCLIPG-4370/test/application.git
I have used the script(http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/) to check my repository before and after running BFG Repo-Cleaner and it shows the removal of the escrow directory and there is also a reduction in memory in the two repositories.
Everything looks ok, but how can I verify that all my commits are the same? Would I have to create a script with git-for-each-ref and compare the commits, with the same name, in the two repositories, to verify that BFG has worked correctly?
Any suggestions would be greatly appreciated.
Source: (StackOverflow)
I have cleaned my repo with BFG Repo Cleaner using the following procedure:
$ git clone --mirror git://example.com/some-big-repo.git
$ java -jar bfg.jar --strip-biggest-blobs 500 some-big-repo.git
$ cd some-big-repo.git
$ git reflog expire --expire=now --all
$ git gc --prune=now --aggressive
$ git push
I can see that my local repo has shrunk with 1GB. Great. The problem that I'm having now and that I haven't been able to find any info on is that now I would like to also shrink the size of the GitHub-repo as well. How to achieve this?
git push
didn't work and I also tried git push origin --force --all
which gave me this error message: error: --all and --mirror are incompatible
Source: (StackOverflow)
How do I delete only one directory using BFG?
The help says:
delete folders with the specified names (eg '.svn', '*-tmp' - matches on folder name, not path within repo)
Which seems to mean that --delete-folders "config"
will match all folders named config, anywhere in the repository.
Source: (StackOverflow)
When using BFG Repo-Cleaner is there any way around not having everyone do a fresh clone? With a large team and multiple branches it is difficult to organize this. I am willing to run bfg multiple times should something be reintroduced as long as I don't have to have everyone re-clone the repo.
I'm thinking remove the files (ie private keys) from history, add them to .gitignore file, git push, and have the team rebase their branch.
Hoping Roberto Tyley sees this and can offer some advice.
Cheers!
Source: (StackOverflow)