Recovering From a Bungled Git LFS Migration
As soon as I realized I had made a mistake while migrating away from Netlify’s Git LFS-based Large Media, I created an Org file to organize the recovery effort. LFS works by replacing files with pointers. I had these pointers going all the way back to when I first enabled the feature, but I was missing many files from older commits. In the end, getting back to a working repository took many failed attempts at automation with PowerShell, quite a bit of manual conflict resolution, and writing off a fair bit of data as irrecoverable.
Cataloging the files
git lfs ls-files -a showed me all LFS entries in the repository, but for some reason I
can’t remember, that wasn’t enough. Some sleuthing showed that enabling lfs.skipdownloaderrors
would allow me to check out revisions with missing LFS entries. The errors are logged, so I
could do this (pointing the rebase at the last commit before I enabled LFS):
PowerShellgit config --local lfs.skipdownloaderrors true
git rebase -i e166a6^
git lfs logs > ../lfs-logs
git lfs logs doesn’t show the logs themselves but a list of filenames. I saved it for easy reference and narrowed it down to just the errors:
PowerShell$allLogs = Get-Content ..\lfs-logs | %{ Get-Content ..\blog\.git\lfs\logs\$_ | where { $_ -like "*Error downloading source*" } }
This produced 453,000 lines thanks to my prior attempts to repair the history. Since I knew most of those were repeated, I sorted and de-duplicated them in Emacs. That brought the count down to 1,900. The files I saw now fell into two categories: a large number of old social media previews (that I didn’t care about losing), and around 10–15 unused images I wanted to save if I could. That included two post images which I was fortunately able to find in old backups.
Getting related commits
Given an object hash (i.e. the hash of the file being pointed to), I was able to get a list of commits pointing to it like so:
Output$ git log --no-show-signature '--pretty=format:%H %aI' -S 3bcd5d08e22d628abf41ff6a253952e77b11dc83ff150903a6e2ff17f7148c82 f0d9ae066f23c6cc5dc097ef2cc30e0afa83011f 2021-06-20T21:11:05+05:30 ceb7ae814862b50231440301d6a65fc7be4afb4f 2021-06-23T17:56:35+05:30
The teasers
I began by tackling the social media images. I was perfectly happy to replace the missing files with a placeholder, given that they were automatically generated. In order to avoid repeating work, I took a multi-step approach with PowerShell and Python (all inside Org with Babel):
-
Parse the errors into a mapping. The keys were the paths the LFS objects were meant for, the values described the object identifier, associated commit, and commit date.
PowerShell
$map = @{} foreach ($line in (Get-Content lfs-missing)) { if (!($line -like "*/teasers/*")) { continue } if ($line -like "Error downloading object:*") { continue } $parts = $line -split " " $image = $parts[2] $hash = $parts[3].Substring(1, $parts[3].Length - 2) # get rid of parentheses $commitinfo = ((git --no-show-signature '--pretty=format:%H %aI' -S $hash) -split "`n")[0] -split " " $commit = $commitinfo[0] $date = $commitinfo[1] $map[$image] = @{ "hash" = $hash "commit" = $commit "date" = $date } } ConvertTo-Json $map
I saved this as a JSON object.
-
Serialize the object to iterate over it chronologically:
PowerShell
$map = Get-Content -Raw teasers-map.json | ConvertFrom-Json -AsHashTable $serialized = @() foreach ($k in $map.Keys) { $v = $map[$k] $serialized += @{ "path" = $k; "commit" = $v["commit"]; "hash" = $v["hash"]; "date" = $v["date"] } } return Sort-Object -InputObject $serialized -Property date,path | ConvertTo-Json
-
Group the same data by commit:
PowerShell
$map = Get-Content -Raw teasers-map.json | ConvertFrom-Json -AsHashTable $byCommit = @{} foreach ($k in $map.Keys) { $v = $map[$k] $commit = $v["commit"] if (!$byCommit.ContainsKey($commit)) { $byCommit[$commit] = @{ "date" = $v["date"] "paths" = @() } } $byCommit[$commit]["paths"] += $k } $serialized = @() foreach ($k in $byCommit.Keys) { $serialized += @{ "commit" = $k "date" = $byCommit[$k]["date"] "paths" = $byCommit[$k]["paths"] } } return $serialized | ConvertTo-Json
-
Having saved the above file as teasers-by-commit.json, use Python to sort it again because it turned out PowerShell wouldn’t do it correctly:
Python
import json from pathlib import Path raw = Path("teasers-by-commit.json").read_text() parsed = json.loads(raw) ordered = sorted(parsed, key=lambda c: c["date"], reverse=True) print(json.dumps(ordered, indent=2))
From here, it gets murky. I tried many different approaches. Automation just didn’t work. Sometimes it would turn runs of unrelated commits into empty commits. The rest of the time, I’d end up with (expected) conflicts because of rewriting early history that collided with rewriting later history, and the script couldn’t automatically account for those. I ended up splitting the work into several scripts that I ran manually, having copied the sorted list of commits into a new file I could manually work through:
-
Move to the next commit to edit, with a hack to turn interactive rebasing into non-interactive rebasing using sd (an alternative to sed):
PowerShell
$commits = (Get-Content -Raw ../teasers-by-commit-sorted-remaining.json | ConvertFrom-Json -AsHashtable) $current = $commits[0] $hash = (git rev-parse --short $current["commit"]) [Environment]::SetEnvironmentVariable("GIT_SEQUENCE_EDITOR", "sd -s `"pick $hash`" `"edit $hash`"", "Process") Write-Output "Editing $($hash): $(Get-Item Env:\GIT_SEQUENCE_EDITOR)" git rebase -i "$($hash)^" Remove-Item Env:\GIT_SEQUENCE_EDITOR
-
Overwrite the missing teasers in the current commit:
PowerShell
$ErrorActionPreference = "Stop" $commits = (Get-Content -Raw ../teasers-by-commit-sorted-remaining.json | ConvertFrom-Json -AsHashtable) $files = $commits[0].paths Write-Output "Files: $files" foreach ($file in $files) { Copy-Item ..\teasers-placeholder-image.png $file -Force -Verbose } touch @files git add @files git commit --amend --no-edit git rebase --continue
-
Remove the newly-added teasers from the commit which ultimately disabled the social media images:
PowerShell
$ErrorActionPreference = "Stop" $commits = (Get-Content -Raw ../teasers-by-commit-sorted-remaining.json | ConvertFrom-Json -AsHashtable) $files = $commits[0].paths Write-Output "Files: $files" git rm @files git commit --amend --no-edit git rebase --continue
-
Remove the current commit from the list:
PowerShell
$ErrorActionPreference = "Stop" $path = "../teasers-by-commit-sorted-remaining.json" $commits = (Get-Content -Raw $path | ConvertFrom-Json -AsHashtable) $withoutCurrent = $commits | Select-Object -Skip 1 copy-item -Force $path "$($path).old" Set-Content $path (ConvertTo-Json $withoutCurrent)
In the end, however, despite much repetition of these steps and lots of manual resolution of conflicts… I still had missing objects. I gave up and removed all teasers from the repository’s history with git-filter-repo:
PowerShellpython C:\App\Scoop\apps\git-filter-repo\current\git-filter-repo --invert-paths --path-glob 'source/assets/images/teasers'
The remaining images
I had to put in more work for the images I cared about. I couldn’t successfully automate much. I just took the list of objects and went through it manually with a lot of git log --grep to discover new commit hashes each time I finished a rebase.
Even after that, I still got errors about missing objects when I tried to push the branch. I removed a few more commits with git-filter-repo. I pushed again and got more errors. I fixed the commits that the errors were indirectly referencing. I pushed again. Suddenly, it worked!
Tidying up the branches
After eight days of effort, I had finally pushed a new branch to GitLab. It showed a 33-page diff with all sorts of inexplicable changes. Locally, though, I could see that only the LFS configuration had changed; clearly, GitLab was struggling to relate the two branches. I switched to the new one by:
- Renaming the old branch.
- Renaming the new branch to
main
. - Setting the default branch in GitLab to
main
. - Updating the branch protection rules in GitLab.
- Updating Netlify to build from
main
. - Redeploying to Netlify.
- Deleting the old branch or branches locally.
I put a backup of the old working directory (with the old history) in a safe place, just in case.