We need to talk about Sitecore Blobs

Recently, I noticed that one of our set of Sitecore 10.1 databases had grown out of control. Whilst it was a Sitecore instance with multiple separate websites each with a considerable amount of media, the size it had grown to, did not compute. Not only that, but the Web database was considerably larger than the Master database, which to anyone who knows Sitecore, is a bit of a surprise. As the Web database contains only the latest published content, you would expect it to be smaller than Master, or the same at best.

To identify where the space was being used, we ran a database script which unsurprisingly highlighted the Blobs table as the location of huge amounts of data.

SELECT 
    t.NAME AS TableName,
    s.Name AS SchemaName,
    p.rows AS RowCounts,
    SUM(a.total_pages) * 8 AS TotalSpaceKB, 
    SUM(a.used_pages) * 8 AS UsedSpaceKB, 
    (SUM(a.total_pages) - SUM(a.used_pages)) * 8 AS UnusedSpaceKB
FROM 
    sys.tables t
INNER JOIN      
    sys.indexes i ON t.OBJECT_ID = i.object_id
INNER JOIN 
    sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN 
    sys.allocation_units a ON p.partition_id = a.container_id
LEFT OUTER JOIN 
    sys.schemas s ON t.schema_id = s.schema_id
WHERE 
    t.NAME NOT LIKE 'dt%' 
    AND t.is_ms_shipped = 0
    AND i.OBJECT_ID > 255 
GROUP BY 
    t.Name, s.Name, p.Rows
ORDER BY 
    t.Name
SQL script output showing Blobs table has 380,000 rows and around 250GB

The Blobs table is where Sitecore stores media items by default, as binary large objects (blobs). The Blob table contained over 380,000 rows! Whilst I was not aware how much media content editors had added over the years, this felt like an unlikely number. A quick Google reveals a few blog articles similar to this one which demonstrates that this is a common problem. The older articles refer to a simple database scripts that you can run, whilst newer ones mention the Sitecore Admin page where you can “Cleanup Blobs”. As I have been through this pain recently on one of our 10.1 instances, I thought it could be useful to explain my findings with relation to previous articles, and hopefully assist someone else in understanding the differences.

First off I want to clarify that I am not saying any of these articles are wrong, they were accurate at the time or accurate for a specific Sitecore version or situation. I want to try to give the next person a good starting point from which to solve their own Sitecore Blob problem, and to learn from my mistakes.

WARNING:

Always backup databases before running destructive scripts such as the ones supplied in this article. Always understand your own situation and evaluate whether the solution is right for you. Most importantly, don’t blame me if you delete all of your blobs. There are several scripts referenced in this article which demonstrate the process I went through, including ones which caused harm. I have added comments at the start of those scripts to warn against running them.

The problem

What you probably already knew is that Sitecore stores media as blobs in the database. What you may not realise is that when a media item is deleted, or if the media item is updated by attaching a new file, the original blob is not removed from the database. In those instances, the blob becomes orphaned and is inaccessible in the Sitecore interface.

In our case, we had a process that was generating media items on a schedule, overwriting previous versions of the file. This meant we were not only orphaning blobs through editor actions but also automatically through a Sitecore task with over 100 blobs per day. Before too long, the database was was almost 250GB. Whilst the huge database was not directly causing any problems, it needed to be dealt with to prevent causing problems in the future or racking up unnecessary hosting costs.

graph showing a Sitecore Web database space utilization. 249GB used, 51GB remaining.
Sitecore Web database utilisation

What did I do?

I Googled. One of the first articles I stumbled across was 7 years old and referenced a database script that you can use to remove the orphaned blobs. Naturally I was happy to find a solution, but I wasn’t going to run a 7 year old script when my version of Sitecore was very different to that of the article. However, the article did mention the newly added feature “cleanup blobs”. I checked, it still exists in Sitecore 10.1 (Control panel > Administration tools > Database cleanup). So this felt like the obvious answer. It is a feature provided by Sitecore to clean up your Sitecore instance.

The Sitecore database cleanup page
The Sitecore database cleanup page

I had several goes at this. The first time, very little happened. After a short amount of time, the window said it had finished but nothing appeared to have changed. A further Google revealed that this was because the default SQL execution timeout is very low and it simply isn’t long enough to cleanup any blobs. Sitecore do warn about this timeout, and suggest increasing it temporarily. I tried this, increasing initially to 10 hours, then 3 days, then 10 days. Each time the process never completed or was interrupted/disconnected somehow.

At this point I felt that a SQL script was the way to go. I just needed an accurate one for the version I was running. The next article from Dean OBrien, was much more recent, from last year. The warning at the end of the article did concern me – “Caution: This script doesn’t fit all versions. I tried this on a 9.3 database and it was deleting blobs which were still in use” (Mark Lowe).

Wanting a quick(ish) answer to which SQL script was right for me, I fired up dotpeek and decompiled the Sitecore DLL used by the cleanup blobs feature. After navigating through many abstracted classes and methods I found the main bulk of the SQL query.

Unfortunately it was building up part of the query elsewhere and it quickly became confusing to what the final query would look like. Reading the main part showed huge similarities to what Sitecore had provided Dean in the previous article. The big difference, and what I believe accounted for Mark Lowe’s loss of data, was that the later script caters for two different formats of blob id. It appears that at some point, Sitecore has changed whether the reference contains curly braces or not, when referenced from the various *Fields tables.

I constructed the SQL query as best as I could understand, which gave me:

​
## DO NOT USE THIS SCRIPT. IT IS ONLY TO DEMONSTRATE 
## THE STEPS I TOOK. THIS ONE WILL DELETE VALID BLOBS

DECLARE @r INT;
DECLARE @batchsize INT;
DECLARE @BlobId UNIQUEIDENTIFIER;

create table #UnusedBlobIDs (
	ID UNIQUEIDENTIFIER PRIMARY KEY (ID)
);

SET @r = 1;
SET @batchsize=1000;
SET @BlobId='FF8A2D01-8A77-4F1B-A966-65806993CD31';

WITH [BlobFields] ([FieldId])
AS
(   SELECT [SharedFields].[ItemId]
	FROM [SharedFields]
	WHERE
    [SharedFields].[FieldId] = @BlobID
    AND [SharedFields].[Value] = 1
	UNION
	SELECT [VersionedFields].[ItemId]
	FROM [VersionedFields]
	WHERE
    [VersionedFields].[FieldId] = @BlobID
    AND [VersionedFields].[Value] = 1
	UNION
	SELECT [UnversionedFields].[ItemId]
	FROM [UnversionedFields]
	WHERE
    [UnversionedFields].[FieldId] = @BlobID
    AND [UnversionedFields].[Value] = 1
	UNION
	SELECT [ArchivedFields].[ArchivalId]
	FROM [ArchivedFields]
	WHERE
    [ArchivedFields].[FieldId] = @BlobID
    AND [ArchivedFields].[Value] = 1),
	
	[ExistingBlobs] ([BlobId])
	AS
	(SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [SharedFields]
	ON '{{' + CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) + '}}' = [SharedFields].[Value]
    JOIN [BlobFields]
	ON [SharedFields].[FieldId] = [BlobFields].[FieldId]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [SharedFields]
	ON CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) = [SharedFields].[Value]
    JOIN [BlobFields]
	ON [SharedFields].[FieldId] = [BlobFields].[FieldId]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [VersionedFields]
	ON '{{' + CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) + '}}' = [VersionedFields].[Value]
    JOIN [BlobFields]
	ON [VersionedFields].[FieldId] = [BlobFields].[FieldId]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [VersionedFields]
	ON CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) = [VersionedFields].[Value]
    JOIN [BlobFields]
	ON [VersionedFields].[FieldId] = [BlobFields].[FieldId]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [UnversionedFields]
	ON '{{' + CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) + '}}' = [UnversionedFields].[Value]
    JOIN [BlobFields]
	ON [UnversionedFields].[FieldId] = [BlobFields].[FieldId]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [UnversionedFields]
	ON CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) = [UnversionedFields].[Value]
    JOIN [BlobFields]
	ON [UnversionedFields].[FieldId] = [BlobFields].[FieldId]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [ArchivedFields]
	ON '{{' + CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) + '}}' = [ArchivedFields].[Value]
    JOIN [BlobFields]
	ON [ArchivedFields].[FieldId] = [BlobFields].[FieldId]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [ArchivedFields]
	ON CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) = [ArchivedFields].[Value]
    JOIN [BlobFields]
	ON [ArchivedFields].[FieldId] = [BlobFields].[FieldId])
	
	INSERT INTO #UnusedBlobIDs (ID) SELECT DISTINCT  [Blobs].[BlobId]
    FROM [Blobs]
	WHERE NOT EXISTS
	(  SELECT NULL
	FROM [ExistingBlobs]
	WHERE [ExistingBlobs].[BlobId] = [Blobs].[BlobId])
	
WHILE @r > 0
	BEGIN
		BEGIN TRANSACTION;
		DELETE TOP (@batchsize) FROM [Blobs] where [Blobs].[BlobId] IN (SELECT ID from #UnusedBlobIDs);
		SET @r = @@ROWCOUNT;
		COMMIT TRANSACTION;
	END

DROP TABLE #UnusedBlobIDs;

I tried running it directly in SSMS. The first time I ran without the delete clause and added Select count(ID) from #UnusedBlobIDs, to understand how many rows it was going to delete. I compared this with the count of the rows in the Blobs table.

Number of Unused Blob Ids

The number was big but not everything (251K out of 387k), it seemed good. However, I later discovered after running it for real, that blob id is not unique. This meant that the number I got out of my initial select was inaccurate – it was a distinct count of blob ids. It was going to delete more items that I was expecting and when it finished 20 hours later, my Blobs table was completely empty!

face palm

Obviously I was working on a backup, so all was not lost. I just had to import 250GB and go again. I re-examined the script and found that the first section (BlobFields) was evaluating to 0 rows. As this is later inner joined with the existing blobs, it effectively marks everything for deletion. I took a closer look at what the first section was doing. It seems that it is trying to identify the Sitecore fields which are set as Blob types. On any Sitecore field you can set this option:

Screenshot of Sitecore content editor showing the Blob setting on a field definition item

The list of Blob fields is then used to inner join in the latter part of the script so that only fields with blob ids where the field type is actually a blob type. It feels that this check is highly unnecessary for accurately identifying orphaned blob items as Guids are to all intent and purposes unique. I suspect this clause is here for performance reasons as the string comparison that occurs on the *Fields tables is probably more expensive than the comparison of uniqueidentifier types.

With this in mind I decided that the worst that would happen if I removed this check, is I would retain a tiny amount of the orphaned blobs and it might take a bit longer to run. I modified the script to remove the BlobFields element and went again. My modified script now looked like this:

​
## DO NOT USE THIS SCRIPT. IT IS ONLY TO DEMONSTRATE 
## THE STEPS I TOOK. THIS ONE WILL DELETE VALID BLOBS

DECLARE @r INT;
DECLARE @batchsize INT;

create table #UnusedBlobIDs (
	ID UNIQUEIDENTIFIER PRIMARY KEY (ID)
);

SET @r = 1;
SET @batchsize=1000;

WITH [ExistingBlobs] ([BlobId])
	AS
	(SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [SharedFields]
	ON '{{' + CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) + '}}' = [SharedFields].[Value]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [SharedFields]
	ON CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) = [SharedFields].[Value]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [VersionedFields]
	ON '{{' + CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) + '}}' = [VersionedFields].[Value]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [VersionedFields]
	ON CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) = [VersionedFields].[Value]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [UnversionedFields]
	ON '{{' + CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) + '}}' = [UnversionedFields].[Value]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [UnversionedFields]
	ON CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) = [UnversionedFields].[Value]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [ArchivedFields]
	ON '{{' + CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) + '}}' = [ArchivedFields].[Value]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [ArchivedFields]
	ON CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) = [ArchivedFields].[Value])
	
	INSERT INTO #UnusedBlobIDs (ID) SELECT DISTINCT  [Blobs].[BlobId]
    FROM [Blobs]
	WHERE NOT EXISTS
	(  SELECT NULL
	FROM [ExistingBlobs]
	WHERE [ExistingBlobs].[BlobId] = [Blobs].[BlobId])
	
WHILE @r > 0
	BEGIN
		BEGIN TRANSACTION;
		DELETE TOP (@batchsize) FROM [Blobs] where [Blobs].[BlobId] IN (SELECT ID from #UnusedBlobIDs);
		SET @r = @@ROWCOUNT;
		COMMIT TRANSACTION;
	END

DROP TABLE #UnusedBlobIDs;

This time I was left with around 4000 blobs. I was happy that they didn’t all get deleted like last time, but fairly quickly I noticed media was missing in the media library. A final look at the script and it was obvious what I had done wrong. I had forgotten that I would need to un-escape the curly braces. As I had taken the script from a string in a decompiled DLL, there were double curly braces. Running this directly in SSMS was not equivalent. So I reimported the DB again, modified the script, and tried again. The final script looked like this:

​
DECLARE @r INT;
DECLARE @batchsize INT;

create table #UnusedBlobIDs (
	ID UNIQUEIDENTIFIER PRIMARY KEY (ID)
);

SET @r = 1;
SET @batchsize=1000;

WITH [ExistingBlobs] ([BlobId])
	AS
	(SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [SharedFields]
	ON '{' + CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) + '}' = [SharedFields].[Value]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [SharedFields]
	ON CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) = [SharedFields].[Value]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [VersionedFields]
	ON '{' + CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) + '}' = [VersionedFields].[Value]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [VersionedFields]
	ON CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) = [VersionedFields].[Value]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [UnversionedFields]
	ON '{' + CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) + '}' = [UnversionedFields].[Value]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [UnversionedFields]
	ON CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) = [UnversionedFields].[Value]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [ArchivedFields]
	ON '{' + CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) + '}' = [ArchivedFields].[Value]
	UNION
	SELECT [Blobs].[BlobId]
	FROM [Blobs]
    JOIN [ArchivedFields]
	ON CONVERT(NVARCHAR(MAX), [Blobs].[BlobId]) = [ArchivedFields].[Value])
	
	INSERT INTO #UnusedBlobIDs (ID) SELECT DISTINCT  [Blobs].[BlobId]
    FROM [Blobs]
	WHERE NOT EXISTS
	(  SELECT NULL
	FROM [ExistingBlobs]
	WHERE [ExistingBlobs].[BlobId] = [Blobs].[BlobId])
	
WHILE @r > 0
	BEGIN
		BEGIN TRANSACTION;
		DELETE TOP (@batchsize) FROM [Blobs] where [Blobs].[BlobId] IN (SELECT ID from #UnusedBlobIDs);
		SET @r = @@ROWCOUNT;
		COMMIT TRANSACTION;
	END

DROP TABLE #UnusedBlobIDs;

This time all appears much better. I was left with around 35000 blobs and a random check of items in the media library didn’t reveal any anomalies.

The weigh in

A final comparison of the numbers reveals that our slimmed down Web database has gone from 249GB to 22GB, and from 387k blobs to 31k blobs. Very satisfying.

graph showing a Sitecore Web database space utilization. 22GB used, 228GB remaining.
Sitecore Web database utilisation after running the SQL script

A final recommended step would be to shrink the database. Until this is done, the database allocates the original size. To shrink the database I ran:

​
DBCC SHRINKDATABASE (N'database-name-web');
​

This can take a significant amount of time to run (for me it was around 10 hours when running at a high number of DTUs), and if it is interrupted, you have to start the shrink again.

Patience

I thought it might be useful to demonstrate how long the script can take to run. Everyone’s situation will be different, the size of the blob table, the number of orphaned blobs, the processing power available to your SQL Server instance, will all affect how long the script will take. Whilst I was able to get the script to complete in around 6 or 7 hours at one point, I did scale the Azure SQL to 1600 DTUs. Generally, what I experienced was that the script will fill however much available processing power is free. However, I have a suspicion that Azure treated the long running script as an issue and tried to manage it. Often after a few hours, the utilisation would drop right down to ~200 DTUs like it was being throttled. When this happened, I rescaled (which killed the script) then I ran again. As the script deletes in batches, it can just be rerun as many times as needed until all orphaned blobs are deleted.

Occasionally I would also experience connection drops which would terminate the script. This can be frustrating when you don’t detect this for some hours or if it happens moments after you go to sleep (wasted sleeping hours).

To give you an idea of the time to expect, I ran against one of my environments without scaling up the database. The spec was:

  • Azure SQL S1 (20 CTUs)
  • Sitecore Master DB
  • 153,000 orphaned blob Ids
  • 224,000 orphaned rows in the Blobs table
  • 161GB before
  • 30GB after

Whilst an S1 is not very big, it is the size the Master database normally runs on in our non-production environment so I wanted to see how it would perform without adding additional hosting cost. It managed to run successfully without any connection drop outs and it took…

42 hours!

The graph below shows the duration of the cleanup script on the S1 instance.

Graph showing the space used in the database and the DTU utilisation. The space used goes down over time and the DTU utilisation is at around 100% the whole time.
Graph showing the space used in the database and the DTU utilisation

Shrinking the database took a further 40 hours.

BlobFields

I may have solved my problem but one thing was still bugging me. Why did I need to remove the section about BlobFields? Did I misunderstand what it was trying to do, or is there something wrong with my specific instance of Sitecore? I couldn’t leave this unanswered.

So what is the script doing. It is checking each of the tables that contains field data (SharedFields, VersionedFields, UnversionedFields and ArchivedFields) and looking for an item that has the Blob field set to 1 (true). My expectation is that this should return items that are based on the File item which has Blob ticked on the Blob field and is used by things such as versioned PDF, versioned Image etc. There is also a separate unversioned blob field too.

System Blob field in the content editor
System Blob field in the content editor

The Field Ids for the two Blob fields are:

  • {40E50ED9-BA07-4702-992E-A912738D32DC}
  • {DBBE7D99-1388-4357-BB34-AD71EDF18ED3}

Those Ids can be found in other legacy SQL scripts. If I perform a query on a known media item I can see one of the blob fields listed with a value, which points to a blob id in the Blobs table.

Result of a SQL query showing the field values of a single media item. The blob field value is highlighted
Result of a SQL query showing the field values of a single media item. The blob field value is highlighted.

So we can safely assume that in our original SQL script we would expect the result of the BlobFields query to contain {40E50ED9-BA07-4702-992E-A912738D32DC}. This means that we would expect there to be a row in one of the *Fields tables with an ItemId (or ArchivalId in the ArchiveFields table) with the value 40E50ED9-BA07-4702-992E-A912738D32DC.

Interestingly this is not the case, and presumably is the reason why the first part of the SQL query fails to return any results. The next question is whether this is just our instance of Sitecore in a corrupted state or if other Sitecore instances return the same result. I checked another instance and it too returned no results for the Blob field or when running the BlobFields portion of the query. A script I found on a Gist by Michael West had made me believe that the BlobFields section of the script MUST return something. If it didn’t, the script would only ever delete all of your blobs. His script was very similar to mine, the only real difference is the batch deleting of blobs in the one I had taken from the decompiled DLL. This brings me back to the one part of the script I never fully resolved. – {ExternalBlobFieldsIdsMarker}

dotpeak: decompiled code showing where the SQL script has a placeholder replaced with Ids
dotpeak: decompiled code showing where the SQL script has a placeholder replaced with Ids

In the DLL, there was a string replace for {ExternalBlobFieldsIdsMarker} with the values returned from another method. This placeholder just so happens to JOIN in the BlobFields query. Originally it seemed that blobFieldsIds was always null, which is why I disregarded this replacement. However, there is an additional entry point to the parent method where the Ids are not null and instead come from GetResourceBlobFieldIds(). It’s a tough read but my current working assumption is that the default Blob field Ids are returned from this method and that this may be something that changed in v10. If this were the case, it would explain how Michael West’s script could have worked, presumably in v9 the BlobFields query would have returned results. If anyone has the answer to this, I’d love to hear from you.

Bug or a feature?

In writing this article I found this Sitecore Knowledgebase article which talks about the issue of orphaned blobs and says it is fixed in v6.6. I can only assume they mean they provided the cleanup tool in that version, rather than fixing the underlying issue. I admit I don’t know the ins and outs of why Sitecore handles media in this way, but it would seem sensible to have a mechanism that deletes blobs when their media items are deleted or replaced. At the very least it feels that part of the database set up should involve creating a job that routinely removes orphaned blobs. Maybe this is a set up step I have missed, or has been added since we created the instance – let me know if it is.

Summary

I am not the first to write about orphaned blobs, I probably wont be the last. But hopefully, from going through my experience I may have helped explain why there are differences in the various posts you may have found on this subject. If nothing else, I will probably help my future self when I try to deal with this issue again.

In my next article I write about automating the cleanup of blobs when using Azure SQL database, so we never have to worry about this again.

One thought on “We need to talk about Sitecore Blobs

Leave a comment