Coding Kata – Anagram Solving Algorithm for SQL Server

A coding challenge is set here to find the anagrams in a list of over 330,000 words – http://codekata.com/kata/kata06-anagrams/ . This solution uses SQL Server to sort the letters in each word by their ASCII value, then uses a SQL Server group by (ASCII Total) clause to find the anagrams.

The challenge states there are 20,683 anagram, this solution found 20,210 of them, 97%. Apparently its possible to find all anagrams using a Ruby algorithm in less than 2 seconds, this solution in SQL Server takes about 40 seconds. It’s dynamic, tables are constructed at runtime to cater for the longest words, but that makes the code quite ugly. If tables were created for a given maximum word length then 1/2 the code needed, this solution is flexible but at a complexity cost.

Anagram Search Algorithm Summary

To explain the anagram search process, two words are used as examples, ‘retreats’ and ‘treaters’

Step 1

The word list file from the challenge website was saved to a text file using notepad, a table is defined in SQL Server with a single Unicode column then the text file is bulk inserted into the database table.

Step 2

A dynamic SELECT INTO statement is constructed where the ASCII value of each letter in a word is substringed into a new column as shown below:

Word Word Len ASCII_1 ASCII_2 ASCII_3 ASCII_4 ASCII_5 ASCII_6 ASCII_7 ASCII_8
retreats 8 114 101 116 114 101 97 116 115
treaters 8 116 114 101 97 116 101 114 115

Step 3

A dynamic un-pivot is used to create rows from the ASCII columns as shown below, the rows are sorted by their ASCII value:

ASCII_Total Word ASCII_Value
874 retreats 97
874 retreats 101
874 retreats 101
874 retreats 114
874 retreats 114
874 retreats 115
874 retreats 116
874 retreats 116
874 treaters 97
874 treaters 101
874 treaters 101
874 treaters 114
874 treaters 114
874 treaters 115
874 treaters 116
874 treaters 116

Step 4

A dynamic pivot is used to construct a map for each word, sorted by the ASCII value of each letter, shown below:

ASCII_Total Word ASCII_ValueMap
874 retreats 116 116 115 114 114 101 101 97
874 treaters 116 116 115 114 114 101 101 97

Step 5

A results table is created using GROUP BY to construct an XML representation of all words that are anagrams of each other, shown below:

ASCII_Total ASCII_ValueMap Anagrams Words WordLength
20 116 116 115 114 114 101 101 97 2 <Word>retreats</Word><Word>treaters</Word> 8

Results

The challenge asks 2 questions, what is the most words in a single anagram, and which anagram words contain the most letters:

Most Words in a Single Anagram – Top 20

AnagramResults1x

Most Letters in Anagram Words – Top 20

AnagramResults2

T-SQL Code – Double Click to Highlight, Copy and Paste into Query Analyser


USE tempdb;
GO

--------------------------------------------------------------------------------------------------------
-- Parameter - Set the anagram word file name and location
DECLARE @Words VARCHAR(500) = 'C:\Words\Words.txt';

--------------------------------------------------------------------------------------------------------
-- STEP 1 - Create word table and import the word file
DECLARE @nSQL nvarchar(max);

IF OBJECT_ID('tempdb..t0') IS NOT NULL DROP TABLE tempdb..t0; 

CREATE TABLE [dbo].[t0]([Word] [nvarchar](450) NOT NULL) ON [PRIMARY];
CREATE CLUSTERED INDEX [idx_t0] ON [dbo].[t0] ([Word] ASC) ON [PRIMARY]

SELECT @nSQL = 'BULK INSERT tempdb.dbo.t0 FROM ' + '''' + @Words + '''' + ' WITH (ROWTERMINATOR = ' + '''' + '\n' + '''' + ' , CODEPAGE = ' + '''' + '1252' + '''' + ');';
EXEC sp_executesql @nSQL;
GO 

--------------------------------------------------------------------------------------------------------
-- STEP 2 - Dynamic table definition  - A field for each letter, based on longest word length
DECLARE @nSQL nvarchar(max);
IF OBJECT_ID('tempdb..t1') IS NOT NULL DROP TABLE tempdb..t1; 

WITH CTE1 AS 
(
	SELECT MIN(LEN(Word)) AS MinWordLen, MAX(LEN(Word)) AS MaxWordLen
	FROM tempdb.dbo.t0
), 
CTE2 (WordPos) AS
(
	SELECT 1 AS WordPos
	UNION ALL
	SELECT WordPos + 1
	FROM CTE2
	JOIN CTE1 ON 1 = 1 
	WHERE WordPos < CTE1.MaxWordLen 
),
CTE3 AS
(
	SELECT '
		; WITH CTE AS (SELECT Word, LEN(Word) AS WordLen'  AS CmdStr
	FROM CTE1
),
CTE4 AS
(
	SELECT '
		, ISNULL(ASCII(SUBSTRING(Word,' + CAST(WordPos AS VARCHAR(3)) + ' ,1)),0) AS ASCII_' + CAST(WordPos AS VARCHAR(3)) AS CmdStr, WordPos 
	FROM CTE2
), 
CTE5A AS
(
	SELECT '
		0 '  AS CmdStr, 1000000 AS WordPos
	FROM CTE1
), 
CTE5B AS
(
	SELECT '
		 + CASE ASCII_' + CAST(WordPos AS VARCHAR(3)) + ' WHEN 39 THEN 0 ELSE ASCII_' + CAST(WordPos AS VARCHAR(3)) + ' END' AS CmdStr, 1000001 AS WordPos
	FROM CTE2
), 
CTE6A AS
(
	SELECT '
		FROM tempdb.dbo.t0) SELECT *, '  AS CmdStr
	FROM CTE1
), 
CTE6C AS
(
	SELECT '
		AS ASCII_Total INTO tempdb..t1 FROM CTE;'  AS CmdStr
	FROM CTE1
), 
CTE7 AS
(
	SELECT CmdStr, WordPos
	FROM CTE4
	UNION ALL
	SELECT CmdStr, 0 AS WordPos  
	FROM CTE3
	UNION ALL
	SELECT CmdStr, WordPos  
	FROM CTE5A
	UNION ALL
	SELECT CmdStr, WordPos  
	FROM CTE5B
	UNION ALL
	SELECT CmdStr, 1000 AS WordPos  
	FROM CTE6A
	UNION ALL
	SELECT CmdStr, 2000001 AS WordPos  
	FROM CTE6C
)

SELECT @nSQL = STUFF((SELECT ' ' + CmdStr FROM CTE7 ORDER BY WordPos FOR XML PATH, TYPE).value('.[1]', 'nvarchar(max)'), 1, 1, '');
EXEC sp_executesql @nSQL;
GO

--------------------------------------------------------------------------------------------------------
--  STEP 3 - Create Horizontal ASCII Values Map, a row for each letters ASCII_Value
IF OBJECT_ID('tempdb..t2') IS NOT NULL DROP TABLE t2; 
DECLARE @nSQL NVARCHAR(MAX);

SELECT @nSQL = 
';WITH CTE1 (ASCII_Total, PossibleAnagramCount) AS
(
	SELECT ASCII_Total, COUNT(*) AS PossibleAnagramCount 
	FROM tempdb..t1 x1
	GROUP BY ASCII_Total
	HAVING COUNT(*) > 1
),
CTE2 AS
(
	SELECT *
	FROM tempdb..t1
)
SELECT x1.ASCII_Total, x1.PossibleAnagramCount, x2.Word, x2.ASCII_Value 
INTO tempdb..t2
FROM CTE1 x1
JOIN ( '

DECLARE @MaxWordLen INT;
SELECT @MaxWordLen = MAX(LEN(Word))
FROM tempdb.dbo.t0;

WITH CTE0 (Position) AS
(
	SELECT 1 Position
	UNION ALL
	SELECT Position + 1 
	FROM CTE0
	WHERE Position < @MaxWordLen
)
SELECT @nSQL = @nSQL + CASE Position 
	WHEN 1 THEN 'SELECT ASCII_Total, Word, ASCII_' + CAST(Position AS VARCHAR(3)) +  ' AS ASCII_Value FROM CTE2'
	ELSE ' UNION ALL SELECT ASCII_Total, Word, ASCII_' + CAST(Position AS VARCHAR(3)) +  ' AS ASCII_Value FROM CTE2'
	END 
FROM CTE0;

SELECT @nSQL = @nSQL +
') x2 ON x1.ASCII_Total = x2.ASCII_Total
WHERE ASCII_Value > 0
AND ASCII_Value <> 39
ORDER BY x1.ASCII_Total'

EXEC sp_executesql @nSQL;

CREATE CLUSTERED INDEX [idx_t2] ON [dbo].[t2] (Word, [ASCII_Value] ASC) ON [PRIMARY];

GO

--------------------------------------------------------------------------------------------------------
--  STEP 4 - Create Vertical ASCII Values Map - 1 row for all letters in a word, Sorted by ASCII_Value
IF OBJECT_ID('tempdb..t3') IS NOT NULL DROP TABLE tempdb..t3; 

WITH CTE AS
(
	SELECT x2.ASCII_Total, Word,  
	(
		SELECT CAST(x1.ASCII_Value AS VARCHAR(3)) + ' '
		FROM tempdb..t2 x1
		WHERE x1.ASCII_Total = x2.ASCII_Total AND x1.Word = x2.Word
		ORDER BY x1.ASCII_Value DESC
		FOR XML PATH ('')
	) AS ASCII_ValueMap
	FROM tempdb..t2 x2
)

SELECT ASCII_Total, Word, CAST(ASCII_ValueMAP AS VARCHAR(400)) AS ASCII_ValueMap -- Support words with up to 100 letters 
INTO tempdb..t3
FROM CTE
GROUP BY ASCII_Total, Word, CAST(ASCII_ValueMAP AS VARCHAR(400)) 
ORDER BY ASCII_Total, Word;

CREATE CLUSTERED INDEX IDX_T3_1 ON tempdb..t3(ASCII_ValueMap, ASCII_Total, Word);
CREATE NONCLUSTERED INDEX IDX_T3_2 ON tempdb..t3(ASCII_Total, ASCII_ValueMap);

GO

--------------------------------------------------------------------------------------------------------
-- STEP 5 - Create Results Table
IF OBJECT_ID('tempdb..t4') IS NOT NULL DROP TABLE tempdb..t4; 

;WITH CTE1 AS
(
	SELECT ASCII_Total, ASCII_ValueMap, REPLACE(Word,CHAR(39),'') AS Word, COUNT(*) AS NotImportant
	FROM tempdb..t3
	GROUP BY ASCII_Total, ASCII_ValueMap, REPLACE(Word,CHAR(39),'')
),
CTE2 AS
(
	SELECT ASCII_Total, ASCII_ValueMap, COUNT(*) AS Anagrams 
	FROM CTE1
	GROUP BY ASCII_Total, ASCII_ValueMap
),
CTE3 AS
(
	SELECT x2.*,
		(
				SELECT x1.Word
				FROM CTE1 x1
				WHERE x1.ASCII_ValueMap = x2.ASCII_ValueMap 
				ORDER BY Word
				FOR XML PATH('')	
		) AS Words,
		AnagramGroup = ROW_NUMBER() OVER(ORDER BY x2.ASCII_Total, x2.ASCII_ValueMap)
	FROM CTE2 x2
	WHERE Anagrams > 1
)

SELECT ASCII_Total, ASCII_ValueMap, Anagrams, Words, (CHARINDEX('<', Words,2) - CHARINDEX('>', Words,1) - 1) AS WordLength 
INTO tempdb.dbo.t4
FROM CTE3
ORDER BY Anagrams DESC;

CREATE CLUSTERED INDEX IDX_T4_1 ON tempdb..t4(Anagrams);
CREATE NONCLUSTERED INDEX IDX_T4_2 ON tempdb..t4(WordLength);

--------------------------------------------------------------------------------------------------------
-- Results - Top 20 most word anagrams & top 20 word length anagrams
SELECT TOP 20 *
FROM tempdb.dbo.t4
ORDER BY Anagrams DESC

SELECT TOP 20 *
FROM tempdb.dbo.t4
ORDER BY WordLength DESC

--------------------------------------------------------------------------------------------------------
-- Tidy up
--IF OBJECT_ID('tempdb..t4') IS NOT NULL DROP TABLE t4; 
--IF OBJECT_ID('tempdb..t3') IS NOT NULL DROP TABLE t3; 
--IF OBJECT_ID('tempdb..t2') IS NOT NULL DROP TABLE t2; 
--IF OBJECT_ID('tempdb..t1') IS NOT NULL DROP TABLE t1; 
--IF OBJECT_ID('tempdb..t0') IS NOT NULL DROP TABLE t0; 

GO




Categories: General

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: