Removing Characters in SQL

murph

Limp Gawd
Joined
Mar 17, 2003
Messages
190
I have a table of ~ 90,000 records. Each record has an SSN field and a incident number field. Can anyone provide a good reference or the like to show how to remove all non-alphanumeric characters?

Right now I have:
Code:
  CREATE PROCEDURE sp_DoStuff AS
  UPDATE My_table SET incident_num = REPLACE(incident_num, ' ', ''),  <-- Remove whitespace
  ssn = REPLACE(ssn,'-','') <-- Remove the dashes from the SSN field

I'm sure this isn't close to being correct or efficient but it works for the time being. I just would like some other input on the topic.

Thanks in advance.
 
I used to try to do things like this using stored procedures, but I have long since abandoned that method in favor of outside-the-database processing. Even with 100,000 records, it would take me far less time to write the processing code in PHP/Java/whatever, pipe all those records out to my chosen language, process them, and pipe it all back in than it would to write and debug a sproc.

Do you happen to have a way to do this outside of the database? I don't happen to know the typical string manipulation functions in SQL variants all that well, but I imagine it will take a lot of "replaces" to get it done, where a simple regexp in an outside language could do the same job.

If you don't have a way of doing that, what database environment are you using? I think the string functions vary from database to database. What you probably want to look at, though, is the TRANSLATE function (see here for example). It will probably be more concise than using a bunch of REPLACEs.
 
Time to bust out an updateable cursor...

Code:
DECLARE @ssn AS varchar(50)
DECLARE @incident _numAS varchar(50)
DECLARE Cur CURSOR FORWARD_ONLY SCROLL_LOCKS
  FOR SELECT ssn, incident _num FROM My_table WHERE incident_num LIKE '%[^a-z0-9]%' OR ssn LIKE '%[^a-z0-9]%' 
  FOR UPDATE OF ssn, incident _num

OPEN Cur
FETCH NEXT FROM Cur\
  INTO @ssn, @incident_num

WHILE @@FETCH_STATUS = 0
BEGIN
  WHILE PATINDEX('%[^a-z0-9]%', @ssn) > 0 
  BEGIN
    SET @ssn= REPLACE(@ssn, SUBSTRING(@ssn, PATINDEX('%[^a-z0-9]%', @ssn), 1), '')
  END

  WHILE PATINDEX('%[^a-z0-9]%', @incident _num) > 0 
  BEGIN
    SET @ssn= REPLACE(@incident _num, SUBSTRING(@incident _num, PATINDEX('%[^a-z0-9]%', @incident _num), 1), '')
  END

  UPDATE My_table
    SET ssn = @ssn, incident _num = @incident _num
    WHERE CURRENT OF Cur 

  FETCH NEXT FROM Cur
    INTO @ssn, @incident_num
END

CLOSE Cur
DEALLOCATE Cur

EDIT: That's T-SQL, which will work for SQL Server. Your DBMS may vary, but the basic concepts remain the same.

EDIT2: Added "INTO @ssn, @incident_num" to the FETCH statements, which I missed before.
 
Hammer:

That's great, thank you. The issue now is that I'm sure it will work, but I need to learn what it all means! I haven't gotten quite that far yet, but thank you either way.
 
Code:
DECLARE @ssn AS varchar(50)
DECLARE @incident _numAS varchar(50)
DECLARE Cur [COLOR=DarkOrange]<- cursor name[/COLOR] CURSOR FORWARD_ONLY [COLOR=DarkOrange]<- can only FETCH the NEXT record (can't go backwards or random, only forward)[/COLOR] SCROLL_LOCKS  [COLOR=DarkOrange]<- locks all the rows when you open the cursor, so UPDATEs and DELETEs are guaranteed to work[/COLOR] 
  FOR  [COLOR=DarkOrange]<- specify what your going to grab ->[/COLOR] SELECT ssn, incident _num FROM My_table WHERE incident_num LIKE '%[^a-z0-9]%' OR ssn LIKE '%[^a-z0-9]%' 
  FOR UPDATE OF ssn, incident _num [COLOR=DarkOrange]<- specify what columns can be UPDATEd ->[/COLOR]

OPEN Cur [COLOR=DarkOrange]<- OPEN the cursor, which locks all the rows in it[/COLOR]
FETCH NEXT FROM Cur [COLOR=DarkOrange]<- Get the next row (when you OPEN, the cursor is position before the first (if any) row)[/COLOR]
  INTO @ssn, @incident_num [COLOR=DarkOrange]<- forgot to include this before; place the data from the row into the corresponding variables[/COLOR]

WHILE [COLOR=DarkOrange]<- loop to process all rows in the cursor[/COLOR] @@FETCH_STATUS = 0 [COLOR=DarkOrange]<- check to see if the last FETCH was successful (a FETCH made when there are no rows remaining will not succeed)[/COLOR]
BEGIN [COLOR=DarkOrange]<- Open a block[/COLOR]
  WHILE [COLOR=DarkOrange]<- Loop to catch every non-alpha-numeric character in the variable[/COLOR] PATINDEX[COLOR=DarkOrange]<- find the index of a pattern in a string[/COLOR]('%[^a-z0-9]%'[COLOR=DarkOrange]<- matches to first index containing a non-alpha-numeric character[/COLOR], @ssn) > 0 [COLOR=DarkOrange]<- 0 indicates the pattern wasn't found[/COLOR]
  BEGIN [COLOR=DarkOrange]<- Open another block[/COLOR]
    SET @ssn= REPLACE(@ssn, SUBSTRING(@ssn, PATINDEX('%[^a-z0-9]%', @ssn), 1), '') [COLOR=DarkOrange]<- Remove the found non-alpha-numeric character anywhere it is found in the variable (the loop is due to the possibility that there may be multiple flavors of non-alpha-numeric characters found in the variable)[/COLOR] 
  END [COLOR=DarkOrange]<- Close the block, making this the end of the inner WHILE loop[/COLOR]

[COLOR=DarkOrange]Rinse and repeat for incident _num[/COLOR]
  WHILE PATINDEX('%[^a-z0-9]%', @incident _num) > 0 
  BEGIN
    SET @ssn= REPLACE(@incident _num, SUBSTRING(@incident _num, PATINDEX('%[^a-z0-9]%', @incident _num), 1), '')
  END

  UPDATE My_table
    SET ssn = @ssn, incident _num = @incident _num
    WHERE CURRENT OF Cur [COLOR=DarkOrange]<- specifies that it UPDATEs the current row in the cursor[/COLOR]

  FETCH NEXT FROM Cur
    INTO @ssn, @incident_num
END  [COLOR=DarkOrange]<- Close the block, making this the end of the outer WHILE loop[/COLOR]

CLOSE Cur [COLOR=DarkOrange]<- Close the cursor, releasing all locks it held[/COLOR]
DEALLOCATE Cur [COLOR=DarkOrange]<- Deallocate cursor, releasing all resources it required[/COLOR]

Note: I forgot to include "INTO @ssn, @incident_num" in the previous version.

From SQL Server Books Online:
PATINDEX
Returns the starting position of the first occurrence of a pattern in a specified expression, or zeros if the pattern is not found, on all valid text and character data types.

Syntax
PATINDEX ( '%pattern%' , expression )

Arguments
pattern

Is a literal string. Wildcard characters can be used; however, the % character must precede and follow pattern (except when searching for first or last characters). pattern is an expression of the short character data type category.

expression

Is an expression, usually a column that is searched for the specified pattern. expression is of the character string data type category.

Edit: Chose a more visible comment color.
Edit2: Comments completed.
 
Thank you for the information. I really appreciate it. This will definitely help out a tremendous amount.
 
Back
Top