Python Text Manipulation(Splitting)
Strings in Python are objects, and because of Python’s strong operator support, simple tasks like concatenating two strings together can be done easily using the + operator. Strings are also sequences, and that means that you can address a single character or a sequence of characters (a slice) without resorting to an additional function to extract the characters you want.
But what about other operations, like changing the case of a string or splitting a text record where fields are delimited by a single character into separate fields?
Here’s an example: Let’s say you want to replace a portion of text in a Python string with another portion of text, for example, replacing the word “cat” with “dog” in the expression “the cat on the mat”. Although you can find the location of the word using the index() method, you cannot replace it. Strings are immutable you cannot replace the characters in the sequence directly even if the replacement characters are the same length as the character you’re replacing. You could do it by finding the location of the word, slicing out everything up to the word and everything after the word, and then reassembling the string.
Messy though, isn’t it? What happens if the original expression is “the Cat sat on the mat”, or “The Cat Sat on the Mat”? Worse still, what happens if you want to replace either “Cat” or “Mouse” with “Elephant”? You’d have to account for all the different cases of the words and all the different possible words in order to make the process work.
Alternatively, you could use regular expressions. Regular expressions are really just another language that allows you to specify what you want to be replaced. The language runs from the simple act of replacing “cat” with “dog”, through to replacing “cat” or “mouse” with “elephant”, irrespective, right up to replacing “cat” with “dog” but only if the cat is “sat” on the “mat”, and only if it’s all in lowercase.
For most of the basic string manipulations, you can rely on either the built-in functionality offered by the string object type, or you can use the string module. For regular expressions, you use the re-module. This section discusses both of these modules.
Basic String Manipulation
The string module defines a host of useful functions for manipulating, extracting, and mutating strings. This section describes the string module in detail, including the built-in methods and operators for manipulating strings.
Throughout this section, it’s important to remember that you must capture the return values from all of these functions-Python does not change or modify a string except when you are using a method on a specific string object.
Finding String Segments
To find the location of a particular string within another string, use the index0 function. For example, to find the string “cat” in the string “the cat sat on the mat” use
loc = string.index(the cat sat on the mat', 'cat')
The loc variable should now contain the value 4 since the word you are looking for appears at the fifth character in the string (remember that string indices start at 0).
The index function also accepts a third argument-the index within the string where the search should start. For example, if you are looking for “at”, but aren’t interested in the one in “cat”, you could use
lastat = string.index( 'the cat sat on the mat', 'at', 8)
Finally, you can look for a string starting at the end of the string and working backward, rather than starting from the beginning of the string using the rindex function. Note that the order of the string is not affected, only where you start looking for the text. For example, to pick up the word “on”, starting from the end, use the following:
rloc = rindex ('The cat sat on the mat', 'on')
Replacing String Segments
In Python, the string object does not support assignment, so the following command still does not work:
text[4:7]= 'tyrannosaurus rex'
A TypeError exception will be raised if you try this. Instead, the most straightforward solution is to use slices to extract the text before and after the portion of text that you want to replace and then use concatenation to reassemble the string, as follows:
text = text[:4] + 'tyrannosaurus rex' + text [7:]
Alternatively, if you can be sure that the string you are searching for can be found without skipping over elements, you can use the string module’s replace() function:
replace(text, old, new[ , max])
This example replaces old with new in the string text either as many times as seen. or max times if the argument is supplied. You can therefore rewrite the preceding example as:
text = string.replace(text, 'cat', tyrannosaurus rex')
You can also use the replace() method directly on the string:
text.replace('cat, 'tyrannosaurus rex')
Note however that searches always start from the beginning of the string. Replacing “at” once changes “cat”: you can’t start the search from anyplace other than the start If you want to replace a word that appears at the end of a string, either use a regular expression or use slices to make the change only in the component of the original string that you want to modify. For example,
text = text[:-3] + string.replace(text[-3:]. 'at', 'oon')
changes the text variable to read “the cat sat on the moon”.
Splitting
The Python split() function comes as part of the string module and allows you to split a string by a single or multi-character sequence. For example, you might want to split a list of fields separated by commas into a list of separate fields. The function splits a string into component parts, returning a list. The general format of the split() function is
split(text, [, expr [, max]])
The text argument is the string that you want to split and expr is the separator. If you do not supply expr, Python assumes you want to split by whitespace. For example, you can convert a sentence of text into a list of individual words as follows:
words = string.split('I pushed the button but nothing happened')
You can also use the split() function to extract individual fields from a character-delimited string such as
fields = split( 'rod: TA266:$23.99' , ' : ')
If the max argument is supplied, it operates in the same way as the Perl version, limiting the number of times that the split operation occurs. If you want to perform a regular expression-based split, use the split() function that comes with the re module:
import re
fields = re.split(r['\s, : ;]+', text)
The preceding statement splits the characters in text whenever it sees one or more whitespaces, commas, colons, or semi-colons. See the section “Regular Expressions” later in this chapter.
Joining
Although you concatenate two strings together in Python using the + operator.. the process can be quite tedious if you have a number of strings that you want to join together using the same character or sequence. For example, imagine that you want to reassemble the sentence you split earlier. Because you don’t know the length of the list of words, you’d ordinarily need to use a loop:
S = ' '
for word in words:
S += word + ' '
Using this technique, you end up with a useless space on the end of the string. The solution is to use the join() function in the string module. The join function joins together all the elements in a list using the same separator, but it only places the separator between the elements that it’s concatenating, as in the following example:
names = string.join (['martin', 'sohail', 'wendy', 'rikke'], ' , ' )
As you can see the first argument is the list or tuple that you want to join together, and the second argument is the separator you want to place between each element pair. The value of names is:
'martin, sharon, wendy, rikke'
Trimming
You can trim the leading and/or trailing whitespace from a string in Python using the Istrip(), rstrip(), and strip() functions. The mnemonic here is that lstrip() strips the left whitespace and rstrip() strips the right whitespace:
String= ' leading space trailing '
lstrip(string) # return 'leading space trailing '
rstrip(string) # return' leading space trailing '
Strip(string) # returns 'leading space trailing'
Changing Case
Python supports a range of case translation functions through the string module. These functions change the case and return the new version of the string Table 10-4 lists these functions.
For example, to change a word to lowercase, use
lctmat=strirg.lower(text)
Function | Description |
String.lower(text) | Change all characters in the text to lowercase |
String.upper(text) | Change all characters in the text to uppercase |
String.capitalize(text) | Change to he the first character in the text to uppercase |
String.capwords(text) | Change the first character of each word in the text to uppercase. |
String.swapcase(text) | Swap the case of all characters in the text (lowercase to uppercase/ uppercase to lowercase). |
Translating Characters
Although there are times when all you want to do is change the case of a string, there are also times when you want to translate individual characters. For example, imagine you have the string “ffffff” when you need the string “aaaaaa”. You could do the translation manually, but there is an easier way-use the translate function in the string module.
The translate function accepts two arguments: a string and a translation table that maps. the characters you want to find and the replacement characters you want to use. This mapping needs to be generated by the maketrans() function, which itself accepts two arguments: the list of characters to be translated from and the corresponding characters that you want them translated to.
For example, you can change all lowercase characters to uppercase using
string.translate (text, maketrans('abcdefghijklmnopqrstuvwxyz 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'))
The two strings are used in sequence in the preceding example, the first character in the first string is “a” and the first character in the second string is “A”. When the string is translated, each occurrence of “a” is replaced with “A”.
Let’s consider a different example, translating lowercase characters to their opposite number, i.e., “a” to “z” and “2” to “a”:
string.translate(text, maketrans('abcdefghijklmnopqrstuvwxyz', zyxwvutarqponmlkjihgfedcba'))
Using the preceding example, the result of translating “the cat sat on the mat” is “gsv xzg hzg Im gsv nzg”!
Standard Character Definitions
In addition to the functions you’ve already seen, the string module also defines some constant string sequences that can be used with many of the functions described in this section. Table 10-5 lists the constants that you can use within functions like maketrans() to refer to specific groups of characters.
[…] Text manipulation […]