`
yesjavame
  • 浏览: 656061 次
  • 性别: Icon_minigender_2
  • 来自: 杭州
文章分类
社区版块
存档分类
最新评论

(Python编程)目录工具

阅读更多

Programming Python, 3rd Edition 翻译
最新版本见wiki:http://wiki.woodpecker.org.cn/moin/PP3eD
欢迎参与翻译与修订。


4.3. Directory Tools

4.3. 目录工具


One of the more common tasks in the shell utilities domain is applying an operation to a set of files in a directorya "folder" in Windows-speak. By running a script on a batch of files, we can automate (that is, script) tasks we might have to otherwise run repeatedly by hand.

在shell应用领域,更常见的任务是,操作目录中的一组文件,按Windows的说法是“文件夹”。通过对一批文件运行脚本,我们可以将任务自动化(即脚本化),否则我们就必须以手工方式重复运行脚本。

For instance, suppose you need to search all of your Python files in a development directory for a global variable name (perhaps you've forgotten where it is used). There are many platform-specific ways to do this (e.g., the grep command in Unix), but Python scripts that accomplish such tasks will work on every platform where Python worksWindows, Unix, Linux, Macintosh, and just about any other platform commonly used today. If you simply copy your script to any machine you wish to use it on, it will work regardless of which other tools are available there.

例如,假设你需要搜索开发目录中所有的Python文件,以查找一个全局变量名(也许你忘了在哪儿使用过它)。有许多平台专用的方法可以做到这一点(例如Unix grep命令),但完成这种任务的Python脚本可以运行于所有Python可以运行的平台:Windows、Unix、Macintosh和几乎所有目前常用的其他平台。你只需将你的脚本复制到你想使用的机器,不管该机器上其他工具是否可用,脚本都可以运行。

4.3.1. Walking One Directory

4.3.1. 遍历一个目录


The most common way to go about writing such tools is to first grab a list of the names of the files you wish to process, and then step through that list with a Python for loop, processing each file in turn. The trick we need to learn here, then, is how to get such a directory list within our scripts. There are at least three options: running shell listing commands with os.popen, matching filename patterns with glob.glob, and getting directory listings with os.listdir. They vary in interface, result format, and portability.

编写这类工具最常用的方法是,先获取你要处理的文件名列表,然后通过Python for循环遍历该列表,依次处理每个文件。那么,这里我们需要学习的诀窍是,如何在脚本中得到这样一个目录列表。至少有三种方法:用os.popen运行shell目录列表命令、用glob.glob进行文件名模式匹配,或用os.listdir得到目录列表。这三种方法在接口、结果格式和可移植性上各不相同。

4.3.1.1. Running shell listing commands with os.popen

4.3.1.1. 用os.popen运行shell目录列表命令


Quick: how did you go about getting directory file listings before you heard of Python? If you're new to shell tools programming, the answer may be "Well, I started a Windows file explorer and clicked on stuff," but I'm thinking here in terms of less GUI-oriented command-line mechanisms (and answers submitted in Perl and Tcl get only partial credit).

抢答:在你听说Python之前,你是如何获取目录中的文件列表的呢?如果您不熟悉shell工具编程,答案可能是“嗯,我打开了Windows资源管理器并点击目录”,但我在这里要求使用非GUI的命令行机制(并且用Perl和Tcl回答都不能得到满分)。

On Unix, directory listings are usually obtained by typing ls in a shell; on Windows, they can be generated with a dir command typed in an MS-DOS console box. Because Python scripts may use os.popen to run any command line that we can type in a shell, they are the most general way to grab a directory listing inside a Python program. We met os.popen in the prior chapter; it runs a shell command string and gives us a file object from which we can read the command's output. To illustrate, let's first assume the following directory structures (yes, I have both dir and ls commands on my Windows laptop; old habits die hard):

在Unix上,通常在shell中键入ls来获得目录列表;在Windows上,可以在MS-DOS控制台窗口中键入dir命令来生成目录列表。由于Python脚本可以使用os.popen运行任何命令行,就像在shell中输入一样,这是在Python程序中获取目录列表的最一般的方法。我们在上一章见过os.popen,它会运行一个shell命令字符串,并且提供一个文件对象,我们可以从该文件读取命令的输出。作为例子,我们先假设有以下目录结构(是的,我的Windows笔记本上同时有dir和ls命令,旧习难改):

C:\temp>dir /B
about-pp.html
python1.5.tar.gz
about-pp2e.html
about-ppr2e.html
newdir

C:\temp>ls
about-pp.html about-ppr2e.html python1.5.tar.gz
about-pp2e.html newdir

C:\temp>ls newdir
more temp1 temp2 temp3




The newdir name is a nested subdirectory in C:\temp here. Now, scripts can grab a listing of file and directory names at this level by simply spawning the appropriate platform-specific command line and reading its output (the text normally thrown up on the console window):

其中newdir是C:\temp的子目录。现在,脚本可以在该层上抓取文件和目录名列表了,只需运行适当的该平台上的命令行,并读取其输出(正常情况下,文字会产生在控制台窗口上):

C:\temp>python
>>> import os
>>> os.popen('dir /B').readlines( )
['about-pp.html\n', 'python1.5.tar.gz\n', 'about-pp2e.html\n',
'about-ppr2e.html\n', 'newdir\n']



Lines read from a shell command come back with a trailing end-of-line character, but it's easy enough to slice off with a for loop or list comprehension expression as in the following code:

从shell命令读取的行带有行尾符,但很容易通过for循环或者列表解析表达式用分片操作切除,如以下代码:

>>> for line in os.popen('dir /B').readlines( ):
... print line[:-1]
...
about-pp.html
python1.5.tar.gz
about-pp2e.html
about-ppr2e.html
newdir

>>> lines = [line[:-1] for line in os.popen('dir /B')]
>>> lines
['about-pp.html', 'python1.5.tar.gz', 'about-pp2e.html',
'about-ppr2e.html', 'newdir']



One subtle thing: notice that the object returned by os.popen has an iterator that reads one line per request (i.e., per next( ) method call), just like normal files, so calling the readlines method is optional here unless you really need to extract the result list all at once (see the discussion of file iterators earlier in this chapter). For pipe objects, the effect of iterators is even more useful than simply avoiding loading the entire result into memory all at once: readlines will block the caller until the spawned program is completely finished, whereas the iterator might not.

注意一个微妙之处:os.popen返回的对象有个迭代器,每次请求时它就会读取一行(即每次next()方法调用时),就像普通文件一样,所以调用readlines方法是可选的,除非你真的需要一下子提取结果列表(见本章前面文件迭代器的讨论)。对于管道对象,迭代器的效果更为有用,不仅仅是避免一下子加载整个结果到内存:readlines会阻塞调用者,直到生成的程序完全结束,而迭代器不会。

The dir and ls commands let us be specific about filename patterns to be matched and directory names to be listed; again, we're just running shell commands here, so anything you can type at a shell prompt goes:

dir和ls命令可以让我们指定文件名匹配的模式和需要列出的目录名;再说一次,在这里我们只是运行shell命令,所以,任何只要你可以在shell提示符下键入的命令都可以:

>>> os.popen('dir *.html /B').readlines( )
['about-pp.html\n', 'about-pp2e.html\n', 'about-ppr2e.html\n']

>>> os.popen('ls *.html').readlines( )
['about-pp.html\n', 'about-pp2e.html\n', 'about-ppr2e.html\n']

>>> os.popen('dir newdir /B').readlines( )
['temp1\n', 'temp2\n', 'temp3\n', 'more\n']

>>> os.popen('ls newdir').readlines( )
['more\n', 'temp1\n', 'temp2\n', 'temp3\n']




These calls use general tools and work as advertised. As I noted earlier, though, the downsides of os.popen are that it requires using a platform-specific shell command and it incurs a performance hit to start up an independent program. The following two alternative techniques do better on both counts.

这些调用使用了一般的工具,并且能正确工作。但是,正如我前面指出,os.popen的缺点是它需要使用特定于平台的shell命令,并且,它需要启动一个独立程序而导致性能损耗。下面的两个替代技术在这两点上做得更好。

4.3.1.2. The glob module

4.3.1.2. glob模块


The term globbing comes from the * wildcard character in filename patterns; per computing folklore, a * matches a "glob" of characters. In less poetic terms, globbing simply means collecting the names of all entries in a directoryfiles and subdirectorieswhose names match a given filename pattern. In Unix shells, globbing expands filename patterns within a command line into all matching filenames before the command is ever run. In Python, we can do something similar by calling the glob.glob built-in with a pattern to expand:

glob一词来自文件名模式中的通配符*;在计算机民间传统中,一个*匹配“glob(所有)”字符。用缺乏诗意的话说,glob仅仅意味着收集目录中所有符合给定文件名模式的文件名和子目录名。在Unix shell中,命令运行前,glob会将命令行中的文件名模式扩展为所有匹配的文件名。在Python中,我们可以通过调用glob.glob做类似的事情,参数为待扩展的模式:

>>> import glob
>>> glob.glob('*')
['about-pp.html', 'python1.5.tar.gz', 'about-pp2e.html', 'about-ppr2e.html',
'newdir']

>>> glob.glob('*.html')
['about-pp.html', 'about-pp2e.html', 'about-ppr2e.html']

>>> glob.glob('newdir/*')
['newdir\\temp1', 'newdir\\temp2', 'newdir\\temp3', 'newdir\\more']



The glob call accepts the usual filename pattern syntax used in shells (e.g., ? means any one character, * means any number of characters, and [] is a character selection set).[*] The pattern should include a directory path if you wish to glob in something other than the current working directory, and the module accepts either Unix or DOS-style directory separators (/ or \). Also, this call is implemented without spawning a shell command and so is likely to be faster and more portable across all Python platforms than the os.popen schemes shown earlier.

glob调用接受在shell中使用的通常的文件名模式语法(例如,?表示任何一个字符,*表示任意多个字符,以及[]是字符选择集)[*]。如果你希望glob的东西不在当前工作目录,模式中还应该包括目录路径,该模块可以接受Unix或DOS样式的目录分隔符(/或\)。另外,该调用的实现中没有产生shell命令,因此比前面所示的os.popen方案更快,并且移植性更好,可用于所有的Python平台。

[*] In fact, glob just uses the standard fnmatch module to match name patterns; see the fnmatch description later in this chapter for more details.

[*] 事实上,glob只是利用标准的fnmatch模块匹配名称模式,详见本章后面对fnmatch的描述。

Technically speaking, glob is a bit more powerful than described so far. In fact, using it to list files in one directory is just one use of its pattern-matching skills. For instance, it can also be used to collect matching names across multiple directories, simply because each level in a passed-in directory path can be a pattern too:

从技术上讲,glob比迄今所描述的还强大一点。其实,用它来列出一个目录中的文件只是其模式匹配技术的应用之一。例如,它也可以用于跨多个目录收集匹配的名字,因为传入的目录路径的每一级都可以是一个模式:

C:\temp>python
>>> import glob
>>> for name in glob.glob('*examples/L*.py'): print name
...
cpexamples\Launcher.py
cpexamples\Launch_PyGadgets.py
cpexamples\LaunchBrowser.py
cpexamples\launchmodes.py
examples\Launcher.py
examples\Launch_PyGadgets.py
examples\LaunchBrowser.py
examples\launchmodes.py

>>> for name in glob.glob(r'*\*\visitor_find*.py'): print name
...
cpexamples\PyTools\visitor_find.py
cpexamples\PyTools\visitor_find_quiet2.py
cpexamples\PyTools\visitor_find_quiet1.py
examples\PyTools\visitor_find.py
examples\PyTools\visitor_find_quiet2.py
examples\PyTools\visitor_find_quiet1.py




In the first call here, we get back filenames from two different directories that match the *examples pattern; in the second, both of the first directory levels are wildcards, so Python collects all possible ways to reach the base filenames. Using os.popen to spawn shell commands achieves the same effect only if the underlying shell or listing command does too.

此处第一个调用中,我们从两个不同的目录得到了文件名,这两个目录都匹配模式*examples;在第二个中,前两个目录级别都是通配符,所以Python查找一切可能的路径来收集基本文件名。如果用os.popen产生shell命令要达到同样的效果,只有在底层shell或列表命令能够做到时才行。

4.3.1.3. The os.listdir call

4.3.1.3. os.listdir调用


The os module's listdir call provides yet another way to collect filenames in a Python list. It takes a simple directory name string, not a filename pattern, and returns a list containing the names of all entries in that directoryboth simple files and nested directoriesfor use in the calling script:

os模块的listdir调用提供了另一方法,它会将名字收集成Python列表。它需要一个普通的目录名字符串,而不是一个文件名模式,并且,它返回一个列表供脚本使用,其中包含该目录中所有条目的名字,不管是简单的文件,还是嵌套目录:

>>> os.listdir('.')
['about-pp.html', 'python1.5.tar.gz', 'about-pp2e.html', 'about-ppr2e.html',
'newdir']

>>> os.listdir(os.curdir)
['about-pp.html', 'python1.5.tar.gz', 'about-pp2e.html', 'about-ppr2e.html',
'newdir']

>>> os.listdir('newdir')
['temp1', 'temp2', 'temp3', 'more']



This too is done without resorting to shell commands and so is portable to all major Python platforms. The result is not in any particular order (but can be sorted with the list sort method), returns base filenames without their directory path prefixes, and includes names of both files and directories at the listed level.

它也没有借助shell命令,因此可以移植到所有主要的Python平台。它的结果没有任何特定的顺序(但可以用列表的排序方法进行排序),返回的是不带目录路径前缀的基本文件名,并且同时包含所列举目录中的文件名和目录名。

To compare all three listing techniques, let's run them here side by side on an explicit directory. They differ in some ways but are mostly just variations on a themeos.popen sorts names and returns end-of-lines, glob.glob accepts a pattern and returns filenames with directory prefixes, and os.listdir takes a simple directory name and returns names without directory prefixes:

为了比较这三种目录列表技术,让我们在特定目录下依次运行它们。它们在某些方面有所不同,但大多只是主题不同。os.popen会排序名字,并返回行尾符,glob.glob接受一个模式并返回带目录前缀的文件名,而os.listdir需要一个普通的目录名,并返回不带目录前缀的名字:

>>> os.popen('ls C:\PP3rdEd').readlines( )
['README.txt\n', 'cdrom\n', 'chapters\n', 'etc\n', 'examples\n',
'examples.tar.gz\n', 'figures\n', 'shots\n']

>>> glob.glob('C:\PP3rdEd\*')
['C:\\PP3rdEd\\examples.tar.gz', 'C:\\PP3rdEd\\README.txt',
'C:\\PP3rdEd\\shots', 'C:\\PP3rdEd\\figures', 'C:\\PP3rdEd\\examples',
'C:\\PP3rdEd\\etc', 'C:\\PP3rdEd\\chapters', 'C:\\PP3rdEd\\cdrom']

>>> os.listdir('C:\PP3rdEd')
['examples.tar.gz', 'README.txt', 'shots', 'figures', 'examples', 'etc',
'chapters', 'cdrom']



Of these three, glob and listdir are generally better options if you care about script portability, and listdir seems fastest in recent Python releases (but gauge its performance yourselfimplementations may change over time).

三者之中,如果您关心脚本的可移植性,glob和listdir一般是更好的选择,在最新的Python版本中,listdir似乎是最快的(但您需要自己衡量其表现,实现可能会随时间变化)。

4.3.1.4. Splitting and joining listing results

4.3.1.4. 分割与合并列表结果


In the last example, I pointed out that glob returns names with directory paths, whereas listdir gives raw base filenames. For convenient processing, scripts often need to split glob results into base files or expand listdir results into full paths. Such translations are easy if we let the os.path module do all the work for us. For example, a script that intends to copy all files elsewhere will typically need to first split off the base filenames from glob results so that it can add different directory names on the front:

我在上例中指出,glob返回带目录路径的名字,而listdir给出的是原始的基本文件名。为方便处理,脚本通常需要将glob的结果分割成基本文件名,或将listdir的结果扩展到完整路径。让os.path模块做这种转换很容易。例如,如果脚本打算将所有文件复制到其他地方,一般需要先从glob的结果中分割出基本文件名,这样它才可以在前面添加不同的目录名:

>>> dirname = r'C:\PP3rdEd'
>>> for file in glob.glob(dirname + '/*'):
... head, tail = os.path.split(file)
... print head, tail, '=>', ('C:\\Other\\' + tail)
...
C:\PP3rdEd examples.tar.gz => C:\Other\examples.tar.gz
C:\PP3rdEd README.txt => C:\Other\README.txt
C:\PP3rdEd shots => C:\Other\shots
C:\PP3rdEd figures => C:\Other\figures
C:\PP3rdEd examples => C:\Other\examples
C:\PP3rdEd etc => C:\Other\etc
C:\PP3rdEd chapters => C:\Other\chapters
C:\PP3rdEd cdrom => C:\Other\cdrom




Here, the names after the => represent names that files might be moved to. Conversely, a script that means to process all files in a different directory than the one it runs in will probably need to prepend listdir results with the target directory name before passing filenames on to other tools:

其中,=>后面的名字代表文件移动的目的文件名。相反,如果脚本要处理其他目录中的所有文件,而非当前它所运行的目录,它可能需要在listdir的结果前添加目标目录名,然后才能将文件名传给其他工具:

>>> for file in os.listdir(dirname):
... print os.path.join(dirname, file)
...
C:\PP3rdEd\examples.tar.gz
C:\PP3rdEd\README.txt
C:\PP3rdEd\shots
C:\PP3rdEd\figures
C:\PP3rdEd\examples
C:\PP3rdEd\etc
C:\PP3rdEd\chapters
C:\PP3rdEd\cdrom



4.3.2. Walking Directory Trees

4.3.2. 遍历目录树


As you read the prior section, you may have noticed that all of the preceding techniques return the names of files in only a single directory. What if you want to apply an operation to every file in every directory and subdirectory in an entire directory tree?

当你阅读前一部分时,你可能已经注意到,前面的方法返回的文件名都是仅在一个目录下的文件。如果你想要在整个目录树中,对每个目录和子目录中的所有文件操作,那该怎么办?

For instance, suppose again that we need to find every occurrence of a global name in our Python scripts. This time, though, our scripts are arranged into a module package: a directory with nested subdirectories, which may have subdirectories of their own. We could rerun our hypothetical single-directory searcher manually in every directory in the tree, but that's tedious, error prone, and just plain not fun.

例如,再次假设我们需要在多个Python脚本中查找一个全局变量名的所有使用。不过这一次,我们的脚本被编排成了模块封装包:一个包含嵌套子目录的目录,子目录可能有它们自己的子目录。我们可以在目录树中的每个目录下,手工重复运行我们假想的单目录搜索器,但这很乏味,容易出错,一点也不好玩。

Luckily, in Python it's almost as easy to process a directory tree as it is to inspect a single directory. We can either write a recursive routine to traverse the tree, or use one of two tree-walker utilities built into the os module. Such tools can be used to search, copy, compare, and otherwise process arbitrary directory trees on any platform that Python runs on (and that's just about everywhere).

幸运的是,在Python中,处理目录树几乎和检查单个目录一样容易。我们既可以编写递归程序来遍历树,也可以使用os模块内置的两种树遍历工具。这些工具可对任意目录树进行检索、复制、比较,和其他处理,并且是在任何Python可以运行的平台上(那几乎就是到处)。

4.3.2.1. The os.path.walk visitor

4.3.2.1. os.path.walk访问者


To make it easy to apply an operation to all files in a tree hierarchy, Python comes with a utility that scans trees for us and runs a provided function at every directory along the way. The os.path.walk function is called with a directory root, function object, and optional data item, and walks the tree at the directory root and below. At each directory, the function object passed in is called with the optional data item, the name of the current directory, and a list of filenames in that directory (obtained from os.listdir). Typically, the function we provide (often referred to as a callback function) scans the filenames list to process files at each directory level in the tree.

为了方便对目录树层次结构中的所有文件应用一个操作,Python提供了一种实用工具,它会扫描目录树,并沿途在每个目录中运行我们所提供的函数。该os.path.walk函数被调用时需要指定目录的根、一个函数对象和可选的数据项,它将遍历根目录及以下的目录树。在每一个目录,传入的函数对象会被调用,参数是可选的数据项、当前目录的名称,以及该目录的列表(从os.listdir获得)。典型情况下,我们提供的函数(通常称为回调函数)将扫描文件列表,以处理树上每个目录级别下的文件。

That description might sound horribly complex the first time you hear it, but os.path.walk is fairly straightforward once you get the hang of it. In the following code, for example, the lister function is called from os.path.walk at each directory in the tree rooted at .. Along the way, lister simply prints the directory name and all the files at the current level (after prepending the directory name). It's simpler in Python than in English:

这样的描述第一次听起来可能非常复杂,但只要你掌握它的决窍,os.path.walk其实相当简单。例如,以下代码中,在以.为根的目录树中,os.path.walk会在每个目录下调用lister函数。一路上,lister简单地打印当前层次的目录名和所有文件(在前面加上目录名)。用Python表达比用英语更简单:

>>> import os
>>> def lister(dummy, dirname, filesindir):
... print '[' + dirname + ']'
... for fname in filesindir:
... print os.path.join(dirname, fname) # handle one file
...
>>> os.path.walk('.', lister, None)
[.]
.\about-pp.html
.\python1.5.tar.gz
.\about-pp2e.html
.\about-ppr2e.html
.\newdir
[.\newdir]
.\newdir\temp1
.\newdir\temp2
.\newdir\temp3
.\newdir\more
[.\newdir\more]
.\newdir\more\xxx.txt
.\newdir\more\yyy.txt




In other words, we've coded our own custom (and easily changed) recursive directory listing tool in Python. Because this may be something we would like to tweak and reuse elsewhere, let's make it permanently available in a module file, as shown in Example 4-4, now that we've worked out the details interactively.

换句话说,我们用Python编写了我们自己的自定义(并且容易更改的)递归目录列表工具。因为我们可能会在其他地方调整和重用这段代码,既然我们已经以交互方式完成了细节,就让我们把它写入模块文件,让它永久可用,如示例4-4所示。

Example 4-4. PP3E\System\Filetools\lister_walk.py
# list file tree with os.path.walk
import sys, os

def lister(dummy, dirName, filesInDir): # called at each dir
print '[' + dirName + ']'
for fname in filesInDir: # includes subdir names
path = os.path.join(dirName, fname) # add dir name prefix
if not os.path.isdir(path): # print simple files only
print path

if _ _name_ _ == '_ _main_ _':
os.path.walk(sys.argv[1], lister, None) # dir name in cmdline





This is the same code except that directory names are filtered out of the filenames list by consulting the os.path.isdir test in order to avoid listing them twice (see, it's been tweaked already). When packaged this way, the code can also be run from a shell command line. Here it is being launched from a different directory, with the directory to be listed passed in as a command-line argument:

代码几乎相同,除了文件名用os.path.isdir进行测试,以过滤掉列表中的目录名,这是为了避免把它们列举两次(看,它已经进行了调整)。这样包装之后,代码也可以从shell命令行运行了。此处,它从不同的目录启动,而待列举的目录是通过命令行参数传入的:

C:\...\PP3E\System\Filetools>python lister_walk.py C:\Temp
[C:\Temp]
C:\Temp\about-pp.html
C:\Temp\python1.5.tar.gz
C:\Temp\about-pp2e.html
C:\Temp\about-ppr2e.html
[C:\Temp\newdir]
C:\Temp\newdir\temp1
C:\Temp\newdir\temp2
C:\Temp\newdir\temp3
[C:\Temp\newdir\more]
C:\Temp\newdir\more\xxx.txt
C:\Temp\newdir\more\yyy.txt




The walk paradigm also allows functions to tailor the set of directories visited by changing the file list argument in place. The library manual documents this further, but it's probably more instructive to simply know what walk truly looks like. Here is its actual Python-coded implementation for Windows platforms (at the time of this writing), with comments added to help demystify its operation:

该遍历模式还允许函数就地更改文件列表参数,来裁剪进行访问的目录集。库手册对此有更多的说明,但了解walk的真正样子可能更有益。下面是其Windows平台实际的Python实现(在撰写本文时),附加了注释以帮助解开其神秘性:

def walk(top, func, arg): # top is the current dirname
try:
names = os.listdir(top) # get all file/dir names here
except os.error: # they have no path prefix
return
func(arg, top, names) # run func with names list here
exceptions = ('.', '..')
for name in names: # step over the very same list
if name not in exceptions: # but skip self/parent names
name = join(top, name) # add path prefix to name
if isdir(name):
walk(name, func, arg) # descend into subdirs here




Notice that walk generates filename lists at each level with os.listdir, a call that collects both file and directory names in no particular order and returns them without their directory paths. Also note that walk uses the very same list returned by os.listdir and passed to the function you provide in order to later descend into subdirectories (variable names). Because lists are mutable objects that can be changed in place, if your function modifies the passed-in filenames list, it will impact what walk does next. For example, deleting directory names will prune traversal branches, and sorting the list will order the walk.

请注意,walk用os.listdir生成每一层的文件名列表,而os.listdir调用会同时收集文件名和目录名,名字无任何特定的顺序,并且返回结果中不包含它们的目录路径。另外请注意,walk将os.listdir返回的列表传入你所提供的函数,然后又用该同一列表下降进入各个子目录(即变量names)。由于列表是可变对象,可以就地更改,如果你的函数修改了传入的文件名列表,就会影响walk的下一步动作。例如,删除目录名会修剪遍历的分支,而排序该列表会调整walk的顺序。

4.3.2.2. The os.walk generator

4.3.2.2. os.walk生成器


In recent Python releases, a new directory tree walker has been added which does not require a callback function to be coded. This new call, os.walk, is instead a generator function; when used within a for loop, each time through it yields a tuple containing the current directory name, a list of subdirectories in that directory, and a list of nondirectory files in that directory.

在最新的Python版本中,增加了一个新的目录树遍历函数,它不需要编写回调函数。这个全新的调用,os.walk,是一个生成器函数,当它在for循环内使用时,它每次会产生一个元组,其中包含当前目录名、该目录的子目录列表,及该目录的非目录文件列表。

Recall that generators have a .next( ) method implicitly invoked by for loops and other iteration contexts; each call forces the walker to the next directory in the tree. Essentially, os.walk replaces the os.path.walk callback function with a loop body, and so it may be easier to use (though you'll have to judge that for yourself).

回想一下,生成器有个.next()方法,在for循环和其他迭代情况下,该方法会被隐式地调用;每次调用会迫使遍历函数进入树上的下一个目录。从本质上讲,os.walk用循环替换了os.path.walk的回调函数,所以它可能会更好用(但你必须自己判断是否好用)。

For example, suppose you have a directory tree of files and you want to find all Python source files within it that reference the Tkinter GUI module. The traditional way to accomplish this with os.path.walk requires a callback function run at each level of the tree:

例如,假设你有个文件目录树,你想搜索其中所有的Python源文件,查找对Tkinter GUI模块的引用。用os.path.walk来完成的传统方法需要一个回调函数,os.path.walk会在树的各个层次运行该函数:

>>> import os
>>> def atEachDir(matchlist, dirname, fileshere):
for filename in fileshere:
if filename.endswith('.py'):
pathname = os.path.join(dirname, filename)
if 'Tkinter' in open(pathname).read( ):
matchlist.append(pathname)

>>> matches = []
>>> os.path.walk(r'D:\PP3E', atEachDir, matches)
>>> matches
['D:\\PP3E\\dev\\examples\\PP3E\\Preview\\peoplegui.py', 'D:\\PP3E\\dev\\
examples\\PP3E\\Preview\\tkinter101.py', 'D:\\PP3E\\dev\\examples\\PP3E\\
Preview\\tkinter001.py', 'D:\\PP3E\\dev\\examples\\PP3E\\Preview\\
peoplegui_class.py', 'D:\\PP3E\\dev\\examples\\PP3E\\Preview\\
tkinter102.py', 'D:\\PP3E\\NewExamples\\clock.py', 'D:\\PP3E\\NewExamples
\\calculator.py']




This code loops through all the files at each level, looking for files with .py at the end of their names and which contain the search string. When a match is found, its full name is appended to the results list object, which is passed in as an argument (we could also just build a list of .py files and search each in a for loop after the walk). The equivalent os.walk code is similar, but the callback function's code becomes the body of a for loop, and directory names are filtered out for us:

这段代码循环遍历每一级的文件,寻找名字以.py结尾,并且包含搜索字符串的文件。当找到一个匹配,其全称会附加到结果列表对象,该列表对象是作为参数传入的(我们也可以只建立一个.py文件列表,然后在walk之后用for循环搜索)。等效的os.walk代码与此相似,但回调函数的代码变成了循环体,并且目录名已为我们过滤掉了:

>>> import os
>>> matches = []
>>> for (dirname, dirshere, fileshere) in os.walk(r'D:\PP3E'):
for filename in fileshere:
if filename.endswith('.py'):
pathname = os.path.join(dirname, filename)
if 'Tkinter' in open(pathname).read( ):
matches.append(pathname)

>>> matches
['D:\\PP3E\\dev\\examples\\PP3E\\Preview\\peoplegui.py', 'D:\\PP3E\\dev\\examples\\
PP3E\\Preview\\tkinter101.py', 'D:\\PP3E\\dev\\examples\\PP3E\\Preview\\
tkinter001.py', 'D:\\PP3E\\dev\\examples\\PP3E\\Preview\\peoplegui_class.py', 'D:\\
PP3E\\dev\\examples\\PP3E\\Preview\\tkinter102.py', 'D:\\PP3E\\NewExamples\\
clock.py', 'D:\\PP3E\\NewExamples\\calculator.py']




If you want to see what's really going on in the os.walk generator, call its next( ) method manually a few times as the for loop does automatically; each time, you advance to the next subdirectory in the tree:

如果你想看看os.walk生成器实际是如何运作的,可以手动调用几次它的next()方法,来模拟for循环中的自动调用;每一次,你会前进到树中的下一个子目录:

>>> gen = os.walk('D:\PP3E')
>>> gen.next( )
('D:\\PP3E', ['proposal', 'dev', 'NewExamples', 'bkp'], ['prg-python-2.zip'])
>>> gen.next( )
('D:\\PP3E\\proposal', [], ['proposal-programming-python-3e.doc'])
>>> gen.next( )
('D:\\PP3E\\dev', ['examples'], ['ch05.doc', 'ch06.doc', 'ch07.doc', 'ch08.doc',
'ch09.doc', 'ch10.doc', 'ch11.doc', 'ch12.doc', 'ch13.doc', 'ch14.doc', ...more...




The os.walk generator has more features than I will demonstrate here. For instance, additional arguments allow you to specify a top-down or bottom-up traversal of the directory tree, and the list of subdirectories in the yielded tuple can be modified in-place to change the traversal in top-down mode, much as for os.path.walk. See the Python library manual for more details.

os.walk生成器有许多功能我没有在此展示。例如,附加参数允许你指定自上而下还是自下而上遍历目录树,以及在自上而下的模式中,生成的元组中的子目录列表可以就地修改来更改遍历,就像os.path.walk中的一样。详情请参阅Python库手册。

So why the new call? Is the new os.walk easier to use than the traditional os.path.walk? Perhaps, if you need to distinguish between subdirectories and files in each directory (os.walk gives us two lists rather than one) or can make use of a bottom-up traversal or other features. Otherwise, it's mostly just the trade of a function for a for loop header. You'll have to judge for yourself whether this is more natural or not; we'll use both forms in this book.

那么,为什么要有这个新的调用呢?是新的os.walk比传统的os.path.walk更好用?如果您需要区分每个目录中的子目录和文件(os.walk为我们提供了两个列表,而不是一个),或者想利用自下而上的遍历或其他功能,也许os.walk是更好用。否则,os.walk几乎仅仅是把一个函数替换为for循环头。你必须自己去判断这是否更自然;在本书中,这两种形式我们都会使用。

4.3.2.3. Recursive os.listdir traversals

4.3.2.3. 递归os.listdir遍历


The os.path.walk and os.walk tools do tree traversals for us, but it's sometimes more flexible and hardly any more work to do it ourselves. The following script recodes the directory listing script with a manual recursive traversal function (a function that calls itself to repeat its actions). The mylister function in Example 4-5 is almost the same as lister in Example 4-4 but calls os.listdir to generate file paths manually and calls itself recursively to descend into subdirectories.

os.path.walk和os.walk工具可以为我们做树遍历,但有时,我们自己遍历会更灵活,并且几乎无须做太多工作。以下脚本用一个手动递归遍历函数重写了目录列表脚本(递归函数就是它会调用自身做重复的动作)。示例4-5中的mylister函数与示例4-4的lister几乎相同,但它调用os.listdir来手动产生文件路径,并递归调用自己进入子目录。

Example 4-5. PP3E\System\Filetools\lister_recur.py
# list files in dir tree by recursion


import sys, os

def mylister(currdir):
print '[' + currdir + ']'
for file in os.listdir(currdir): # list files here
path = os.path.join(currdir, file) # add dir path back
if not os.path.isdir(path):
print path
else:
mylister(path) # recur into subdirs

if _ _name_ _ == '_ _main_ _':
mylister(sys.argv[1]) # dir name in cmdline





This version is packaged as a script too (this is definitely too much code to type at the interactive prompt); its output is identical when run as a script:

此版本也被打包为脚本(在交互式提示符下敲代码,这无疑是太多了);作为脚本运行时,其输出是相同的:

C:\...\PP3E\System\Filetools>python lister_recur.py C:\Temp
[C:\Temp]
C:\Temp\about-pp.html
C:\Temp\python1.5.tar.gz
C:\Temp\about-pp2e.html
C:\Temp\about-ppr2e.html
[C:\Temp\newdir]
C:\Temp\newdir\temp1
C:\Temp\newdir\temp2
C:\Temp\newdir\temp3
[C:\Temp\newdir\more]
C:\Temp\newdir\more\xxx.txt
C:\Temp\newdir\more\yyy.txt




But this file is just as useful when imported and called elsewhere:

但是该文件可以在其他地方被导入并调用:

C:\temp>python
>>> from PP3E.System.Filetools.lister_recur import mylister
>>> mylister('.')
[.]
.\about-pp.html
.\python1.5.tar.gz
.\about-pp2e.html
.\about-ppr2e.html
[.\newdir]
.\newdir\temp1
.\newdir\temp2
.\newdir\temp3
[.\newdir\more]
.\newdir\more\xxx.txt
.\newdir\more\yyy.txt




We will make better use of most of this section's techniques in later examples in Chapter 7 and in this book at large. For example, scripts for copying and comparing directory trees use the tree-walker techniques listed previously. Watch for these tools in action along the way. If you are interested in directory processing, also see the discussion of Python's old grep module in Chapter 7; it searches files and can be applied to all files in a directory when combined with the glob module, but it simply prints results and does not traverse directory trees by itself.

在本书及后面第7章的例子中,我们将好好地利用本节的大部分技术。例如,复制和比较目录树的脚本会使用前面列出的树遍历技术。请一路上注意这些实用工具。如果你对目录处理有兴趣,也请看看第7章对Python旧的grep模块的讨论;grep会搜索文件,并且与glob模块组合时,可以应用于目录中的所有文件,但它本身只是打印结果,并不遍历目录树。

4.3.3. Rolling Your Own find Module

4.3.3. 打造你自己的find模块


Another way to go hierarchical is to collect files into a flat list all at once. In the second edition of this book, I included a section on the now-defunct find standard library module, which was used to collect a list of matching filenames in an entire directory tree (much like a Unix find command). Unlike the single-directory tools described earlier, although it returned a flat list, find returned pathnames of matching files nested in subdirectories all the way to the bottom of a tree.

层次遍历的另一种方法是将文件一次性收集到一个平坦的列表。在本书的第二版,包含了一个现在已作废的标准库模块find,它用来收集整个目录树中匹配的文件名列表(很像UNIX find命令)。与前面描述的单目录工具不同,虽然find返回一个平坦的列表,但它会返回嵌套在子目录中的匹配文件的路径名,一路下去直到树底。

This module is now gone; the os.walk and os.path.walk tools described earlier are recommended as easier-to-use alternatives. On the other hand, it's not completely clear why the standard find module fell into deprecation; it's a useful tool. In fact, I used it oftenit is nice to be able to grab a simple linear list of matching files in a single function call and step through it in a for loop. The alternatives still seem a bit more code-y and tougher for beginners to digest.

这个模块现在已经不复存在了;据建议,前面描述的os.walk和os.path.walk工具是更好用的替代品。另一方面,并不完全清楚为什么标准的find模块会遭到废弃;它是个有用的工具。事实上,我经常使用它;能够在单个函数调用中抓取匹配的文件到一个简单的线性列表,并在for循环中遍历它,这很好。而替代方法对于初学者来说,似乎仍然有点理解困难。

Not to worry though, because instead of lamenting the loss of a module, I decided to spend 10 minutes whipping up a custom equivalent. In fact, one of the nice things about Python is that it is usually easy to do by hand what a built-in tool does for you; many built-ins are just conveniences. The module in Example 4-6 uses the standard os.path.walk call described earlier to reimplement a find operation for use in Python scripts.

但是不要担心,不必哀悼失去的模块,因为我决定花10分钟做出一个自定义的等价模块。事实上,Python的好处之一就是,通常很容易用手工做到内置工具所做的事情;许多内置模块仅仅只是提供了方便。示例4-6中的模块使用了前面所述的标准os.path.walk调用,重新实现了可用于Python脚本的find操作。

Example 4-6. PP3E\PyTools\find.py
#!/usr/bin/python
##############################################################################
# custom version of the now deprecated find module
in the
standard library:
# import as "PyTools.find"; equivalent to the original, but uses os.path.walk,
# has no support for pruning subdirs in the tree, and is instrumented to be
# runnable as a top-level script; uses tuple unpacking in function arguments;
##############################################################################

import fnmatch, os

def find(pattern, startdir=os.curdir):
matches = []
os.path.walk(startdir, findvisitor, (matches, pattern))
matches.sort( )
return matches

def findvisitor((matches, pattern), thisdir, nameshere):
for name in nameshere:
if fnmatch.fnmatch(name, pattern):
fullpath = os.path.join(thisdir, name)
matches.append(fullpath)

if _ _name_ _ == '_ _main_ _':
import sys
namepattern, startdir = sys.argv[1], sys.argv[2]
for name in find(namepattern, startdir): print name





There's not much to this file; but calling its find function provides the same utility as the deprecated find standard module and is noticeably easier than rewriting all of this file's code every time you need to perform a find-type search. Because this file is instrumented to be both a script and a library, it can be run or called.

该文件没什么东西;但是它的find函数所提供的功能,与作废的find标准模块相同,并且当你需要执行find类型的搜索时,比起每次重写该文件的所有代码,使用它的find函数明显更容易。因为此文件既是脚本也是库,所以既可以运行也可以调用。

For instance, to process every Python file in the directory tree rooted in the current working directory, I simply run the following command line from a system console window. I'm piping the script's standard output into the more command to page it here, but it can be piped into any processing program that reads its input from the standard input stream:

例如,处理当前工作目录为根的目录树下的每个Python文件,我只需在系统控制台窗口中运行以下命令行。这里我把脚本的标准输出管道到more命令进行分页,但它也可以管道到任何读取标准输入流的处理程序:

python find.py *.py . | more



For more control, run the following sort of Python code from a script or interactive prompt (you can also pass in an explicit start directory if you prefer). In this mode, you can apply any operation to the found files that the Python language provides:

为了实施更多控制,可运行以下这类脚本,在脚本中也行,在交互提示符下也行(如果你喜欢,你也可以传入一个明确的开始目录)。在这种模式下,您可以对找到的文件应用任何Python语言所提供的操作:

from PP3E.PyTools import find
for name in find.find('*.py'):
...do something with name...




Notice how this avoids the nested loop structure you wind up coding with os.walk and the callback functions you implement for os.path.walk (see the earlier examples), making it seem conceptually simpler. Its only obvious downside is that your script must wait until all matching files have been found and collected; os.walk yields results as it goes, and os.path.walk calls your function along the way.

请注意,这样做避免了用os.walk编码时的嵌套循环结构,也避免了为os.path.walk实现的回调函数(见前面的例子),概念上更简单。它唯一明显的缺点是,你的脚本必须等待所有匹配的文件被找到和收集; 而os.walk会边执行边产生结果,而os.path.walk会沿途调用你的函数。

Here's a more concrete example of our find module at work: the following system command line lists all Python files in directory D:\PP3E whose names begin with the letter c or t (it's being run in the same directory as the find.py file). Notice that find returns full directory paths that begin with the start directory specification.

下面是我们的find模块更具体的应用例子:以下系统命令行列出目录D:\PP3E下的所有Python文件,其文件名以字母c或t开始(它运行于find.py文件所在目录)。请注意,find返回完整的目录路径,会以指定的开始目录开头。

C:\Python24>python find.py [ct]*.py D:\PP3E
D:\PP3E\NewExamples\calculator.py
D:\PP3E\NewExamples\clock.py
D:\PP3E\NewExamples\commas.py
D:\PP3E\dev\examples\PP3E\Preview\tkinter001.py
D:\PP3E\dev\examples\PP3E\Preview\tkinter101.py
D:\PP3E\dev\examples\PP3E\Preview\tkinter102.py



And here's some Python code that does the same find but also extracts base names and file sizes for each file found:

以下的一些Python代码做了同样的find,但是同时对找到的每个文件提取了基本名字和文件大小:

>>> import os
>>> from find import find
>>> for name in find('[ct]*.py', r'D:\PP3E'):
... print os.path.basename(name), '=>', os.path.getsize(name)
...
calculator.py => 14101
clock.py => 11000
commas.py => 2508
tkinter001.py => 62
tkinter101.py => 235
tkinter102.py => 421




As a more useful example, I use the following simple script to clean out any old output text files located anywhere in the book examples tree. I usually run this script from the example's root directory. I don't really need the full path to the find module in the import here because it is in the same directory as this script itself; if I ever move this script, though, the full path will be required:

下面是个更为有用的例子,我用以下的简单脚本来清除书中examples目录树下,所有旧的输出文本文件。我通常在示例的根目录下运行此脚本。在这里的导入中,我其实并不需要find模块的完整路径,因为find模块和该脚本本身是在同一目录;但如果我一旦移动这个脚本,就需要完整的路径:

C:\...\PP3E>type PyTools\cleanoutput.py
import os # delete old output files in tree
from PP3E.PyTools.find import find # only need full path if I'm moved
for filename in find('*.out.txt'):
print filename
if raw_input('View?') == 'y':
print open(filename).read( )
if raw_input('Delete?') == 'y':
os.remove(filename)


C:\temp\examples>python %X%\PyTools\cleanoutput.py
.\Internet\Cgi-Web\Basics\languages.out.txt
View?
Delete?
.\Internet\Cgi-Web\PyErrata\AdminTools\dbaseindexed.out.txt
View?
Delete?y




To achieve such code economy, the custom find module calls os.path.walk to register a function to be called per directory in the tree and simply adds matching filenames to the result list along the way.

为了经济地完成这样的代码,自定义find模块调用了os.path.walk来注册一个函数,树上的每个目录都要调用该函数,而该函数只是沿途将匹配的文件名添加到结果列表。

New here, though, is the fnmatch moduleyet another Python standard library module that performs Unix-like pattern matching against filenames. This module supports common operators in name pattern strings: * (to match any number of characters), ? (to match any single character), and [...] and [!...] (to match any character inside the bracket pairs, or not); other characters match themselves.[*] If you haven't already noticed, the standard library is a fairly amazing collection of tools.

不过fnmatch模块是新的内容:它是另一个Python标准库模块,对文件名执行Unix的模式匹配。该模块支持名字模式串中的普通操作:*(匹配任意多个字符)、?(匹配任意单个字符),及[...]和[!...](匹配方括号内的任意单个字符,或不匹配);其他字符匹配它们自己[*]。不知您有没有注意到,标准库是个相当惊人的工具集合。

[*] Unlike the re module, fnmatch supports only common Unix shell matching operators, not full-blown regular expression patterns; to understand why this matters, see Chapter 18 for more details.

[*] 与re模块不同的是,fnmatch仅支持普通的Unix shell匹配操作,而不是全面的正则表达式模式;想要理解有什么区别,请详见第18章。

Incidentally, find.find is also roughly equivalent to platform-specific shell commands such as find -print on Unix and Linux, and dir /B /S on DOS and Windows. Since we can usually run such shell commands in a Python script with os.popen, the following does the same work as find.find but is inherently nonportable and must start up a separate program along the way:

顺便说一句,find.find也与Unix和Linux上的find -print、DOS和Windows上的dir /B /S这些平台专用的shell命令大致等效。由于我们通常可以在Python脚本中用os.popen运行这样的shell命令,以下代码做了与find.find相同的工作,但其本质上是不可移植的,并且必须沿途启动独立的程序:

>>> import os
>>> for line in os.popen('dir /B /S').readlines( ): print line,
...
C:\temp\about-pp.html
C:\temp\about-pp2e.html
C:\temp\about-ppr2e.html
C:\temp\newdir
C:\temp\newdir\temp1
C:\temp\newdir\temp2
C:\temp\newdir\more
C:\temp\newdir\more\xxx.txt




The equivalent Python metaphors, however, work unchanged across platformsone of the implicit benefits of writing system utilities in Python:

但是等效的Python隐喻却可以不加修改地跨平台运行:这就是用Python编写系统工具隐含的好处之一:

C:\...> python find.py * .

>>> from find import find
>>> for name in find(pattern='*', startdir='.'): print name



Finally, if you come across older Python code that fails because there is no standard library find to be found, simply change find-module imports in the source code to, say:

最后,如果您遇到较老的Python代码因为找不到标准库find而失败,只需简单地将源码中的find模块导入语句改为:

from PP3E.PyTools import find



rather than:

代替:

import find



The former form will find the custom find module in the book's example package directory tree. And if you are willing to add the PP3E\PyTools directory to your PYTHONPATH setting, all original import find statements will continue to work unchanged.

前者的形式会找到自定义的find模块,它位于本书的example包目录树。如果您愿意将PP3E\PyTools目录加入到您的PYTHONPATH设置中,则原来所有的import find语句可以保持不变。

Better still, do nothing at allmost find-based examples in this book automatically pick the alternative by catching import exceptions just in case they are run on a more modern Python and their top-level files aren't located in the PyTools directory:

更好的是什么也不做:本书大多数基于find的例子会自动选择替代方法,如果它们运行于一个更现代的Python,并且它们的顶层文件不在PyTools目录中,它们会捕获导入异常,从而作出选择:

try:
import find
except ImportError:
from PP3E.PyTools import find




The find module may be gone, but it need not be forgotten.

find模块可以消失,但它不应该被忘记。

Python Versus csh

Python与csh

If you are familiar with other common shell script languages, it might be useful to see how Python compares. Here is a simple script in a Unix shell language called csh that mails all the files in the current working directory with a suffix of .py (i.e., all Python source files) to a hopefully fictitious address:

如果你熟悉其他常见的shell脚本语言,看看它们与Python的比较可能是有益的。这里是个简单脚本,是用被称为csh的Unix shell语言写的,它会将当前工作目录中的所有以.py为后缀的文件(即所有的Python源文件),邮寄到一个地址,希望该地址不是真的:

#!/bin/csh
foreach x (*.py)
echo $x
mail eric@halfabee.com -s $x < $xend



The equivalent Python script looks similar:

等效的Python脚本类似于:

#!/usr/bin/python
import os, glob
for x in glob.glob('*.py'):
print x
os.system('mail eric@halfabee.com -s %s < %s' % (x, x))



but is slightly more verbose. Since Python, unlike csh, isn't meant just for shell scripts, system interfaces must be imported and called explicitly. And since Python isn't just a string-processing language, character strings must be enclosed in quotes, as in C.

但稍微冗长。因为Python与csh不同,它不只是用于shell脚本,其系统接口必须显式地导入并调用。而且由于Python不仅仅是个字符串处理语言,字符串必须放在引号内,就像C语言。

Although this can add a few extra keystrokes in simple scripts like this, being a general-purpose language makes Python a better tool once we leave the realm of trivial programs. We could, for example, extend the preceding script to do things like transfer files by FTP, pop up a GUI message selector and status bar, fetch messages from an SQL database, and employ COM objects on Windows, all using standard Python tools.

虽然这会在简单脚本中增加这样一些额外的按键,但是,一旦我们离开简单程序的领域,Python作为一个通用的语言,将成为一个更好的工具。例如,我们可以使用标准的Python工具,来扩展前面的脚本,让它做些像通过FTP传文件、弹出一个GUI消息选择器和状态栏、从SQL数据库获取信息,和使用Windows的COM对象这样的事情。

Python scripts also tend to be more portable to other platforms than csh. For instance, if we used the Python SMTP interface to send mail instead of relying on a Unix command-line mail tool, the script would run on any machine with Python and an Internet link (as we'll see in Chapter 14, SMTP only requires sockets). And like C, we don't need $ to evaluate variables; what else would you expect in a free language?

比起csh,Python脚本也更容易移植到其他平台。例如,如果我们使用了Python的SMTP接口发送邮件,而不是依赖于Unix命令行工具mail,脚本将可运行于任何带Python和Internet连接的机器上(在第14章我们将看到SMTP只需要套接口)。就像C语言,我们不需要用$对变量求值;对于一个免费的语言,您还有什么其他期望呢?




分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics