Python 解析树状结构文件

时间 2019-11-19

标签 python 解析树状结构文件栏目 Python 繁體版

原文原文链接

背景：
基于博客《Python 解析树状结构文件》的算法优化算法

核心思想：app

创建一个List用来存储父节点信息，每当读到以Tab+name 开头的行时，将这行父节点信息存储在prefixList[tab 的个数] 中，即prefixList[i] 存储 Tab 个数为 i 的父节点信息。优化

当读到以Tab+ptr 开头的行的时候，代表到达了子节点，那么它的父节点（前缀）一定为：preList[0] + ...+ preList[tab 的个数]，因此最终结果为：前缀 + 当前子节点信息。debug

当再次读到以Tab+name 开头的行时，代表对于接下来的子节点而言，其父节点中某个节点变化了，咱们只要覆盖对应的prefixList[tab 的个数] 的值，由于不会有节点须要原来prefixList[tab 的个数] 的值。code

实现：字符串

现模拟debug trace 建一个文本文件1.txt，内容以下：get

01
service[hi]博客

02
name: [1]string

03
{it

04
name:[11]

05
{

06
name: [111]

07
{

08
ptr->1111-->[value0]

09
ptr->1112-->[value1]

10
}

11
name: [112]

12
{

13
name: [1121]

14
{

15
ptr->111211-->[value2]

16
}

17

18
}

19
}

20
name:[12]

21
{

22
ptr->121-->[value3]

23
}

24
name:[13]

25
{

26
ptr->131-->[value4]

27
}

28
}

29
service[Jeff]

30
name: [1]

31
{

32
name:[11]

33
{

34
name: [111]

35
{

36
ptr->1111-->[value0]

37
ptr->1112-->[value1]

38
}

39
name: [112]

40
{

41
name: [1121]

42
{

43
ptr->111211-->[value2]

44
}

45

46
}

47
}

48
name:[12]

49
{

50
ptr->121-->[value3]

51
}

52
name:[13]

53
{

54
ptr->131-->[value4]

55
}

56
}

解析程序以下：

1.common.py

01
'''

02
Created on 2012-5-28

03

04
@author: Jeff_Yu

05
'''

06

07
def getValue(string,key1,key2):

08
"""

09
get the value between key1 and key2 in string

10
"""

11
index1 = string.find(key1)

12
index2 = string.find(key2)

13

14
value = string[index1 + 1 :index2]

15
return value

16

17
def getFiledNum(string,key,begin):

18
"""

19
get the number of key in string from begin position

20
"""

21
keyNum = 0

22
start = begin

23

24
while True:

25
index = string.find(key, start)

26
if index == -1:

27
break

28

29
keyNum = keyNum + 1

30
start = index + 1

31

32
return keyNum

2. main.py

01
'''

02
Created on 2012-6-1

03

04
@author: Jeff_Yu

05
'''

06

07
import common

08

09
fileNameRead = "1.txt"

10
fileNameWrite = '%s%s' %("Result_",fileNameRead)

11
writeList = []

12
# the first name always start with 0 Tab

13
i = 0

14

15
fr = open(fileNameRead,'r')

16
fw = open(fileNameWrite,'w')

17

18
for data in fr:

19
if not data:

20
break

21

22
# find the Service Name

23
if data.startswith("service"):

24
#for each service

25
prefixList = list("0" * 30)

26
prefixString = ""

27
recordNum = ""

28

29
index = data.find('\n')

30
writeList.append('%s\n' %data[0:index])

31
continue

32

33

34
# find name

35
if data.find("name") != -1:

36
tabNumOfData = common.getFiledNum(data, '\t', 0)

37

38
value = common.getValue(data, '[', ']')

39

40
prefixList[tabNumOfData] = value + "."

41

42
if data.find("ptr") != -1:

43
tabNumOfLeaf = common.getFiledNum(data, '\t', 0)

44

45
valueOfLeaf = common.getValue(data, '[', ']')

46
nameOfLeaf = common.getValue(data, '>', '-->')

47
LeafPartstring = nameOfLeaf + "[" + valueOfLeaf + "]"

48

49
finalString = ""

50
while i < tabNumOfLeaf:

51
finalString = finalString + prefixList[i]

52
i = i + 1

53

54
i = 0

55

56
finalString = finalString + LeafPartstring

57

58
#append line to writeList

59
writeList.append(finalString)

60
writeList.append("\n")

61

62

63

64
# write writeList to result file

65
fw.writelines(writeList)

66

67

68
del prefixList

69
del writeList

70

71
fw.close()

72
fr.close()

解析结果Result_1.txt：

01
service[hi]

02
1.11.111.1111[value0]

03
1.11.111.1112[value1]

04
1.11.112.1121.111211[value2]

05
1.12.121[value3]

06
1.13.131[value4]

07
service[Jeff]

08
1.11.111.1111[value0]

09
1.11.111.1112[value1]

10
1.11.112.1121.111211[value2]

11
1.12.121[value3]

12
1.13.131[value4]

实际的trace文件比这个复杂，由于涉及公司信息，实现代码就不贴出来，可是核心思想和上面是同样的

这个版本效率大大提升，原来解析5M的文件要2分多钟，如今只要1秒钟

这个版本优化了：

1.字符串相加的部分改为 all = ‘%s%s%s%s’ % (str0, str1, str2, str3) 的形式。

2.要写入得内容保存在List中，最后用f.writelines(list)一块儿写入。

3. 这个算法减小了读文件的次数，及时保存读过的有用信息，避免往回读文件。