Convert Chinese numerals in file names to Arabic numerals.

Background#

I obtained a penetration testing tutorial from a public account, written by the expert micropoor. The original source (requires VPN): https://micropoor.blogspot.com/2019/01/php8.html

(Updated on 19-02-23) Micropoor also updates on Github, which is the most comprehensive source. You can use the script in this article to rename files while maintaining the order: https://github.com/Micropoor/Micro8. Soapffz has synchronized this project on Gitee, refer to this article: git clone too slow? Use Gitee as a proxy for elegant acceleration (turned into a water article due to failure)

You can directly use:

git clone https://gitee.com/soapffz/Micro8.git

Download, Micropoor is also updating on the Knownsec forum: https://www.kanxue.com/book-section_list-38.htm

However, the file names are sorted as "Lesson xx", which is not convenient for sorting. So we are going to batch rename them like this:

For example, rename "msf 的前生今世（第三十一课）.pdf" to "31-msf 的前生今世.pdf"

Code Implementation#

Extract Chinese Numerals for Sorting#

First, we need to extract the Chinese numerals for the lesson number using the re regular expression library:

import re
cc = 'msf的前生今世（第三十一课）.pdf'
r1 = re.compile(u'[一二三四五六七八九十]{1,}')
print(r1.findall(cc))

As you can see, the Chinese numerals have been extracted.

Extract Lesson Names#

Reference link: https://www.cnblogs.com/lzhc/p/8744299.html

We need to extract the part of the lesson name that does not include the Chinese numerals for sorting, which means removing the text inside the parentheses:

import re
cc = 'msf 的前生今世（第三十一课）.pdf'
a = re.sub(u"\\（.\*?）","",cc)
print(a)

Output: msf的前生今世.pdf

Convert Chinese Numerals to Arabic Numerals#

Reference article: https://segmentfault.com/a/1190000013048884

Consider the following points:

Zero serves as a placeholder in Chinese numeral strings and can be ignored during processing.
The combination of "一十" is usually shortened to "十", which means that if there is no number before "十", it is considered as "一十".
The numbers before the units "千" (thousand), "百" (hundred), and "十" (ten) are single digits.
The numbers before the unit "万" (ten thousand) can be composed of (3).
The numbers before the unit "亿" (hundred million) can be composed of (3), (4), and "亿" itself.

Code implementation:

digit = {'一': 1, '二': 2, '三': 3, '四': 4, '五': 5, '六': 6, '七': 7, '八': 8, '九': 9}

def \_trans(s):
num = 0
if s:
idx_q, idx_b, idx_s = s.find('千'), s.find('百'), s.find('十')
if idx_q != -1:
num += digit[s[idx_q - 1:idx_q]] * 1000
if idx_b != -1:
num += digit[s[idx_b - 1:idx_b]] * 100
if idx_s != -1: # Ignore "一" before "十"
num += digit.get(s[idx_s - 1:idx_s], 1) * 10
if s[-1] in digit:
num += digit[s[-1]]
return num

def trans(chn):
chn = chn.replace('零', '')
idx_y, idx_w = chn.rfind('亿'), chn.rfind('万')
if idx_w < idx_y:
idx_w = -1
num_y, num_w = 100000000, 10000
if idx_y != -1 and idx_w != -1:
return trans(chn[:idx_y]) * num_y + \_trans(chn[idx_y + 1:idx_w]) * num_w + \_trans(chn[idx_w + 1:])
elif idx_y != -1:
return trans(chn[:idx_y]) * num_y + \_trans(chn[idx_y + 1:])
elif idx_w != -1:
return \_trans(chn[:idx_w]) * num_w + \_trans(chn[idx_w + 1:])
return \_trans(chn)

Testing:

print(trans('十'))
print(trans('六百五十四'))
print(trans('五十六万零一十'))
print(trans('一亿九千三百五十二万六千七百五十四'))
print(trans('一百万亿零七'))

Code Summary#

Reference article: https://www.cnblogs.com/kba977/p/3533367.html

# !/usr/bin/python

# - _ - coding:utf-8 - _ -

'''
@author: soapffz
@fucntion: Convert Chinese Numerals in File Names to Arabic Numerals
@Description: For example, rename "msf 的前生今世（第三十一课）.pdf" to "31-msf 的前生今世.pdf"
@time: 2019-02-06
'''

import os
import re

digit = {'一': 1, '二': 2, '三': 3, '四': 4,
'五': 5, '六': 6, '七': 7, '八': 8, '九': 9}

def \_trans(s):
num = 0
if s:
idx_q, idx_b, idx_s = s.find('千'), s.find('百'), s.find('十')
if idx_q != -1:
num += digit[s[idx_q - 1:idx_q]] * 1000
if idx_b != -1:
num += digit[s[idx_b - 1:idx_b]] * 100
if idx_s != -1:
num += digit.get(s[idx_s - 1:idx_s], 1) * 10
if s[-1] in digit:
num += digit[s[-1]]
return num

def trans(chn):
chn = chn.replace('零', '')
idx_y, idx_w = chn.rfind('亿'), chn.rfind('万')
if idx_w < idx_y:
idx_w = -1
num_y, num_w = 100000000, 10000
if idx_y != -1 and idx_w != -1:
return trans(chn[:idx_y]) * num_y + \_trans(chn[idx_y + 1:idx_w]) * num_w + \_trans(chn[idx_w + 1:])
elif idx_y != -1:
return trans(chn[:idx_y]) * num_y + \_trans(chn[idx_y + 1:])
elif idx_w != -1:
return \_trans(chn[:idx_w]) * num_w + \_trans(chn[idx_w + 1:])
return \_trans(chn)

if __name__ == "__main__":
for filename in os.listdir("."): # print(filename)
split = filename.split(".") # Split the file name and extension
if split[1] == 'pdf':
name = re.sub(u"\\（.\*?）", "", filename) # Use regular expression to match the file name and Chinese numeral for sorting
cn_number = re.compile(u'[一二三四五六七八九十]{1,}').findall(filename)[-1::]
if cn_number: # To prevent manual conversion
number = trans(cn_number[0])
new_filename = str(number) + "-" + name # print(new_filename)
os.rename(filename, new_filename)

The result is as follows:

However, if you download the files directly from Micropoor's website or lsh4ck's website, the file names will be like this:

第一百课：HTTP 隧道 reDuh 第四季.pdf
第七十四课：基于白名单 regsvcs.exe 执行 payload 第四季.pdf

After testing, the above code does not consider Chinese numerals above one hundred. Modify the main function code as follows:

if __name__ == "__main__":
for filename in os.listdir("."): # print(filename)
portion = os.path.splitext(filename)
if portion[1] == '.pdf':
name = re.split('：', portion[0])[1]
cn_number = re.compile(
u'[一二三四五六七八九十零百千万亿]{2,}').findall(portion[0])[0]
if cn_number:
number = trans(cn_number)
new_filename = str(number) + "-" + (name) + ".pdf" # print(new_filename)
os.rename(filename, new_filename)

You can compare the two versions. Only a small part has been modified. The result is as follows:

Do not execute the script with modified files and unmodified files together.