Python script get killed only when analyzing large dataset.

I wrote a Python script to analyze a large dataset of chess games. I've been testing it with a small dataset, about 1 mb, and it works no problem. Running the same code with a larger dataset (about 10gb) will run for a few minutes but will always crash around the same time (but never at the exact same moment). It seems like the system is killing the process when it starts using a certain threshold of resources. I am on a brand new M2 MacBook Pro with 16gb of ram. I've tried restarting my computer and I've reinstalled python from several different sources, all to no avail.

When it crashed, the shell says nothing more than "zsh: killed". Here is the Python script in question:

print('opening file')
games = open("games.txt", "r")
gamesArray = []
print("reading file")
for textLine in games:
    gamesArray.append(textLine)
posArray = []
print("dissecting games")
ct=0
for game in gamesArray:
    ct+=1
    if (ct % 10000 == 0):
        print("dissecting games: " + str(ct / len(gamesArray) * 100) + "%")
    positions = game.split('.')
    pos = ''
    for state in positions:
        pos+=state
        if (len(pos) > 5):
            posArray.append(pos)
        

print("converting positions to set")
posArraySet = set(posArray)
print("removing duplicates")
posArrayUnique = (list(posArraySet))
posFreqArray = []

print("counting duplicates")
ct = 0
for i in posArrayUnique:
    ct+=1
    if (ct % 10000 == 0):
        print("counting duplicates " + str(100 * (ct / len(posArrayUnique))) + "%")
    posFreqArray.append({"position": i, "ct": posArray.count(i)})

# print("counting variations")
posFreqArray = sorted(posFreqArray, key=lambda x: x['ct'])
# the below is all unnecessary lol
# ct = 0
# for i in range(0,len(posFreqArray)):
#     ct += 1
#     if (ct % 100 == 0):
#         print(str(100*(ct/len(posFreqArray))))
#     for n in range(0, len(posFreqArray)):
#         if (i != n):
#             if (posFreqArray[i]["position"] in posFreqArray[n]["position"]):
#                 posFreqArray[i]["ct"] += posFreqArray[n]["ct"]
print('writing to data.txt')
outputFile = open("data.txt", "w")
output = ''
for i in posFreqArray:
    output += str(i) + '\n'
outputFile.write(output)
print("done")

I'm on MacOS 13.2.1 and Python 3.11.

Any help would be greatly appreciated. Thanks in advance!

I think it is a problem of the process's virtual memory limit being exceeded, and the same thing is happening to me with another Python script.

  • I have a similar issue. I had a left over Google Colab subscription from my semester and executed my program there. Turns out the problem (in my case at least) is the System RAM being exceeded. I would recommend looking into using the python garbage collecting package and explicitly garbage collect things you are not using. For example, GameArray looks like a resource that could be garbage collected since it is not used any where else besides the first loop. If the problem still persists, you may need to make a trade-off or outsource the runtime to a server with high RAM capacity.
Python script get killed only when analyzing large dataset.
 
 
Q