The WWDC25: Explore large language models on Apple silicon with MLX video talks about using your own data to fine-tune a large language model. But the video doesn't explain what kind of data can be used. The video just shows the command to use and how to point to the data folder. Can I use PDFs, Word documents, Markdown files to train the model? Are there any code examples on GitHub that demonstrate how to do this?
Data used for MLX fine-tuning
The process is explained here: https://github.com/ml-explore/mlx-examples/tree/main/lora#Custom-Data
and examples of the json files are here: https://github.com/ml-explore/mlx-examples/tree/main/lora/data
Note the specific format. Imagine how much work is required to create these files. Not an easy feat. But then that's why scaleAI exists - to put the global south (Venezuela and Chile for example) to work for $5 a day doing this work remotely. This is the dark underbelly of deep learning, along with the existential threat of global warming from its exponentially increasing energy requirement.
The github explains the format fairly well. I've added info from my eBook content after saving the Pages version as plain text. I created the train.json file by breaking the text file into lines an added '{"text": "My text info"}' around each line My text info. I fudged the valid.json file by including a small part of the train.json. The biggest problem was finding special characters in my original that weren't json compatible (e.g. tab character). This was just for adding 'text' content to the model using LoRa.