Efficiently Importing Data

This article describes how you can efficiently import data into a Core Data application and turn the data into managed objects to save to a persistent store. It discusses some of the fundamental Cocoa patterns you should follow, and patterns that are specific to Core Data.

Cocoa Fundamentals

In common with many other situations, when you use Core Data to import a data file it is important to remember “normal rules” of Cocoa application development apply. If you import a data file that you have to parse in some way, it is likely you will create a large number of temporary objects. These can take up a lot of memory and lead to paging. Just as you would with a non-Core Data application, you can use local autorelease pool blocks to put a bound on how many additional objects reside in memory. For more about the interaction between Core Data and memory management, see “Reducing Memory Overhead.”

You should also avoid repeating work unnecessarily. One subtle case lies in creating a predicate containing a variable. If you create the predicate as shown in the following example, you are not only creating a predicate every time through your loop, you are parsing one.

// Loop over employeeIDs.
for (NSString *anID in employeeIDs) {
    NSString *predicateString = [NSString stringWithFormat:@"employeeID == %@", anID];
 
    NSPredicate *predicate = [NSPredicate predicateWithFormat:predicateString];

To create a predicate from a formatted string, the framework must parse the string and create instances of predicate and expression objects. If you are using the same form of a predicate many times over but changing the value of one of the constant value expressions on each use, it is more efficient to create a predicate once and then use variable substitution (see “Creating Predicates”). This technique is illustrated in the following example.

NSString *predicateString = [NSString stringWithFormat @"employeeID == $EMPLOYEE_ID"];
NSPredicate *predicate = [NSPredicate predicateWithFormat:predicateString];
 
for (NSString *anID in employeeIDs) {
    NSDictionary *variables = @{ @"EMPLOYEE_ID" : anID };
    NSPredicate *localPredicate = [predicate predicateWithSubstitutionVariables:variables];

Reducing Peak Memory Footprint

If you import a large amount of data into a Core Data application, you should make sure you keep your application’s peak memory footprint low by importing the data in batches and purging the Core Data stack between batches. The relevant issues and techniques are discussed in “Core Data Performance” (particularly “Reducing Memory Overhead”) and “Object Lifetime Management,” but they’re summarized here for convenience.

Importing in batches

First, you should typically create a separate managed object context for the import, and set its undo manager to nil. (Contexts are not particularly expensive to create, so if you cache your persistent store coordinator you can use different contexts for different working sets or distinct operations.)

NSManagedObjectContext *importContext = [[NSManagedObjectContext alloc] init];
NSPersistentStoreCoordinator *coordinator = <#Get the coordinator#>;
[importContext setPersistentStoreCoordinator:coordinator];
[importContext setUndoManager:nil];

(If you have an existing Core Data stack, you can get the persistent store coordinator from another managed object context.) Setting the undo manager to nil means that:

  1. You don’t waste effort recording undo actions for changes (such as insertions) that will not be undone;

  2. The undo manager doesn’t maintain strong references to changed objects and so prevent them from being deallocated (see “Change and Undo Management”).

You should import data and create corresponding managed objects in batches (the optimum size of the batch will depend on how much data is associated with each record and how low you want to keep the memory footprint). You process batches within an autorelease pool block; at the end of each batch you need to save the managed object context (using save:). (Until you save, the context needs to keeps strong references to all the pending changes you’ve made to the inserted objects.)

Dealing with strong reference cycles

Managed objects with relationships nearly always create unreclaimable strong reference cycles. If during the import you create relationships between objects, you need to break the cycles so that the objects can be deallocated when they’re no longer needed. To do this, you can either turn the objects into faults, or reset the whole context. For a complete discussion, see “Breaking Relationship Strong Reference Cycles.”

Implementing Find-or-Create Efficiently

A common technique when importing data is to follow a "find-or-create" pattern, where you set up some data from which to create a managed object, determine whether the managed object already exists, and create it if it does not.

There are many situations where you may need to find existing objects (objects already saved in a store) for a set of discrete input values. A simple solution is to create a loop, then for each value in turn execute a fetch to determine whether there is a matching persisted object and so on. This pattern does not scale well. If you profile your application with this pattern, you typically find the fetch to be one of the more expensive operations in the loop (compared to just iterating over a collection of items). Even worse, this pattern turns an O(n) problem into an O(n^2) problem.

It is much more efficient—when possible—to create all the managed objects in a single pass, and then fix up any relationships in a second pass. For example, if you import data that you know does not contain any duplicates (say because your initial data set is empty), you can just create managed objects to represent your data and not do any searches at all. Or if you import "flat" data with no relationships, you can create managed objects for the entire set and weed out (delete) any duplicates before save using a single large IN predicate.

If you do need to follow a find-or-create pattern—say because you're importing heterogeneous data where relationship information is mixed in with attribute information—you can optimize how you find existing objects by reducing to a minimum the number of fetches you execute. How to accomplish this depends on the amount of reference data you have to work with. If you are importing 100 potential new objects, and only have 2000 in your database, fetching all of the existing and caching them may not represent a significant penalty (especially if you have to perform the operation more than once). However, if you have 100,000 items in your database, the memory pressure of keeping those cached may be prohibitive.

You can use a combination of an IN predicate and sorting to reduce your use of Core Data to a single fetch request. Suppose, for example, you want to take a list of employee IDs (as strings) and create Employee records for all those not already in the database. Consider this code, where Employee is an entity with a name attribute, and listOfIDsAsString is the list of IDs for which you want to add objects if they do not already exist in a store.

First, separate and sort the IDs (strings) of interest.

// get the names to parse in sorted order
NSArray *employeeIDs = [[listOfIDsAsString componentsSeparatedByString:@"\n"]
        sortedArrayUsingSelector: @selector(compare:)];

Next, create a predicate using IN with the array of name strings, and a sort descriptor which ensures the results are returned with the same sorting as the array of name strings. (The IN is equivalent to an SQL IN operation, where the left-hand side must appear in the collection specified by the right-hand side.)

// Create the fetch request to get all Employees matching the IDs.
NSFetchRequest *fetchRequest = [[NSFetchRequest alloc] init];
[fetchRequest setEntity:
        [NSEntityDescription entityForName:@"Employee" inManagedObjectContext:aMOC]];
[fetchRequest setPredicate: [NSPredicate predicateWithFormat:@"(employeeID IN %@)", employeeIDs]];
 
// make sure the results are sorted as well
[fetchRequest setSortDescriptors:
        @[[[NSSortDescriptor alloc] initWithKey: @"employeeID" ascending:YES]]];

Finally, execute the fetch.

NSError *error;
NSArray *employeesMatchingNames = [aMOC executeFetchRequest:fetchRequest error:&error];

You end up with two sorted arrays—one with the employee IDs passed into the fetch request, and one with the managed objects that matched them. To process them, you walk the sorted lists following these steps:

  1. Get the next ID and Employee. If the ID doesn't match the Employee ID, create a new Employee for that ID.

  2. Get the next Employee: if the IDs match, move to the next ID and Employee.

Regardless of how many IDs you pass in, you only execute a single fetch, and the rest is just walking the result set.

The listing below shows the complete code for the example in the previous section.

// Get the names to parse in sorted order.
NSArray *employeeIDs = [[listOfIDsAsString componentsSeparatedByString:@"\n"]
        sortedArrayUsingSelector: @selector(compare:)];
 
// create the fetch request to get all Employees matching the IDs
NSFetchRequest *fetchRequest = [[NSFetchRequest alloc] init];
[fetchRequest setEntity:
        [NSEntityDescription entityForName:@"Employee" inManagedObjectContext:aMOC]];
[fetchRequest setPredicate: [NSPredicate predicateWithFormat: @"(employeeID IN %@)", employeeIDs]];
 
// Make sure the results are sorted as well.
[fetchRequest setSortDescriptors:
    @[ [[NSSortDescriptor alloc] initWithKey: @"employeeID" ascending:YES] ]];
// Execute the fetch.
NSError *error;
NSArray *employeesMatchingNames = [aMOC executeFetchRequest:fetchRequest error:&error];