CreateML/CoreML Issues with Large Dataset

Hello All,

I'm developing a machine learning model for image classification, which requires managing an exceptionally large dataset comprising over 18,000 classes. I've encountered several hurdles while using Create ML, and I would appreciate any insights or advice from those who have faced similar challenges.

Current Issues:

  1. Create ML Failures with Large Datasets:

    • When using Create ML, the process often fails with errors such as "Failed to create CVPixelBufferPool." This issue appears when handling particularly large volumes of data.
  2. Custom Implementation Struggles:

    • To bypass some of the limitations of Create ML, I've developed a custom solution leveraging the MLImageClassifier within the CreateML framework in my own SwiftUI MacOS app.
    • Initially I had similar errors as I did in Create ML, but I discovered I could move beyond the "extracting features" stage without crashing by employing a workaround: using a timer to cancel and restart the job every 30 seconds. This method is the only way I've been able to finish the extraction phase, even with large datasets, but it causes many errors in the console if I allow it to run too long.
  3. Lack of Progress Reporting:

    • Using MLJob<MLImageClassifier>, I've noticed that progress reporting stalls after the feature extraction phase. Although system resources indicate activity, there is no programmatic feedback on what is occurring.

Things I've Tried:

  • Data Validation: Ensured that all images in the dataset are valid and non-corrupted, which helps prevent unnecessary issues from faulty data.
  • Custom Implementation with CreateML Framework: Developed a custom solution using the MLImageClassifier within the CreateML framework to gain more control over the training process.
  • Timer-Based Workaround: Employed a workaround using a timer to cancel and restart the job every 30 seconds to move past the "extracting features" phase, allowing progress even with larger datasets.
  • Monitoring System Resources: Observed ongoing system resource usage when process feedback stalled, confirming background processing activity despite the lack of progress reporting.
  • Subset Testing: Successfully created and tested a model on a subset of the data, which validated the approach worked for smaller datasets and could produce a functioning model.
  • Router Model Concept: Considered training multiple models for different subsets of data and implementing a "router" model to decide which specialized model to utilize based on input characteristics.

What I Need Help With:

  • Handling Large Datasets:

    • I'm seeking strategies or best practices for effectively utilizing Create ML with large datasets.
    • Any guidance on memory management or alternative methodologies would be immensely helpful.
  • Improving Progress Reporting:

    • I'm looking for ways to obtain more consistent and programmatic progress updates during the training and testing phases.

I'm working on a Mac M1 Pro w/ 32GB RAM, with Apple Silicon and am fully integrated within the Apple ecosystem. I am very grateful for any advice or experiences you could share to help overcome these challenges.

Thank you!

I've pasted the relevant code below:

func go() {
    if self.trainingSession == nil {
        self.trainingSession = createTrainingSession()
    }

    if self.startTime == nil {
        self.startTime = Date()
    }

    job = try! MLImageClassifier.resume(self.trainingSession)

    job.phase
        .receive(on: RunLoop.main)
        .sink { phase in
            self.phase = phase
        }
        .store(in: &cancellables)

    job.checkpoints
        .receive(on: RunLoop.main)
        .sink { checkpoint in
            self.state = "\(checkpoint)\n\(self.job.progress)"
            self.progress = self.job.progress.fractionCompleted + 0.2
            self.updateTimeEstimates()
        }
        .store(in: &cancellables)

    job.result
        .receive(on: DispatchQueue.main)
        .sink(receiveCompletion: { completion in
            switch completion {
            case .failure(let error):
                print("Training Failed: \(error.localizedDescription)")
            case .finished:
                print("🎉🎉🎉🎉 TRAINING SESSION FINISHED!!!!")
                self.trainingFinished = true
            }
        }, receiveValue: { classifier in
            Task {
                await self.saveModel(classifier)
            }
        })
        .store(in: &cancellables)
}

private func createTrainingSession() -> MLTrainingSession<MLImageClassifier> {
    do {
        print("Initializing training Data...")
        let trainingData: MLImageClassifier.DataSource = .labeledDirectories(at: trainingDataURL)
        let modelParameters = MLImageClassifier.ModelParameters(
            validation: .split(strategy: .automatic),
            augmentation: self.augmentations,
            algorithm: .transferLearning(
                featureExtractor: .scenePrint(revision: 2),
                classifier: .logisticRegressor
            )
        )
        let sessionParameters = MLTrainingSessionParameters(
            sessionDirectory: self.sessionDirectoryURL,
            reportInterval: 1,
            checkpointInterval: 100,
            iterations: self.numberOfIterations
        )
        print("Initializing training session...")

        let trainingSession: MLTrainingSession<MLImageClassifier>
        if FileManager.default.fileExists(atPath: self.sessionDirectoryURL.path) && isSessionCreated(atPath: self.sessionDirectoryURL.path()) {
            do {
                trainingSession = try MLImageClassifier.restoreTrainingSession(sessionParameters: sessionParameters)
            }
            catch {
                print("error resuming, exiting.... \(error.localizedDescription)")
                fatalError()
            }
        }
        else {
            trainingSession = try MLImageClassifier.makeTrainingSession(
                trainingData: trainingData,
                parameters: modelParameters,
                sessionParameters: sessionParameters
            )
        }

        return trainingSession
    } catch {
        print("Failed to initialize training session: \(error.localizedDescription)")
        fatalError()
    }
}
CreateML/CoreML Issues with Large Dataset
 
 
Q