Classify hand poses and actions with Create ML

Classify hand poses and actions with Create ML

With Create ML, your app's ability to understand the expressiveness of the human hand has never been easier. Discover how you can build off the support for Hand Pose Detection in Vision and train custom Hand Pose and Hand Action classifiers using the Create ML app and framework. Learn how simple it is to collect data, train a model, and integrate it with Vision, Camera, and ARKit to create a fun, entertaining app experience. To learn more about Create ML and related concepts around model training, check out “Build an Action Classifier with Create ML” from WWDC20. And don't miss “Build dynamic iOS apps with the Create ML framework” to learn how your models can be trained on-the-fly and on device from within your app.

Resources
- Create ML
- Vision
- - HD Video
  - SD Video
Related Videos

WWDC23
- Integrate with motorized iPhone stands using DockKit
WWDC22
- Get to know Create ML Components
- What's new in Create ML
WWDC21
WWDC20
- Build an Action Classifier with Create ML
- Detect Body and Hand Pose with Vision
Download

♪ ♪ Hi, and welcome to "Classify Hand Poses and Actions with Create ML." I'm Nathan Wertman. And today, I'll be joined by my colleagues, Brittany Weinert and Geppy Parziale. Today, we're going to be talking about classifying hand positions. But before we dig into that, let's talk about the hand itself. With over two dozen bones, joints, and muscles, the hand is an engineering marvel. In spite of this complexity, the hand is one of the first tools infants use to interact with the world around them. Babies learn the basics of communication using simple hand movements before they are able to speak. Once we learn to speak, our hands continue to play a role in communication. They shift to adding emphasis and expression. In the past year, our hands have become more important than ever to bring people closer. In 2020, the Vision Framework introduced Hand Pose Detection, which allows developers to identify hands in the frame, as well as each of the 21 identifiable joints present in the hand. This is a great tool if you are trying to identify if a hand exists or where a hand is in the frame, but it can be a challenge if you're trying to classify what the hand is doing. While the expressive capabilities of the hand are limitless, for the rest of this session, I'd like to focus on short, single-handed poses, like stop, and quiet, and peace, and short, single-handed actions, like back up, go away, and come here. I just referred to hand poses and actions. How about some more concrete definitions? Well, on the one hand, we have poses, which are meaningful as a still image. Even though these two videos are paused, the intent of the subject is clearly expressed. Think of a pose like an image. And on the other hand, we have an action, which requires movement to fully express meaning. The meaning of these two actions is unclear. Looking at a single frame simply isn't sufficient. But with a series of frames over time, like a video or a live photo, the meaning of the action is obvious. A friendly "hello" and "come here." With that cleared up, I'm excited to introduce two new Create ML templates this year, Hand Pose Classification and Hand Action Classification. These new templates allow you to train Hand Pose and Action models using either the Create ML app or the Create ML framework. These models are compatible with macOS Big Sur and later, as well as iOS and iPadOS 14 and later.
And new for this year, we've added the ability to train models on iOS devices using the Create ML framework. You can learn more about this in the "Build Dynamic iOS Apps with the Create ML Framework" session. First, I'd like to talk about Hand Pose Classification, which allows you to easily train machine learning models to classify hand positions detected by the Vision Framework. Since you are in charge of training the model, you define which poses your app should classify to best suit its needs. Let me give you a brief demo of a trained model in action. Starting from a simple prototype app, I was able to easily integrate a Hand Pose Classifier model. My app can now classify hand poses and show the corresponding emoji and confidence of the classified pose. It classifies the hand poses One as well as Two. But you'll notice that all positions which it does not recognize are classified as part of the background. This includes the Open Palm pose, which I'd like to add support for. I'm going to hand this model off to my colleague Brittany in a moment, to show you how to integrate a Hand Pose Classifier model into your app. Before I do, I wanna add support for the Open Palm pose. It'll be really easy, but we should talk about how a model is trained first.
Just like all other Create ML projects, it's very simple to integrate a Hand Pose Classifier into your app. The process has three steps. Collect and categorize training data, train the model, integrate the model into your application. Let's start by talking about collecting training data. For Hand Pose Classifier, you will need images. Remember, poses are fully expressive as images. These images should be categorized into folders whose name matches the pose present in the image. Here we have two poses which we would like to identify: One, Two, as well as a Background class.
The Background class is a catch-all category for poses which your app doesn't care about correctly identifying. In my demo, this includes many hand positions which are not One or Two. A well-defined Background class helps your app know when your user is not making an important pose. There are two types of images which make up a Background class. First, we have a random assortment of hand poses which are not the important poses you'd like your app to classify. These poses should include a diverse set of skin tones, ages, genders, and lighting conditions. Second, we have a set of positions which are very similar to the expressions you'd like your app to classify. These transitional poses frequently occur as the user is moving their hand towards one of the expressions your app cares about. When I raise my hand to give an Open Palm pose, notice I transition through several positions which are similar to but not quite what I want my app to consider an Open Palm. These positions occur as I lower my arm afterward as well. This isn't unique to the Open Palm. The same type of transitional poses occur as I raise my arm to give a Two pose, as well as when I lower it. All of these transitional poses should be added into the Background class, along with the random poses. This combination allows the model to properly differentiate the poses your app cares about and all other background poses.
With the training data collected and categorized, it's now time to train our model using the Create ML app. So let's get our hands dirty. I'm gonna start with the existing Create ML project I used to train the model in my previous demo. The training results looked good, so I'm expecting this model to perform fairly well. Fortunately, the Create ML app allows you to preview your model before integrating it into your app. On the Preview tab, you'll find that, for Hand Pose Classifiers, we've added Live Preview capability this release. Live Preview takes advantage of the FaceTime camera to show you predictions in real-time. Using Live Preview, we can verify that this model correctly classifies the poses One and Two. And I would like it to correctly classify the Open Palm as well, but it currently classifies that pose as part of the Background class. In the Data Source that I used to train this model, notice that it does not include an Open Palm class, only classes for One, Two, and Background. Let's train a new model which supports Open Palm now. First, I'm gonna create a new model source for this.
I have a data set which includes an Open Palm class that I'd like to use for this training. I will select this data set.
Jumping into this new data source, we find that it now includes an entry for the Open Palm as well as the classes from the previous data set. Back on the model source, I'd like to add a few augmentations to extend the training data and make my model more robust.
That's it. It's time to hit Train.
Before the training can begin, Create ML needs to do some preliminary image processing, as well as feature extraction. We told Create ML to train for 80 iterations. This is a good starting point, but you may need to tweak that number based on your data set. This process will take some time. Fortunately, I've already trained a model. Let me grab that now. Live Preview shows that our newly-trained model now correctly identifies the Open Palm pose. And just to be sure, I'm gonna verify that it continues to identify the One and the Two pose.
Wasn't that easy? I'm gonna send this model to my colleague, Brittany, and she will talk about integrating it into her app. Thanks, Nathan, for the model. Hello. I'm Brittany Weinert. And I'm a member of the Vision Framework team.
When I first learned about Hand Pose Classification, I immediately thought, I can use this to create special effects with my hands.
I know that using CoreML to classify hand poses and Vision to detect and track the hands will be the perfect technologies to use together. Let's see if we can give ourselves superpowers. I've already created a first draft of the pipeline for a demo that can do just that. Let's review it.
First, we're going to have a camera providing a stream of frames, and we're going to use each frame for a Vision request to detect the location and key points of the hands in the frame. DetectHumanHandPoseRequest will be the request we're using. It will return a HumanHandPoseObservation for each hand it finds in the frame. The data we will send to the CoreML Hand Pose Classification model is an MLMultiArray and a property on the HumanHandPoseObservation called the keypointsMultiArray. Our Hand Pose Classifier will then return back to us the top estimated hand action label with its confidence score, which we can then use to determine action within the app. Now that we've gone over the high-level details of the app, let's look at the code. Let's start out by looking at how to use Vision to detect hands in a frame. For what we want to do, we only need one instance of the VNDetectHumanHandPoseRequest, and we only need to detect one hand, so we set the maximumHandCount to one.
If you set the maximumHandCount, and there's more hands than specified in the frame, the algorithm will detect the most prominent and central hands in the frame instead. The default value for maximumHandCount is two.
We recommend to set the revision here, so you're not surprised by updates to the request later. But if you always want to opt in for the latest algorithm supported by the SDK you're linked against, you don't have to set it. Also, as a note, we will be doing the detection on every frame retrieved by the ARSession via ARKit, but this is only one way to grab frames from a camera feed. You may use whichever method you like. AVCaptureOutput would also be a useful alternative.
For every frame received, we need to create a VNImageRequestHandler, which handles all the requests on a given image. The results property on the hand pose request will be populated with VNHumanHandPoseObservations, up to a max hand number of one, as we specified on the request earlier.
If the request detects no hand poses, we might want to clear out any effects currently being displayed. Otherwise, we will have a single hand observation.
Next, we want to predict what our hand pose is using our CoreML model. We don't want to do a prediction every single frame, as we don't want the rendering of the effect to be jittery. Doing a prediction at intervals creates a smoother user experience. When we want to make a prediction, we start by passing the MLMultiArray to the Hand Pose CoreML model, and we retrieve the top label and confidence from the single prediction returned.
I want to trigger changes to the effects being displayed only when a label is predicted with a high level of confidence. This also is key to protecting against behavior where the effect may switch on and off too quickly and become jittery. Here, Background classification is helping us by allowing us to keep the confidence threshold very high.
If One is predicted with great confidence, we can set the effectNode to render. If One isn't predicted with great confidence, I want to stop the effect on the screen to match what my hand is doing. Let's test out what we have. If I make my hand into the One pose, it should trigger a single energy beam effect. Very cool! The model could tell that I made the pose One and triggered the effect. Although it would be even cooler if it followed my finger. Even better if it rendered at a specific point on my finger. Let's go back to the code and change it. What we need to do is feed the key point location of the hand to the graphic asset, which means using the view to translate the normalized key points into the camera view space. You may also want to consider pruning which key points you save by looking at the confidence scores. Here, I only care about the index fingertip. We need to translate the key point to the coordinate space, as Vision uses normalized coordinates. Also, Vision's origin point is in the lower left corner of an image, so keep that in mind when you're doing the conversion.
Finally, let's save the index location, and if no key point was found, we default to nil. Let's look at the code responsible for rendering the effect and how I can adjust it to follow my finger. We want to find the spot where the location of the graphical object is being set. setLocationForEffects is being asynchronously called every frame. As a default, we set the effect to appear at the center of the view. Switching it out for the indexFingerTipLocation CGPoint from earlier, we can get the intended effect.
Awesome! This is starting to look cool. Let's take it one more step. To create a more interesting graphical story surrounding superpowers, it would be good to utilize a few more of the Hand Pose Classifications in our application. In this case, we'll pick the classification Two and Open Palm. I've already extended my application to take action when both of these poses are detected. Here, I'm centering the energy beam to appear at the tip of my index finger, as shown before, for the pose One. Two energy beams at the tip of my middle and index finger for the pose Two. And the last energy beam is triggered by the Hand Pose Open Palm and is anchored between a key point at the bottom of my middle finger and the wrist key point.
All right. Everything Nathan and I have introduced covers the steps of fully integrating your own Hand Pose Classification model. There is one more new feature in Vision that you may find helpful, so let me introduce to you an API that might help with triggering and controlling this app's functionality. Vision is introducing a new property that allows users to differentiate between the left hands and right hands on the HumanHand- PoseObservation: chirality. This is an Enum that indicates which hand the HumanHandPoseObservation most likely is and can be one of three values: left hand, right hand, and unknown. You can probably guess the meaning behind the first two values, but the unknown value would only appear if an older version of the HumanHandPoseObservation were deserialized, and the property had never been set. As Nathan mentioned earlier, you can get more information on Vision hand pose detection by referring back to the WWDC 2020 session, "Detect Body and Hand Pose with Vision." As a side note, for each hand detected in a frame, the underlying algorithm will try to predict the chirality of each hand separately. This means that the prediction of one hand does not affect the prediction of other hands in the frame. Let me show you an example of what code using chirality can look like. We've already covered the setup to creating and running a VNDetectHumanHandPoseRequest. After performing the request, the observations will have the Enum property chirality, and you can use it to take action on or sort through Vision hand poses observations like so.
Everything thus far has been about how to use Hand Pose Classification. But as Nathan mentioned earlier, Hand Action Classification is another new technology this year. Here's Geppy to talk to you about it. Thanks, Brittany. Hello, my name is Geppy Parziale, and I'm a machine learning engineer from the Create ML team. In addition to Hand Pose Classification, this year, Create ML introduces a new template to perform Hand Action Classification, and I'm going to show you how to use it in your apps.
For this reason, I will extend Brittany's superpowers demo with some hand actions and highlight some important distinctions between hand poses and hand actions.
Please refer to the session "Build an Action Classifier with Create ML" from the WWDC 2020 for additional information and comparison since Hand Action and Body Action are two very similar tasks. But now, let me explain what Hand Action is.
Hand action consists of a sequence of hand poses that the ML model needs to analyze during the motion of the hand. The number of poses within a sequence should be large enough to capture the entire hand action from start to end.
You use video to capture hand actions. Training a Hand Action Classifier is identical to training a Hand Pose Classifier, as Nathan showed us earlier, with some minor differences.
While a static image represents a hand pose, videos are used to capture and represent hand actions. So to train a Hand Action Classifier, you use short videos, where each video represents hand action. These videos can be organized into folders, where each folder name represents an action class.
And remember to include a Background class containing videos with action dissimilar than the action you want the classifier to recognize.
As alternative representation, you can add all your example video files in a single folder.
Then, you add an annotation file, using either a CSV or JSON format.
Each entry in the annotation file represents the name of the video file, the associated class, the starting and ending times of the hand action.
Also, in this case, remember to include a Background class.
Remember, you train the model with videos of the same length, more or less. Indeed, you provide the action duration as a training parameter and then Create ML randomly samples a consecutive number of frames according to the value you provide. You can also provide video frame rate and training iterations. In addition to that, the app offers different types of data augmentation that will help the model to generalize better and increase its accuracy. In particular, Time Interpolate and Frame Drop are the two augmentations added to Hand Action Classification to provide video variation closer to real use cases.
So I already trained a Hand Action Classifier for my demo-- let's see it in action. Well, since I'm a superhero, I need some source of energy.
Here is mine. Here, I use hand pose to visualize my source of energy. But now, let me use my superpower to activate it.
In this case, I'm using hand action. This is cool.
And now, Hand Pose and Hand Action Classifier are executing concurrently. I'm taking advantage of the new chirality feature from Vision and use my left hand for hand poses and my right hand for hand action.
This is super cool. So this is possible because of the optimization that Create ML applies, a training time to every model to unleash all the power of the Apple Neural Engine.
And now, let me go back to my real world and explain you how to integrate the Create ML Hand Action Classifier into my demo.
Let's look at the input of the model first. When you integrate a Hand Action Classifier into your app, you need to make sure to provide the model with the correct number of expected hand poses. My model is expecting a MultiArray of size 45 by 3 by 21, as I can inspect in the XCode Preview. Here, 45 is the number of poses the classifier needs to analyze to recognize the action. 21 is the number of joints provided by the Vision Framework for each hand. Finally, 3 are the x and y coordinates and the confidence value for each joint. Where does the 45 come from? That's the prediction window size and depends on the video length and the frame rate of the videos used at the training time.
In my case, I decided to train my model with videos recorded at 30 fps and 1.5 seconds long. This means the model was trained with 45 video frames per hand action, so during inference, the model is expecting the same number of hand poses. An additional consideration needs to be taken into account with respect to the frequency of the arriving hand poses at inference time. It's very important that the rate of the hand poses presented to the model during the inference matches the rate of the poses used to train the model. In my demo, I used ARKit. So I had to halve the number of the arriving poses per each second, since ARKit provides frames at 60 fps, and my classifier was trained with videos at 30 fps.
If done otherwise, the classifier could provide wrong predictions.
Let's jump now into the source code to show you how to implement this. First, I use a counter to reduce the rate of the poses arriving from Vision from 60 fps to 30 fps, matching so the frame rate my model is expecting to work properly.
Then, I get the array containing joints and chirality for each hand in the scene. Next, I discard the key points from my left hand since, in my demo, I use my right hand to activate some of the effects with the hand action.
Okay, now I need to accumulate hand poses for the classifier. To do so, I use a FIFO queue and accumulate 45 hand poses and make sure the queue always contains the last 45 poses. The queue is initially empty. When a new hand pose arrives, I add it to the queue, and I repeat this step until the queue is full. Once the queue is full, I can start to read its entire content. I could read the queue every time I receive a new hand pose from Vision. But remember, now I'm processing 30 frames per second. And depending on the use case, this could be a waste of resources.
So I use another counter to read the queue after I define number of frames.
You should choose the queue sampling rate as a trade-off between the responsiveness of the application and the number of predictions per second you want to obtain.
At this point, I read the entire sequence of 45 hand poses, organized in MLMultiArray, and input it to a classifier to predict the hand action.
Then, I extract the prediction label and the confidence value. Finally, if the confidence value is larger than our defined threshold, I add the particle effects to the scene. So remember, when you integrate the Create ML Hand Action Classifier in your app, make sure to input the sequence of hand poses at the frame rate the model is expecting. Match the same frame rate of the video used to train the classifier. Use a first-in-first-out queue to collect hand poses to the model prediction. Read the queue at the right frame rate. I'm looking forward to seeing all the cool applications you will build with the hand action models trained with Create ML. And now, back to Nathan for final consideration and recap. Thanks, Geppy. You and Brittany did a great job on your app. I'm excited to try it out. But before things get out of hand, here are a few things you should keep in mind to ensure a high-quality experience for your users. Be mindful of how far away the hand is from the camera. The distance should be kept under 11 feet, or 3 1/2 meters, for best results. It's also best to avoid extreme lighting conditions, either too dark or too light. Bulky, loose, or colorful gloves can make it difficult to accurately detect the hand pose, which may affect classification quality. Like all machine learning tasks, the quality and quantity of your training data is key. For the Hand Pose Classifier shown in this session, we used 500 images per class. The Hand Action Classifier, we used 100 videos per class, but the data requirements for your use case may differ. What is most important is that you collect enough training data to capture the expected variation your model will see in your app. Now feels like a good time for a recap. So what have we learned? Well, starting in 2021, you can build apps which interpret expressions of the human hand. We discussed the differences between the two categories of hand expressions, poses and actions. We talked about how to prepare training data, including a Background class, for use in the Create ML app to train a model. We talked about how to integrate a trained model into an app. And finally, we talked about incorporating multiple models into a single app and using chirality to differentiate the hands. Obviously, today's demo only scratches the surface. The Vision Framework is a powerful technology for detecting hand presence, pose, position, and chirality. Create ML is a fun and easy way to train and classify hand poses and hand actions. When used together, they provide deep insights into one of humankind's most powerful and expressive tools, and we can't wait to see what you do with them. Bye. [upbeat music]

func session(_ session: ARSession, didUpdate frame: ARFrame) {

    let pixelBuffer = frame.capturedImage 
    let handPoseRequest = VNDetectHumanHandPoseRequest()
    handPoseRequest.maximumHandCount = 1
    handPoseRequest.revision = VNDetectHumanHandPoseRequestRevision1

    let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, options: [:])
    do { 
        try handler.perform([humanBodyPoseRequest]) 
    } catch {
        assertionFailure("Human Pose Request failed: \(error)")
    }

    guard let handPoses = request.results, !handPoses.isEmpty else {
        // No effects to draw, so clear out current graphics
        return
    }
    let handObservation = handPoses.first

11:03 - Predicting hand pose

if frameCounter % handPosePredictionInterval == 0 {

    guard let keypointsMultiArray = try? handObservation.keypointsMultiArray() 
else { fatalError() }
    let handPosePrediction = try model.prediction(poses: keypointsMultiArray)
    let confidence = handPosePrediction.labelProbabilities[handPosePrediction.label]!

    if confidence > 0.9 {
       renderHandPoseEffect(name: handPosePrediction.label)
    }
}

func renderHandPoseEffect(name: String) {
	switch name {
        case "One": 
            if effectNode == nil {
               effectNode = addParticleNode(for: .one)
            }
        default:
			removeAllParticleNode()
	}
}

12:25 - Getting tip of index finger to use as anchor

let landmarkConfidenceThreshold: Float = 0.2

let indexFingerName = VNHumanHandPoseObservation.JointName.indexTip

let width = viewportSize.width
let height = viewportSize.height

if let indexFingerPoint = try? observation.recognizedPoint(indexFingerName),
   indexFingerPoint.confidence > landmarkConfidenceThreshold {
    
    let normalizedLocation = indexFingerPoint.location
    indexFingerTipLocation = CGPoint((x: normalizedLocation.x * width,
                                      y: normalizedLocation.y * height))
} else {
    indexFingerTipLocation = nil
}

15:47 - Getting hand chirality

// Working with chirality

let handPoseRequest = VNDetectHumanHandPoseRequest()
try handler.perform([handPoseRequest])
let detectedHandPoses = handPoseRequest.results!

for hand in detectedHandPoses where hand.chirality == .right {
    // Take action on every right hand, or prune the results
}

22:16 - Hand action classification by accumulating queue of hand poses

var queue = [MLMultiArray]()
// . . .
frameCounter += 1
if frameCounter % 2 == 0 {
    let hands: [(MLMultiArray, VNHumanHandPoseObservation.Chirality)] = getHands()
    for (pose, chirality) in hands where chirality == .right {
        queue.append(pose)
        queue = Array(queue.suffix(queueSize))
        queueSamplingCounter += 1
        if queue.count == queueSize && queueSamplingCounter % queueSamplingCount == 0 {
            let poses = MLMultiArray(concatenating: queue, axis: 0, dataType: .float32)
            let prediction = try? handActionModel?.prediction(poses: poses)
            guard let label = prediction?.label, 
              let confidence = prediction?.labelProbabilities[label] else { continue }
            if confidence > handActionConfidenceThreshold {
                DispatchQueue.main.async {
                    self.renderer?.renderHandActionEffect(name: label)
                }
            }
        }
    }
}

Looking for something specific? Enter a topic above and jump straight to the good stuff.

An error occurred when submitting your query. Please check your Internet connection and try again.

Resources

Related Videos

WWDC23

WWDC22

WWDC21

WWDC20