As we mentioned in the previous post, a complete code walk-through for the Kinect SDK is beyond the scope of this blog post. Microsoft has some great documentation and support for getting up and running with the SDK. That said, let’s cover the main event loop for the TouchGestures source code we are providing on top of Microsoft’s code…
The basic steps in our touch-as-depth software can be summarized by the flowchart at left. For each mode of operation (SurfaceCalibration, ExtentsCalibration, and MissionMode), the data flow is slightly different. The flowchart displayed is that of the typical MissionMode flow loop. Let’s go over these one by one…
Depth information is gathered frame-by-frame from the Kinect. These depth points are stored in the depth buffer. Next, the base interaction surface is found in relation to the location of the Kinect (as determined by the SurfaceCalibration routine). During MissionMode, if it’s the first pass through, all the depth data is adjusted relative to the calibration matrix. In all subsequent frames, this depth data is then compared relative to the surface to determine whether or not there is a “touch” event. The raw touch-from-depth data is then filtered using a custom mean filter that weeds out some of the extraneous noise…some, but not all. Connected 8 component labeling is used to disambiguate one finger from another. Particle analysis is then performed to figure out exactly which points are “touch” points and which are noise…at this point, the data should not be noisy. Now that we have the touch points, we need to make sure they line up with the spots in the real world. The raw points as seen by the Kinect are then transformed relative to the tracked space (as determined by the ExtentsCalibration routine). Finally, a filter is applied to the frame-to-frame behavior of the points to determined their mapped keyboard and mouse event. Rinse and repeat…as fast as possible. Now let’s go through each of the steps in more detail…
Once you have placed your Kinect relative to the surface (a table, a wall, a chair, etc.) you want to track touch on, you need to get some very accurate readings about depth. This is a one-time calibration…unless you move your Kinect or change the surface, in which case you will need to re-calibrate. The SurfaceCalibration mode is responsible for modeling the interaction surface over multiple frames and a fixed spatial resolution of 640×480 depth points. The methodology of modeling the surface and extracting touch information that we use is explained very clearly by Andy Wilson of Microsoft Research. Unfortunately, the process does not calculate depth information adaptively. Hence, any objects placed on the interaction surface during calibration will be modeled as being part of the touch space. It is important to keep the field-of-view devoid of interference from direct light sources and objects (even a few seconds of interference would yield result in errors of ~10% in calibration).
SurfaceCalibration mode involves a single step as soon as the depth buffer is ready and locked. This step contains the routine
(Gelib_DoSurfaceCalibration) that translates the depth values to millimeters and stores the cumulative sum over a period. The number of frames to track the depth values (defined by
DUMPFOR parameter) and the frame interval (defined by
DUMPEVERY parameter) are specified in the header file
TouchGestures.h. The calibration routine shows a progress bar. Once the depth values are gathered, the final depth data for each pixel (one of 640×480) are calculated by finding the average values (dividing each pixel accumulated value by
DUMPFOR). The average depth values are stored in the file
DepthAverage.dump (our file format uses a semicolon-space-newline delimitation). The
SurfaceCalibration.exe application may be terminated as soon as this file is generated (or as soon as the progress bar indicates completion). At this point, the Kinect should be calibrated relative to your surface and you are ready to tell the Kinect what the boundaries of your tracked space are…for that you switch to:
During the ExtentsCalibration mode, the system works in partial Mission mode. Touch gestures are processed but not translated and emulated as mouse events. Instead, touch points are recorded for a certain time period and the corresponding co-ordinates are logged in a file.
Upon initialization and depth buffer filled event, the first step is to read the surface calibration depth values as an
IMAGE structure from the stored file
IMAGE data structure is defined as a 2D unsigned short integer array with sizes specified as
DHEIGHT (640, 480).
The image analysis and processing routines are similar to that of the MissionMode and will be explained in that section. The additional function
(CALIBRATION Gelib_CalibrateExtents( CALIBRATION C ) ) displays and logs the touch points in the sequence. The five points are iterated through a counter
(cIteration) and the calibration mode swaps to regular mode
(qswap = 0) once the calibration is completed…
Calout data structure contains the calibration reference points. These points are stored in the
Calibration.dat file. The data flow switches to touch display mode hence forth (displays transformed touch points and does not translate and emulate mouse events).
Transforming touch points: An affine transform of the touch points is performed to translate the touch points from the depth space to the table space…
The depth space is defined as the 2D region as seen by the Kinect depth camera (this is different from the 2D video space seen by the Kinect). This is expected to be a subset of the table space which is the physical 2D region of the glass top over the table. The transforms applied are:
1. Translation in
T[C0(x,y)] towards & relative to the table space origin of
2. Scale in
S[M(x,y))] towards the upper boundaries of the table space.
The above process determines factors to calculate the transformation offset and scale factors. The translation offset and scale factors are then applied to any incoming touch particle set over a loop…
gestureCanvas image is necessary for debugging and visualizing the touch points in real time by assigning the gesture points to a square 5×5 super pixel.
The Mission Mode is the run time mode that tracks touch gestures in real time and translates it to keyboard and mouse events for the Windows 7 OS.
Reading Surface Calibration Values
During the system boot-up (Frames = 0), after the global run time variables are initialized, the surface calibration values are read off of the
gSurface is an image data structure that holds the average depth values of the table surface obtained during calibration. Following this, the calibration reference co-ordinates obtained from the ExtentsCalibration Mode is read back from the file
Calibration.dat and stored in the structure
Calout. For each calibration co-ordinate obtained from the file, an envelope bounding box (rectangular) is applied and the co-ordinates are redefined. This is done to adjust for any rotational shifts in the Z-axis during calibration (please note: this is not a transform, but a linear extrapolation).
Calculating Touch from Depth
The output of the
Gelib_CalcTouchImage function is an image structure containing binary touch information (a 640×480 array of 0s and 255s). The function takes inputs of the buffer containing depth information passed on to the application from the SDK as a memory pointer, bounding box co-ordinates for the calibration points (this is required to exclude points outside the touch space), the constants
FINGER passed through the GUI.
The touch image is calculated by first reducing the depth buffer and extracting the masked depth values (by a logical bitwise &0x0ffff). Again, following Andy Wilson’s work, the classification of depth values as touch is explained in detail here.
The values of
Dmin are calculated as shown in the code snippet above.
Dmax is the maximum value of depth beyond which any value of depth is classified as not belonging to the touch range.
Dmin is the minimum depth value below which depth values do not count as meaningful touch values. The small range of values between
Dmin is classified as a valid touch point. (Note: To avoid spurious calibration parameters to be counted in the process, any value of Depth equaling 0 or greater than 1200 is classified as a non-touch point).
Dmax can be assumed to be the actual surface value as calculated during the surface calibration procedure. However, owing to temporal depth noise, a tolerance parameter
(DSURFACE_TOL) is included in the calculation of
Dmax. The thickness of your finger dictates the
Dmin calculation. We include the parameter
FINGER to determine the
2D Spatial Depth Filter
The kernel size is fixed at 9×9 (although this is configurable using the GUI). A binary threshold is applied following the two stages of filtering. The threshold is adjustable from the MissionMode GUI. A value of 96 to 128 worked well at the time of testing.
Connected Component Labeling
The process of identifying blobs of connected particles (in this case, separate one finger from another in the touch space) and labeling them progressively for each frame is known as Connected Component Labeling (CCL for short). The algorithm used here is 8 connected (Moore Neighborhood) component labeling. Check out the Wikipedia entry for more info. Our algorithm (borrowing heavily from the standard) is implemented here:
The CCL algorithm is run in two passes. The first pass labels each valid pixel incrementally depending on the 8-connectivity rule. The second pass creates an equivalence table for labels that are associated with the same particle but have a different primary value (a primary value is the dominant value in the particle). The connected-8 kernel is defined as…
First Pass: For each pixel (represented by the solid black cell), the surrounding connected pixels are compared with the current pixel. The smallest label above 0 in the connected 8 kernel is assigned as the new label for the current pixel. If the smallest label in the kernel cannot be found (probably connected by zeros), then the pixel value is assigned a new label which is defined as a counter.
Second pass: The first pass output is rerun to derive the equivalence table. The equivalence table associates boundary particles that may have a different label than the primary value of the particle. The same procedure of 8-connecting is applied and the table structure is formed based on the variance in pixel values between the current pixel and the surrounding pixels. Once the association is derived, the equivalence table is normalized and the frame is redrawn. The resulting output is a well-connected and a uniquely labeled blob set.
Particle analysis deals with performing morphology and intensity based parameter measurements on binary images. Typical operations involve measuring areas, lengths, coordinates, chords and axes, shape equivalence and shape features. Before particle analysis, Histogram-based thresholding and equalization is done to remove small particles and noise. Currently, the threshold is a fixed parameter of 32, which means that any particle with less than 32 pixels area is ignored for gesture calculations.
We limit the particle analysis operations in this application to a “bounding-box and center point” approach. More accurate measurements are unnecessary considering the errors in touch registration. For each particle identified as a label, the bounding box and center values are found as the maximum and minimum values in X and Y direction.
The particle analysis structure P returns the measured values for each valid particle back to the application.
Detecting Gestures and Emulating I/O Device Actions
The particle structure is used as the input to the gesture filter switch-case structure. Based on the number of particles detected (label count from the particle analysis routine), the gesture filter redirects I/O control (typically mouse actions or keyboard events) to the corresponding functions. These functions process gestures temporally and spatially, hence the accuracy of gesture detection depends on well-trained coefficients (these coefficients are currently hard-wired in the source code. These can be exposed as controls during the Surface Calibration process or as a completely different calibration/training process. Coefficients maybe determined as OTP values (hard-wired), dynamic and linear values or intelligence based values such as using neural networks). At the time of writing, up to 5 particles per frame are handled by the gesture filter (up to five fingers or five cases in the switch-case structure), although there is no limit to the number of particles (
NUM_PARTICLES parameter) that the particle analysis method can process and the gesture filter can handle.
For multi-user interactions, it may be necessary to process multiple particles through deeper particle measurements (palm recognition etc.). This would also require the Windows Multipoint SDK for multi-mouse interactions. This should get easier in the future.
For the moment, our gesture detection mechanism is designed in a way to facilitate smooth hand off between gestures and gesture fade-in and fade-out. As an example, there may be cases where a two-finger zoom gesture may transition and fade-out from 2 particles to 1 particle followed by 0. These transitions have been accounted for in the gesture tracking mechanism. Here’s a breakdown of how each gesture works…
Idle mode: During the idle mode, the gesture filter and class variables are initialized and the filter waits for a valid gesture particle set. Any value of the number of particles above 1 is assumed to be a valid gesture. In addition, all emulation events are reset (mouse buttons are released). In order to avoid spurious gesture resets to case 0 (noisy frames), a time out counter is set. For any value of the time out beyond 3 (3 frames) the previous gesture is assumed to have completed. Certain gestures like Select/Move have special time out values that are unique to the behavior of those interactions. Every gesture case also keeps track of the previous gesture in the
pGesture variable (which acts a lot like a state machine).
Select/Move mode: This mode (
case 1) handles gestures such as Select/Move components and delete component as well as (left and right) mouse click gestures.
Handling Mouse Right Click:The right click event is handled based on a timed gesture response. This is required to differentiate between gesture transitions as well as other single particle gestures. The right click event is determined based on the
The select gesture is valid when the spatial difference between the current touch point and the previous touch point (the previous frame) is less than a tolerance value defined by
SE_TOLERANCE. The tolerance value is set to a strict 1 currently. To distinguish between spurious or stray touch points, we track the gesture for stability. If the current touch point is stable for 10 frames, the gesture is locked down by setting the mouse pointer position at the translated position and a “mouse left-click and hold” event is sent to the operating system. For the 11th frame (special case), the mouse pointer is jittered for a pixel in the Y direction to ensure that the component information (usually achieved by mouse hover) is displayed.
The component move gesture is opposite in implementation to the move gesture. In this case, we check if the difference in coordinates of the current and previous touch point is above a certain threshold
(SE_TOLERANCE + 2) and proceed to send the “mouse left-click and hold” event to the OS. The delete component gesture is also implemented under the same category wherein we check if the difference in current and previous touch coordinates is above the delete threshold of
DE_TOLERANCE. Once the threshold limit is hit, the delete gesture identifier proceed to a binary encoded style state machine that tracks for successive gestures (back and forth movement of the finger). After 3 successive move gestures (back and forth), the gesture recognition loop proceeds to send the “Delete” key event to the OS.
On function entry, the floating point distance between the two fingers is calculated. This is required to differentiate between a zoom in and zoom out gesture. For each frame, the current distance between the fingers and the previous frame‟s distance between the fingers are compared. If the difference is positive, the gesture is a zoom-out. In this case, the mouse-scroll event is sent to the OS with a positive
WHEEL_DELTA. A zoom counter in each case is incremented to verify switching between one zoom-mode to the other.
Pan: Cases 4 and 5 are used in tandem for the pan mode. The pan gesture is a 5 finger gesture, however there may be instances where two fingers combine to one particle and yield 4 particle gestures. In those cases, the gesture filter operates in the same pan canvas mode. The start of the gesture however requires 5 unique particles…
The actual mouse position is determined as the average
Y position of each of the (4 or) 5 particles passed as the parameter to the Pan Canvas function. The pan canvas gesture is easily susceptible to temporal noise owing to the inconsistency in the ability to determine a fixed mouse pointer coordinate. To nullify this problem, a 2 point frame average of the coordinates is calculated as the final pan point. Whenever the current touch point changes position, the movement delta values in
Y directions are calculated. If the delta values exceed a threshold
(PAN_TOLERANCE), the pan gesture is simulated by a mouse right-click and hold event.
It is important to identify the optimum relative positioning of the 3 fingers on the touch space. For natural interaction the index finger, middle finger and the thumb are used. The middle finger shall be used as the actual pointer position for the gesture. To calculate the validity of this condition, a conditional check is performed on the 3 coordinates passed to the function. Following the classification of the touch points, the final touch point (unfortunately, it’s the middle finger) is extracted and checked for motion across frames using the same method described for the move gesture. The shift key is also held in addition to the “Left-Click-Hold” event. The 3 finger hold gesture is implemented as well for the sake of uniformity in behavior.