Power Glove Gesture Recognition Foreword: Perhaps everyone who has used the Nintendo Power Glove, whether it be for entertainment or development purposes, has gained insight into its usefulness as a gesture sensing device - something that "knows" what you are trying to say with your body. Of course, even the Power Glove is limited to hand and wrist input, but it is clearly a far more natural input device than the keyboard, mouse, or even the touch screen. To be more quantitative, the glove offers fully 8 channels of completely independent input, as opposed to 4 for the mouse or joystick (X, Y, and 2 buttons) and (essentially) 1 for the keyboard. It is as though you are playing an 8 string guitar that your computer can "hear" perfectly, (although some of the strings only have 4 frets.) What I want to implement is a program to take the flood of glove input and boil it down to easily-interpreted commands like "punch", "move left", "move forward", "twist left", "twist right", "fist", and so on and so forth. It is much easier to write a useful application when this filtering has already been taken care of. I see the first and most important application of this implementation to be virtual reality. Simply put, VR is the attempt to make a computer "look and feel" more like real life and less like a stupid machine. The glove puts people in closer contact with computer mechanics, and perhaps more importantly, it puts the machine in closer contact with the human user! This much power will probably find other applications as well, and I list some potential ones later in the text. This paper gives a brief description of how I will implement an excellent gesture recognition system. At this point I'm not asking for input on what you think of this, so if you don't like it, ignore it. The discussion is very technical and contains a lot of "stream of consciousness" writing. If this bothers you, I'm sorry, I don't have a lot of time to polish it at this point. I need to start coding as quickly as possible. If it bugs you that I'm releasing this before the code is done, well, read the afterword, then go back to your private, selfish excuse for a life. The rest of us are getting down to work. Section I - How simple should recognized gestures to be? Fingers: Thumb, Index, Middle, Ring. Location: X, Y, Z, and Rotation. Gesture consists of finger "delta" and/or location "delta" ("delta" = "change in glove input"). More subtle observation - we may wish to ignore the delta of any of the eight parameters. Do finger deltas need to be handled differently? (My answer turns out to be an emphatic "NO".) Big issue is one of gesture "combination". Should two identical rapid gestures be interpreted as a "double-click"? I think so, (though later I'll reject this.) The concept can be extended like so: Say the gestures listed in the 1st paragraph can be sensed. If several gestures occur rapidly, there might be a gesture in the event queue like "punch-twirl-punch-punch" which would be distinguished from "punch", "twirl", "punch", "punch" (4 separate queue events) by the delay between the gestures. Clearly a queue of gestures would be required, similar to an event queue for mouse clicks. Another possibility is the "shifting" of gestures. That is, if a gesture does not use a parameter delta (in other words "an input channel is ignored in a particular gesture"), that parameter could be used as a kind of "shift key". The most logical extension would work like this: - User picks a "shift" gesture, such as holding the thumb close to the palm. - The shift gesture parameter (in this case the thumb delta) cannot be used in any of the "regular" gestures. (This is to prevent "overlap" see below for more.) - Now any regular gesture can be shifted, giving twice as many fundamental gestures. - The # of fundamental gestures could in fact be doubled a few more times by adding additional "shifting" gestures in the manner of the and keys on most PC keyboards. (One idea I had was to change the meaning of gestures based on the "quadrant" that the user is pointing at - upper left, upper right, lower left and lower right. E.g. "fist" can be broken into "upper left fist", "lower right fist", etc.) HOWEVER, using shifting gestures encourages modality in the client application, which is usually very frustrating to the end-user learning an application. The human mind, really, only operates in one "mode". I imagine that this is disturbing to some of you. User interfaces have progressed past the "insert mode," "delete mode", "change mode" type. Equally offensive to me are the "just hold down the shift to key to modify commands in such-and-such a way." (E.g. shift-left-mouse-button to copy, ctrl-left-mouse-button to move, or Shift-F10 to print and Ctrl-F10 for print preview.) The human mind can't handle layers like that, but the computer will store them forever. How many times have you moved a file when you wanted to copy it, or printed a file when all you wanted was a preview? This happens most frequently in beginning users, and users who have not used a program in a while. So those of you who think that "shifting" gestures is neat had better grow up. If I choose to implement them, it will be for a better reason than that - Another basic problem is that to be effective, the shifting gestures must lock out an entire channel of input which may limit the variety and intuitiveness of the "unshifted" gestures. (What's the difference between a "shifted-fist" and an (unshifted) "fist" if the shifting gesture is the thumb tucked into the palm?) More and more I'm thinking one shift gesture is probably too much. Might want to make it optional for certain applications. It would be useful for the gesture recognition system to do automatic "globbing". E.g. the "punch-twirl-punch-punch" command would automatically be shortened to something like "knockout". This is different from gesture combination in that it does not directly involve time. Example - "punch" followed later by another "punch" could be globbed to "twopunch". This way, "punch", "knockout", and "twopunch" would all be separate gestures, even though they all involve a rapid thrust toward the screen. The client app would not need to do any special interpretation to receive them correctly. There is more about shifting, combination, and globbing later in this article. Section II - Uses of this implementation The intended implementation would allow the end user to define the gestures he/she is most comfortable with. What the computer is supposed to do when a gesture is received is not the subject of this article. There are a myriad of applications that could use gesture receipt to trigger a function. Ideas: Virtual Reality Communicating between multiple users Constructing rooms & objects Movement within the world "Training" of autonomous moving objects Disabled persons Simplified communication (speech synthesis, text generation) Therapy, muscular exercise Windowing/User Interfaces Sizing, moving, selecting windows and data chunks Education Training for sign language Visual computer programming Introductions to computers Multimedia applications Moving between subjects. Games Control of unusual peripherals (robotic arms) Perhaps with the glove and good gesture recognition, we can break out of "flat" computer interfaces and put computers within the reach of more people. Keep your fingers crossed! Section III - The implementation itself We need a way to "name" the gestures without using the keyboard. One way is to have a built-in gesture set designed for entering alphabetic characters (similar to the way hearing-impaired people do proper names). Another way is to just have a long list of gestures that can be assigned, and you just select the gesture you want to define menu- style. I think we want a combination of the above - I'll supply a list but users can also add their own. Keep in mind that these ASCII names do not go into the event queue, they are there solely for the convenience of the application. I imagine that once the gestures are loaded from disk and any modifications are made by the end user, the names can be removed from memory. They only need to be reloaded if the user wants to change the definition. The app refers to the gestures via a pointer (or "handle"). (Side note: I've read some discussion on Compuserve about whether to use #defines or constant strings to refer to the gestures. NEITHER OF THESE METHODS IS FLEXIBLE ENOUGH.) Clearly I need a way of saving/loading gestures/gesture-sets to/from disk. I will avoid "binary dumps"; however, don't expect them to easily interpretable. (Although you never know!) Okay. I am assuming that the glove will be sampled at regular intervals. The length of this interval is not terribly important, but most applications are going to need real-time input. When I refer to a "click" or a "time click" below, I mean the time between glove samples. I suppose you could still do gesture recognition if the glove is sampled irregularly, as long as the time between each sample is recorded, or maybe a time stamp on each sample or something like that. It would still be a pain in the butt compared to regular sampling. Basic objects are shown below: Sample: One UNSIGNED byte for each of the following X, Y, Z, Rotation, Thumb, Index, Middle, Ring. (Practically speaking, the straight glove data.) Gesture: 1 word pointer to a Recognizable. Recognizable: One SIGNED byte for each of the following: delta X, delta Y, delta Z, delta Rotation, delta Thumb, delta Index, delta Middle, delta Ring 8 bit value with each bit signifying whether the corresponding "channel" is to be used or ignored for the gesture. One size_t time (in clicks) for the length of the gesture. EventQueue: A queue of Gestures. SampleBuffer: A random access array of Samples (updated after each click.) Must hold as many samples as the largest Recognizable in the GestureList. GestureList: A sorted list of Recognizable objects. The sorting key is the length of the gesture. The main engine works like so: 1. A new Sample is added to the SampleBuffer. 2. Step through the GestureList: a. Compute a delta vector for the current Recognizable, (if it is not the same as the last one.) This is done by subtracting each of the current Sample values from the Sample values N clicks ago (which will be present in the SampleBuffer), where N is the length of the current Recognizable. b. See of the delta vector lies within Epsilon of the current Recognizable's delta vector. Epsilon is a constant vector to allow for "close enough" user input. I will supply the appropriate Epsilon vector for the Power Glove after experimenting. c. If 2b. is true, add the address of the Recognizable to the event queue. The address of the Recognizable is in fact a Gesture. 3. Continue forever! Hmmm. This isn't quite right. Depending on how quickly the user makes a gesture, it might make it to the queue several times, instead of just once as hoped. We need to amend the data structures. Recognizable: Same as above, plus an 8 byte "work vector" and a 1 bit flag to say whether the gesture is currently being sensed (the "sensed bit"). Gesture: Same as above, plus a 1 bit value that is TRUE when the gesture is first recognized, and FALSE when the gesture has been released. In case you can't tell, event queue messages now take the form "punch(sense)" followed later by "punch(release)". This is similar to the way mouse "click-hold-and-drag" operations - one message is sent when the gesture is first "seen" and another is sent when the gesture is finally released. Hmmm, rather than setting and resetting a bit to indicate whether the gesture is turning on or off, why not just have the client app assume that identical queue messages will be sent. The first one will always be the "turning on" message, and it must be followed by another identical "turning off" message at a later time. But the messages may not occur one right after another. Does that put too much responsibility on the client? Hmmm. It would be nice to have just a single pointer in the queue, but I'm leaning toward keeping the extra bit. Many client apps will want to ignore the "release" message, and they would just have to check that bit to distinguish the irrelevant messages. Otherwise they'll have to keep track... I'll modify the engine as follows: 2. Step through the GestureList: a. Compute a delta vector for the current Recognizable, (if it is not the same as the last one.) This is done by subtracting each of the current Sample values from the Sample values N clicks ago, where N is the length of the current Recognizable. If the "sensed bit" is set, subtract the current values from the work vector (rather than from the sample N clicks ago.). b. See of the delta vector lies within Epsilon of the current Recognizable's delta vector. Epsilon is a constant vector to allow for "close enough" user input. I will supply the appropriate Epsilon vector for the Power Glove after experimenting. c. If 2b. is true, and the "sensed bit" is not set, then set the "sensed bit" and add the address of the Recognizable to the event queue, making sure it is a "turning on" gesture. Also, copy the Sample from N clicks ago into the work vector! d. If 2b is false, and the "sensed bit" is set, then send a "turning off" queue message. 3. Continue forever and ever! It should be plainly obvious how users can create their own gestures using JUST THE GLOVE! They'll hit one of the glove buttons to start a gesture, make the gesture, then hit the glove button again. All you need to store are the sample deltas and the length of the gesture. What could be easier? Glove gestures as I have defined them are rather "elemental". It is not possible to define a gesture like "move hand up, wiggle your index finger, move toward the screen and twist your wrist left" However, you COULD define 4 or 5 elemental gestures that would be "added up" by the client application and interpreted as a single gesture! I will include one or more relatively simple examples of this with the code, but I'm not sure if they'll be the combination or globbing type. That about covers it. I'm still not sure about shifting, combination, or globbing. I have some deeper questions about these concepts which I have not really addressed here. If you have understood everything I said, you probably have the same questions! I'll probably use Borland's C++ CLASSLIB library to handle the queue, buffer, and array in the first cut. I'll probably release that mainly to show beginning OOP programmers how a class library is used. (I know there's a lot of you out there.) But the real version will not use CLASSLIB for three reasons: 1) Portability, 2) Execution speed, and 3) Memory usage. Good reasons don't you think! Only disadvantage is perhaps maintainability. But this is a REAL TIME application. Compactness in both speed and size are penultimate. I will promise you that it will be object oriented as well. I really want this thing to reach a wide audience. If you doubt my credentials read MTP.BIO in the COMART forum on CompuServe. Afterword: The system will be freeware with the provision that you give me credit if you use any part of the source code for any purpose. It will be copyrighted. It is my opinion that good software is only of value to the intellectual community when it is accessible. I am making every effort to put this tool in a position where it will be used. Please use it! In the spirit of Richard Stallman's work, I am permitting you to take advantage of my ideas. If you make any significant improvements to the engine, please let me know. If my gesture recognition system sounds good to you, here's how you can pay me back. Think about what gesture sets will be useful to you. When the code is done, use it to create the gesture sets you want. But I WANT TO SEE WHAT YOU HAVE DONE. This is only fair. You may sell your application if you must, but you still owe it to me to let me have your gesture sets. It's the only way I can see the effect of my effort. It will encourage me to stay focused on this important area. There's obviously no way I can force you to do this without investing big $$$, which I don't have. I just want you to "Be Nice to Me" as Todd Rundgren so aptly put it (in 1971). Also, if you're willing to hire me to work on Virtual Reality construction tools or let me help in constructing actual Virtual Worlds, please contact me (see below). My present job is very boring, and this kind of stuff is one of the few things which stimulate me. Mark T. Pflaging Home: 7651 S. Arbory Lane Laurel, MD 20707 (301)-498-5840 Work: Cambridge Scientific Abstracts 7200 Wisconsin Avenue Bethesda, MD 20814