The patent application, which was titled “CONTENT DISTRIBUTION REGULATION BY VIEWING USER” allows distribution companies to monitor user media usage and charge users when they allow more than an agreed upon number of users to view content.
Here is how it works. Distributors send their material to consumption devices including televisions, set-top boxes and digital displays. Their content comes with an associate license option that allows for a per-user viewing option. When this option is selected, the the number of viewers in a room would be monitored via a monitoring device such as a webcam or Kinect so that charges could be issued accordingly.
The patent application explains:
The technology, briefly described, is a content presentation system and method allowing content providers to regulate the presentation of content on a per-user-view basis. Content is distributed to consuming devices, such as televisions, set-top boxes and digital displays, with an associated license option on the number of individual consumers or viewers allowed to consume the content. The limitation may comprise a number of user views, a number of user views over time, a number of simultaneous user views, views tied to user identities, views limited to user age or any variation or combination thereof, all tied to the number of actual content consumers allowed to view the content. Consumers are presented with a content selection and a choice of licenses allowing consumption of the content. In one embodiment, a license manager on the consuming device or on a content providers system manages license usage and content consumption. The users consuming the content on a display device are monitored so that if the number of user-views licensed is exceeded, remedial action may be taken.
A deeper look at the patent application reveals a good deal of information that would seem to suggest that Microsoft is serious about using the Kinect for the task of user monitoring:
The capture device 58 may include one or more image sensors for capturing images and videos. An image sensor may comprise a CCD image sensor or a CMOS sensor. In some embodiments, capture device 58 may include an IR CMOS image sensor. The capture device 58 may also include a depth camera (or depth sensing camera) configured to capture video with depth information including a depth image that may include depth values via any suitable technique including, for example, time-of-flight, structured light, stereo image, or the like.
The following paragraph seems to be referring to very Kinect-esque technology:
he image camera component 32 may include an IR light component 34, a three-dimensional (3-D) camera 36, and an RGB camera 38 that may be used to capture the depth image of a capture area. For example, in time-of-flight analysis, the IR light component 34 of the capture device 58 may emit an infrared light onto the capture area and may then use sensors to detect the backscattered light from the surface of one or more targets and objects in the capture area using, for example, the 3-D camera 36 and/or the RGB camera 38. In some embodiments, pulsed infrared light may be used such that the time between an outgoing light pulse and a corresponding incoming light pulse may be measured and used to determine a physical distance from the capture device 58 to a particular location on the targets or objects in the capture area. Additionally, the phase of the outgoing light wave may be compared to the phase of the incoming light wave to determine a phase shift. The phase shift may then be used to determine a physical distance from the capture device to a particular location on the targets or objects.
The next line leaves little in doubt:
In another example, the capture device 58 may use structured light to capture depth information. In such an analysis, patterned light (i.e., light displayed as a known pattern such as grid pattern or a stripe pattern) may be projected onto the capture area via, for example, the IR light component 34. Upon striking the surface of one or more targets (or objects) in the capture area, the pattern may become deformed in response. Such a deformation of the pattern may be captured by, for example, the 3-D camera 36 and/or the RGB camera 38 and analyzed to determine a physical distance from the capture device to a particular location on the targets or objects.
However, it would seem that Microsoft also leaves the door open for other types of capture devices:
In some embodiments, two or more different cameras may be incorporated into an integrated capture device. For example, a depth camera and a video camera (e.g., an RGB video camera) may be incorporated into a common capture device. In some embodiments, two or more separate capture devices of the same or differing types may be cooperatively used. For example, a depth camera and a separate video camera may be used, two video cameras may be used, two depth cameras may be used, two RGB cameras may be used or any combination and number of cameras may be used. In one embodiment, the capture device 58 may include two or more physically separated cameras that may view a capture area from different angles to obtain visual stereo data that may be resolved to generate depth information. Depth may also be determined by capturing images using a plurality of detectors that may be monochromatic, infrared, RGB, or any other type of detector and performing a parallax calculation. Other types of depth image sensors can also be used to create a depth image.
The patent also covers the monitoring of users via microphone
As shown in FIG. 7, capture device 58 may include a microphone 40. The microphone 40 may include a transducer or sensor that may receive and convert sound into an electrical signal. In one embodiment, the microphone 40 may be used to reduce feedback between the capture device 20 and the computing environment 54. Additionally, the microphone 40 may be used to receive audio signals that may also be provided by the user to control applications such as life recording applications or the like that may be executed by the computing environment 54.
Obviously, it would seem that the Kinect would be the device of choice as it is able to perform most, if not all, of the required tasks right out of the box – but it is worth noting that the patent leaves a lot of doors open and the system could use motion capture devices (webcams for example) just as easily as it could be use the Kinect. The section of the patent devoted to processing is a little more revealing:
The capture device 58 may be in communication with the computing environment 54 via a communication link 46. The communication link 46 may be a wired connection including, for example, a USB connection, a FireWire connection, an Ethernet cable connection, or the like and/or a wireless connection such as a wireless 802.11b, g, a, or n connection. The computing environment 54 may provide a clock to the capture device 58 that may be used to determine when to capture, for example, a scene via the communication link 46. In one embodiment, the capture device 58 may provide the images captured by, for example, the 3-D camera 36 and/or the RGB camera 38 to the computing environment 54 via the communication link 46.
 As shown in FIG. 7, computing environment 612 includes image and audio processing engine 194 in communication with operating system 196. Image and audio processing engine 194 includes gesture recognizer engine 190, structure data 198, processing unit 191, and memory unit 192, all in communication with each other. Image and audio processing engine 194 processes video, image, and audio data received from capture device 58. To assist in the detection and/or tracking of objects, image and audio processing engine 194 may utilize structure data 198 and gesture recognition engine 190.
 Processing unit 191 may include one or more processors for executing object, facial, and voice recognition algorithms. In one embodiment, image and audio processing engine 194 may apply object recognition and facial recognition techniques to image or video data. For example, object recognition may be used to detect particular objects (e.g., soccer balls, cars, or landmarks) and facial recognition may be used to detect the face of a particular person. Image and audio processing engine 194 may apply audio and voice recognition techniques to audio data. For example, audio recognition may be used to detect a particular sound. The particular faces, voices, sounds, and objects to be detected may be stored in one or more memories contained in memory unit 192.
To me, it all sound more than a little creepy and is a timely reminder to keep checking the fine print in terms of service agreements. When something costs less, it may have other strings attached.
Obviously Microsoft aims to take a hands-off approach with this technology. They want the people’s equipment, rather than live humans, to monitor users – if they are aware of the consequences that come with cheaper content, so be it. The problem for me is that it gets people too used to seeing that little light that tells them that their device is recording – so when it is on when it shouldn’t be, people are far less likely take notice.
[Source; Image credit: Spare Candy]