Home People Facilities Maps & Directions Calendar of Events Search CAIP

   

CAIP TECHNICAL ABSTRACT

 

CURRENT RESEARCH

 

 

TR 274 Towards Robust Speech Recognition:  Linear and Cepstral Domain Adaptation

 Hidden Markov Models (HMMs) have been used with considerable success in speech recognition technology. These systems perform well when the system is used in conditions similar to the one used to train the acoustic models. However, mismatches severely degrade the performance. The mismatch can be because of inter speaker variation or environmental variation or both. Speaker variation can be because of factors such as dialect differences and vocal tract lengths. Environmental variation can be because of microphone variations, channel noise, additive acoustic noise and reverberation.  This thesis aims at developing an algorithm that can adapt the acoustic models to a new environment and/or speaker, to improve the performance of the speech recognizer under mismatched conditions. An algorithm that adapts the acoustic models in the linear spectral domain, called Linear Domain Linear Regression (LDLR) is proposed. This algorithm is compared with other adaptation techniques such as Maximum Likelihood Linear Regression (MLLR) and Maximum A-Posteriori (MAP). A hybrid algorithm that uses LDLR in conjunction with MLLR is also proposed.  Experiments show that under mismatched conditions, recognition word error rate decreases when using LDLR adaptation. The hybrid technique that uses both MLLR and LDLR does better than any other single form of adaptation. For the 20db additive noise case, the hybrid adaptation technique reduces the word error rate by 65% as compared to the unadapted models. With reverberation present along with the additive noise, the word error rate reduces by 53%.

 

TR273 Supplementary Features for Improving Automatic Speech Recognition

 Most of the state of the art Automatic Speech Recognition (ASR) systems use acoustic features and their first order derivatives as input and are modeled using Hidden Markov Models (HMM).  In this research we have used the Hidden Markov Model Tool Kit (HTK) version 3.1 to build a phone based speech recognizer.  We compare the performance of the ASR system for context independent and context dependent phonemes which are modeled using multiple Gaussian mixtures.  It was observed that the system performance improved with increase in mixtures when enough training data was available.  Best performance was observed when we used 5 mixtures to model tri-phones with the whole (TIMIT) (recorded at Texas Instruments (TI) and transcribed at Massachusetts Institute of Technology (MIT) data set used for training.

In addition to cepstral parameters, we also investigate three other Supplementary Features (SF’s):  Periodicity, Zero Crossing Rate (ZCR) and Ratio of low frequency energy to total energy.  We demonstrate that these SF’s improve accuracy of ASR.  Various combinations of SF and Mel-Frequency Cepstral Coefficients (MFCC) along with their first order derivatives were studied and compared with the performance of the standard MFCC based systems.  Results suggest that for optimal recognition performance a combination of SF and MFCC is more advantageous than either of them used individually.  We observe further improvements in accuracy and noise robustness in the ASR when the last four MFCC;’s and their corresponding first derivatives were replaced by the SF’s and their derivatives respectively.

 

TR-272 Statistical Modeling Of User Input In a Multimodal Speech and Graphics Environment:

In a communication act, whether it be ‘between humans’ or ‘between a compute system and a user’, where multiple modalities are used by the human to convey information, there exists a pattern in which these modalities are integrated by the user to arrive at a combined meaning.  This research work attempts to study this pattern of integration and develop a computational model of the pattern that can used by multimodal systems to gain knowledge of the user’s behavior.  Having this model the multimodal systems can more accurately integrate the continuous data streams from the users different input channels to interpret the combiner meaning.

The multimodal system developed in this research is capable of responding to speech, touch, pen-tablet and mouse inputs of the use.  We build a multimodal model that determines the temporal synchrony between these input modalities during a multimodal interaction and integrate it with the system to reduce the search region in the gesture which is required to resolve the ambiguities in the user’s spoken utterance. 

  

TR-271 A Peer-to-Peer Approach to Web Service Discovery:                                      

 Web Services are emerging as a dominant paradigm for constructing and composing distributed business applications and enabling enterprise-wide interoperability. A critical factor to the overall utility of Web Services is a scalable, flexible and robust discover mechanism. This paper presents a peer-to-peer (P2P) indexing system and associated P2P storage that supports large-scale, decentralized, real-time search capabilities. The presented system supports complex queries containing partial keywords and wildcards. Furthermore, it guarantees that all existing data elements matching a query will be found with bounded costs in terms of number of messages and number of nodes involved. The key innovation is a dimension reducing indexing scheme that effectively maps the multidimensional information space to physical peers. The design and an experimental evaluation of the system are presented.

 

TR-270 An Ink and Gesture Based Annotation Architecture for the Internet: 

 Significant strides have occurred in document technology like the standardization of object interfaces and event models to access and manipulate the properties of digital documents. There has also been considerable progress in pen based computing for recognition of digital ink in desktops and handheld devices.  With the advent of powerful tablet PCs, new applications for pen and ink seem to be mushrooming every day [18].  There has been research on annotation systems for the web right from the start of the Internet and web documents resulting in several complex architectures for annotating and personalizing web pages [1, 5, 6, 9, 11, 13].  Annotations are the digital form of marks that are added to documents in a paper-based environment, for instance, highlighted text, text notes and ink marks on paper margins.

The above-mentioned advances in document technology have necessitated further research on meta-data markup or annotation architectures for digital documents, specifically pen-based annotation systems. This thesis presents an attempt to leverage the new standards of Dynamic HTML and the Document Object Model (DOM, proposed by the World Wide Web consortium or W3C) that are being gradually implemented by popular browsers, to build a prototype of an ink annotation system with common components across browsers. The primary goals in this study are to provide users with a standard tool to annotate web pages with freeform ink and semantically link the user drawn ink annotations with underlying document elements like text and images.  Further, it provides recognition techniques for ink gestures that can help in modifying the rendered ink for annotation operations such as editing, resizing and association.  The main components of the system are the ink capture and dynamic rendering module, an ink understanding module that recognizes and associates ink with the underlying document and annotation storage and retrieval modules.

 
TR-269 AutoMate: Enabling Autonomic Grid Applications

 The increasing complexity, heterogeneity and dynamism of networks, systems, services applications have made our computational/information infrastructure brittle, unmanageable and insecure. This has necessitated the investigation of a new paradigm for design, development and deployment based on strategies used by biological systems to deal with complexity, heterogeneity, and uncertainty, i.e. autonomic computing. This paper introduces the AutoMate project and describes its key components. The overall objective of AutoMate is to investigate key technologies to enable the development of autonomic Grid applications that are context aware and are capable of self-configuring, self-composing, self-optimizing and self-adapting. Specifically, it will investigate the definition of autonomic components, the development of autonomic applications as dynamic composition of autonomic components, and the design of key enhancements to existing Grid middleware and runtime services to support these applications.

 

TR- 268 Integrating Grid Services using the DISCOVER Middleware

 The growth of the Internet and the advent of the computational Grid have made it possible to develop and deploy advanced services on the Grid.  These services build on high-end computational resources, communication technologies and enable seamless and collaborative access to resources, applications and data. However, problem solving on the Grid requires combining these services in a seamless manner.  However, getting existing services to interoperate presents many challenges, as these services have a typically have customized architectures and implementations, and builds on different enabling technologies. This paper presents the design and implementation of a DISCOVER middleware substrate for integrating Grid services, and describes the integration of Globus and DISCOVER services using this middleware. An experimental evaluation of the middleware substrate is presented.

 

TR-267 Semantic Consistency Optimization in Heterogeneous Virtual Environments

 Abstract Collaborative virtual environments with heterogeneous computing resources and user preferences often reduce data fidelity to accommodate such heterogeneity.  Given the resource limitations and user preferences, the problem is to optimize the fidelity degradation so as to achieve maximum semantic consistency across the different data representations.  Consistency maximization can be formulated as an inter-programming problem, wherein constraints are resource limitations and user preferences.  We consider several formulations of the problem, some of which do not enforce topological constraints in degraded representation, while others do.  The solutions to this problem result in reduced amounts of distributed data which conserve network bandwidth and other system resources.  Experimental results and proposed topics for further research are also presented.

 

TR266 Scalable Keyword Searches with Guarantees in Peer-to-Peer Storage Systems

 Abstract The ability to support complex keyword searches is important for any information storage and sharing system.  While peer-to-peer storage systems are gaining popularity, existing systems either support keyword lookup without any guarantees or do not allow keyword searches at all.  In the former systems, the cost of a query is not bound and existing matches in that system may not be found.  The latter systems (data lookup systems do provide guarantees and bounds, but do not allow keyword searches.  These systems only support information lookup by name.  In this paper, we present an innovative approach to building a Peer–to-Peer storage system that provides the flexibility of keyword search systems while providing the guarantees and bounds of data lookup systems.  The system guarantees that all existing data elements that match a query will be found with reasonable costs in terms of number of messages and number of nodes involved.  Complex queries containing partial keywords and wildcards are supported.  An experimental evaluation of the system is presented.

 

TR 265 Who’s in Charge Here?  Communicating Across Unequal Computer Platforms, Ivan Marsic, Maria Velez, Marilyn Tremaine, Bogdan Dorohonceanu, Allan Krebs, Aleksandra Sarcevic

Abstract Personal data assistants are being used in the field to collect data and to communicate with others both in the field and in the office.  The individual in the office invariably has a laptop or a high-end personal workstation and thus, significantly more compute power, more screen real estate and higher volume input devices such as a mouse and keyboard. It is therefore useful to know what impact these differences have on work communications.  Four different platform combinations involving a PC and a PDA were used to examine the effect of communicating via heterogeneous computer platforms.  The PC platform used a mouse, keyboard and 3-dimensional screen display.  The PDA platform used a stylus, soft buttons and a 2-dimensional screen display.  A variation of the Tetris wall-building game called Slow Tetris was used as the subjects' task.  An in depth analysis of the communication exchanges found that subjects with mixed platforms had the most communication problems.  Additionally, control of the task stayed with the person having the richer display and collaboration and mitigating politeness statements were most exchanged between partners when the direction giving authority was given to the person with the inferior display.  Tasks directed by persons with PCs were carried out significantly faster than tasks directed by persons with PDAs.  The slow task performance may have been one of the reasons for the authority switches observed.

 

TR-264 Ivan Marsic, Xiaodong Sun, Carlos Correa, and Tongyin Liu, Maintaining State Consistency Across Heterogeneous Collaborative Applications, March 2002

Abstract—The proliferation of computer devices and wireless networks allows the users to access information and collaborate with others from anywhere, using the device that best matches the current context. Device capabilities constrain the application implementation, which implies that the user interface and the shared information will differ across devices. Some shared information may be omitted or abstracted to fit the device capabilities and, as a result, the application state differs on different devices. Therefore, there arises a problem of consistency management of the application state across different platforms. Application state is determined by the data structures that represent the application’s data and its user interface and we assume that the application state can be represented as a graph data structure. We present a set of conditions on the rules relating the states in different implementations of an application and derive an algorithm which maintains the state consistency under the user interactions. We also illustrate an application scenario that benefits from heterogeneous state representations.

 

TR-263 Liang Cheng, Network Awareness for Heterogeneous Data Networks, March, 2002 (Ph.D)

Abstract - Network awareness, which is defined as the capability of network devices and applications to be aware of network characteristics, is the basis for network quality-of-service (QoS) provisions and network management.  The necessity of network awareness in heterogeneous data networks is illustrated by several experimental studies, such as multimedia collaboration, QoS provisioning, and cluster computing in heterogeneous data networks.

 Existing techniques of network awareness are studied in three research areas:  link-type awareness, link –bandwidth awareness, and service awareness.  Considering the limitations and/or unsuitability in their applications to various heterogeneous data networks, original techniques in these areas are presented:  (i) fuzzy reasoning for wireless awareness; (ii) accurate bandwidth measurement in digital subscriber line (xDSL) networks; and (iii) service awareness in mobile ad hoc networks (MANET)

 A novel piecewise framework for network awareness service (NAS) for efficient integration of various network awareness techniques in heterogeneous data networks is presented.  Analytical results on the performance of the NAS framework demonstrate that it has significant advantages over traditional unitary frameworks in terms of reducing wireless bandwidth consumption and saving battery energy of mobile devices.  An original study of statistical properties of session throughputs in wireless local area networks exemplifies the feasibility of applying predictions in the NAS framework.

 

TR-262 Ashutosh Morde, An Application for Voice Controlled Driving Directions, February, 2002 (masters thesis)

Abstract - Access to online information is crucial for people on the move.  This information can be provided to the user through the phone.  Voice as an access mechanism is direct and allows additional tasks to be performed while the hands or eyes are busy.  PhoneBrowser is a tool that allows the user to browse the web by speech control over an ordinary telephone.

An application for voice controlled driving directions was developed using the PhoneBrowser.  The user is guided through the process of providing the origin and destination address to the system, which then queries an online database to access the driving directions.  The user has control over the order and rate at which the information is provided to him – he can ask the system to go to step n or repeat a previous step.  Early experiments with students identified the problems with the temporal attributes of the synthesized speech.  It was difficult for the students to recollect steps simultaneously with the telephone communications; the application was extended to display turn-by-turn maps and driving directions on an Ipaq.  The information displayed on the Ipaq is synchronized to that requested by the user over the phone.  A framework for collaborative browsing, which can be extended for multimodal interfaces using the PhoneBrowser, was also established.

 

TR-261 Shashank Sathyanarayana, A Method for Electronic Mail Dictation & Retrieval Over the Telephone, July 2001, (masters thesis)

Speech is the preferred medium of communication between humans.  Of late, speech technology is becoming increasingly important in the computing world as it used to improved existing user interfaces and to support new means of human interaction with computers.  One such use of speech technology is the ability to browse the web over the phone, known as voice browsing.  Apart from regular browsing, voice-browsing technology can be extended to provide anytime accessibility to traditionally deskbound applications.  This thesis discusses a real-time implementation of integrating voice-browsing technology with a traditional desktop application, viz. electronic mail.  The technologies involved in sending and receiving electronic mail over the telephone are discussed.  At its simplest, the system consists of a speech recognizer to transcribe to perform transcription are examined.  Experiments are performed to measure the confusability existing in the grammar.  Words that are most prone to misrecognition in the grammar are noted.  A mechanism for correcting dictation misrecognitions using this knowledge is explored.  Usability studies are conducted on the application.  The study and the user responses are discussed.

 

TR-260, Christopher Alvino, Acoustical Source Location of Multiple Talkers in Reverberant Environments, April 2001, (masters thesis)

Acoustical source location is a topic of interest in the fields of signal processing and acoustics.  The theory of acoustical location of a single source is well developed.  Unfortunately, the reverberation that exists in closed environments severely degrades the accuracy of the source location estimates.  The focus of this thesis will be on performing acoustical source location of multiple moving talkers in reverberant environments.  An outlier algorithm designed to combat the negative effects of reverberation is validated and tested experimentally.  A framework is presented for a speech/non-speech detector to be used in a teleconferencing environment such that audio and visual sensors are not aimed at the locations of non-speech sounds.  In the last two chapters, this thesis explains the theory of estimating the source locations of two or more simultaneous source locations.  Experimental results of this algorithm are described along with the results of other suggested methods.  Finally, suggestions are made on how this work can be continued and improved.

 

TR-259, Rares F. Boin and Gregore C. Burdea, WorldToolKit vs. Java 3D:  A Performance Comparison, 4/26/01

This report compares the performance of WorldToolKit (releases 8 and 9) and Java3D (versions 1.1.3 and 1.2) running a VR simulation on a dual-processor 450 MHz PC.  The simulation was designed to run in several configurations having different interaction levels (no interaction, fly-through and haptic interaction), rendering modes (wireframe, Gouraud and textured), graphics modes (mono and stereo) and scene complexities (5,000-50,000 polygons).  Results show that overall Java3D is faster than WTK in terms of frame refresh rates.  However, WTK has a more uniform frame rendering time, which results in more predictable visual feedback.

 
TR-258, Daniel Nagy, Online Language Acquisition in Multimodal Environment, 9/12/00

As a test vehicle we picked the Information Kiosk task.  It is very common to place self-service information kiosks in areas where large amounts of visits are expected, such as international airports, railroad terminals, museums and famous research centers.  Kiosks at transportation hubs provide visitors with travel information about the city or country they are located at, in museums and research centers they can act as virtual tour guides (a very nice and impressive example is the kiosk network at the State Hermitage Museum in St. Petersburg, Russia by IBM).

The kiosks are aimed at random users, therefore the designers are motivated to equip the terminals with natural user interfaces, since the users are not expected to know any particular command language, if typed or spoken input is considered at all, it has to deal with unconstrained natural language.  Redundant multimodal input is very useful, for many reasons.  Different information can be best expressed in different modalities:  it is more convenient to express spatial information by pointing, while pointing is not very well suited for selection from a very large number of named entities.  Some circumstances may prevent the user from using one modality or another; users may have inabilities, one cannot talk in a noisy environment, etc.  And at last such kiosks are devices for both information access and entertainment (infotainment:  a term widely used by post-modern philosophers), and the ability of the machine to respond to different input stimuli increases its entertaining capability.

 

TR-257, George V. Popescu, Design & Performance Analysis of a Virtual Reality-Based Telerehabilitation System, 1/2001

In recent years the area of medical VR applications has continuously expanded, addressing new domains such as home healthcare, clinical neuropsychology, and rehabilitation.  The research presented here explores the use of Virtual Reality (VR) for telerehabilitation applications.  A prototype platform for VR-based telerehabilitation was defined first.  The main component of the platform is the hand force feedback unit.  A programming library -- the Rutgers Haptic Library – was developed for modeling hand haptic interactions.  The software was used to build real-time VR simulations that involve elastic and plastic deformations and physical modeling.

The VR-based platform was the basic component of the telerehabilitation architectures we developed.  These architectures use Virtual Reality as an advance interface for therapy as well as to enable communication between the therapist at the clinic/hospital and the remote patient or group of patients.  The first prototype supports offline interaction between the therapist and the VR-enabled patient site.  This “store and forward” system uses a Client/Server architecture.  The client (patient home) runs VR rehabilitation exercises with force feedback and collects patient data.  The exercises simulate physical and functional rehabilitation routines.  Patient data are forwarded to the server (clinic site), which stores medical records and runs data analysis software.  System performance over several types of connections was measured in laboratory experiments.  The guidelines extracted from these experiments help sizing the system in terms of recorded data and number of concurrent users.  Clinical trials were conducted at the Stanford University Medical School.  Data collected during these trials indicates that patient’s level of effort and grasping strength increased after using the VR-based rehabilitation system.

“Store and forward” systems are insufficient for implementing the whole range of potential telerehabilitation services.  The second architecture developed in this thesis uses a Shared Virtual Environment to enable real-time patient-therapist interaction.  The prototype system allows the therapist to perform remote physical therapy and collect patient data.  Simulated physical interactions between therapist and patient were implemented using force feedback.

 

TR-256, John Sucec and Ivan Marsic, The Scalability Tradeoff of Topology Dispersion for Mobile Ad hoc Networks, 11/10/00

 

This paper addresses the issue of scalability, with respect to increasing node county, in mobile ad hoc networks.  In particular, scalability of on-demand routing protocols is considered in detail.  The scalability of routing protocols is characterized as tradeoffs between several criteria.  As a result of evaluating these tradeoffs, a heuristic known here as topology dispersion is proposed for on-demand routing protocols.  Analysis and simulation results show that topology dispersion can afford scalability in terms of locally available routing information, at the expense of a moderate increase in control packet overhead, over the range of mid-size networks (50 to 400 nodes).

 

TR-255, Phillip Stanley-Marbell and Michael Hsiao, Fast, Cycle-Accurate Energy Estimation for Networks of Embedded Systems, 11/00

Increased demand to investigate the effect of software on energy usage in embedded systems such as networks of sensors, requires cycle accurate energy estimators.  The analysis of such systems necessitates a means of simulating the computation and communication costs both in individual nodes and globally across the entire network.  Resented ais a fast, flexible, cycle-accurate architectural simulator for networks of embedded systems, that models a popular commercial micro controller family, the Hitachi SuperH RISC architecture.  The simulator enables cycle-accurate power dissipation analyses through both instruction level power analysis and circuit activity estimation, and permits both functional and energy cost simulation of a wired communication link.  By providing a flexible range of simulation detail and amortizing portions of the simulation cost across several simulated devices, it permits simulation of realistic applications on networks of embedded systems.  The simulator provides hitherto unavailable functionality, while being over an order of magnitude faster than a contemporary state of the art power-estimating simulator.

 

TR-254, John Sucec and Ivan Marsic, Selecting Query Scope in On-Demand routing Protocols Based on Path Count Estimation, 11/28/00
 
To discover a route to a peer node, an on-demand routing protocol may initiate a network-wide broadcast of a route request packet, perhaps after initially confining the dissemination of its request to neighboring nodes.  However, significant savings in packet overhead can be achieved by incrementally increasing the query radius rather than resorting prematurely to a query radius equal to the network diameter.

This paper presents a method to estimate the number of currently active pairs of communicating nodes (i.e., the number of active paths) in a mobile ad hoc network.  The method is entirely distributed and requires only modest additional overhead in each Hello message transmission.  Based on the estimate of the network path count, a decision can be made locally at a node performing route discovery as to how the query radius should be incremented.  By selecting the correct route query radius, the number of packet transmissions required for route discovery can be minimized.

The simulation results reported herein pertain to an on-demand routing protocol that incrementally increases its query radius from one hop to two hops prior to initiating a network-wide broadcast of the query.  The results indicate that such a method can substantially reduce the route request packed overhead, on average, as compared with a route discovery process that resorts immediately to network-wide broadcast of the query.  The results indicate that such a method can substantially reduce the route request packet overhead, on average, as compared with a route discovery process that resorts immediately to network-wide broadcast if the initial hop query fails.  The path count estimate is applied to detect instances where incrementing the hp radius form one to two hops will likely yield little or no savings.  This allows the routing protocol to skip the incremental query expansion process when the expected savings in route request packet overhead is low.  Thus, improving the protocol performance it terms of route discovery delay.

 

TR-253, John Sucec and Ivan Marsic, Accessing the Scalability of On-Demand Routing Protocols, in Terms of Locally Available Routing Information, in Mobile Ad hoc Networks, 12/2001

This paper evaluates the ability of on-demand routing protocols to efficiently perform route discovery in mobile ad hoc networks (MANETs) as the network diameter, node count and number of active routes are increased.  Specifically, the probability of route discovery at various hop count distances from a requesting node is determined via simulation, where route discovery at a small hop distance represents efficient route discovery and a large hop distance represents costly route discovery.  Effective route caching heuristics are essential for efficient route discovery.  However, to minimize the likelihood of stale route cache entries, retrofitting a heartbeat protocol to the routing protocols being considered is proposed.  The routing paradigms under consideration are source routing with and without neighbor discovery.  Simulations, presented herein, provide quantitative evidence that source routing with neighbor discovery is the most robust approach.  The efficiency of on-demand route discovery in a hybrid routing environment, with a proactive routing radius of 2 hops, is also considered.

 
TR-252, James Kleban, Chris Alvino, Shashank Sathyanarayana, Ashutosh Morde, James L. Flanagan, Digital Implementation of a Source Location and Sound Capture System for Hands-Free Video Teleconferencing, 10/31/00

A real time system for hands free video teleconferencing is implemented on a Texas Instruments TMS320c6201 DSP.  The acoustic signals from the microphones are digitally beamformed thereby resulting in a high quality audio signal.  The source location estimate required by the beamformer is determined by the Cross Power Spectrum Phase (CPSP) technique, with the CPSP algorithm modified for the fixed point DSP.  The inaccuracies due to the presence of ambient acoustic noise is minimized by implementing an adaptive energy tracker which rejects a frame of audio data if its energy is less than the estimated background noise.  Reduction in source location inaccuracies by the technique of TDOA outlier removal was also tested.  A recursive Least Squares Algorithm is used to smooth out the source location estimates.

 

TR-251, James Theodore Kleban, Combined Acoustic and Visual Processing for Video Conferencing Systems, 9/29/00 (thesis)

The goal in hands-free videoconferencing is to accomplish, in real time, high quality sound pickup with automatic source location of the current talker.  Signal processing techniques mitigate the deleterious effects of interfering sources, room reverberation, and background noise.  A new real-time digitally processed system utilizes increased processing power to implement acoustic source location and sound capture for higher fidelity videoconferencing.  The talker location system, however, is subject to inaccuracies when using the acoustic input alone.  An additional source of information, images obtained from the application’s video camera, can enhance talker positioning.  Image processing algorithms are available that detect and localize people in a room by their faces.  Face locations can complement the talker positions found using acoustic signal processing.  Integrating the two modes of information provides a more efficient way to locating a speaker in a room than either input modality on its own.  Improved position estimate accuracy will lead to higher quality sound and video.

 

This thesis outlines the construction of an improved fully digital real-time source location and sound capture for teleconferencing.  Algorithms using face detection to provide image-based locations of people in a room are presented.  Offline experiments investigate face detection/localization techniques, and examine the integration of visual and audio information.  A method for determining and locating multiple talkers, difficult to do with acoustic data alone, is presented.  Systems combining image and acoustic processing for later real-time implementation are examined.

 
TR-250, Manpreet Kaur, Integration of Gaze and Speech for Multimodal Human Computer Interaction, 9/29/00 (PhD thesis)

 

Most commonly used human-computer interfaces do not take advantage of the many communication channels humans use to communicate in verbal and non-verbal ways.  The Rutgers University CAIP Center, under the NSF STIMULATE program, has been conducting research to establish, quantify and evaluate techniques for designing synergistic combinations of human-machine communication modalities like sight, sound and touch.  An initial system using these modalities has been implemented at CAIP, and it has been seen that even with our simplistic integration scheme and imperfect component technologies, there are obvious performance advantages to be gained from the use of multiple modalities.  The research described in this thesis is a systematic evaluation and characterization of gaze as an input modality, and its integration with speech.  The overall goal of the research was to explore the use of gaze and speech as input modalities for HCI, and to understand the natural integration patterns typically occurring in the combined use of the two.  Exhaustive characterization of the use of gaze as an input modality for human-computer interaction has been done.  Relationship between object selection times with distance, size and index of difficulty, as defined by Fitts law, have been studied.  The speech and gaze experiments described in this work provide detailed timing correlations of speech and gaze, both in a natural environment, and a computer-based system which can be used to answer the questions of when and how to integrate the two modalities.  The integration of speech and gaze has been studied under linguistically different command structures like the use of labels (Move This There).  It has been found that there are fundamental relationships between gaze and speech events e.g. gaze always precedes speech, though the time is seen to vary with command structure.  The effect of command structure is seen to decrease in a computer-based environment.  Also, it has been demonstrated that the detailed knowledge of the timing relationship of the speech and gaze patterns for command specification and command execution can be used for error resolution for a more robust multimodal interface.

 

TR-249, Dwight Macomber, Deborah Grove, Joseph French, Report of Microphone Array Experiments Performed at CAIP in January 2000, 10/31/00

The general purpose of the experiments was to evaluate microphone array performance under conditions approximating those typical of teleconference environments.  Conference Room 601 of the CoRE Building on the Busch Campus of Rutgers University was used to measure the behavior of a 32-microphone wall mounted matched-filter array (MFA).  Measurements were made to evaluate behavior with a source located at the array focus, as well as with the source located at increasing distances from the focus position.  Impulse responses (IR’s) were measured from twelve different off-focus source positions along a line in one direction and from seven positions in another direction.  Measurements were also made to judge the sensitivity of the array to changes to the room environment. 

A second set of measurement was taken in CoRE Room 732.  The goal of these measurements was (1) to determine whether the acoustic field in a real room approximates a diffuse reverberant field, and (2) to obtain some preliminary indications of the practicability of a ceiling-mounted microphone array for frequencies at the audio mid-range and above.

 

TR-248, John Susec and Ivan Marsic, An Efficient Distributed Network-Wide Broadcast Algorithm for Mobile Ad Hoc Networks, 7/5/00
 
In this paper, an algorithm for efficient network-wide broadcast (NSW) in mobile ad hoc networks (MANETs) is proposed.  The algorithm is performed in an asynchronous and distributed manner by each network node.  The algorithm requires only limited otpology knowledge, and therefore, is suitable for reactive MANET routing protocols.  Simulations show that the proposed algorithm is on average 3-4 times as efficient as brute force flooding.  Further, simulations show that the proposed algorithm compares favorably over a wide range of network sizes, with a greedy algorithm using global topology knowledge, in terms of minimizing packet transmissions.  The application of the algorithm to route discovery in on-demand routing protocols is discussed in detail.  Proofs of the algorithm’s reliability and of the intractability of solving for a minimum sized transmitter set to perform NWB are also given.
 
TR-247, Amit Chhabra, Spatially Selective Sound Capture Using a Workstation Matched-filter Microphone Array for a Small Office Environment, 9/8/00 – Masters Thesis

 Modern speech-based systems require the “capture” and digitization speech with an increasingly higher degree of fidelity, i.e., obtaining speech comparable to that of a human talker.  Additionally, it is desirable to provide this capability in a manner that does not hinder human movement, i.e., it is desired to provide capture of clear and high-quality speech samples without using any body-worm equipment.  To this end, spatially selective sound capture techniques must be used to remove interfacing noise (e.g., HVAC systems, reverberation, and other talkers) from the desired speech captured in real rooms.  The spatially selective sound capture technique of interest in this work is the Matched-filter Array (MFA) technique as a means of removing noise from competing sources and reverberant noise in a room.  The technique’s performance is measured by its SNR improvement over that of a single microphone and the delay-sum beamformer (DSB) technique.  The technique was implemented on real-time computing system, and experiments were performed in a small office environment.  The results yielded an improvement of 6.4dB above a single microphone and approximately 2 dB above the DSB.  The technique is noted as a promising method to deliver the necessary results needed for modern speech-based systems.

 

TR-246, Sumathi Gopal, Aristotle and the Knowledge Web, 8/30/00 – Thesis

 Aristotle is a Distributed Learning System.  A distributed Learning System is an educational tool that supplements classroom training or even replaces one.  Currently, the audience of Aristotle comprises of all the users using the RUNet2000 intra-network of Rutgers University.  The current project prototype of Aristotle is a Rutgers University freshmen course in General Biology – Biology 101.  The current implementation comprises of two portions area – The Virtual Labs, and The Online Classroom.  Aristotle was commissioned during the fall of 1999.

This thesis work concerns with the development of the Online Classroom.  It elaborates on the program utilities that were developed to generate the various kinds of files required to launch the classroom, and the design and implementation of the novel concept of the Knowledge Web.

The Online Classroom is unique in its class, as it renders each lecture into topics and keywords.  Each topic/keyword presentation comprises of a video-clip, transcript, images and a definition.  More so, the video, transcript and images are totally synchronized.  This is achieved by means of SMIL, Real Time Streaming Protocol (RTSP), RealPlayer G2 and RealServer of Real Networks, Inc.

 

The second portion of the thesis is The Knowledge Web.  This is a tool developed specifically for the online classroom, to alleviate the process of meta-cognition.  It is a graphical representation of knowledge/information.  Two applications – The Knowledge Web Composer and the Knowledge Web Navigator have been designed to construct and navigate knowledge webs.  Details of the design and implementation have been discussed in this thesis report.

 

Aristotle uses the Internet resource available to ease the learning process of students.  Distributed Learning Systems overcome several problems associated with traditional classrooms.  The Knowledge Web tries to compensate for the absence of a live teacher.  Details of the design of the Online Classroom of Aristotle and the Knowledge Web have been presented in this report.

 

TR-245 Ivan Marsic, Attila Medl, and James Flanagan, Natural Communication with Information Systems, 4/20/00

 

Pervasive networking and sophisticated computing open opportunities for collaborative information processing independent of time and space.  In this instance the information system becomes an enhancer of human intellect, as well as a mediator for communication among participants.  The human user favors the sensory dimensions of sight, sound, and touch as primary channels of communication.  Machines that can accommodate these modes promise flexibilities and functionalities that transcend the traditional mouse and keyboard.  This paper describes research to establish human-computer interfaces that capture attributes of natural face-to-face communication.  An experimental multimodal system is developed to study several aspects of natural style human-computer communication.  While as yet primitive, the technologies of image and gaze processing, hands-free conversation, and force feedback tactile transduction are combined and used simultaneously for manipulating objects in a shared workspace.  Software agents fuse the sensory signals to estimate and interpret user intent.  Current areas of experimental application include disaster relief/crisis management, telemedicine/rehabilitation, and mobile office/wearable computers.

 

TR-244 assigned but no report has been received.

 

TR-243 – Helmuth Trefftz and Grigore Burdea, 3-D Tracking Calibration of a Polhemus Long Ranger Used with the Baron Workbench, 14/8/00

 

Applications involving a Workbench such as the Barco Baron one [1] typically require a larger working envelope than desktop VR applications.  In order to provide for accurate tracking of the users’ hands in such an environment, a Polhemus Long Ranger 3-D magnetic tracker was acquired.  The Long-Ranger provides a larger radius of interaction but the accuracy of the signal is adversely affected by metallic surfaces in the proximity of its working area.  This project describes a procedure to reduce these errors, based on samples taken at regular space intervals in the working volume.

 

TR-242 – Chengwei Feng, Deborah Silver, Karen G. Bemis, Peter Rona, Acoustic Imaging Manual:  Object Segmentation and Feature Quantification, 12/6/99

 

In this work, our goal is to apply visualization techniques that will facilitate the study of simulation and experimental datasets of hydrothermal plumes.  Hydrothermal plumes are bodies of hot water containing mineral particles, and discharge from vents on the sea floor.  They rise buoyantly, for instance, up to hundreds of meters, and disperse heat and chemicals into the ocean.  This report is intended to describe enhancements to our existing feature extraction, tracking and quantification system.  It focuses on feature extraction (object segmentation) and feature quantification procedures.  The simulation dataset is a numerical model based on a large eddy simulation in the field of computational fluid dynamics (CFD).  Experimental datasets are acoustic images of thermal plumes, which record intensity of backscatter from the particulate matter suspended in the plumes.  The experimental datasets were recorded at different locations and times.  Our new approaches, including skeleton and centerline representation and quantification, provide additional visual information.  A centerline and a skeleton are useful shape abstraction of features, which are regions of interests in the dataset.  Both capture the essential topology of the feature but on different levels of details.  The centerline plays an important rule in defining the qualitative and quantitative behavior of features and their evaluations in the field of plume study.  The skeleton is a thinned version of the original object, much like the skeleton of a human figure.  It is related to the medial-axis which is the locus of points centered with respect to the boundaries of the feature.  We extract the skeleton of a feature using a distance transformation method.  In our approach, users can control the density of the skeleton points by interactively changing the thinness parameter.  Centerlines are computed from these scattered skeletal points using appropriate averaging, connecting and smoothing techniques.  Quantification provides analytical calculations for the features.  Our quantifications, including center of area, center of mass, local maximum sets, mass, area, and volume are computed based on the centerline representation.  Understanding of simulation and experimental datasets is achieved by applying visualization and quantification techniques. Quantification parameters from both simulation dataset and acoustic imaging dataset, including plume volume, cross-sectional area, centerline location, surface area and iso surfaces at percentages of maximum backscatter intensity, are being used to derive elements of plume behavior including expansion with height, dilution, and mechanisms of entrainment of surrounding seawater.

 

TR-241 – Chengwei Feng, Xin Wang, Deborah Silver, User Interface for Feature Extraction, Tracking and Quantification System, 12/6/99

 

3D time-varying datasets tend to be extremely large and complex, and standard visualization techniques such as iso surface and volume rendering provide no facilities to manipulate features of interest in the dataset.  X-Wang presented a feature-based approach in his PhD thesis [2].  This allows users to extract regions of interest and then visualize, track isolate and quantify their evolution.  The tracking and quantitative information can be used to enhance the visualization of the dataset.  A feature-based approach can significantly improve and facilitate the processing of massive datasets.  This report centers on the integration of the feature extraction, tracking and quantification programs into a modular GUI environment.  With the integrated user interface, users can interact with features of interest and focus on their evolution and quantification without being distracted b the implementation details.  A complete feature extraction, tracking and quantification system has been integrated into AVS.  The user interface has been applied to experimental and simulation datasets from oceanography and CFD.

 

TR-240 – Deborah M. Grove, Time & Frequency Data for On/Off Focus Matched Filter Processing, 11/5/99

 

This report presents results of matched filter array processing for a real conference room.  Hands-free communication for situations such as teleconferencing and distance learning allows individuals to move around a room without needing wearable microphones.  However, room conditions cause speech to be distorted by multiple reflections throughout the volume.  Matched filter array processing uses room reflections in its algorithm rather than trying to remove them.  Using stored impulse responses between a source and each sensor as a matched filter causes an acute spatial selectivity on the order of 30 cm and mitigates the effects of reverberation.  Previous reports have documented this behavior without examining how the impulse responses cause this selectivity.  Here, eight randomly spaced sensors are sued to measure impulse responses.  The impulse responses and resulting matched filter array response are studied in time and frequency for both on/off-focus conditions.  The influence of the room transfer function on the frequency response of the matched filter array is shown.  Random sensor placement for matched filter processing in the time domain is explained by inspection of individual matched filters via impulse responses.  It is seen that the frequency focal volume immediately degrades off-focus for the high frequencies first and that the off-focus condition causes a time delay in the array response.  Truncation of the impulse responses is applied to minimize anticausal echo.  Examination of these truncated responses reveals insights into tracking changes to the matched filters around the focus.

 

TR-239 – Boi Sletterink, A Managing Agent for Sharing Multiple Modalities, 8/31/99

 

Multimodal user interfaces promise more natural man-machine communication, as well as improved accessibility for disabled people.  When extending multimodal user interfaces beyond single application scenarios, providing a multimodal interface in a desktop, applications have to share access to input and output devices.  This requires an agent that arbitrates between applications and low-level drivers.  This thesis proposes a design for communication between driver, manager and application, and discusses a prototype implementation and preliminary testing results.

 

TR-238 – DongSuk Yuk, Robust Speech Recognition Using Neural Networks and Hidden Markov Models – Adaptations Using Non-linear Transformations, 8/31/99, (PhD Dissertation)

 

When the training and testing conditions are not similar, statistical speech recognition algorithms suffer from severe degradation in recognition accuracy.  Even when the underlying distributions from which data is generated are the same, the observed distributions may vary because of the interference from acoustical environments where systems are actually used.  Another source of variability comes from speakers themselves where the produced sound is different between speakers.  This research concerns robustness issues in statistical speech recognition, especially when the training and the testing data distributions are not matched.

 

Since the parameters of recognizers are estimated from training examples, it would be better to use the data that is collected from testing environments to reliability estimate the parameters of recognizers is a very expensive task.  In this research, a transformation approach based upon neural networks is studied to handle the training and testing condition mismatches.  Neural networks can be used for situations where speech feature vectors are non-linearly distorted, such as in noisy reverberant speech or telephone speech.  By using a neural network, the adaptation process requires a small amount of training data.  First, a neural network is applied to the computation of an inverse distortion function.  This type of network requires simultaneously recorded input and target data pairs for training.  Traditionally, neural networks are trained to minimize the mean squared error between the network output and the corresponding target value.  However, minimizing the mean squared error does not guarantee maximum recognition accuracy.  Therefore, a new objective function for the neural network is proposed, which makes use of the conditional probabilities that come form hidden Markov model (HMM) based recognizers.  It maximizes the likelihood of the data from testing environments, and allows global optimization of the neural network when used for the transformation of data, or for the adaptation of recognizers to an operating environment.  In the latter case, the parameters of recognizers (i.e. mean vectors and covariance matrices) are transformed to best match the data distribution.  The new algorithm is evaluated on a large vocabulary continuous speech recognition task.

 

TR-237

 

Harvey Ray and Deborah Silver, A Memory Efficient Architecture for Real-Time Parallel and Perspective Direct Volume Rendering, 8/4/99

 

Real-time visualization of large three-dimensional datasets demands high performance; thus pushes storage, processing, and data communication requirements to the limits of current technology.  General purpose high-performance parallel processing has been used to visualize these datasets; however, these solutions are not tractable because of the cost of the machines.  High-end graphics workstations have recently achieved interactive rates on moderate sized datasets using texture mapping; however, texture mapping memory is typically limited and texturing hardware does not directly support 3D gradients.  Several custom architectures have been proposed to address the shortcomings of other approaches.  These solutions promise unprecedented price-performance ratios.  Recently, two specialized architectures have been proposed for interactive visualization of 2563 datasets.  One achieves 30Hz for parallel projections and other achieves an average of 10Hz for both parallel and perspective projections.  This paper introduces the Resample And Composite Engineer (RACE) architecture.  A new high-performance general purpose volume graphics architecture that targets 20 – 24 Hz average performance for both perspective and parallel rendering with as few as 4 rendering pipelines for 2563 datasets using 100 MHz SDRAM memories.  This is anywhere from 33 – 50% less voxel bandwidth than other recent approaches.  It will achieve 40 – 48Hz average performance for 256 x 256 x 128 datasets.  The RACE architecture will support antialiasing of perspective images (casting multiple rays per pixel).  We believe that this work further validates the need for specialized direct volume rendering hardware for voxel processing.

 

TR-236 -- D. Sinder, Speech Synthesis Using an Aeroacoustic Fricative Model, (PhD Dissertation), 7/12/99

 

Progress in advanced computer speech interfaces is limited in part due to incomplete knowledge of the physics of speech production.  Unvoiced speech sounds such as fricatives are an important example.  These sounds are produced by “turbulent” air motion in the vocal tract.  A proper understanding of how unvoiced sounds are produced is thus far lacking because the speech community has for the most part limited its physical picture of air motion in the vocal tract shape, lung pressure, and other speech parameters is not at all clear.

 

A considerable body of work has been produced on the subject of aeroacoustics, which is the study of the interaction between sound and non-acoustic air motions such as turbulence.  The purpose of this dissertation is to apply ideas form aeroacoustics and unsteady aerodynamics to produce a model of the aeroacoustic source associated with turbulent flow in the vocal tract.  Particular emphasis has been given to produce a model suitable for articulatory speech synthesis.  This requirement led to the development of reduced-complexity modeling of turbulent flow such that the computational requirements are not far in excess of those needed for existing transmission-line computations of speech signals.

 

The essential result from aeroacoustic theory incorporated into this work is that of Howe; his result relates the motion of virticity through a duct of changing cross-section to the pane wave sound field generated by that motion.  This relation is sued to compute the value of an acoustic pressure source in the duct.  The aeroacoustic theory implicitly incorporates the source spectrum, level, impedance, and spatial distribution, assuming the behavior of vorticity and the vocal tract shape are known.

 

Due to its complexity, obtaining detailed information about the vorticity distribution of any turbulent flow entails a high cost in time and resources, whether the approach is computational or experimental.  Fortunately, this problem has received enough attention that it is possible to parameterize the essential features of the vorticity field in the vocal tract into a jet model which requires a minimum of computational effort.  Such a jet model is presented here.  It prescribes the motion of vorticity based upon criteria which determine the location of jet formation (flow separation) in the vocal tract, the geometry of the location where the jet is formed, and the local airflow speed at the jet formation location.

 

The new jet model and aeroacoustic source description were incorporated into a transmission line model for duct acoustics.  The result is an engineering solution for a new fricative model which combines low-cost computation with judicious application of fundamental physics.  Two sets of validation studies were conducted to test the computational method.  The first synthesized the sound produced by steady airflow in a pipe with axial area variations.  The pipe geometry and jet speed were matched to those of a quiet aeroacoustic pipe flow facility.  The pressure spectrum measured at the pipe exit compared favorably to the pressure spectrum computed for the simulated system.  The second validation study tested the method by synthesizing unvoiced speech sounds, both in isolation and in vowel context.  The results show the strong potential for this approach to produce high quality unvoiced speech without the need to estimate source strength, spectra, or location for different vocal tract geometries.  That is, the synthesis of unvoiced sounds is gained automatically from the articulatory description.

 

TR-235 – M. Krishnamoorthy, Toward Robust Speech Recognition:  Speaker and Environmental Adaptation in the Linear Spectral Domain 1, 2000

 

There has been considerable success in speech recognition technology and there are commercial products available today.  Hidden Markov Models (HMMs) have been used predominantly in speech recognition technology.  These systems perform well in a quiet environment and when the speech signal is recorded using a close-talking microphone.  However, mismatches between training and testing environment severely degrade the performance.  The mismatch can be because of inter speaker variation or environmental variation or both.  Speaker variation can be because of factors such as dialect differences and vocal tract lengths.  Environmental variation can be because of microphone variations, additive acoustic noise and channel noise.

 

Current recognition systems solve this problem using a technique called adaptation.  Conversational adaptation schemes work in the cepstral domain.  This thesis aims at developing an algorithm which can adapt the models/features in the linear Spectral domain.  For this purpose an adaptation technique called Linguistic Tree based Maximum Likelihood Linear Regression (LT-MLLR) is used, to differentiate between the performance in the cepstral and linear spectral domains.  The advantages of using the linear features are explained along with experimental results.

 

A representative result addresses noisy speech with 20dB SNR, collected from a microphone array, at a distance of about 5.5 meters from the speaker, in a reverberant environment.  Automatic recognition of this signal yields an accuracy of about 8% on the clean speech models.  The linear domain adaptation technique developed here produced an accuracy of about 66%.

 

TR-234 -- J. Flanagan, D. Yuk, M. Krishnamoorthy, K. Dayanidhi, A Neural Network System for Robust Large-Vocabulary Continuous Speech Recognition in Variable Acoustic Environments, 1/15/99

 

Hidden Markov Models (HMM’s) have to date been accepted as an effective classification method for large vocabulary continuous speech recognition.  Most of existing HMM-based recognition systems, such as DARPA sponsored SPHINX and DECIPHER, are designed to operate on “high-quality close-talking speech”.  They require consistency in sound capturing equipment, and in acoustic environments between training and testing sessions, the so-called “matched” conditions.  When the testing condition differs from the training condition, the performance of these recognizers is typically degraded if they are not retrained to cope with new environments [Che et al, 1992, 1993].

 

Usually, a retraining of HMM-based recognizers is complex and time-consuming.  It requires collection of speech data again under corresponding conditions and reestimation of HMM’s parameters based on new speech material.  Particularly great time and effort are needed to retrain a recognizer which operates in a speaker-independent mode, which is the mode of greatest general interest.

 

The broad objective of this research is to explore the emerging microphone array technology for distant-talking speech recognition in practical reverberant, noisy environments.  For this purpose, a system of microphone array and neural network (MANN) has been developed as a robust front-end for speech recognition.  The MANN system has two synergistic components:  (1) Speech enhancement by microphone arrays and (2) Adaptation by neural network processors.  By using the MANN system, existing DARPA speech recognition systems can be directly developed in adverse day-to-day applications.  That is, speech recognizers need not be retrained.  Furthermore, the distant-talking MANN system frees the user from encumbrance of hand-held, body-worn, or tethered microphone equipment.  It thus enables deployment of DARPA speech recognition technology in hands-busy/eyes-busy and/or distant-talking applications.

 

The combined advantages of microphone arrays and neural network (NN) computing are used to expand the capabilities of DARPA speech recognition technology to application environments where users must not be encumbered by body-worn or hand-held microphones, and must have freedom of movement.  (Examples include Combat Information Centers, large group conferences, and mobile hands-busy eyes-busy maintenance tasks).

 

The approach allows the neural network to ‘learn’ the reverberant distortion and noise interference of the acoustic environment, and to transform speech-feature data (such as cepstrum coefficients) obtained from a distant-talking microphone array to those corresponding to a high-quality, close-talking microphone system.  The performance of the speech recognizer can therefore be elevated in the hostile acoustic environment without retraining the recognizer.

 

The neural network learns the characteristics of a specific unfavorable environment by adapting its weights through direct comparison of the distant-talking signal to a close-talking calibration signal.  Thereafter performance of the speech recognizer under distant-talking reverberant conditions can be elevated and made comparable to that of the close-talking, acoustically favorable condition.

 

In Part I, the initial work involved with the line array and the stereo data neural networks is described.  The main idea of the MANN system is explained and some pilot experimental results are shown.  In Part II, the Matched Filter Array (MFA) is discussed.  In Part III, the combination of the MFA and the maximum Likelihood Linear Regression (MLLR) is described.  Finally, in Part IV, the Maximum Likelihood Neural Networks (MLNN) are explained.

 

TR-233 -- Piyush Modi, Discriminative Utterance Verification by Integrating Multiple Confidence Measures:  A Unified Training and Testing Approach (PhD Dissertation)

 

Robustness to acoustic and language variabilities is the most challenging problem facing automatic speech recognition systems today.  It is becoming increasingly important for this system to assign a measure of “goodness” or a confidence measure (CM) to their output.  The use of these measures to validate a recognized hypothesis is usually referred to as utterance verification (UV).

 

In this thesis, we present a novel UV framework for exploiting the complementary properties of different sources of information and their integration in a unified system.  The proposed framework uses a single objective function that integrates multiple knowledge sources and also acts as a loss function to train the entire parameter set of the UV system.  A discriminative minimum verification error-training algorithm is developed to optimize the parameters of both the objective function and the knowledge sources.  To demonstrate the utility of our framework we have developed a UV system that integrates two acoustic based knowledge sources.  Experimental results on a connected digits task show that the UV with multiple confidence measures (UV-MCM) outperforms state-of-the-art system that rely on using each CM individually.

 

TR-232 -- J. L. Flanagan, Autodirective Sound Capture; Toward Smarter Conference Rooms

 

TR-231 -- S. Juth, Collaboration Components for Programming Real-Time Synchronous Groupware Applications, 10/98

 

It is a well-known fact that developing software applications can be very complex and difficult.  Developing multi-user applications is even more challenging since such systems typically require additional features such as session management, synchronizing users’ states, providing user awareness, etc.  In the past people have tried to ease this additional burden by using toolkits that support these multi-user requirements.  This thesis presents a different approach for building collaborative applications that takes advantage of the JavaBean specification from JavaSoft, Inc.  A suite a collaboration components has been implemented as JavaBeans in order to provide the typical features found in real-time, synchronous groupware applications.  By using the collaboration components, the multi-user application developer only has to focus on developing the features specific to the application while leaving the multi-user aspects (i.e. concurrency control, session management, awareness of remote users, etc.) to the components.  This research also provides a specification for developing collaboration-aware Beans that can be used with the components.

 

TR-230 -- M. J. A. Andre, Multimodal Human-Computer Interaction, 8/26/98

The influence of computers on people’s daily lives is increasing and the need for simpler interfaces to use computers is emerging.  Current human-machine communication systems predominantly use keyboard and mouse inputs which inadequately approximate human abilities for communication.  More natural communication technologies are capable of freeing computer users from the keyboard and mouse.

 

This work presents a prototype of a multimodal interface featuring fusion of multiple modalities for human-computer interaction.  The three modalities integrated are a speech recognizer and synthesizer, a tactile glove, and a gaze tracker.  The application used for this system is a collaborative whiteboard application extended to a military mission planning system.  The design and implementation of the whole system and the methods applied are described and preliminary results of the real-time multimodal fusion are analyzed.

 

TR-229 -- J. Ray, R. Samtaney and N.J. Zabusky, Shock-in Competition and A Model for Circulation Deposition in Shock Interactions with Heavy Prolate Cylinders, 8/31/98

 

We identify two different modes of interaction for planar shocks accelerating heavy prolate gaseous cylinders. These modes arise from different interactions of the incident and transmitted shocks on the leeward side of the cylinder and yield different vorticity deposition mechanisms. We model the net baroclinic circulation generated on the interface by both the shocks and validate the model via numerical simulations of the Euler equations. The principal parameters governing the interaction are the Mach number of the shock (M), the ratio of the density of the gas cylinder to the ambient gas density, (h, h > 1), g0, gb, (the ratio of specific heats of the two gases), l (the aspect ratio), and tT / tI (a time ratio which characterizes the mode of interaction). In the range 1.2 £ M £ 3.5, 1.54 £ h £ 5.04 and l = 1.5 and 3.0, our model predicts within 10 % of the simulation results.

 

TR-228 -- Samir Chennoukh, Daniel Sinder, James Flanagan, Voice Mimic System (Hyper-computing Design Project Final Report), 8/21/98

The research undertaken in this part of the HPCD project aims to advance fundamental understanding of human speech production and coalesces the problems of speech synthesis, speech recognition, and low bit-rate speech coding into a compact parametric framework.  The approach uses a computationally-intensive technique of speech analysis and synthesis to gain more understanding of the acoustics of speech.

 

The research aims to design a voice mimic system which can adapt parameters, moment by moment, for an articulatory model to duplicate an arbitrary speech input.  Using an articulatory speech synthesizer, the input speech is restored at the output of the system from the obtained set of articulatory parameters.  Articulatory synthesis has been studied using both linear-acoustic and fluid-dynamic models of speech generation.  Such simulation based on a physiological model of speech production requires a knowledge of the optimal number of geometrical, acoustical and mechanical parameters in order to account for the complexity off speech production.

 

Research in speech production collects vast amounts of data on the vocal system, its mechanics, the acoustic signal and the information which it encodes.  The analysis of this data should establish the characteristics of the speech production device in order to model it.  Due to the difficulty of obtaining articulatory data, techniques for estimating the vocal tract area function directly from the speech signal are of interest in studies of the speech production process and as the basis for efficient coding of the speech signal.  The problem of estimating the vocal tract shape from the acoustic speech signal is often referred to as the inverse problem.  This is a difficult problem because of non-uniqueness of the acoustic-geometry relation.  The problem is far from solved.  An overview of the state of the art is given in Chapter 2 regarding modeling of the vocal tract and the knowledge required to estimate the vocal tract shape from the speech signal.  Although brief, the overview relates clearly the main components of our research.

 

Chapter 3 illustrates our efforts to solve the inverse problem by an optimization procedure using an articulatory codebook.  A codebook is used to obtain the first estimate of the vocal tract shape that may produce a given combination of acoustic parameters.  It must be designed such that it spans the natural aritculatory space of a speaker.  Furthermore, sampling of the space must be fine enough so that an acoustic entry always exists very close to the global optimum.  Such codebooks require a large set of matching pairs of vocal tract and acoustic parameters.  However, as the codebook size increases, searching it becomes increasingly time consuming.  This chapter is devoted to the different techniques and algorithms developed to access such large codebooks and to solve the non-uniqueness of the articulatory trajectories which follow from the non-uniqueness of the acoustic-geometric relation.

 

Chapter 4 proposes another concept for solving the inverse problem.  It consists of developing a unique and a continuous acoustic-to-articulatory mapping that will uncover the information encoded in the signal about the vowel, the consonant and the vowel-consonant coarticulation.  Obviously, the concept requires more understanding of the acoustic-articulatory relation, an understanding which is still far from complete.  However, this approach has proven to be a fast, efficient and robust method for acoustic-to-aritulcatory mapping.  The concept consists of a mapping from vocal tract shape to formant frequencies whose relationship is kept linear in terms of one gesture that describes the variation of the vocal tract shape gives a rectilinear formant trajectory.  Thus, the estimation of the model shape form the formant frequencies according to this concept becomes a simple interpolation mapping.

 

The limiting factor in the quality of voice mimic systems is in the accuracy with which articulatory speech synthesizers model fricative production.  Since the physical process of fricative production is not well understood, the problem of obtaining an articulatory description from an acoustic signal is especially difficult for these sounds.  Computational studies of speech production using fluid dynamics has the potential to provide much insight into fricative production.  In Chapter 5, numerical simulation of flows in idealized vocal tract configurations is described, as well as physical experiments in the same geometries.  Source terms from aero-acoustic theory are compared with experimental results to identify which terms (or combination of terms) most accurately reflect the noise sources in the vocal tract.  An understanding of the proper source terms will allow the development of reduced models which can be implemented in traditional synthesizers.  The improved synthesis models will ultimately improve the performance of the voice mimic system.  Chapters 6 and 7 give a summary and conclusions, respectively, of the present research.

 

TR-227 -- Prabhu Raghaven, Speaker & Environment Adaptation in Continuous Speech Recognition, 6/24/98

 

Hidden Markov Models (HMMs) have been used with considerable success in continuous speech recognition.  It is well known that high accuracy can be obtained when the HMM system is trained and tested in a quiet environment and the speech signal is acquired from a close-talking microphone.  However, mismatches between training and testing environment severely degrade performance.  Two major sources of mismatches are speaker and environment variability.  Speaker variation is typically caused by different speaking styles and other physiological differences between speakers, such as vocal tract lengths, etc.  Environment variability includes channel distortion, such as that which affects telephone speech, additive noise, and reverberation which results when the microphone is far away from the speaker.

 

The goal of this report is to explore different adaptation algorithms that mitigate the effects of speaker and environmental variability for speech recognition.  The adaptation algorithm closely examined in this report is a Linguistic Tree based Maximum Likelihood Linear Regression (LT-MLLR).  Speech Recognition experiments using the LT-MLLR for speaker and environment adaptation are given.

 

It is shown that the LT-MLLR algorithm is superior to other adaptation algorithms discussed.  For speaker adaptation, a 30% reduction is achieved over the baseline word error rate (WER) using this algorithm.  In addition it is shown that the use of Matched Filter Array Processing (MFA) with LT-MLLR reduces the WER of distant-talking speech with high reverberation.  In the case when the reverberation time is as high as 0.9s, the WER is reduced from 57.89% to 19.41%, a reduction of 66.47%.  [This work was supported by DARPA Contract DABT63-93-C-0037.]

 

TR-226 -- Xin Wang and Deborah Silver, Visualizing Time-Varying Features

 

Visualization of time-varying 3D data is difficult because of the immense amount of high dimensional data to process and assimilate.  Feature extraction and tracking techniques can greatly reduce the data size and complexity, and thus help scientists identify and quantify important regions and events.  In this paper, we propose a feature-based framework to visualize time varying datasets, and discuss some visualization approaches to enhance the scientists’ ability to grasp 3D time varying patterns of isolated features.  These techniques utilize tracking data which has been computed over the dataset.  The turbulent data set is used to demonstrate the ideas.

 

TR-225 -- George Patounakis, Mourad Bouzit, and Greg Burdea, Study of the Electromechanical Bandwidth of the Rutgers Master, 29 May 1998

 

The Rutgers Master II's (RMII) current mechanical bandwidth is about 5Hz for a 16 psi to 64 psi pressure change.  This is not good enough to render haptic environments.  The goal of this study is to see what factors influence the bandwidth of the RMII.  The experiments involved using higher pressure inputs to the RMII, changing the length of the tube leading from the RMII interface box to the piston, changing the cross section of the tube from the RMII to the piston, checking for correlation between the readings from the force sensor on the piston shaft and the pressure sensor on the pressure regulator, and varying the air flow.  The above tests were done using the Buzmatics SPCJR pneumatic servo controller.  Subsequently, the bandwidth was measured for the MATRIX 751 valve from Amatrix Corporation, by changing the number of microvalves controlled in parallel.

 

TR-224 -- Chewei Che, Automatic Speaker Recognition System for Telephone Speech, 5/19/98

 

It is well known that high accuracy can be obtained for speaker recognition systems operated under quiet laboratory environments.  The performance of the system degrades when the application is constrained to be text-independent using telephone conversational speech.  Factors such as a mismatch of training/testing handset, and inherent variation of speech from different talkers contribute to the level of difficulty of the task.  This report is aimed at developing robust speaker recognition systems to combat those variabilities and still maintain high performance.  The general framework of the system is based on modeling the speaker with statistical analysis of the speech signal.  Two systems developed by the author are presented.  The first system uses concatenated phoneme Hidden Markov Model (HMM) and is operated in a text-prompted mode.  The system has been evaluated with the YOHO voice verification corpus in terms of both speaker verification and closed-set speaker identification.  It is shown that by using 10 seconds of testing speech, an error rate of 0.09% for male and 0.29% for female are obtained for speaker identification with a total population of 138 talkers.  For speaker verification, under the 0% false rejection condition, the system achieves a state of the art performance false acceptance rate of 0.09% for male and 0% for female.  The second system utilizes a Vector Quantization based Gaussian Speaker Model (VQGSM) and is operated with no context constraint.  The system was evaluated using the Switchboard corpus and yields competitive speaker recognition accuracy.

 

TR-223 -- V. George Popescu and Greg Burdea, Research on Full Body Modeling and Force Feedback, 4/24/98

 

This report presents the guidelines of a full-body haptic suit design and simulation system which integrates a real-time hand sensing and force-feedback device (RMII) with a full body modeling software (JACK).  The RMII system and the Polhemus Fastrak sensor were initially integrated with “Jack” full-body simulation software.  The system is evaluated in terms of bandwidth and increase in simulation realism after the addition of the haptic component.  Next the concept of a Virtual Human Agent enhanced with haptic feedback is presented.  The proposed concept is illustrated by a simulation developed using the RMII system, JACK software and a speech recognition engine.  The conceptual design of a full body haptic suit is explored subsequently.  Solutions provided by state-of-the-art technology are evaluated.  New ideas are also explored to meet the demands of the Haptic Suit design.  The design of new force-feedback device for elbow conclude the report.

 

TR-222 -- A. Sharma, Real-Time Systems Support for Multimedia Applications, May 1998

 

New multimedia applications, e.g. teleconferencing, video-on-demand, Internet telephony, and many more are revolutionizing every aspect of human life.  These applications demand real-time performance, better digital signal processing algorithms, elaborate multimedia communication protocols, high-speed networks, and powerful platforms.

 

Traditionally the embedded software to support multimedia applications, e.g. digital signal processing tasks, network switching and routing is developed in a custom way.  The software is fine-tuned in several iterations to get the needed real-time performance.  This approach has several serious drawbacks:  tedious design which becomes even more difficult for multiprocessors; lack of flexibility, as a minor change in specification may necessitate a total redesign; slower response to asynchronous events, leading to a degradation in the system’s real-time support.  In contrast, a kernal or operating system approach allows a more flexible software development, and real-time support.  In the presence of a kernal layer, multimedia application writers need not worry about the task coordination, or careful timing of the events.  Since the underlying software to support multimedia applications is still of embedded nature, a complex multimedia applications is still of embedded nature, a complex general-purpose operating system like Unix would be inappropriate.  We architect a lightweight kernal layer that enhances real-time support for multimedia applications.

 

One crucial requirement for the real-time support of a system is to increase the predictability in the system.  So, while architecting the different subsystems of the layer, we have attempted at increasing the predictability in the system.  For scheduling, we separate out the unpredictable external events, from the synchronous computation or communication.  Having one scheduling scheme for all three makes a system more unpredictable than it needs to be.  This resulted in a novel approach of heterogeneous scheduling scheme with functional separation of control flow, data flow, and computation in contrast with conventional spatial hierarchical scheduling schemes.  When designing the real-time communication and synchronization primitives, using the profile information, we have provided hints via constructs like real-time semaphores to increase the predictability in the system.  Being in embedded domain, we do not emphasize on real-time support for memory management, though do provide special notions of region-based management and callback routines.  We made our kernal customizable with a service table which allows to plug-in a custom set of scheduling, synchronization, or memory management schemes.  A theoretical model for the scheduling scheme is provided, with two algorithms to operate on it.  A comparative empirical study was performed on our scheduling scheme and related alternate schemes.  The study found that two hybrid schemes, with good mix of the basic schemes, provide a good combination of scalability, resiliency to load, and functional separation.

 

TR-221 -- Daniel V. Rabinkin, Optimum Sensor Placement for Microphone Arrays,

 

Microphone arrays can be used for high-quality sound pickup in reverberant and noisy environments.  Sound capture using conventional single microphone methods suffers severe degradation under these conditions.  The beamforming capabilities of microphone array systems allow highly directional sound capture, providing enhanced signal-to-noise ratio (SNR) when compared to single microphone performance.

 

The overall performance of an array system is governed by its ability to locate and track sound sources and its ability to capture sound from desired spatial volumes.  These abilities are strongly affected by the spatial placement of microphone sensors.  A met