|
|
TR 274
Towards Robust Speech
Recognition: Linear and Cepstral Domain Adaptation
Hidden Markov Models (HMMs) have been used with considerable success in
speech recognition technology. These systems perform well when the system is
used in conditions similar to the one used to train the acoustic models.
However, mismatches severely degrade the performance. The mismatch can be
because of inter speaker variation or environmental variation or both.
Speaker variation can be because of factors such as dialect differences and
vocal tract lengths. Environmental variation can be because of microphone
variations, channel noise, additive acoustic noise and reverberation. This
thesis aims at developing an algorithm that can adapt the acoustic models to
a new environment and/or speaker, to improve the performance of the speech
recognizer under mismatched conditions. An algorithm that adapts the
acoustic models in the linear spectral domain, called Linear Domain Linear
Regression (LDLR) is proposed. This algorithm is compared with other
adaptation techniques such as Maximum Likelihood Linear Regression (MLLR)
and Maximum A-Posteriori (MAP). A hybrid algorithm that uses LDLR in
conjunction with MLLR is also proposed. Experiments show that under
mismatched conditions, recognition word error rate decreases when using LDLR
adaptation. The hybrid technique that uses both MLLR and LDLR does better
than any other single form of adaptation. For the 20db additive noise case,
the hybrid adaptation technique reduces the word error rate by 65% as
compared to the unadapted models. With reverberation present along with the
additive noise, the word error rate reduces by 53%.
TR273
Supplementary Features
for Improving Automatic Speech Recognition
Most of the state of
the art Automatic Speech Recognition (ASR) systems use acoustic features and
their first order derivatives as input and are modeled using Hidden Markov
Models (HMM). In this research we have used the Hidden Markov Model Tool
Kit (HTK) version 3.1 to build a phone based speech recognizer. We compare
the performance of the ASR system for context independent and context
dependent phonemes which are modeled using multiple Gaussian mixtures. It
was observed that the system performance improved with increase in mixtures
when enough training data was available. Best performance was observed when
we used 5 mixtures to model tri-phones with the whole (TIMIT) (recorded at
Texas Instruments (TI) and transcribed at Massachusetts Institute of
Technology (MIT) data set used for training.
In addition to
cepstral parameters, we also investigate three other Supplementary Features
(SF’s): Periodicity, Zero Crossing Rate (ZCR) and Ratio of low frequency
energy to total energy. We demonstrate that these SF’s improve accuracy of
ASR. Various combinations of SF and Mel-Frequency Cepstral Coefficients (MFCC)
along with their first order derivatives were studied and compared with the
performance of the standard MFCC based systems. Results suggest that for
optimal recognition performance a combination of SF and MFCC is more
advantageous than either of them used individually. We observe further
improvements in accuracy and noise robustness in the ASR when the last four
MFCC;’s and their corresponding first derivatives were replaced by the SF’s
and their derivatives respectively.
TR-272
Statistical Modeling
Of User Input In a Multimodal Speech and Graphics Environment:
In a communication
act, whether it be ‘between humans’ or ‘between a compute system and a
user’, where multiple modalities are used by the human to convey
information, there exists a pattern in which these modalities are integrated
by the user to arrive at a combined meaning. This research work attempts to
study this pattern of integration and develop a computational model of the
pattern that can used by multimodal systems to gain knowledge of the user’s
behavior. Having this model the multimodal systems can more accurately
integrate the continuous data streams from the users different input
channels to interpret the combiner meaning.
The multimodal system
developed in this research is capable of responding to speech, touch,
pen-tablet and mouse inputs of the use. We build a multimodal model that
determines the temporal synchrony between these input modalities during a
multimodal interaction and integrate it with the system to reduce the search
region in the gesture which is required to resolve the ambiguities in the
user’s spoken utterance.
TR-271
A Peer-to-Peer
Approach to Web Service Discovery:
Web
Services are emerging as a dominant paradigm for constructing and composing
distributed business applications and enabling enterprise-wide
interoperability. A critical factor to the overall utility of Web Services
is a scalable, flexible and robust discover mechanism. This paper presents a
peer-to-peer (P2P) indexing system and associated P2P storage that supports
large-scale, decentralized, real-time search capabilities. The presented
system supports complex queries containing partial keywords and wildcards.
Furthermore, it guarantees that all existing data elements matching a query
will be found with bounded costs in terms of number of messages and number
of nodes involved. The key innovation is a dimension reducing indexing
scheme that effectively maps the multidimensional information space to
physical peers. The design and an experimental evaluation of the system are
presented.
TR-270
An Ink and Gesture
Based Annotation Architecture for the Internet:
Significant strides
have occurred in document technology like the standardization of object
interfaces and event models to access and manipulate the properties of
digital documents. There has also been considerable progress in pen based
computing for recognition of digital ink in desktops and handheld devices.
With the advent of powerful tablet PCs, new applications for pen and ink
seem to be mushrooming every day [18]. There has been research on
annotation systems for the web right from the start of the Internet and web
documents resulting in several complex architectures for annotating and
personalizing web pages [1, 5, 6, 9, 11, 13]. Annotations are the digital
form of marks that are added to documents in a paper-based environment, for
instance, highlighted text, text notes and ink marks on paper margins.
The above-mentioned
advances in document technology have necessitated further research on
meta-data markup or annotation architectures for digital documents,
specifically pen-based annotation systems. This thesis presents an attempt
to leverage the new standards of Dynamic HTML and the Document Object Model
(DOM, proposed by the World Wide Web consortium or W3C) that are being
gradually implemented by popular browsers, to build a prototype of an ink
annotation system with common components across browsers. The primary goals
in this study are to provide users with a standard tool to annotate web
pages with freeform ink and semantically link the user drawn ink annotations
with underlying document elements like text and images. Further, it
provides recognition techniques for ink gestures that can help in modifying
the rendered ink for annotation operations such as editing, resizing and
association. The main components of the system are the ink capture and
dynamic rendering module, an ink understanding module that recognizes and
associates ink with the underlying document and annotation storage and
retrieval modules.
TR-269
AutoMate: Enabling
Autonomic Grid Applications
The increasing
complexity, heterogeneity and dynamism of networks, systems, services
applications have made our computational/information infrastructure brittle,
unmanageable and insecure. This has necessitated the investigation of a new
paradigm for design, development and deployment based on strategies used by
biological systems to deal with complexity, heterogeneity, and uncertainty,
i.e. autonomic computing. This paper introduces the AutoMate project and
describes its key components. The overall objective of AutoMate is to
investigate key technologies to enable the development of autonomic Grid
applications that are context aware and are capable of self-configuring,
self-composing, self-optimizing and self-adapting. Specifically, it will
investigate the definition of autonomic components, the development of
autonomic applications as dynamic composition of autonomic components, and
the design of key enhancements to existing Grid middleware and runtime
services to support these applications.
TR- 268
Integrating Grid
Services using the DISCOVER Middleware
The growth of the Internet and the advent of the computational Grid have
made it possible to develop and deploy advanced services on the Grid. These
services build on high-end computational resources, communication
technologies and enable seamless and collaborative access to resources,
applications and data. However, problem solving on the Grid requires
combining these services in a seamless manner. However, getting existing
services to interoperate presents many challenges, as these services have a
typically have customized architectures and implementations, and builds on
different enabling technologies. This paper presents the design and
implementation of a DISCOVER middleware substrate for integrating Grid
services, and describes the integration of Globus and DISCOVER services
using this middleware. An experimental evaluation of the middleware
substrate is presented.
TR-267
Semantic
Consistency Optimization in Heterogeneous Virtual Environments
Abstract
Collaborative virtual environments with heterogeneous computing resources
and user preferences often reduce data fidelity to accommodate such
heterogeneity. Given the resource limitations and user preferences, the
problem is to optimize the fidelity degradation so as to achieve maximum
semantic consistency across the different data representations. Consistency
maximization can be formulated as an inter-programming problem, wherein
constraints are resource limitations and user preferences. We consider
several formulations of the problem, some of which do not enforce
topological constraints in degraded representation, while others do. The
solutions to this problem result in reduced amounts of distributed data
which conserve network bandwidth and other system resources. Experimental
results and proposed topics for further research are also presented.
TR266
Scalable Keyword Searches with Guarantees in Peer-to-Peer
Storage Systems
Abstract
The ability to support complex keyword searches is important
for any information storage and sharing system. While peer-to-peer storage
systems are gaining popularity, existing systems either support keyword
lookup without any guarantees or do not allow keyword searches at all. In
the former systems, the cost of a query is not bound and existing matches in
that system may not be found. The latter systems (data lookup systems do
provide guarantees and bounds, but do not allow keyword searches. These
systems only support information lookup by name. In this paper, we present
an innovative approach to building a Peer–to-Peer storage system that
provides the flexibility of keyword search systems while providing the
guarantees and bounds of data lookup systems. The system guarantees that
all existing data elements that match a query will be found with reasonable
costs in terms of number of messages and number of nodes involved. Complex
queries containing partial keywords and wildcards are supported. An
experimental evaluation of the system is presented.
TR 265
Who’s in Charge Here? Communicating Across Unequal Computer
Platforms, Ivan Marsic, Maria Velez, Marilyn Tremaine, Bogdan Dorohonceanu,
Allan Krebs, Aleksandra Sarcevic
Abstract
Personal data
assistants are being used in the field to collect data and to communicate
with others both in the field and in the office. The individual in the
office invariably has a laptop or a high-end personal workstation and thus,
significantly more compute power, more screen real estate and higher volume
input devices such as a mouse and keyboard. It is therefore useful to know
what impact these differences have on work communications. Four different
platform combinations involving a PC and a PDA were used to examine the
effect of communicating via heterogeneous computer platforms. The PC
platform used a mouse, keyboard and 3-dimensional screen display. The PDA
platform used a stylus, soft buttons and a 2-dimensional screen display. A
variation of the Tetris wall-building game called Slow Tetris was used as
the subjects' task. An in depth analysis of the communication exchanges
found that subjects with mixed platforms had the most communication
problems. Additionally, control of the task stayed with the person having
the richer display and collaboration and mitigating politeness statements
were most exchanged between partners when the direction giving authority was
given to the person with the inferior display. Tasks directed by persons
with PCs were carried out significantly faster than tasks directed by
persons with PDAs. The slow task performance may have been one of the
reasons for the authority switches observed.
TR-264
Ivan Marsic, Xiaodong Sun, Carlos Correa, and Tongyin Liu,
Maintaining State Consistency Across Heterogeneous Collaborative
Applications, March 2002
Abstract—The
proliferation of computer devices and wireless networks allows the users to
access information and collaborate with others from anywhere, using the
device that best matches the current context. Device capabilities constrain
the application implementation, which implies that the user interface and
the shared information will differ across devices. Some shared information
may be omitted or abstracted to fit the device capabilities and, as a
result, the application state differs on different devices. Therefore, there
arises a problem of consistency management of the application state across
different platforms. Application state is determined by the data structures
that represent the application’s data and its user interface and we assume
that the application state can be represented as a graph data structure. We
present a set of conditions on the rules relating the states in different
implementations of an application and derive an algorithm which maintains
the state consistency under the user interactions. We also illustrate an
application scenario that benefits from heterogeneous state representations.
TR-263
Liang Cheng, Network Awareness for Heterogeneous Data
Networks, March, 2002 (Ph.D)
Abstract
- Network awareness, which is defined as the capability of network devices
and applications to be aware of network characteristics, is the basis for
network quality-of-service (QoS) provisions and network management. The
necessity of network awareness in heterogeneous data networks is illustrated
by several experimental studies, such as multimedia collaboration, QoS
provisioning, and cluster computing in heterogeneous data networks.
Existing techniques of network awareness are studied in
three research areas: link-type awareness, link –bandwidth awareness, and
service awareness. Considering the limitations and/or unsuitability in
their applications to various heterogeneous data networks, original
techniques in these areas are presented: (i) fuzzy reasoning for
wireless awareness; (ii) accurate bandwidth measurement in digital
subscriber line (xDSL) networks; and (iii) service awareness in
mobile ad hoc networks (MANET)
A novel piecewise framework for network awareness service
(NAS) for efficient integration of various network awareness techniques in
heterogeneous data networks is presented. Analytical results on the
performance of the NAS framework demonstrate that it has significant
advantages over traditional unitary frameworks in terms of reducing wireless
bandwidth consumption and saving battery energy of mobile devices. An
original study of statistical properties of session throughputs in wireless
local area networks exemplifies the feasibility of applying predictions in
the NAS framework.
TR-262
Ashutosh Morde, An Application for Voice Controlled Driving
Directions, February, 2002 (masters thesis)
Abstract - Access to online information is crucial for people on the move.
This information can be provided to the user through the phone. Voice as an
access mechanism is direct and allows additional tasks to be performed while
the hands or eyes are busy. PhoneBrowser is a tool that allows the user to
browse the web by speech control over an ordinary telephone.
An
application for voice controlled driving directions was developed using the
PhoneBrowser. The user is guided through the process of providing the
origin and destination address to the system, which then queries an online
database to access the driving directions. The user has control over the
order and rate at which the information is provided to him – he can ask the
system to go to step n or repeat a previous step. Early experiments with
students identified the problems with the temporal attributes of the
synthesized speech. It was difficult for the students to recollect steps
simultaneously with the telephone communications; the application was
extended to display turn-by-turn maps and driving directions on an Ipaq.
The information displayed on the Ipaq is synchronized to that requested by
the user over the phone. A framework for collaborative browsing, which can
be extended for multimodal interfaces using the PhoneBrowser, was also
established.
TR-261
Shashank Sathyanarayana, A Method for Electronic Mail
Dictation & Retrieval Over the Telephone, July 2001, (masters thesis)
Speech is the preferred medium of communication between humans. Of late,
speech technology is becoming increasingly important in the computing world
as it used to improved existing user interfaces and to support new means of
human interaction with computers. One such use of speech technology is the
ability to browse the web over the phone, known as voice browsing. Apart
from regular browsing, voice-browsing technology can be extended to provide
anytime accessibility to traditionally deskbound applications. This thesis
discusses a real-time implementation of integrating voice-browsing
technology with a traditional desktop application, viz. electronic mail.
The technologies involved in sending and receiving electronic mail over the
telephone are discussed. At its simplest, the system consists of a speech
recognizer to transcribe to perform transcription are examined. Experiments
are performed to measure the confusability existing in the grammar. Words
that are most prone to misrecognition in the grammar are noted. A mechanism
for correcting dictation misrecognitions using this knowledge is explored.
Usability studies are conducted on the application. The study and the user
responses are discussed.
TR-260,
Christopher Alvino, Acoustical Source Location of Multiple Talkers in
Reverberant Environments, April 2001, (masters thesis)
Acoustical source location is a topic
of interest in the fields of signal processing and acoustics. The theory of
acoustical location of a single source is well developed. Unfortunately,
the reverberation that exists in closed environments severely degrades the
accuracy of the source location estimates. The focus of this thesis will be
on performing acoustical source location of multiple moving talkers in
reverberant environments. An outlier algorithm designed to combat the
negative effects of reverberation is validated and tested experimentally. A
framework is presented for a speech/non-speech detector to be used in a
teleconferencing environment such that audio and visual sensors are not
aimed at the locations of non-speech sounds. In the last two chapters, this
thesis explains the theory of estimating the source locations of two or more
simultaneous source locations. Experimental results of this algorithm are
described along with the results of other suggested methods. Finally,
suggestions are made on how this work can be continued and improved.
TR-259,
Rares F.
Boin and Gregore C. Burdea, WorldToolKit vs. Java 3D: A Performance
Comparison,
4/26/01
This report compares the performance of WorldToolKit (releases 8 and 9) and
Java3D (versions 1.1.3 and 1.2) running a VR simulation on a dual-processor
450 MHz PC. The simulation was designed to run in several configurations
having different interaction levels (no interaction, fly-through and haptic
interaction), rendering modes (wireframe, Gouraud and textured), graphics
modes (mono and stereo) and scene complexities (5,000-50,000 polygons).
Results show that overall Java3D is faster than WTK in terms of frame
refresh rates. However, WTK has a more uniform frame rendering time, which
results in more predictable visual feedback.
TR-258,
Daniel Nagy, Online Language Acquisition in Multimodal Environment,
9/12/00
As a test vehicle we picked the
Information Kiosk task. It is very common to place self-service information
kiosks in areas where large amounts of visits are expected, such as
international airports, railroad terminals, museums and famous research
centers. Kiosks at transportation hubs provide visitors with travel
information about the city or country they are located at, in museums and
research centers they can act as virtual tour guides (a very nice and
impressive example is the kiosk network at the State Hermitage Museum in St.
Petersburg, Russia by IBM).
The kiosks are aimed at random users, therefore the designers are motivated
to equip the terminals with natural user interfaces, since the users are not
expected to know any particular command language, if typed or spoken input
is considered at all, it has to deal with unconstrained natural language.
Redundant multimodal input is very useful, for many reasons. Different
information can be best expressed in different modalities: it is more
convenient to express spatial information by pointing, while pointing is not
very well suited for selection from a very large number of named entities.
Some circumstances may prevent the user from using one modality or another;
users may have inabilities, one cannot talk in a noisy environment, etc.
And at last such kiosks are devices for both information access and
entertainment (infotainment: a term widely used by post-modern
philosophers), and the ability of the machine to respond to different input
stimuli increases its entertaining capability.
TR-257, George
V. Popescu, Design & Performance Analysis of a Virtual Reality-Based
Telerehabilitation System, 1/2001
In recent years the area of medical VR applications has continuously
expanded, addressing new domains such as home healthcare, clinical
neuropsychology, and rehabilitation. The research presented here explores
the use of Virtual Reality (VR) for telerehabilitation applications. A
prototype platform for VR-based telerehabilitation was defined first. The
main component of the platform is the hand force feedback unit. A
programming library -- the Rutgers Haptic Library – was developed for
modeling hand haptic interactions. The software was used to build real-time
VR simulations that involve elastic and plastic deformations and physical
modeling.
The VR-based platform was the basic
component of the telerehabilitation architectures we developed. These
architectures use Virtual Reality as an advance interface for therapy as
well as to enable communication between the therapist at the clinic/hospital
and the remote patient or group of patients. The first prototype supports
offline interaction between the therapist and the VR-enabled patient site.
This “store and forward” system uses a Client/Server architecture. The
client (patient home) runs VR rehabilitation exercises with force feedback
and collects patient data. The exercises simulate physical and functional
rehabilitation routines. Patient data are forwarded to the server (clinic
site), which stores medical records and runs data analysis software. System
performance over several types of connections was measured in laboratory
experiments. The guidelines extracted from these experiments help sizing
the system in terms of recorded data and number of concurrent users.
Clinical trials were conducted at the Stanford University Medical School.
Data collected during these trials indicates that patient’s level of effort
and grasping strength increased after using the VR-based rehabilitation
system.
“Store and forward” systems are insufficient for implementing the whole
range of potential telerehabilitation services. The second architecture
developed in this thesis uses a Shared Virtual Environment to enable
real-time patient-therapist interaction. The prototype system allows the
therapist to perform remote physical therapy and collect patient data.
Simulated physical interactions between therapist and patient were
implemented using force feedback.
TR-256, John
Sucec and Ivan Marsic, The Scalability Tradeoff of Topology Dispersion
for
Mobile Ad hoc Networks,
11/10/00
This paper
addresses the issue of scalability, with respect to increasing node county,
in mobile ad hoc networks. In particular, scalability of on-demand routing
protocols is considered in detail. The scalability of routing protocols is
characterized as tradeoffs between several criteria. As a result of
evaluating these tradeoffs, a heuristic known here as topology dispersion
is proposed for on-demand routing protocols. Analysis and simulation
results show that topology dispersion can afford scalability in terms of
locally available routing information, at the expense of a moderate increase
in control packet overhead, over the range of mid-size networks (50 to 400
nodes).
TR-255,
Phillip Stanley-Marbell and Michael Hsiao, Fast, Cycle-Accurate Energy
Estimation for Networks of Embedded Systems, 11/00
Increased demand to investigate the
effect of software on energy usage in embedded systems such as networks of
sensors, requires cycle accurate energy estimators. The analysis of such
systems necessitates a means of simulating the computation and communication
costs both in individual nodes and globally across the entire network.
Resented ais a fast, flexible,
cycle-accurate architectural simulator for networks of embedded systems,
that models a popular commercial micro controller family, the Hitachi SuperH
RISC architecture. The simulator enables cycle-accurate power dissipation
analyses through both instruction level power analysis and circuit activity
estimation, and permits both functional and energy cost simulation of a
wired communication link. By providing a flexible range of simulation
detail and amortizing portions of the simulation cost across several
simulated devices, it permits simulation of realistic applications on
networks of embedded systems. The simulator provides hitherto unavailable
functionality, while being over an order of magnitude faster than a
contemporary state of the art power-estimating simulator.
TR-254,
John Sucec and Ivan Marsic, Selecting Query Scope in On-Demand routing
Protocols Based on Path Count Estimation, 11/28/00
To discover a route to a peer node, an on-demand routing protocol may
initiate a network-wide broadcast of a route request packet, perhaps after
initially confining the dissemination of its request to neighboring nodes.
However, significant savings in packet overhead can be achieved by
incrementally increasing the query radius rather than resorting prematurely
to a query radius equal to the network diameter.
This paper presents a method to estimate the number of currently active
pairs of communicating nodes (i.e., the number of active paths) in a
mobile ad hoc network. The method is entirely distributed and requires only
modest additional overhead in each Hello message transmission. Based on the
estimate of the network path count, a decision can be made locally at a node
performing route discovery as to how the query radius should be
incremented. By selecting the correct route query radius, the number of
packet transmissions required for route discovery can be minimized.
The simulation results reported
herein pertain to an on-demand routing protocol that incrementally increases
its query radius from one hop to two hops prior to initiating a network-wide
broadcast of the query. The results indicate that such a method can
substantially reduce the route request packed overhead, on average, as
compared with a route discovery process that resorts immediately to
network-wide broadcast of the query. The results indicate that such a
method can substantially reduce the route request packet overhead, on
average, as compared with a route discovery process that resorts immediately
to network-wide broadcast if the initial hop query fails. The path count
estimate is applied to detect instances where incrementing the hp radius
form one to two hops will likely yield little or no savings. This allows
the routing protocol to skip the incremental query expansion process when
the expected savings in route request packet overhead is low. Thus,
improving the protocol performance it terms of route discovery delay.
TR-253,
John Sucec and Ivan Marsic, Accessing the Scalability of On-Demand
Routing Protocols, in Terms of Locally Available Routing Information, in
Mobile Ad hoc Networks, 12/2001
This paper evaluates the ability of
on-demand routing protocols to efficiently perform route discovery in mobile
ad hoc networks (MANETs) as the network diameter, node count and number of
active routes are increased. Specifically, the probability of route
discovery at various hop count distances from a requesting node is
determined via simulation, where route discovery at a small hop distance
represents efficient route discovery and a large hop distance represents
costly route discovery. Effective route caching heuristics are essential
for efficient route discovery. However, to minimize the likelihood of stale
route cache entries, retrofitting a heartbeat protocol to the routing
protocols being considered is proposed. The routing paradigms under
consideration are source routing with and without neighbor discovery.
Simulations, presented herein, provide quantitative evidence that source
routing with neighbor discovery is the most robust approach. The efficiency
of on-demand route discovery in a hybrid routing environment, with a
proactive routing radius of 2 hops, is also considered.
TR-252,
James Kleban, Chris Alvino, Shashank Sathyanarayana, Ashutosh Morde, James
L. Flanagan, Digital Implementation of a Source Location and Sound
Capture System for Hands-Free Video Teleconferencing, 10/31/00
A real time system for hands free
video teleconferencing is implemented on a Texas Instruments TMS320c6201 DSP.
The acoustic signals from the microphones are digitally beamformed thereby
resulting in a high quality audio signal. The source location estimate
required by the beamformer is determined by the Cross Power Spectrum Phase (CPSP)
technique, with the CPSP algorithm modified for the fixed point DSP. The
inaccuracies due to the presence of ambient acoustic noise is minimized by
implementing an adaptive energy tracker which rejects a frame of audio data
if its energy is less than the estimated background noise. Reduction in
source location inaccuracies by the technique of TDOA outlier removal was
also tested. A recursive Least Squares Algorithm is used to smooth out the
source location estimates.
TR-251,
James Theodore Kleban, Combined Acoustic and Visual Processing for Video
Conferencing Systems, 9/29/00 (thesis)
The goal in hands-free videoconferencing is to accomplish, in real time,
high quality sound pickup with automatic source location of the current
talker. Signal processing techniques mitigate the deleterious effects of
interfering sources, room reverberation, and background noise. A new
real-time digitally processed system utilizes increased processing power to
implement acoustic source location and sound capture for higher fidelity
videoconferencing. The talker location system, however, is subject to
inaccuracies when using the acoustic input alone. An additional source of
information, images obtained from the application’s video camera, can
enhance talker positioning. Image processing algorithms are available that
detect and localize people in a room by their faces. Face locations can
complement the talker positions found using acoustic signal processing.
Integrating the two modes of information provides a more efficient way to
locating a speaker in a room than either input modality on its own.
Improved position estimate accuracy will lead to higher quality sound and
video.
This thesis outlines the construction of an improved fully digital real-time
source location and sound capture for teleconferencing. Algorithms using
face detection to provide image-based locations of people in a room are
presented. Offline experiments investigate face detection/localization
techniques, and examine the integration of visual and audio information. A
method for determining and locating multiple talkers, difficult to do with
acoustic data alone, is presented. Systems combining image and acoustic
processing for later real-time implementation are examined.
TR-250,
Manpreet Kaur, Integration of Gaze and Speech for Multimodal Human
Computer Interaction, 9/29/00 (PhD thesis)
Most commonly used human-computer interfaces do not take advantage of the
many communication channels humans use to communicate in verbal and
non-verbal ways. The Rutgers University CAIP Center, under the NSF
STIMULATE program, has been conducting research to establish, quantify and
evaluate techniques for designing synergistic combinations of human-machine
communication modalities like sight, sound and touch. An initial system
using these modalities has been implemented at CAIP, and it has been seen
that even with our simplistic integration scheme and imperfect component
technologies, there are obvious performance advantages to be gained from the
use of multiple modalities. The research described in this thesis is a
systematic evaluation and characterization of gaze as an input modality, and
its integration with speech. The overall goal of the research was to
explore the use of gaze and speech as input modalities for HCI, and to
understand the natural integration patterns typically occurring in the
combined use of the two. Exhaustive characterization of the use of gaze as
an input modality for human-computer interaction has been done.
Relationship between object selection times with distance, size and index of
difficulty, as defined by Fitts law, have been studied. The speech and gaze
experiments described in this work provide detailed timing correlations of
speech and gaze, both in a natural environment, and a computer-based system
which can be used to answer the questions of when and how to
integrate the two modalities. The integration of speech and gaze has been
studied under linguistically different command structures like the use of
labels (Move This There). It has been found that there are
fundamental relationships between gaze and speech events e.g. gaze always
precedes speech, though the time is seen to vary with command structure.
The effect of command structure is seen to decrease in a computer-based
environment. Also, it has been demonstrated that the detailed knowledge of
the timing relationship of the speech and gaze patterns for command
specification and command execution can be used for error resolution for a
more robust multimodal interface.
TR-249,
Dwight Macomber, Deborah Grove, Joseph French, Report of Microphone Array
Experiments Performed at CAIP in January 2000, 10/31/00
The general purpose of the
experiments was to evaluate microphone array performance under conditions
approximating those typical of teleconference environments. Conference Room
601 of the CoRE Building on the Busch Campus of Rutgers University was used
to measure the behavior of a 32-microphone wall mounted matched-filter array
(MFA). Measurements were made to evaluate behavior with a source located at
the array focus, as well as with the source located at increasing distances
from the focus position. Impulse responses (IR’s) were measured from twelve
different off-focus source positions along a line in one direction and from
seven positions in another direction. Measurements were also made to judge
the sensitivity of the array to changes to the room environment.
A
second set of measurement was taken in CoRE Room 732. The goal of these
measurements was (1) to determine whether the acoustic field in a real room
approximates a diffuse reverberant field, and (2) to obtain some preliminary
indications of the practicability of a ceiling-mounted microphone array for
frequencies at the audio mid-range and above.
TR-248,
John Susec and Ivan Marsic, An Efficient Distributed Network-Wide
Broadcast Algorithm for
Mobile Ad Hoc Networks, 7/5/00
In this paper, an algorithm for efficient network-wide broadcast (NSW) in
mobile ad hoc networks (MANETs) is proposed. The algorithm is performed in
an asynchronous and distributed manner by each network node. The algorithm
requires only limited otpology
knowledge, and therefore, is suitable for reactive MANET routing protocols.
Simulations show that the proposed algorithm is on average 3-4 times as
efficient as brute force flooding. Further, simulations show that the
proposed algorithm compares favorably over a wide range of network sizes,
with a greedy algorithm using global topology knowledge, in terms of
minimizing packet transmissions. The application of the algorithm to route
discovery in on-demand routing protocols is discussed in detail. Proofs of
the algorithm’s reliability and of the intractability of solving for a
minimum sized transmitter set to perform NWB are also given.
TR-247,
Amit Chhabra, Spatially Selective Sound Capture Using a Workstation
Matched-filter Microphone Array for a Small Office Environment, 9/8/00 –
Masters Thesis
Modern speech-based systems require the “capture” and digitization speech
with an increasingly higher degree of fidelity, i.e., obtaining speech
comparable to that of a human talker. Additionally, it is desirable to
provide this capability in a manner that does not hinder human movement,
i.e., it is desired to provide capture of clear and high-quality speech
samples without using any body-worm equipment. To this end, spatially
selective sound capture techniques must be used to remove interfacing noise
(e.g., HVAC systems, reverberation, and other talkers) from the desired
speech captured in real rooms. The spatially selective sound capture
technique of interest in this work is the Matched-filter Array (MFA)
technique as a means of removing noise from competing sources and
reverberant noise in a room. The technique’s performance is measured by its
SNR improvement over that of a single microphone and the delay-sum
beamformer (DSB) technique. The technique was implemented on real-time
computing system, and experiments were performed in a small office
environment. The results yielded an improvement of 6.4dB above a single
microphone and approximately 2 dB above the DSB. The technique is noted as
a promising method to deliver the necessary results needed for modern
speech-based systems.
TR-246,
Sumathi Gopal, Aristotle and the Knowledge Web, 8/30/00 – Thesis
Aristotle is a Distributed Learning System. A distributed Learning System
is an educational tool that supplements classroom training or even replaces
one. Currently, the audience of Aristotle comprises of all the users using
the RUNet2000 intra-network of Rutgers University. The current project
prototype of Aristotle is a Rutgers University freshmen course in General
Biology – Biology 101. The current implementation comprises of two portions
area – The Virtual Labs, and The Online Classroom. Aristotle was
commissioned during the fall of 1999.
This thesis work
concerns with the development of the Online Classroom. It elaborates on the
program utilities that were developed to generate the various kinds of files
required to launch the classroom, and the design and implementation of the
novel concept of the Knowledge Web.
The
Online Classroom is unique in its class, as it renders each lecture into
topics and keywords. Each topic/keyword presentation comprises of a
video-clip, transcript, images and a definition. More so, the video,
transcript and images are totally synchronized. This is achieved by means
of SMIL, Real Time Streaming Protocol (RTSP), RealPlayer G2 and RealServer
of Real Networks, Inc.
The second portion of
the thesis is The Knowledge Web. This is a tool developed specifically for
the online classroom, to alleviate the process of meta-cognition. It is a
graphical representation of knowledge/information. Two applications – The
Knowledge Web Composer and the Knowledge Web Navigator have been designed to
construct and navigate knowledge webs. Details of the design and
implementation have been discussed in this thesis report.
Aristotle uses the
Internet resource available to ease the learning process of students.
Distributed Learning Systems overcome several problems associated with
traditional classrooms. The Knowledge Web tries to compensate for the
absence of a live teacher. Details of the design of the Online Classroom of
Aristotle and the Knowledge Web have been presented in this report.
TR-245
Ivan Marsic, Attila Medl, and James Flanagan, Natural Communication with
Information Systems, 4/20/00
Pervasive networking and sophisticated computing open opportunities for
collaborative information processing independent of time and space. In this
instance the information system becomes an enhancer of human intellect, as
well as a mediator for communication among participants. The human user
favors the sensory dimensions of sight, sound, and touch as primary channels
of communication. Machines that can accommodate these modes promise
flexibilities and functionalities that transcend the traditional mouse and
keyboard. This paper describes research to establish human-computer
interfaces that capture attributes of natural face-to-face communication.
An experimental multimodal system is developed to study several aspects of
natural style human-computer communication. While as yet primitive, the
technologies of image and gaze processing, hands-free conversation, and
force feedback tactile transduction are combined and used simultaneously for
manipulating objects in a shared workspace. Software agents fuse the
sensory signals to estimate and interpret user intent. Current areas of
experimental application include disaster relief/crisis management,
telemedicine/rehabilitation, and mobile office/wearable computers.
TR-244
assigned but no report has been received.
TR-243
– Helmuth Trefftz and Grigore Burdea, 3-D Tracking Calibration of a
Polhemus Long Ranger Used with the Baron Workbench, 14/8/00
Applications involving a Workbench such as the Barco Baron one [1] typically
require a larger working envelope than desktop VR applications. In order to
provide for accurate tracking of the users’ hands in such an environment, a
Polhemus Long Ranger 3-D magnetic tracker was acquired. The Long-Ranger
provides a larger radius of interaction but the accuracy of the signal is
adversely affected by metallic surfaces in the proximity of its working
area. This project describes a procedure to reduce these errors, based on
samples taken at regular space intervals in the working volume.
TR-242
– Chengwei Feng, Deborah Silver, Karen G. Bemis, Peter Rona, Acoustic
Imaging Manual: Object Segmentation and Feature Quantification, 12/6/99
In this work, our goal is to apply visualization techniques that will
facilitate the study of simulation and experimental datasets of hydrothermal
plumes. Hydrothermal plumes are bodies of hot water containing mineral
particles, and discharge from vents on the sea floor. They rise buoyantly,
for instance, up to hundreds of meters, and disperse heat and chemicals into
the ocean. This report is intended to describe enhancements to our existing
feature extraction, tracking and quantification system. It focuses on
feature extraction (object segmentation) and feature quantification
procedures. The simulation dataset is a numerical model based on a large
eddy simulation in the field of computational fluid dynamics (CFD).
Experimental datasets are acoustic images of thermal plumes, which record
intensity of backscatter from the particulate matter suspended in the
plumes. The experimental datasets were recorded at different locations and
times. Our new approaches, including skeleton and centerline representation
and quantification, provide additional visual information. A centerline and
a skeleton are useful shape abstraction of features, which are regions of
interests in the dataset. Both capture the essential topology of the
feature but on different levels of details. The centerline plays an
important rule in defining the qualitative and quantitative behavior of
features and their evaluations in the field of plume study. The skeleton is
a thinned version of the original object, much like the skeleton of a human
figure. It is related to the medial-axis which is the locus of points
centered with respect to the boundaries of the feature. We extract the
skeleton of a feature using a distance transformation method. In our
approach, users can control the density of the skeleton points by
interactively changing the thinness parameter. Centerlines are
computed from these scattered skeletal points using appropriate averaging,
connecting and smoothing techniques. Quantification provides analytical
calculations for the features. Our quantifications, including center of
area, center of mass, local maximum sets, mass, area, and volume are
computed based on the centerline representation. Understanding of
simulation and experimental datasets is achieved by applying visualization
and quantification techniques. Quantification parameters from both
simulation dataset and acoustic imaging dataset, including plume volume,
cross-sectional area, centerline location, surface area and iso surfaces at
percentages of maximum backscatter intensity, are being used to derive
elements of plume behavior including expansion with height, dilution, and
mechanisms of entrainment of surrounding seawater.
TR-241
– Chengwei Feng, Xin Wang, Deborah Silver, User Interface for Feature
Extraction, Tracking and Quantification System, 12/6/99
3D time-varying
datasets tend to be extremely large and complex, and standard visualization
techniques such as iso surface and volume rendering provide no facilities to
manipulate features of interest in the dataset. X-Wang presented a
feature-based approach in his PhD thesis [2]. This allows users to extract
regions of interest and then visualize, track isolate and quantify their
evolution. The tracking and quantitative information can be used to enhance
the visualization of the dataset. A feature-based approach can
significantly improve and facilitate the processing of massive datasets.
This report centers on the integration of the feature extraction, tracking
and quantification programs into a modular GUI environment. With the
integrated user interface, users can interact with features of interest and
focus on their evolution and quantification without being distracted b the
implementation details. A complete feature extraction, tracking and
quantification system has been integrated into AVS. The user interface has
been applied to experimental and simulation datasets from oceanography and
CFD.
TR-240
– Deborah M. Grove, Time & Frequency Data for On/Off Focus Matched Filter
Processing, 11/5/99
TR-239
– Boi Sletterink, A Managing Agent for Sharing Multiple Modalities, 8/31/99
Multimodal user interfaces promise more natural man-machine communication,
as well as improved accessibility for disabled people. When extending
multimodal user interfaces beyond single application scenarios, providing a
multimodal interface in a desktop, applications have to share access to
input and output devices. This requires an agent that arbitrates between
applications and low-level drivers. This thesis proposes a design for
communication between driver, manager and application, and discusses a
prototype implementation and preliminary testing results.
TR-238
– DongSuk Yuk, Robust Speech Recognition Using Neural Networks and Hidden
Markov Models – Adaptations Using Non-linear Transformations, 8/31/99, (PhD
Dissertation)
When the training and testing conditions are not similar, statistical speech
recognition algorithms suffer from severe degradation in recognition
accuracy. Even when the underlying distributions from which data is
generated are the same, the observed distributions may vary because of the
interference from acoustical environments where systems are actually used.
Another source of variability comes from speakers themselves where the
produced sound is different between speakers. This research concerns
robustness issues in statistical speech recognition, especially when the
training and the testing data distributions are not matched.
Since the parameters of recognizers are estimated from training examples, it
would be better to use the data that is collected from testing environments
to reliability estimate the parameters of recognizers is a very expensive
task. In this research, a transformation approach based upon neural
networks is studied to handle the training and testing condition
mismatches. Neural networks can be used for situations where speech feature
vectors are non-linearly distorted, such as in noisy reverberant speech or
telephone speech. By using a neural network, the adaptation process
requires a small amount of training data. First, a neural network is
applied to the computation of an inverse distortion function. This type of
network requires simultaneously recorded input and target data pairs for
training. Traditionally, neural networks are trained to minimize the mean
squared error between the network output and the corresponding target
value. However, minimizing the mean squared error does not guarantee
maximum recognition accuracy. Therefore, a new objective function for the
neural network is proposed, which makes use of the conditional probabilities
that come form hidden Markov model (HMM) based recognizers. It maximizes
the likelihood of the data from testing environments, and allows global
optimization of the neural network when used for the transformation of data,
or for the adaptation of recognizers to an operating environment. In the
latter case, the parameters of recognizers (i.e. mean vectors and covariance
matrices) are transformed to best match the data distribution. The new
algorithm is evaluated on a large vocabulary continuous speech recognition
task.
TR-237
Harvey Ray and Deborah Silver, A Memory Efficient Architecture for
Real-Time Parallel and Perspective Direct Volume Rendering,
8/4/99
Real-time visualization of large three-dimensional datasets demands high
performance; thus pushes storage, processing, and data communication
requirements to the limits of current technology. General purpose
high-performance parallel processing has been used to visualize these
datasets; however, these solutions are not tractable because of the cost of
the machines. High-end graphics workstations have recently achieved
interactive rates on moderate sized datasets using texture mapping; however,
texture mapping memory is typically limited and texturing hardware does not
directly support 3D gradients. Several custom architectures have been
proposed to address the shortcomings of other approaches. These solutions
promise unprecedented price-performance ratios. Recently, two specialized
architectures have been proposed for interactive visualization of 2563
datasets. One achieves 30Hz for parallel projections and other
achieves an average of 10Hz for both parallel and perspective
projections. This paper introduces the Resample And Composite Engineer
(RACE) architecture. A new high-performance general purpose volume
graphics architecture that targets 20 – 24 Hz average performance for
both perspective and parallel rendering with as few as 4 rendering pipelines
for 2563 datasets using 100 MHz SDRAM memories. This is
anywhere from 33 – 50% less voxel bandwidth than other recent approaches.
It will achieve 40 – 48Hz average performance for 256 x 256 x 128
datasets. The RACE architecture will support antialiasing of perspective
images (casting multiple rays per pixel). We believe that this work further
validates the need for specialized direct volume rendering hardware for
voxel processing.
TR-236
-- D. Sinder, Speech Synthesis Using an Aeroacoustic Fricative Model, (PhD
Dissertation), 7/12/99
Progress in advanced computer speech interfaces is limited in part due to
incomplete knowledge of the physics of speech production. Unvoiced speech
sounds such as fricatives are an important example. These sounds are
produced by “turbulent” air motion in the vocal tract. A proper
understanding of how unvoiced sounds are produced is thus far lacking
because the speech community has for the most part limited its physical
picture of air motion in the vocal tract shape, lung pressure, and other
speech parameters is not at all clear.
A
considerable body of work has been produced on the subject of aeroacoustics,
which is the study of the interaction between sound and non-acoustic air
motions such as turbulence. The purpose of this dissertation is to apply
ideas form aeroacoustics and unsteady aerodynamics to produce a model of the
aeroacoustic source associated with turbulent flow in the vocal tract.
Particular emphasis has been given to produce a model suitable for
articulatory speech synthesis. This requirement led to the development of
reduced-complexity modeling of turbulent flow such that the computational
requirements are not far in excess of those needed for existing
transmission-line computations of speech signals.
The essential result from aeroacoustic theory incorporated into this work is
that of Howe; his result relates the motion of virticity through a duct of
changing cross-section to the pane wave sound field generated by that
motion. This relation is sued to compute the value of an acoustic pressure
source in the duct. The aeroacoustic theory implicitly incorporates the
source spectrum, level, impedance, and spatial distribution, assuming the
behavior of vorticity and the vocal tract shape are known.
Due to its complexity, obtaining detailed information about the vorticity
distribution of any turbulent flow entails a high cost in time and
resources, whether the approach is computational or experimental.
Fortunately, this problem has received enough attention that it is possible
to parameterize the essential features of the vorticity field in the vocal
tract into a jet model which requires a minimum of computational effort.
Such a jet model is presented here. It prescribes the motion of vorticity
based upon criteria which determine the location of jet formation (flow
separation) in the vocal tract, the geometry of the location where the jet
is formed, and the local airflow speed at the jet formation location.
The new jet model and aeroacoustic source description were incorporated into
a transmission line model for duct acoustics. The
result is an engineering solution for a new fricative model which
combines low-cost computation with judicious application of fundamental
physics. Two sets of validation studies were conducted to test the
computational method. The first synthesized the sound produced by steady
airflow in a pipe with axial area variations. The pipe geometry and jet
speed were matched to those of a quiet aeroacoustic pipe flow facility. The
pressure spectrum measured at the pipe exit compared favorably to the
pressure spectrum computed for the simulated system. The second validation
study tested the method by synthesizing unvoiced speech sounds, both in
isolation and in vowel context. The results show the strong potential for
this approach to produce high quality unvoiced speech without the need to
estimate source strength, spectra, or location for different vocal tract
geometries. That is, the synthesis of unvoiced sounds is gained
automatically from the articulatory description.
TR-235
– M. Krishnamoorthy, Toward Robust Speech Recognition: Speaker and
Environmental Adaptation in the Linear Spectral Domain 1, 2000
There has been considerable success in speech recognition technology and
there are commercial products available today. Hidden Markov Models (HMMs)
have been used predominantly in speech recognition technology. These
systems perform well in a quiet environment and when the speech signal is
recorded using a close-talking microphone. However, mismatches between
training and testing environment severely degrade the performance. The
mismatch can be because of inter speaker variation or environmental
variation or both. Speaker variation can be because of factors such as
dialect differences and vocal tract lengths. Environmental variation can be
because of microphone variations, additive acoustic noise and channel noise.
Current recognition systems solve this problem using a technique called
adaptation. Conversational adaptation schemes work in the cepstral
domain. This thesis aims at developing an algorithm which can adapt the
models/features in the linear Spectral domain. For this purpose an
adaptation technique called Linguistic Tree based Maximum Likelihood Linear
Regression (LT-MLLR) is used, to differentiate between the performance in
the cepstral and linear spectral domains. The advantages of using the
linear features are explained along with experimental results.
A
representative result addresses noisy speech with 20dB SNR, collected from a
microphone array, at a distance of about 5.5 meters from the speaker, in a
reverberant environment. Automatic recognition of this signal yields an
accuracy of about 8% on the clean speech models. The linear domain
adaptation technique developed here produced an accuracy of about 66%.
TR-234
-- J. Flanagan, D. Yuk, M. Krishnamoorthy, K. Dayanidhi, A Neural Network
System for Robust Large-Vocabulary Continuous Speech Recognition in Variable
Acoustic Environments,
1/15/99
Hidden Markov Models (HMM’s) have to date been accepted as an effective
classification method for large vocabulary continuous speech recognition.
Most of existing HMM-based recognition systems, such as DARPA sponsored
SPHINX and DECIPHER, are designed to operate on “high-quality close-talking
speech”. They require consistency in sound capturing equipment, and in
acoustic environments between training and testing sessions, the so-called
“matched” conditions. When the testing condition differs from the training
condition, the performance of these recognizers is typically degraded if
they are not retrained to cope with new environments [Che et al, 1992,
1993].
Usually, a retraining of HMM-based recognizers is complex and
time-consuming. It requires collection of speech data again under
corresponding conditions and reestimation of HMM’s parameters based on new
speech material. Particularly great time and effort are needed to retrain a
recognizer which operates in a speaker-independent mode, which is the mode
of greatest general interest.
The broad objective of this research is to explore the emerging microphone
array technology for distant-talking speech recognition in practical
reverberant, noisy environments. For this purpose, a system of microphone
array and neural network (MANN) has been developed as a robust front-end for
speech recognition. The MANN system has two synergistic components: (1)
Speech enhancement by microphone arrays and (2) Adaptation by neural network
processors. By using the MANN system, existing DARPA speech recognition
systems can be directly developed in adverse day-to-day
applications. That is, speech recognizers need not be retrained.
Furthermore, the distant-talking MANN system frees the user from encumbrance
of hand-held, body-worn, or tethered microphone equipment. It thus enables
deployment of DARPA speech recognition technology in hands-busy/eyes-busy
and/or distant-talking applications.
The combined advantages of microphone arrays and neural network (NN)
computing are used to expand the capabilities of DARPA speech recognition
technology to application environments where users must not be encumbered by
body-worn or hand-held microphones, and must have freedom of movement.
(Examples include Combat Information Centers, large group conferences, and
mobile hands-busy eyes-busy maintenance tasks).
The approach allows the neural network to ‘learn’ the reverberant distortion
and noise interference of the acoustic environment, and to transform
speech-feature data (such as cepstrum coefficients) obtained from a
distant-talking microphone array to those corresponding to a high-quality,
close-talking microphone system. The performance of the speech recognizer
can therefore be elevated in the hostile acoustic environment without
retraining the recognizer.
The neural network learns the characteristics of a specific unfavorable
environment by adapting its weights through direct comparison of the
distant-talking signal to a close-talking calibration signal. Thereafter
performance of the speech recognizer under distant-talking reverberant
conditions can be elevated and made comparable to that of the close-talking,
acoustically favorable condition.
In Part I, the initial work involved with the line array and the stereo data
neural networks is described. The main idea of the MANN system is explained
and some pilot experimental results are shown. In Part II, the Matched
Filter Array (MFA) is discussed. In Part III, the combination of the MFA
and the maximum Likelihood Linear Regression (MLLR) is described. Finally,
in Part IV, the Maximum Likelihood Neural Networks (MLNN) are explained.
TR-233
-- Piyush Modi, Discriminative Utterance Verification by Integrating
Multiple Confidence Measures: A Unified Training and Testing Approach (PhD
Dissertation)
Robustness to acoustic and language variabilities is the most challenging
problem facing automatic speech recognition systems today. It is becoming
increasingly important for this system to assign a measure of “goodness” or
a confidence measure (CM) to their output. The use of these measures to
validate a recognized hypothesis is usually referred to as utterance
verification (UV).
In this thesis, we present a novel UV framework for exploiting the
complementary properties of different sources of information and their
integration in a unified system. The proposed framework uses a single
objective function that integrates multiple knowledge sources and also acts
as a loss function to train the entire parameter set of the UV system. A
discriminative minimum verification error-training algorithm is developed to
optimize the parameters of both the objective function and the knowledge
sources. To demonstrate the utility of our framework we have developed a UV
system that integrates two acoustic based knowledge sources. Experimental
results on a connected digits task show that the UV with multiple confidence
measures (UV-MCM) outperforms state-of-the-art system that rely on using
each CM individually.
TR-232
-- J. L. Flanagan, Autodirective Sound Capture; Toward Smarter Conference
Rooms
TR-231
-- S. Juth, Collaboration Components for Programming Real-Time
Synchronous Groupware Applications, 10/98
It is a well-known fact that developing software applications can be very
complex and difficult. Developing multi-user applications is even more
challenging since such systems typically require additional features such as
session management, synchronizing users’ states, providing user awareness,
etc. In the past people have tried to ease this additional burden by using
toolkits that support these multi-user requirements. This thesis presents a
different approach for building collaborative applications that takes
advantage of the JavaBean specification from JavaSoft, Inc. A suite a
collaboration components has been implemented as JavaBeans in order to
provide the typical features found in real-time, synchronous groupware
applications. By using the collaboration components, the multi-user
application developer only has to focus on developing the features specific
to the application while leaving the multi-user aspects (i.e. concurrency
control, session management, awareness of remote users, etc.) to the
components. This research also provides a specification for developing
collaboration-aware Beans that can be used with the components.
TR-230
-- M. J. A. Andre, Multimodal Human-Computer Interaction, 8/26/98
The influence of computers on people’s daily lives is increasing and the
need for simpler interfaces to use computers is emerging. Current
human-machine communication systems predominantly use keyboard and mouse
inputs which inadequately approximate human abilities for communication.
More natural communication technologies are capable of freeing computer
users from the keyboard and mouse.
This work presents a prototype of a multimodal interface featuring fusion of
multiple modalities for human-computer interaction. The three modalities
integrated are a speech recognizer and synthesizer, a tactile glove, and a
gaze tracker. The application used for this system is a collaborative
whiteboard application extended to a military mission planning system. The
design and implementation of the whole system and the methods applied are
described and preliminary results of the real-time multimodal fusion are
analyzed.
TR-229
-- J. Ray, R. Samtaney and N.J. Zabusky, Shock-in Competition and A Model
for Circulation Deposition in Shock Interactions with Heavy Prolate
Cylinders, 8/31/98
We identify two different modes of interaction for planar shocks
accelerating heavy prolate gaseous cylinders. These modes arise from
different interactions of the incident and transmitted shocks on the leeward
side of the cylinder and yield different vorticity deposition mechanisms. We
model the net baroclinic circulation generated on the interface by both the
shocks and validate the model via numerical simulations of the Euler
equations. The principal parameters governing the interaction are the Mach
number of the shock (M), the ratio of the density of the gas cylinder to the
ambient gas density, (h,
h
>
1), g0,
gb,
(the ratio of specific heats of the two gases),
l
(the aspect ratio), and tT / tI (a time ratio which
characterizes the mode of interaction). In the range 1.2
£
M £
3.5, 1.54 £
h
£
5.04 and l
= 1.5 and 3.0, our model predicts within 10 % of the simulation results.
TR-228
-- Samir Chennoukh, Daniel Sinder, James Flanagan, Voice Mimic System
(Hyper-computing Design Project Final Report), 8/21/98
The research undertaken in this part of the HPCD project aims to advance
fundamental understanding of human speech production and coalesces the
problems of speech synthesis, speech recognition, and low bit-rate speech
coding into a compact parametric framework. The approach uses a
computationally-intensive technique of speech analysis and synthesis to gain
more understanding of the acoustics of speech.
The research aims to design a voice mimic system which can adapt parameters,
moment by moment, for an articulatory model to duplicate an arbitrary speech
input. Using an articulatory speech synthesizer, the input speech is
restored at the output of the system from the obtained set of articulatory
parameters. Articulatory synthesis has been studied using both
linear-acoustic and fluid-dynamic models of speech generation. Such
simulation based on a physiological model of speech production requires a
knowledge of the optimal number of geometrical, acoustical and mechanical
parameters in order to account for the complexity off speech production.
Research in speech production collects vast amounts of data on the vocal
system, its mechanics, the acoustic signal and the information which it
encodes. The analysis of this data should establish the characteristics of
the speech production device in order to model it. Due to the difficulty of
obtaining articulatory data, techniques for estimating the vocal tract area
function directly from the speech signal are of interest in studies of the
speech production process and as the basis for efficient coding of the
speech signal. The problem of estimating the vocal tract shape from the
acoustic speech signal is often referred to as the inverse problem. This is
a difficult problem because of non-uniqueness of the acoustic-geometry
relation. The problem is far from solved. An overview of the state of the
art is given in Chapter 2 regarding modeling of the vocal tract and the
knowledge required to estimate the vocal tract shape from the speech
signal. Although brief, the overview relates clearly the main components of
our research.
Chapter 3 illustrates our efforts to solve the inverse problem by an
optimization procedure using an articulatory codebook. A codebook is used
to obtain the first estimate of the vocal tract shape that may produce a
given combination of acoustic parameters. It must be designed such that it
spans the natural aritculatory space of a speaker. Furthermore, sampling of
the space must be fine enough so that an acoustic entry always exists very
close to the global optimum. Such codebooks require a large set of matching
pairs of vocal tract and acoustic parameters. However, as the codebook size
increases, searching it becomes increasingly time consuming. This chapter
is devoted to the different techniques and algorithms developed to access
such large codebooks and to solve the non-uniqueness of the articulatory
trajectories which follow from the non-uniqueness of the acoustic-geometric
relation.
Chapter 4 proposes another concept for solving the inverse problem. It
consists of developing a unique and a continuous acoustic-to-articulatory
mapping that will uncover the information encoded in the signal about the
vowel, the consonant and the vowel-consonant coarticulation. Obviously, the
concept requires more understanding of the acoustic-articulatory relation,
an understanding which is still far from complete. However, this approach
has proven to be a fast, efficient and robust method for acoustic-to-aritulcatory
mapping. The concept consists of a mapping from vocal tract shape to
formant frequencies whose relationship is kept linear in terms of one
gesture that describes the variation of the vocal tract shape gives a
rectilinear formant trajectory. Thus, the estimation of the model shape
form the formant frequencies according to this concept becomes a simple
interpolation mapping.
The limiting factor in the quality of voice mimic systems is in the accuracy
with which articulatory speech synthesizers model fricative production.
Since the physical process of fricative production is not well understood,
the problem of obtaining an articulatory description from an acoustic signal
is especially difficult for these sounds. Computational studies of speech
production using fluid dynamics has the potential to provide much insight
into fricative production. In Chapter 5, numerical simulation of flows in
idealized vocal tract configurations is described, as well as physical
experiments in the same geometries. Source terms from aero-acoustic theory
are compared with experimental results to identify which terms (or
combination of terms) most accurately reflect the noise sources in the vocal
tract. An understanding of the proper source terms will allow the
development of reduced models which can be implemented in traditional
synthesizers. The improved synthesis models will ultimately improve the
performance of the voice mimic system. Chapters 6 and 7 give a summary and
conclusions, respectively, of the present research.
TR-227
-- Prabhu Raghaven, Speaker & Environment Adaptation in Continuous Speech
Recognition, 6/24/98
Hidden Markov Models (HMMs) have been used with considerable success in
continuous speech recognition. It is well known that high accuracy can be
obtained when the HMM system is trained and tested in a quiet environment
and the speech signal is acquired from a close-talking microphone. However,
mismatches between training and testing environment severely degrade
performance. Two major sources of mismatches are speaker and environment
variability. Speaker variation is typically caused by different speaking
styles and other physiological differences between speakers, such as vocal
tract lengths, etc. Environment variability includes channel distortion,
such as that which affects telephone speech, additive noise, and
reverberation which results when the microphone is far away from the
speaker.
The goal of this report is to explore different adaptation algorithms that
mitigate the effects of speaker and environmental variability for speech
recognition. The adaptation algorithm closely examined in this report is a
Linguistic Tree based Maximum Likelihood Linear Regression (LT-MLLR).
Speech Recognition experiments using the LT-MLLR for speaker and environment
adaptation are given.
It is shown that the LT-MLLR algorithm is superior to other adaptation
algorithms discussed. For speaker adaptation, a 30% reduction is achieved
over the baseline word error rate (WER) using this algorithm. In addition
it is shown that the use of Matched Filter Array Processing (MFA) with LT-MLLR
reduces the WER of distant-talking speech with high reverberation. In the
case when the reverberation time is as high as 0.9s, the WER is reduced from
57.89% to 19.41%, a reduction of 66.47%. [This work was supported by DARPA
Contract DABT63-93-C-0037.]
TR-226
-- Xin Wang and Deborah Silver, Visualizing Time-Varying Features
Visualization of time-varying 3D data is difficult because of the immense
amount of high dimensional data to process and assimilate. Feature
extraction and tracking techniques can greatly reduce the data size and
complexity, and thus help scientists identify and quantify important regions
and events. In this paper, we propose a feature-based framework to
visualize time varying datasets, and discuss some visualization approaches
to enhance the scientists’ ability to grasp 3D time varying patterns of
isolated features. These techniques utilize tracking data which has been
computed over the dataset. The turbulent data set is used to demonstrate
the ideas.
TR-225
-- George Patounakis, Mourad Bouzit, and Greg Burdea, Study of the
Electromechanical Bandwidth of the
Rutgers Master,
29 May 1998
The Rutgers Master II's (RMII) current mechanical bandwidth is about 5Hz for
a 16 psi to 64 psi pressure change. This is not good enough to render
haptic environments. The goal of this study is to see what factors
influence the bandwidth of the RMII. The experiments involved using higher
pressure inputs to the RMII, changing the length of the tube leading from
the RMII interface box to the piston, changing the cross section of the tube
from the RMII to the piston, checking for correlation between the readings
from the force sensor on the piston shaft and the pressure sensor on the
pressure regulator, and varying the air flow. The above tests were done
using the Buzmatics SPCJR pneumatic servo controller. Subsequently, the
bandwidth was measured for the MATRIX 751 valve from Amatrix Corporation, by
changing the number of microvalves controlled in parallel.
TR-224
-- Chewei Che, Automatic Speaker Recognition System for Telephone Speech,
5/19/98
It is well known that high accuracy can be obtained for speaker recognition
systems operated under quiet laboratory environments. The performance of
the system degrades when the application is constrained to be
text-independent using telephone conversational speech. Factors such as a
mismatch of training/testing handset, and inherent variation of speech from
different talkers contribute to the level of difficulty of the task. This
report is aimed at developing robust speaker recognition systems to combat
those variabilities and still maintain high performance. The general
framework of the system is based on modeling the speaker with statistical
analysis of the speech signal. Two systems developed by the author are
presented. The first system uses concatenated phoneme Hidden Markov Model
(HMM) and is operated in a text-prompted mode. The system has been
evaluated with the YOHO voice verification corpus in terms of both speaker
verification and closed-set speaker identification. It is shown that by
using 10 seconds of testing speech, an error rate of 0.09% for male and
0.29% for female are obtained for speaker identification with a total
population of 138 talkers. For speaker verification, under the 0% false
rejection condition, the system achieves a state of the art performance
false acceptance rate of 0.09% for male and 0% for female. The second
system utilizes a Vector Quantization based Gaussian Speaker Model (VQGSM)
and is operated with no context constraint. The system was evaluated using
the Switchboard corpus and yields competitive speaker recognition accuracy.
TR-223
-- V. George Popescu and Greg Burdea, Research on Full Body Modeling and
Force Feedback, 4/24/98
This report presents the guidelines of a full-body haptic suit design and
simulation system which integrates a real-time hand sensing and
force-feedback device (RMII) with a full body modeling software (JACK). The
RMII system and the Polhemus Fastrak sensor were initially integrated with
“Jack” full-body simulation software. The system is evaluated in terms of
bandwidth and increase in simulation realism after the addition of the
haptic component. Next the concept of a Virtual Human Agent enhanced with
haptic feedback is presented. The proposed concept is illustrated by a
simulation developed using the RMII system, JACK software and a speech
recognition engine. The conceptual design of a full body haptic suit is
explored subsequently. Solutions provided by state-of-the-art technology
are evaluated. New ideas are also explored to meet the demands of the
Haptic Suit design. The design of new force-feedback device for elbow
conclude the report.
TR-222
-- A. Sharma, Real-Time Systems Support for Multimedia Applications, May
1998
New multimedia applications, e.g. teleconferencing, video-on-demand,
Internet telephony, and many more are revolutionizing every aspect of human
life. These applications demand real-time performance, better digital
signal processing algorithms, elaborate multimedia communication protocols,
high-speed networks, and powerful platforms.
Traditionally the embedded software to support multimedia applications, e.g.
digital signal processing tasks, network switching and routing is developed
in a custom way. The software is fine-tuned in several iterations to get
the needed real-time performance. This approach has several serious
drawbacks: tedious design which becomes even more difficult for
multiprocessors; lack of flexibility, as a minor change in specification may
necessitate a total redesign; slower response to asynchronous events,
leading to a degradation in the system’s real-time support. In contrast, a
kernal or operating system approach allows a more flexible software
development, and real-time support. In the presence of a kernal layer,
multimedia application writers need not worry about the task coordination,
or careful timing of the events. Since the underlying software to support
multimedia applications is still of embedded nature, a complex multimedia
applications is still of embedded nature, a complex general-purpose
operating system like Unix would be inappropriate. We architect a
lightweight kernal layer that enhances real-time support for multimedia
applications.
One crucial requirement for the real-time support of a system is to increase
the predictability in the system. So, while architecting the different
subsystems of the layer, we have attempted at increasing the predictability
in the system. For scheduling, we separate out the unpredictable external
events, from the synchronous computation or communication. Having one
scheduling scheme for all three makes a system more unpredictable than it
needs to be. This resulted in a novel approach of heterogeneous scheduling
scheme with functional separation of control flow, data flow, and
computation in contrast with conventional spatial hierarchical scheduling
schemes. When designing the real-time communication and synchronization
primitives, using the profile information, we have provided hints via
constructs like real-time semaphores to increase the predictability in the
system. Being in embedded domain, we do not emphasize on real-time support
for memory management, though do provide special notions of region-based
management and callback routines. We made our kernal customizable with a
service table which allows to plug-in a custom set of scheduling,
synchronization, or memory management schemes. A theoretical model for the
scheduling scheme is provided, with two algorithms to operate on it. A
comparative empirical study was performed on our scheduling scheme and
related alternate schemes. The study found that two hybrid schemes, with
good mix of the basic schemes, provide a good combination of scalability,
resiliency to load, and functional separation.
TR-221
-- Daniel V. Rabinkin, Optimum Sensor Placement for Microphone Arrays,
Microphone arrays can be used for high-quality sound pickup in reverberant
and noisy environments. Sound capture using conventional single microphone
methods suffers severe degradation under these conditions. The beamforming
capabilities of microphone array systems allow highly directional sound
capture, providing enhanced signal-to-noise ratio (SNR) when compared to
single microphone performance.
The overall performance of an array system is governed by its ability to
locate and track sound sources and its ability to capture sound from desired
spatial volumes. These abilities are strongly affected by the spatial
placement of microphone sensors. A met |