Search Content

SearchViz: an interactive visual interface to navigate search-results in online discussion forums

Description

Online programming communities are widely used by programmers for troubleshooting or various problem solving tasks. Large and ever increasing volume of posts on these communities demands more efforts to read and comprehend thus making it harder to find relevant information. In my thesis; I designed and studied an alternate approach…

Online programming communities are widely used by programmers for troubleshooting or various problem solving tasks. Large and ever increasing volume of posts on these communities demands more efforts to read and comprehend thus making it harder to find relevant information. In my thesis; I designed and studied an alternate approach by using interactive network visualization to represent relevant search results for online programming discussion forums.

I conducted user study to evaluate the effectiveness of this approach. Results show that users were able to identify relevant information more precisely via visual interface as compared to traditional list based approach. Network visualization demonstrated effective search-result navigation support to facilitate user’s tasks and improved query quality for successive queries. Subjective evaluation also showed that visualizing search results conveys more semantic information in efficient manner and makes searching more effective.

ContributorsMehta, Vishal Vimal (Author) / Hsiao, Ihan (Thesis advisor) / Walker, Erin (Committee member) / Sarwat, Mohamed (Committee member) / Arizona State University (Publisher)

Created2015

Locality sensitive indexing for efficient high-dimensional query answering in the presence of excluded regions

Description

Similarity search in high-dimensional spaces is popular for applications like image

processing, time series, and genome data. In higher dimensions, the phenomenon of

curse of dimensionality kills the effectiveness of most of the index structures, giving

way to approximate methods like Locality Sensitive Hashing (LSH), to answer similarity

searches. In addition to range searches…

Similarity search in high-dimensional spaces is popular for applications like image

processing, time series, and genome data. In higher dimensions, the phenomenon of

curse of dimensionality kills the effectiveness of most of the index structures, giving

way to approximate methods like Locality Sensitive Hashing (LSH), to answer similarity

searches. In addition to range searches and k-nearest neighbor searches, there

is a need to answer negative queries formed by excluded regions, in high-dimensional

data. Though there have been a slew of variants of LSH to improve efficiency, reduce

storage, and provide better accuracies, none of the techniques are capable of

answering queries in the presence of excluded regions.

This thesis provides a novel approach to handle such negative queries. This is

achieved by creating a prefix based hierarchical index structure. First, the higher

dimensional space is projected to a lower dimension space. Then, a one-dimensional

ordering is developed, while retaining the hierarchical traits. The algorithm intelligently

prunes the irrelevant candidates while answering queries in the presence of

excluded regions. While naive LSH would need to filter out the negative query results

from the main results, the new algorithm minimizes the need to fetch the redundant

results in the first place. Experiment results show that this reduces post-processing

cost thereby reducing the query processing time.

ContributorsBhat, Aneesha (Author) / Candan, Kasim Selcuk (Thesis advisor) / Davulcu, Hasan (Committee member) / Sapino, Maria Luisa (Committee member) / Sarwat, Mohamed (Committee member) / Arizona State University (Publisher)

Created2016

Space adaptation techniques for preference oriented skyline processing

Description

Skyline queries are a well-established technique used in multi criteria decision applications. There is a recent interest among the research community to efficiently compute skylines but the problem of presenting the skyline that takes into account the preferences of the user is still open. Each user has varying interests towards…

Skyline queries are a well-established technique used in multi criteria decision applications. There is a recent interest among the research community to efficiently compute skylines but the problem of presenting the skyline that takes into account the preferences of the user is still open. Each user has varying interests towards each attribute and hence "one size fits all" methodology might not satisfy all the users. True user satisfaction can be obtained only when the skyline is tailored specifically for each user based on his preferences.

This research investigates the problem of preference aware skyline processing which consists of inferring the preferences of users and computing a skyline specific to that user, taking into account his preferences. This research proposes a model that transforms the data from a given space to a user preferential space where each attribute represents the preference of the user. This study proposes two techniques "Preferential Skyline Processing" and "Latent Skyline Processing" to efficiently compute preference aware skylines in the user preferential space. Finally, through extensive experiments and performance analysis the correctness of the recommendations and the algorithm's ability to outperform the naïve ones is confirmed.

ContributorsRathinavelu, Sriram (Author) / Candan, Kasim Selcuk (Thesis advisor) / Davulcu, Hasan (Committee member) / Sarwat, Mohamed (Committee member) / Arizona State University (Publisher)

Created2014

Stochastic models of patient access management in healthcare

Description

This dissertation addresses access management problems that occur in both emergency and outpatient clinics with the objective of allocating the available resources to improve performance measures by considering the trade-offs. Two main settings are considered for estimating patient willingness-to-wait (WtW) behavior for outpatient appointments with statistical analyses of data: allocation…

This dissertation addresses access management problems that occur in both emergency and outpatient clinics with the objective of allocating the available resources to improve performance measures by considering the trade-offs. Two main settings are considered for estimating patient willingness-to-wait (WtW) behavior for outpatient appointments with statistical analyses of data: allocation of the limited booking horizon to patients of different priorities by using time windows in an outpatient setting considering patient behavior, and allocation of hospital beds to admitted Emergency Department (ED) patients. For each chapter, a different approach based on the problem context is developed and the performance is analyzed by implementing analytical and simulation models. Real hospital data is used in the analyses to provide evidence that the methodologies introduced are beneficial in addressing real life problems, and real improvements can be achievable by using the policies that are suggested.

This dissertation starts with studying an outpatient clinic context to develop an effective resource allocation mechanism that can improve patient access to clinic appointments. I first start with identifying patient behavior in terms of willingness-to-wait to an outpatient appointment. Two statistical models are developed to estimate patient WtW distribution by using data on booked appointments and appointment requests. Several analyses are conducted on simulated data to observe effectiveness and accuracy of the estimations.

Then, this dissertation introduces a time windows based policy that utilizes patient behavior to improve access by using appointment delay as a lever. The policy improves patient access by allocating the available capacity to the patients from different priorities by dividing the booking horizon into time intervals that can be used by each priority group which strategically delay lower priority patients.

Finally, the patient routing between ED and inpatient units to improve the patient access to hospital beds is studied. The strategy that captures the trade-off between patient safety and quality of care is characterized as a threshold type. Through the simulation experiments developed by real data collected from a hospital, the achievable improvement of implementing such a strategy that considers the safety-quality of care trade-off is illustrated.

ContributorsKilinc, Derya (Author) / Gel, Esma (Thesis advisor) / Pasupathy, Kalyan (Committee member) / Sefair, Jorge (Committee member) / Sir, Mustafa (Committee member) / Yan, Hao (Committee member) / Arizona State University (Publisher)

Created2019

The impact of graph layouts on the perception of graph properties

Description

Graphs are commonly used visualization tools in a variety of fields. Algorithms have been proposed that claim to improve the readability of graphs by reducing edge crossings, adjusting edge length, or some other means. However, little research has been done to determine which of these algorithms best suit human perception…

Graphs are commonly used visualization tools in a variety of fields. Algorithms have been proposed that claim to improve the readability of graphs by reducing edge crossings, adjusting edge length, or some other means. However, little research has been done to determine which of these algorithms best suit human perception for particular graph properties. This thesis explores four different graph properties: average local clustering coefficient (ALCC), global clustering coefficient (GCC), number of triangles (NT), and diameter. For each of these properties, three different graph layouts are applied to represent three different approaches to graph visualization: multidimensional scaling (MDS), force directed (FD), and tsNET. In a series of studies conducted through the crowdsourcing platform Amazon Mechanical Turk, participants are tasked with discriminating between two graphs in order to determine their just noticeable differences (JNDs) for the four graph properties and three layout algorithm pairs. These results are analyzed using previously established methods presented by Rensink et al. and Kay and Heer.The average JNDs are analyzed using a linear model that determines whether the property-layout pair seems to follow Weber's Law, and the individual JNDs are run through a log-linear model to determine whether it is possible to model the individual variance of the participant's JNDs. The models are evaluated using the R2 score to determine if they adequately explain the data and compared using the Mann-Whitney pairwise U-test to determine whether the layout has a significant effect on the perception of the graph property. These tests indicate that the data collected in the studies can not always be modelled well with either the linear model or log-linear model, which suggests that some properties may not follow Weber's Law. Additionally, the layout algorithm is not found to have a significant impact on the perception of some of these properties.

ContributorsClayton, Benjamin (Author) / Maciejewski, Ross (Thesis advisor) / Kobourov, Stephen (Committee member) / Sefair, Jorge (Committee member) / Arizona State University (Publisher)

Created2019

Chance-constrained optimization models for agricultural seed development and selection

Description

Breeding seeds to include desirable traits (increased yield, drought/temperature resistance, etc.) is a growing and important method of establishing food security. However, besides breeder intuition, few decision-making tools exist that can provide the breeders with credible evidence to make decisions on which seeds to progress to further stages of development.…

Breeding seeds to include desirable traits (increased yield, drought/temperature resistance, etc.) is a growing and important method of establishing food security. However, besides breeder intuition, few decision-making tools exist that can provide the breeders with credible evidence to make decisions on which seeds to progress to further stages of development. This thesis attempts to create a chance-constrained knapsack optimization model, which the breeder can use to make better decisions about seed progression and help reduce the levels of risk in their selections. The model’s objective is to select seed varieties out of a larger pool of varieties and maximize the average yield of the “knapsack” based on meeting some risk criteria. Two models are created for different cases. First is the risk reduction model which seeks to reduce the risk of getting a bad yield but still maximize the total yield. The second model considers the possibility of adverse environmental effects and seeks to mitigate the negative effects it could have on the total yield. In practice, breeders can use these models to better quantify uncertainty in selecting seed varieties

ContributorsOzcan, Ozkan Meric (Author) / Armbruster, Dieter (Thesis advisor) / Gel, Esma (Thesis advisor) / Sefair, Jorge (Committee member) / Arizona State University (Publisher)

Created2019

SATLAB - An End to End framework for Labelling Satellite Images

Description

In this work, I propose a novel, unsupervised framework titled SATLAB, to label satellite images, given a classification task at hand. Existing models for satellite image classification such as DeepSAT and DeepSAT-V2 rely on deep learning models that are label-hungry and require a significant amount of training data. Since manual…

In this work, I propose a novel, unsupervised framework titled SATLAB, to label satellite images, given a classification task at hand. Existing models for satellite image classification such as DeepSAT and DeepSAT-V2 rely on deep learning models that are label-hungry and require a significant amount of training data. Since manual curation of labels is expensive, I ensure that SATLAB requires zero training labels. SATLAB can work in conjunction with several generative and unsupervised machine learning models by allowing them to be seamlessly plugged into its architecture. I devise three operating modes for SATLAB - manual, semi-automatic and automatic which require varying levels of human intervention in creating the domain-specific labeling functions for each image that can be utilized by the candidate generative models such as Snorkel, as well as other unsupervised learners in SATLAB. Unlike existing supervised learning baselines which only extract textural features from satellite images, I support the extraction of both textural and geospatial features in SATLAB, and I empirically show that geospatial features enhance the classification F1-score by 33%. I build SATLAB on the top of Apache Sedona in order to leverage its rich set of spatial query processing operators for the extraction of geospatial features from satellite raster images. I evaluate SATLAB on a target binary classification task that distinguishes slum from non-slum areas, upon a repository of 100K satellite images captured by the Sentinel satellite program. My 5-Fold Cross Validation (CV) experiments show that SATLAB achieves competitive F1-scores (0.6) using 0% labeled data while the best baseline supervised learning baseline achieves 0.74 F1-score using 80% labeled data. I also show that Snorkel outperforms alternative generative and unsupervised candidate models that can be plugged into SATLAB by 33% to 71% w.r.t. F1-score and 3 times to 73 times w.r.t. latency. I also show that downstream classifiers trained using the labels generated by SATLAB are comparable in quality (0.63 F1) to their counterpart classifiers (0.74 F1) trained on manually curated labels.

ContributorsAggarwal, Shantanu (Author) / Sarwat, Mohamed (Thesis advisor) / Zou, Jia (Committee member) / Boscovic, Dragan (Committee member) / Arizona State University (Publisher)

Created2022

A Comparative Study on the Performance Isolation of Virtualization Technologies

Description

Virtualization technologies are widely used in modern computing systems to deliver shared resources to heterogeneous applications. Virtual Machines (VMs) are the basic building blocks for Infrastructure as a Service (IaaS), and containers are widely used to provide Platform as a Service (PaaS). Although it is generally believed that containers have…

Virtualization technologies are widely used in modern computing systems to deliver shared resources to heterogeneous applications. Virtual Machines (VMs) are the basic building blocks for Infrastructure as a Service (IaaS), and containers are widely used to provide Platform as a Service (PaaS). Although it is generally believed that containers have less overhead than VMs, an important tradeoff which has not been thoroughly studied is the effectiveness of performance isolation, i.e., to what extent the virtualization technology prevents the applications from affecting each other’s performance when they share the resources using separate VMs or containers. Such isolation is critical to provide performance guarantees for applications consolidated using VMs or containers. This paper provides a comprehensive study on the performance isolation for three widely used virtualization technologies, full virtualization, para-virtualization, and operating system level virtualization, using Kernel-based Virtual Machine (KVM), Xen, and Docker containers as the representative implementations of these technologies. The results show that containers generally have less performance loss (up to 69% and 41% compared to KVM and Xen in network latency experiments, respectively) and better scalability (up to 83.3% and 64.6% faster compared to KVM and Xen when increasing number of VMs/containers to 64, respectively), but they also suffer from much worse isolation (up to 111.8% and 104.92% slowdown compared to KVM and Xen when adding disk stress test in TeraSort experiments under full usage (FU) scenario, respectively). The resource reservation tools help virtualization technologies achieve better performance (up to 85.9% better disk performance in TeraSort under FU scenario), but cannot help them avoid all impacts.

ContributorsHuang, Zige (Author) / Zhao, Ming (Thesis advisor) / Sarwat, Mohamed (Committee member) / Wang, Ruoyu (Committee member) / Arizona State University (Publisher)

Created2019

GeoSparkSim: A Scalable Microscopic Road Network Traffic Simulator Based on Apache Spark

Description

Researchers and practitioners have widely studied road network traffic data in different areas such as urban planning, traffic prediction and spatial-temporal databases. For instance, researchers use such data to evaluate the impact of road network changes. Unfortunately, collecting large-scale high-quality urban traffic data requires tremendous efforts because participating vehicles must…

Researchers and practitioners have widely studied road network traffic data in different areas such as urban planning, traffic prediction and spatial-temporal databases. For instance, researchers use such data to evaluate the impact of road network changes. Unfortunately, collecting large-scale high-quality urban traffic data requires tremendous efforts because participating vehicles must install Global Positioning System(GPS) receivers and administrators must continuously monitor these devices. There have been some urban traffic simulators trying to generate such data with different features. However, they suffer from two critical issues (1) Scalability: most of them only offer single-machine solution which is not adequate to produce large-scale data. Some simulators can generate traffic in parallel but do not well balance the load among machines in a cluster. (2) Granularity: many simulators do not consider microscopic traffic situations including traffic lights, lane changing, car following. This paper proposed GeoSparkSim, a scalable traffic simulator which extends Apache Spark to generate large-scale road network traffic datasets with microscopic traffic simulation. The proposed system seamlessly integrates with a Spark-based spatial data management system, GeoSpark, to deliver a holistic approach that allows data scientists to simulate, analyze and visualize large-scale urban traffic data. To implement microscopic traffic models, GeoSparkSim employs a simulation-aware vehicle partitioning method to partition vehicles among different machines such that each machine has a balanced workload. The experimental analysis shows that GeoSparkSim can simulate the movements of 200 thousand cars over an extensive road network (250 thousand road junctions and 300 thousand road segments).

ContributorsFu, Zishan (Author) / Sarwat, Mohamed (Thesis advisor) / Pedrielli, Giulia (Committee member) / Sefair, Jorge (Committee member) / Arizona State University (Publisher)

Created2019

Extensions of the dual-resource constrained flexible job-shop scheduling problem

Description

The shift in focus of manufacturing systems to high-mix and low-volume production poses a challenge to both efficient scheduling of manufacturing operations and effective assessment of production capacity. This thesis considers the problem of scheduling a set of jobs that require machine and worker resources to complete their manufacturing operations.…

The shift in focus of manufacturing systems to high-mix and low-volume production poses a challenge to both efficient scheduling of manufacturing operations and effective assessment of production capacity. This thesis considers the problem of scheduling a set of jobs that require machine and worker resources to complete their manufacturing operations. Although planners in manufacturing contexts typically focus solely on machines, schedules that only consider machining requirements may be problematic during implementation because machines need skilled workers and cannot run unsupervised. The model used in this research will be beneficial to these environments as planners would be able to determine more realistic assignments and operation sequences to minimize the total time required to complete all jobs. This thesis presents a mathematical formulation for concurrent scheduling of machines and workers that can optimally schedule a set of jobs while accounting for changeover times between operations. The mathematical formulation is based on disjunctive constraints that capture the conflict between operations when trying to schedule them to be performed by the same machine or worker. An additional formulation extends the previous one to consider how cross-training may impact the production capacity and, for a given budget, provide training recommendations for specific workers and operations to reduce the makespan. If training a worker is advantageous to increase production capacity, the model recommends the best time window to complete it such that overlaps with work assignments are avoided. It is assumed that workers can perform tasks involving the recently acquired skills as soon as training is complete. As an alternative to the mixed-integer programming formulations, this thesis provides a math-heuristic approach that fixes the order of some operations based on Largest Processing Time (LPT) and Shortest Processing Time (SPT) procedures, while allowing the exact formulation to find the optimal schedule for the remaining operations. Computational experiments include the use of the solution for the no-training problem as a starting feasible solution to the training problem. Although the models provided are general, the manufacturing of Printed Circuit Boards are used as a case study.

ContributorsAdams, Katherine Bahia (Author) / Sefair, Jorge (Thesis advisor) / Askin, Ronald (Thesis advisor) / Webster, Scott (Committee member) / Arizona State University (Publisher)

Created2019