Reading List
Provenance: Overview
- Provenance and Scientific Workflows: Challenges and Opportunities Susan Davidson and Juliana Freire. In Proceedings of ACM SIGMOD International Conference on Management of Data, 2008. Tutorial resources
- Provenance for Computational Tasks: A Survey Juliana Freire, David Koop, Emanuele Santos, and Claudio T. Silva. In IEEE Computing in Science & Engineering, 2008.
- Lineage retrieval for scientific data processing:a survey R. Bose and J. Frew. ACM Computing Surveys,37(1):1-28,2005.
- Provenance in Databases: Past, Current, and Future W. Tan. IEEE Data Engineering Bulletin.
- A survey of data provenance in e-science, Yogesh L. Simmhan, Beth Plale, Dennis Gannon, SIGMOD Record, September, 2005.
Provenance in Databases
Would like to present: Avishek (10)
Would like to critique:
- Provenance in Databases: Past, Current, and Future W. Tan. IEEE Data Engineering Bulletin. (short overview)
- Curated Databases W. Tan, P. Buneman, J. Cheney, S. Vansumerren. ACM Symposium on Principles of Database Systems (PODS), 2008.
- Provenance as Dependency Analysis James Cheney, Amal Ahmed, Umut A. Acar. DBPL 2007: 138-152
- Database Provenance Tutorial W. Tan and P. Buneman
Provenance Management: Storage, Indexing and Querying
Would like to present: <<add your name and rate your preference from 1-10, 1=least interested, 10 = very interested>>
Would like to critique: Mark Valentine (5), David Koop (5)
- Querying and Creating Visualizations by Analogy. Carlos E. Scheidegger, Huy T. Vo, David Koop, Juliana Freire and Claudio T. Silva. IEEE Transactions on Visualization and Computer Graphics, 13(6), pp. 1560-1567, 2007. Best paper in IEEE Visualization 2007.
- Efficient Provenance Storage Adriane Chapman. H .V. Jagadish and Prakash Ramanan. SIGMOD 2008.
- Efficient lineage tracking for scientific workflows. Thomas Heinis, Gustavo Alonso. SIGMOD Conference 2008: 1007-1018
- Querying and Managing Provenance through User Views in Scientific Workflows. Olivier Biton, Sarah Cohen Boulakia, Susan B. Davidson, Carmem S. Hara. ICDE 2008: 1072-1081
- Querying Business Processes. Catriel Beeri, Anat Eyal, Simon Kamenkovich, Tova Milo. VLDB 2006: 343-354
Provenance/Workflow/Graph Indexing
Would like to present: Mark Valentine (10), David Koop (10)
Would like to critique: Avishek Saha (10)
- Algorithmics and Applications of Tree and Graph Searching D. Shasha, J. T. L. Wang, and R. Giugno. PODS 2002.
- Graph Indexing: Tree + Delta >= Graph P. Zhao, J. X. Yu, and P. S. Yu. VLDB 2007.
- Closure-Tree: An Index Structure for Graph Queries H. He and A. K. Singh. ICDE 2006.
Some more papers:
- Efficient Matching and Indexing of Graph Models in Content-Based Retrieval by Stefano Berretti , Alberto Del Bimbo , Enrico Vicario, IEEE TPAMI 2001
- Computing Frequent Graph Patterns from Semistructured Data by N. Vanetik , E. Gudes , S. E. Shimony ICDM 2002
- Graph indexing based on discriminative frequent structure analysis by Xifeng Yan, Philip S. Yu, Jiawei Han TODS 2004
- Graph Indexing: A Frequent Structurebased Approach by Xifeng Yan, Philip S. Yu, Jiawei Han SIGMOD 2004
- Graph Database Indexing Using Structured Graph Decomposition by David W. Williams, Jun Huan, Wei Wang ICDE 2007
- Towards graph containment search and indexing by Chen Chen , Xifeng Yan , Philip S. Yu , Jiawei Han , Dong-Qing Zhang , Xiaohui Gu, VLDB 2007
- Treepi: A novel graph indexing method by S Zhang, M Hu, J Yang ICDE 2007
- Summarization Graph Indexing: Beyond Frequent Structure-based Approach by Lei Zou, Lei Chen, Huaming Zhang, Yansheng Lu, and Qiang Lou
Presentation:
Video lecture:
- Mining, Indexing, and Searching Graphs in Large Data Sets by Jiawei Han Nature 2007
Provenance Mining
Would like to present: Parasaran Raman (10)
Would like to critique: Zhan Wang (10)
- VisComplete: Automating Suggestions for Visualization Pipelines. David Koop, Carlos E. Scheidegger, Steven P. Callahan, Huy T. Vo, Juliana Freire and Claudio T. Silva. In IEEE Transactions on Visualization and Computer Graphics, 14(6), pp. 1691-1698, 2008.
- A First Study on Clustering Collections of Workflow Graphs E. Santos, L. Lins, J. P. Ahrens, J. Freire, C. Silva. In Proceedings of IPAW, pp. 160-173, 2008
- Process Mining Based on Clustering: A Quest for Precision. A.K. Alves de Medeiros, A. Guzzo, G. Greco, W.M.P. van der Aalst, A.J.M.M. Weijters, B. van Dongen, and D. Saccà. In A. ter Hofstede, B. Benatallah, and H.-Y. Paik, editors, BPM 2007 Workshops, LNCS 4928: 17–29, 2008.
- Mining and Reasoning on Workflows Greco et al. TKDE2005
Provenance Applications: Publications
Would like to present: Mark Valentine (5), Pravin(5)
Would like to critique: Zhan Wang (10)
- Reproducible Research Fomel, Sergey; Claerbout, Jon F. CiSE Volume: 11 Issue: 1 Date: Jan.-Feb. 2009 Page(s): 5-7 Digital Object Identifier 10.1109/MCSE.2009.14
- Reproducible Research: A Bioinformatics Case Study Robert Gentleman. Bioconductor Project Working Papers. Working Paper 3. (May 2004).
- An Introduction to the Dataverse Network as an Infrastructure for Data Sharing. Gary King. Sociological Methods and Research. Vol. 32, No. 2 (November, 2007): Pp. 173--199,
Provenance: Security and Privacy
Would like to present: Parasaran Raman (10), Zhan Wang(5), Komal (10)
Would like to critique: Mark Valentine (10), David Koop (5)
- Securing provenance. Braun, A. Shinnar, and M. Seltzer. In HotSec’08, 2008.
- The Case of the Fake Picasso: Preventing History Forgery with Secure Provenance, Ragib Hasan, Radu Sion, and Marianne Winslett, USENIX FAST 2009
- Introducing Secure Provenance: Problems and Challenges, Ragib Hasan, Radu Sion, Marianne Winslett, in ACM StorageSS 2007.
- TAPIDO: Trust and Authorization via Provenance and Integrity in Distributed Objects. A. Cirillo, R. Jagadeesan, C. Pitcher, and J. Riely. In European Symposium on Programming (ESOP), Lecture Notes in Computer Science, Springer, 2008.
- Evidence-Based Audit. Jeffrey A. Vaughan, Limin Jia, Karl Mazurak, Steve Zdancewic. CSF 2008: 177-191
- SELinks: End to end security for Web applications. Hicks, Swamy, and Corcoran. Project Web Site
Storing Scientific Data
Would like to present: Parasaran Raman (5), Zhan Wang (10), Komal (10)
Would like to critique: Parasaran Raman (10)
- To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem? Russell Sears; Catharine Van Ingen; Jim Gray. CIDR 2007
- The Sloan Digital Sky Survey Science Archive: Migrating a Multi-Terabyte Astronomical Archive from Object to Relational DBMS. Aniruddha R. Thakar, Alexander S. Szalay, Peter Z. Kunszt, Jim Gray. CoRR cs.DB/0403020: (2004)
- Life Under Your Feet: An End-to-End Soil Ecology Sensor Network, Database, Web Server, and Analysis Service. Katalin Szlavecz, Andreas Terzis, Stuart Ozer, Razvan Musaloiu-E, Joshua Cogan, Sam Small, Randal Burns, Jim Gray, Alex Szalay
Web schema matching and schema integration
Would like to present: Thanh Nguyen (10), Avishek Saha (10)
Would like to critique: Parasaran Raman (10) , Huong Nguyen (10), Avishek Saha (5)
Thanh:
- An interactive clustering-based approach to integrating source query interfaces on the deep Web Wensheng Wu, Clement Yu, AnHai Doan, Weiyi Meng, SIGMOD 2004
- Automatic complex schema matching across Web query interfaces Bin He, Kevin Chuan Chang, ACM Trans. Database Syst. 2006
Avishek:
- Web-scale Data Integration: You can only afford to Pay As You Go Jayant Madhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy. CIDRDB 2007
- Data Integration with Uncertainty Xin Dong, Alon Y. Halevy, Cong Yu. VLDB 2007
Some more papers
- A survey of approaches to automatic schema matching Rahm Erhard and Bernstein Philip, VLDB 2001
- A Survey of Schema-based Matching Approaches Pavel Shvaiko1 and Jerome Euzenat2, JoDS 2005
- Why is schema matching tough and what can we do about it? Avigdor Gal. ACM SIGMOD Record
- Wise-integrator: An automatic integrator of web search interfaces for e-commerce. Hai He and Weiyi Meng. VLDB 2003
- Holistic query interface matching using parallel schema matching. W. Su, J. Wang, and F. Lochovsky. ICDE '06
- Corpus-based schema matching. Jayant Madhavan, Philip A. Bernstein, Anhai Doan, Alon Halevy. ICDE 05
- A Robust Approach to Schema Matching overWeb Query Interfaces Jin Pei, Jun Hong, David Bell. ICDE 06
- Statistical Schema Matching across Web Query Interfaces Bin He, Kevin Chen-Chuan Chang, SIGMOD 2003
- Merging Source Query Interfaces on Web Databases, Eduard Dragut, ICDE06
Some more papers on Dataspaces:
- From databases to dataspaces: a new abstraction for information management by Michael Franklin, Alon Halevy, David Maier, SIGMOD 2005
- A first tutorial on dataspaces by Michael Franklin, Alon Halevy, David Maier, VLDB 2008
Presentation:
- Thanh
- Avishek
Querying Diverse Data
Relational data on the Web
Would like to present: Huong Nguyen (10), Avishek Saha (10), Pravin(10) , Komal(10)
Would like to critique: Parasaran Raman (5) , Thanh Nguyen (10), Ramesh(5), Zhan Wang (10)
Pravin
- WebTables: exploring the power of tables on the web. Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu, Yang Zhang: PVLDB 1(1): 538-549 (2008)
- Uncovering the Relational Web. Michael J. Cafarella, Alon Y. Halevy, Yang Zhang, Daisy Zhe Wang, Eugene Wu. WebDB 2008
Huong
- Mining database structure; or, how to build a data quality browser. Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, Vladislav Shkapenyuk. SIGMOD 2002
- [1] Information-theoretic tools for mining database structure from large data sets. Periklis Andritsos, Renee J. Miller and Panayiotis Tsaparas. SIGMOD 2004
Some More Papers
- Duplicate Record Detection: A Survey. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios. IEEE TKDE, 2007
- Efficient Discovery of Functional and Approximate Dependencies Using Partitions Yka Huhtala, Juha Karkkainen, Pasi Porkka, and Hannu Toivonen. In Proc. IEEE Intl. conf. on Data Engineering, 1998.
- Mining Association Rules between Sets of Items in Large Databases Rakesh Agrawal, Tomasz Imielinski, Arun Swami. SIGMOD 1993
- LIMBO: Scalable Clustering of Categorical DataPeriklis Andritsos, Panayiotis Tsaparas, Ren´ee J. Miller, and Kenneth C. Sevcik. In EDBT 2004.
Presentation:
Data integration on the fly (or almost...)
Would like to present: Zhan Wang (5)
Would like to critique: Mark Valentine (10), Avishek Saha (10), Ramesh(10)
Zhan:
- From databases to dataspaces: a new abstraction for information management. Michael Franklin, Alon Halevy, David Maier. Sigmod Record, 2005
- Indexing dataspaces. Xin Dong and Alon Halevy. SIGMOD 2007.
Some more papers:
- Pay-as-you-go user feedback for dataspace systems. Shawn R. Jeffery, Michael J. Franklin, Alon Y. Halevy. SIGMOD Conference 2008: 847-860
- Bootstrapping pay-as-you-go data integration systems. Anish Das Sarma, Xin Dong, Alon Y. Halevy, SIGMOD Conference 2008: 861-874.
- Building Community Wikipedias: A Human-Machine Approach. P. DeRose, X. Chai, B. Gao, W. Shen, A. Doan, P. Bohannon, J. Zhu. ICDE-08.
- The Case for a Structured Approach to Managing Unstructured Data. A. Doan, J. F. Naughton, A. Baid, X. Chai, F. Chen, T. Chen, E. Chu, P. DeRose, B. Gao, C. Gokhale, J. Huang, W. Shen, B. Vuong. CIDR-09.
Presentation PPT:
Usable query interfaces for structured data
Mark:
- Discover: keyword search in relational databases. Vagelis Hristidis, Yannis Papakonstantinou. VLDB 2002.
- Bidirectional Expansion For Keyword Search on Graph Databases. Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S Sudarshan, Rushi Desai and Hrishikesh Karambelkar, VLDB 2005
Komal:
- Keyword Searching and Browsing in databases using BANKS. Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan. ICDE 2002
- Effective keyword search in relational databases. Liu,, Fang and Yu,, Clement and Meng,, Weiyi and Chowdhury,, Abdur. SIGMOD 2006, pp 563--574.
Presentation:
Snippet Generation and Ranking
- A system for query-specific document summarization. Ramakrishna Varadarajan, Vagelis Hristidis. CIKM, 2006 (Ramesh: Will present this)
- Fast generation of result snippets in web search. Andrew Turpin, Yohannes Tsegay, David Hawking, Hugh E. Williams. ACM SIGIR, 2007 (Ramesh: Will present this)
- Object-level ranking: bringing order to Web objects. Zaiqing Nie, Yuanzhi Zhang, Ji-Rong Wen, Wei-Ying Ma. WWW, 2005
- Page quality: in search of an unbiased web ranking. Junghoo Cho, Sourashis Roy, Robert E. Adams. SIGMOD, 2005
The Deep Web
Would like to present: Huong Nguyen (10), Pravin(10), Ramesh(10), Zhan Wang (10), Komal(10)
Would like to critique: Thanh Nguyen (10), Avishek Saha (10), Ramesh(10), Komal(10)
- Google's Deep Web crawl. Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, Alon Y. Halevy. PVLDB 1(2): 1241-1252 (2008) (*Ramesh will present this)
- Siphoning Hidden-Web Data through Keyword-Based Interfaces. Luciano Barbosa and Juliana Freire. In Proceedings of Brazilian Symposium on Databases (SBBD), 2004. (*Huong will present this)
- Instance-based schema matching for web databases by domain-specific query probing. Jiying Wang , Ji-Rong Wen , Fred Lochovsky , Wei-Ying Ma. VLDB 2004
- Query Selection Techniques for Efficient Crawling of Structured Web Sources. Ping Wu , Ji-Rong Wen , Huan Liu , Wei-Ying Ma. ICDE 2006 (*Pravin will present this)
Information Extraction
Zhan will present (long long paper, but only need to have an overview):
- Information extraction Sunita Sarawagi. FnT Databases, 1(3), 2008.
Parasaran will present:
- On the Provenance of Non-Answers to Queries over Extracted Data. J. Huang, T. Chen, A. Doan, J. Naughton. VLDB-08.
Mark will present:
- Information Extraction From Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld
- Intelligence in Wikipedia Daniel S. Weld, Fei Wu, Eytan Adar
Thanh will present:
- Semantic annotation of unstructured and ungrammatical textMatthew Michelson and Craig A. Knoblock. IJCAI 2005
--- --- ---
- Domain adaptation of information extraction models. Rahul Gupta and Sunita Sarawagi. In Sigmod Record, 2008.
- Information Extraction Challenges in Managing Unstructured Data. AnHai Doan et al. SIGMOD Record, Winter 08, Special Issue on Managing Information Extraction.
- Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty, Andrew McCallum, Fernando Pereira, ICML 2001.
- 2D Conditional Random Fields for Web Information Extraction Jun Zhu,Wei-Ying Ma ICML 2005
- Simultaneous record detection and attribute labeling in web data extraction Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, Wei-Ying Ma. KDD 2006
- Web Data Extraction Based on Partial Tree Alignment Yanhong Zhai, Bing Liu. WWW 2005