##################################################################
SIMMR RECIPE DATASET version 1.0

Jermsak Jermsurawong and Nizar Habash

Computational Approaches to Modeling Language Lab Computer Science
New York University Abu Dhabi (NYUAD), United Arab Emirates

##################################################################

(c) 2015 New York University Abu Dhabi

##################################################################

This data set is provided for research purposes only. If interested
in commercial use, please contact both authors to connect you to the 
NYUAD technology transfer office.

##################################################################

Simplified Ingredient Merging Map in Recipes (SIMMR) represents a recipe as a dependency tree whose leaves (terminal nodes) are the recipe ingredients, and whose internal nodes are the recipe instructions.
SIMMR dataset contains SIMMR representations of English food recipes. The representations are parsed from text recipes annotated in Minimal Instruction Language for the Kitchen (MILK) by Carnegie Mellon University Recipe Database (Tasse and Smith 2008).

###########
DESCRIPTION
###########

SIMMR dataset contains 260 recipes, which are separated into three files for training, development, and test.
The files are 'simmr.train.txt', 'simmr.develop.txt', and 'simmr.test.txt' with 126, 56, and 78 recipes respectively.

Each data file contains a set of SIMMR recipes, and each recipe is a simple tree, represented as an adjacency list.
The first line of each data file gives the number of recipes, R. R recipes follow.
Each recipe begins with a line of tab-separated strings: 
	- recipe<N>, where 0 <= N <= R-1
	- <recipe name>

One line follows, contaning space-separated integers, G and S.
G ingredient lines follow, listing G ingredient nodes.
S instruction lines then follow, listing S instruction nodes.

Each ingredient line is described with tab-separated information as follows:
	- ing<g> denotes ingredient node index, where 0 <= g <= G-1
	- inst<s> denotes the ingredient node's parent, which is an instruction node index s, where s ranges from 0 to S-1
	- <ingredient description> 

Each instruction line is similarly described with tab-separated information as follows:
	- inst<s> denotes instruction node index, where 0 <= s <= S-1
	- inst<s+n> denotes the instruction node's parent, which is the n-th subsequent instruction node
	- <instruction>

The last instruction line has its parent labeled as ROOT.
The recipes are separated with #-------------------------------.

A sample of SIMMR data file with 2 recipes is as follows:

2
recipe0	RecipeName0
3 2
ing0	inst0	IngredientDescription0
ing1	inst1	IngredientDescription1
ing2	inst1	IngredientDescription2
inst0	inst1	Instruction0
inst1	ROOT	Instruction1
#-------------------------------
recipe1	RecipeName1
5 4
ing0	inst0	IngredientDescription0
ing1	inst0	IngredientDescription1
ing2	inst1	IngredientDescription2
ing3	inst1	IngredientDescription3
ing4	inst3	IngredientDescription4
inst0	inst1	Instruction0
inst1	inst2	Instruction1
inst2	inst3	Instruction2
inst3	ROOT	Instruction3

##############
KNOWN PROBLEMS
##############

SIMMR is built from MILK annotations. When ingredients listed in text recipes are not accounted for in MILK annotations, SIMMR suffers as lone ingredient nodes have no instruction parents to link to.
The followings are recipes with lone ingredient nodes:
'simmr.train.txt'
	- 58 Fall Harvest Baked Apples
	- 63 Best Macaroni Salad
	- 68 Garlic Chicken Stir Fry
	- 87 Cabin Hash
	- 103 Grandmas Carrot Cake
	- 105 Chai Butter Cookies
'simmr.develop.txt'
	- 24 German Chocolate Brownie Cookies
	- 41 Cookie Jar Sugar Cookies
'simmr.test.txt'
	- 5 Fried Apple Pies

This dataset includes one additional recipe which was not a part of the paper. The addition is recipe 55 Grandmas Wheat Germ Cookies in the development set.
 
###############
FURTHER READING
###############

Please cite this paper for any work involving the use of SIMMR dataset:

Jermsak Jermsurawong and Nizar Habash. Predicting the Structure of Cooking Recipes. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 781–786,Lisbon, Portugal, 17-21 September 2015.

http://www.aclweb.org/anthology/D/D15/D15-1090.pdf

@InProceedings{jermsurawong-habash:2015:EMNLP,
  author    = {Jermsurawong, Jermsak  and  Habash, Nizar},
  title     = {Predicting the Structure of Cooking Recipes},
  booktitle = {Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing},
  month     = {September},
  year      = {2015},
  address   = {Lisbon, Portugal},
  publisher = {Association for Computational Linguistics},
  pages     = {781--786},
  url       = {http://aclweb.org/anthology/D15-1090}
}



##########
REFERENCES
##########

Dan Tasse and Noah A Smith. 2008. Sour cream: Toward semantic processing of recipes. Technical report, Technical Report CMU-LTI-08-005, Carnegie Mellon University, Pittsburgh, PA.

##################################################################
(c) 2015 New York University Abu Dhabi